Expand Variables class to read s3 urls #464

rwegener2 · 2023-10-31T15:50:58Z

Goal

This PR generalizes the Variables class to display available variables from an s3 path. If an s3 path is given the Variables class uses the product and version to display available variables; it does not try to walk through the file and build a the variables tree. If a user tries to request data from outside the NSIDC bucket a warning is raised indicating that available variable lists may not be accurate.

How it was done

So simply! I'm excited about how easy this was, because it shows how worth it the efforts were to 1) create an auth Mixin and 2) make Variables an independent class 🙂

The biggest changes were in is2ref.py, in which the extract_product and extract_version functions were generalized to read from s3. They now accept an optional auth argument.

How it can be tested

from icepyx.core.variables import Variables

s3_path = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL09/006/2019/11/30/ATL09_20191130215436_09930501_006_01.h5'
v = Variables(path=s3_path)
print(v.product)
print(v.version)
v.avail()

As a template for anyone ever writing future tests, the extract_product and extract_version functions should also be tested, for example as:

import earthaccess
from icepyx.core.is2ref import extract_product

auth = earthaccess.login()
s3_path = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL09/006/2019/11/30/ATL09_20191130215436_09930501_006_01.h5'
extract_product(s3_path, auth=auth)

Co-authored-by: Wei Ji <[email protected]>

Co-authored-by: Jessica Scheick <[email protected]>

Refactor Variables class to be user facing functionality

rwegener2 · 2023-11-07T13:16:40Z

This comment is to start documenting timing for s3 reads. Below I compare wall time across 3 ways of initializing a Variables instance and listing the available variables.

Times

Syntax	Description	Instantiation Time	`.avail()` Time
`v = Variables(path=local_path)`	Initializing from a local file	448 ms ± 24.4 ms	215 ns ± 3.52 ns
`v = Variables(product='ATL09', version='006')`	Initializing with a user given product/version	200 ms ± 24.1 ms	220 ns ± 0.798 ns
`v = Variables(path=s3_path)`	Initializing with an s3 filepath	6.24 s ± 203 ms	219 ns ± 8.26 ns

Notes: Timing done using the %%timeit magic. Earthdata Login already established for all runs, so authentication time is not included in any of the times

Highlights

Initializing a Variables object with an s3 path is ~14 times slower than a with a local file, but the .avail() call is just as fast as local because product and version are known. I assume the most time intensive part of initializing a Variables class with an s3 url is reading the product and version from the s3 file.

I still think we should merge this PR

This is not super hopeful news off the bat, but I still think we should merge this and work on variable reads. The alternate approach to reading the product from the file is to use the filepath, but that had it's own set of dev challenges and also restricts icepyx to only NSIDC s3 data. I'm curious to get a more holistic view of read times before trying to optimize any particular piece.

rwegener2 · 2023-11-07T13:25:04Z

Alright @JessicaS11 this is ready for review! My only uncertainty is that Travis is hanging again 😕
Let me know what you think!

icepyx/core/is2ref.py

icepyx/core/validate_inputs.py

JessicaS11 · 2023-11-20T20:41:25Z

@rwegener2 This is super close. Can you confirm it all still looks good from your perspective after all the merge conflicts, rebasing, etc.? I think we just need to update the line in our variables example to note that you can create a Variables object from local file, s3 url, or product+version.

UPDATE: I randomly tried to read in a file while on this branch, and it forced me to authenticate even though I was reading a local file. Not sure if that's something we inadvertently introduced here, but figured it was the logical place to record it until one of us digs in!

rwegener2 · 2023-11-21T21:49:36Z

Thanks for helping move this to the finish line @JessicaS11! I think this looks ready to merge.

I randomly tried to read in a file while on this branch, and it forced me to authenticate even though I was reading a local file.

TLDR; This was unintended and it is not desired. As a fix I added an if statement in the init of Variables that checks if the path is s3 before authenticating.

Good catch. So it seems like one of the design elements I was most proud of in the auth module is working against us right now. We setup auth.py such that self.auth could exist anywhere as None, thus, but the actual authentication would only happen when self.auth was called. This makes authentication available anywhere but delays the process of authentication until it is actually needed.
By passing self.auth into the self._product = is2ref.extract_product(self._path, auth=self.auth) line of the Variables init, we are instructing the auth module to authenticate even though authentication isn't needed. This is a result of making it so the extract_product function works on either a local or s3 file. If we take auth=self.auth out then the s3 reads that need auth won't work.

The fix I am pushing adds an if statement in the Variables init to check if the path is an s3 path and creates a local variable that is either self.auth or None. I'm not totally sold that's the best option, but I wasn't in love with any of my other ideas either. I considered:

split validate_product into two functions - local and s3 versions. Call them separately in the variables init
make validate product a class method of variables (and also read)
make validate product a class method of auth
make the auth class an attribute of the classes instead of integrating it directly as a mixin

It's possible one of these other ones is better, but my feeling right now is to wait on one of these bigger restructures to see if there are other methods that get created as the cloud features of icepyx develop. The needs of those other functions could help influence how we approach these situations.

review-notebook-app · 2023-11-27T19:05:42Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

JessicaS11 · 2023-11-27T19:11:03Z

The fix I am pushing adds an if statement in the Variables init to check if the path is an s3 path and creates a local variable that is either self.auth or None.

This makes sense to me, in combination with

It's possible one of these other ones is better, but my feeling right now is to wait on one of these bigger restructures to see if there are other methods that get created as the cloud features of icepyx develop. The needs of those other functions could help influence how we approach these situations.

It's possible that some decorators could ultimately help us with this kind of issue - especially if the "check" consistently stays as "does this start with "s3"?".

I added a few notes in the vars example. It looks like this is ready to go!

* expand extract_product and extract_version to check for s3 url * add cloud notes to variables notebook --------- Co-authored-by: Jessica Scheick <[email protected]>

This reverts commit 8d7db9c.

rwegener2 and others added 30 commits August 1, 2023 17:07

mvp remove intake from Read

9d09ff9

Merge branch 'development' into refactor_intake

e5458a1

delete is2cat and references

24f6a42

remove extra comments

b13b847

update doc strings

0779b80

update tests

1cfbf72

update documentation for removing intake

de61d87

update approach paragraph

9f06611

remove one more instance of catalog from the docs

d019b9a

clear jupyter history

156ea89

Update icepyx/core/read.py

b26ca4e

Co-authored-by: Wei Ji <[email protected]>

remove intake and related modules

ce1ca76

Merge branch 'development' into read_arguments

fd00aeb

mvp with new read parameters

431af78

clean up remainder of file and remove extraneous comments

612662e

maintain backward compatibility and combine arguments

c16a003

update to new error message

7648078

update docs

4cfbfdb

glob kwargs and list error

f7f823b

formatting updates

203f3ad

Apply suggestions from code review

10d1591

Co-authored-by: Jessica Scheick <[email protected]>

remove num_files

0b23d1e

fix docs test typo

6f5bead

trying again to fix the build

035ee5a

add feedback to docs page

903c351

Merge branch 'development' into read_arguments

d842bde

fix typo

5e06de9

Merge branch 'development' into read_arguments

9ca29f1

Merge branch 'development' into read_arguments

e8e35ad

mvp for making a standalone variables class

e3566f8

Variables as an independent class (#451)

d5747fa

Refactor Variables class to be user facing functionality

Base automatically changed from indep_vars to development November 7, 2023 12:17

resolve merge conflicts

2e84bbc

make warning clearer

1e0bc69

rwegener2 marked this pull request as ready for review November 7, 2023 13:19

rwegener2 requested a review from JessicaS11 November 7, 2023 13:20

move function to trigger travis run

150d763

JessicaS11 reviewed Nov 11, 2023

View reviewed changes

icepyx/core/is2ref.py Outdated Show resolved Hide resolved

JessicaS11 reviewed Nov 11, 2023

View reviewed changes

icepyx/core/validate_inputs.py Show resolved Hide resolved

JessicaS11 force-pushed the development branch from d5747fa to 142b7ab Compare November 15, 2023 18:36

JessicaS11 and others added 4 commits November 20, 2023 15:08

Merge branch 'development' into s3_vars

ddd2540

move extract product and version functions back to end of file

d44dca0

remove duplicate imports and some comments

0837bd7

accidentally removed numpy...

c77de88

fix auth requirement for local files

454e30b

rwegener2 requested a review from JessicaS11 November 22, 2023 14:45

JessicaS11 and others added 2 commits November 27, 2023 13:50

Merge branch 'development' into s3_vars

f84bb32

add cloud notes to variables notebook

ef7aa1c

JessicaS11 approved these changes Nov 27, 2023

View reviewed changes

rwegener2 merged commit d26422e into development Nov 27, 2023
5 checks passed

rwegener2 deleted the s3_vars branch November 27, 2023 21:25

JessicaS11 added a commit that referenced this pull request Jan 5, 2024

Expand Variables class to read s3 urls (#464)

8d7db9c

* expand extract_product and extract_version to check for s3 url * add cloud notes to variables notebook --------- Co-authored-by: Jessica Scheick <[email protected]>

JessicaS11 added a commit that referenced this pull request Jan 5, 2024

Revert "Expand Variables class to read s3 urls (#464)"

e3e7652

This reverts commit 8d7db9c.

JessicaS11 linked an issue Jan 24, 2024 that may be closed by this pull request

Make Variables an independent class #450

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand Variables class to read s3 urls #464

Expand Variables class to read s3 urls #464

rwegener2 commented Oct 31, 2023 •

edited

Loading

rwegener2 commented Nov 7, 2023 •

edited

Loading

rwegener2 commented Nov 7, 2023

JessicaS11 commented Nov 20, 2023 •

edited

Loading

rwegener2 commented Nov 21, 2023

review-notebook-app bot commented Nov 27, 2023

JessicaS11 commented Nov 27, 2023

Expand Variables class to read s3 urls #464

Expand Variables class to read s3 urls #464

Conversation

rwegener2 commented Oct 31, 2023 • edited Loading

Goal

How it was done

How it can be tested

rwegener2 commented Nov 7, 2023 • edited Loading

Times

Highlights

I still think we should merge this PR

rwegener2 commented Nov 7, 2023

JessicaS11 commented Nov 20, 2023 • edited Loading

rwegener2 commented Nov 21, 2023

review-notebook-app bot commented Nov 27, 2023

JessicaS11 commented Nov 27, 2023

rwegener2 commented Oct 31, 2023 •

edited

Loading

rwegener2 commented Nov 7, 2023 •

edited

Loading

JessicaS11 commented Nov 20, 2023 •

edited

Loading