Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand Variables class to read s3 urls #464

Merged
merged 79 commits into from
Nov 27, 2023
Merged

Expand Variables class to read s3 urls #464

merged 79 commits into from
Nov 27, 2023

Conversation

rwegener2
Copy link
Contributor

@rwegener2 rwegener2 commented Oct 31, 2023

Goal

This PR generalizes the Variables class to display available variables from an s3 path. If an s3 path is given the Variables class uses the product and version to display available variables; it does not try to walk through the file and build a the variables tree. If a user tries to request data from outside the NSIDC bucket a warning is raised indicating that available variable lists may not be accurate.

How it was done

So simply! I'm excited about how easy this was, because it shows how worth it the efforts were to 1) create an auth Mixin and 2) make Variables an independent class 🙂

The biggest changes were in is2ref.py, in which the extract_product and extract_version functions were generalized to read from s3. They now accept an optional auth argument.

How it can be tested

from icepyx.core.variables import Variables

s3_path = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL09/006/2019/11/30/ATL09_20191130215436_09930501_006_01.h5'
v = Variables(path=s3_path)
print(v.product)
print(v.version)
v.avail()

As a template for anyone ever writing future tests, the extract_product and extract_version functions should also be tested, for example as:

import earthaccess
from icepyx.core.is2ref import extract_product

auth = earthaccess.login()
s3_path = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL09/006/2019/11/30/ATL09_20191130215436_09930501_006_01.h5'
extract_product(s3_path, auth=auth)

Refactor Variables class to be user facing functionality
Base automatically changed from indep_vars to development November 7, 2023 12:17
@rwegener2
Copy link
Contributor Author

rwegener2 commented Nov 7, 2023

This comment is to start documenting timing for s3 reads. Below I compare wall time across 3 ways of initializing a Variables instance and listing the available variables.

Times

Syntax Description Instantiation Time .avail() Time
v = Variables(path=local_path) Initializing from a local file 448 ms ± 24.4 ms 215 ns ± 3.52 ns
v = Variables(product='ATL09', version='006') Initializing with a user given product/version 200 ms ± 24.1 ms 220 ns ± 0.798 ns
v = Variables(path=s3_path) Initializing with an s3 filepath 6.24 s ± 203 ms 219 ns ± 8.26 ns

Notes: Timing done using the %%timeit magic. Earthdata Login already established for all runs, so authentication time is not included in any of the times

Highlights

Initializing a Variables object with an s3 path is ~14 times slower than a with a local file, but the .avail() call is just as fast as local because product and version are known. I assume the most time intensive part of initializing a Variables class with an s3 url is reading the product and version from the s3 file.

I still think we should merge this PR

This is not super hopeful news off the bat, but I still think we should merge this and work on variable reads. The alternate approach to reading the product from the file is to use the filepath, but that had it's own set of dev challenges and also restricts icepyx to only NSIDC s3 data. I'm curious to get a more holistic view of read times before trying to optimize any particular piece.

@rwegener2 rwegener2 marked this pull request as ready for review November 7, 2023 13:19
@rwegener2 rwegener2 requested a review from JessicaS11 November 7, 2023 13:20
@rwegener2
Copy link
Contributor Author

Alright @JessicaS11 this is ready for review! My only uncertainty is that Travis is hanging again 😕
Let me know what you think!

icepyx/core/is2ref.py Outdated Show resolved Hide resolved
@JessicaS11
Copy link
Member

JessicaS11 commented Nov 20, 2023

@rwegener2 This is super close. Can you confirm it all still looks good from your perspective after all the merge conflicts, rebasing, etc.? I think we just need to update the line in our variables example to note that you can create a Variables object from local file, s3 url, or product+version.

UPDATE: I randomly tried to read in a file while on this branch, and it forced me to authenticate even though I was reading a local file. Not sure if that's something we inadvertently introduced here, but figured it was the logical place to record it until one of us digs in!

@rwegener2
Copy link
Contributor Author

Thanks for helping move this to the finish line @JessicaS11! I think this looks ready to merge.

I randomly tried to read in a file while on this branch, and it forced me to authenticate even though I was reading a local file.

TLDR; This was unintended and it is not desired. As a fix I added an if statement in the init of Variables that checks if the path is s3 before authenticating.

Good catch. So it seems like one of the design elements I was most proud of in the auth module is working against us right now. We setup auth.py such that self.auth could exist anywhere as None, thus, but the actual authentication would only happen when self.auth was called. This makes authentication available anywhere but delays the process of authentication until it is actually needed.
By passing self.auth into the self._product = is2ref.extract_product(self._path, auth=self.auth) line of the Variables init, we are instructing the auth module to authenticate even though authentication isn't needed. This is a result of making it so the extract_product function works on either a local or s3 file. If we take auth=self.auth out then the s3 reads that need auth won't work.

The fix I am pushing adds an if statement in the Variables init to check if the path is an s3 path and creates a local variable that is either self.auth or None. I'm not totally sold that's the best option, but I wasn't in love with any of my other ideas either. I considered:

  • split validate_product into two functions - local and s3 versions. Call them separately in the variables init
  • make validate product a class method of variables (and also read)
  • make validate product a class method of auth
  • make the auth class an attribute of the classes instead of integrating it directly as a mixin

It's possible one of these other ones is better, but my feeling right now is to wait on one of these bigger restructures to see if there are other methods that get created as the cloud features of icepyx develop. The needs of those other functions could help influence how we approach these situations.

@rwegener2 rwegener2 requested a review from JessicaS11 November 22, 2023 14:45
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@JessicaS11
Copy link
Member

The fix I am pushing adds an if statement in the Variables init to check if the path is an s3 path and creates a local variable that is either self.auth or None.

This makes sense to me, in combination with

It's possible one of these other ones is better, but my feeling right now is to wait on one of these bigger restructures to see if there are other methods that get created as the cloud features of icepyx develop. The needs of those other functions could help influence how we approach these situations.

It's possible that some decorators could ultimately help us with this kind of issue - especially if the "check" consistently stays as "does this start with "s3"?".

I added a few notes in the vars example. It looks like this is ready to go!

@rwegener2 rwegener2 merged commit d26422e into development Nov 27, 2023
5 checks passed
@rwegener2 rwegener2 deleted the s3_vars branch November 27, 2023 21:25
JessicaS11 added a commit that referenced this pull request Jan 5, 2024
* expand extract_product and extract_version to check for s3 url

* add cloud notes to variables notebook

---------

Co-authored-by: Jessica Scheick <[email protected]>
JessicaS11 added a commit that referenced this pull request Jan 5, 2024
@JessicaS11 JessicaS11 linked an issue Jan 24, 2024 that may be closed by this pull request
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Variables an independent class
3 participants