Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand icepyx to read s3 data #468

Merged
merged 97 commits into from
Jan 4, 2024
Merged

Expand icepyx to read s3 data #468

merged 97 commits into from
Jan 4, 2024

Conversation

rwegener2
Copy link
Contributor

@rwegener2 rwegener2 commented Nov 7, 2023

Goal

NSIDC has put IS-2 data in the cloud, so we would like icepyx to be able to access that data for users. ☁️💙

How it was done

The EarthdataAuth class was added as a mixin to the Read class. The product and version are read from the file the same way they are read in using a local file. The python standard glob module does not accept s3 paths, so s3 paths are not glob-able (i.e. the user must always give the full exact filepath). There is an s3 version of glob, so this could be implemented in the future. Mixed lists of s3 paths and local paths are not allowed. If a user tries to load more than two files or three data variables at the same time they are given a warning that discourages continuing and requires user confirmation to proceed.

Very loose timing suggests that it takes about 5 minutes per data variable per file to load s3 data (testing on the dem_h variable of an ATL09 file). Running two files took 11 minutes.

How to test it

from icepyx.core.read import Read

s3_path = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL09/006/2020/06/18/ATL09_20200618112352_12820701_006_01.h5'
reader = Read(s3_path)
reader.vars.append(var_list=['dem_h'])
reader.load()

@rwegener2
Copy link
Contributor Author

Alright this PR is ready for re-review. Changes since the last review:

  • addressing this catch from @JessicaS11 about asking for authentication even for local reads
  • fix a small error about parsing bytes for reading ATL03
  • update the cloud tutorial notebook

@rwegener2 rwegener2 requested a review from JessicaS11 December 21, 2023 14:36
icepyx/core/read.py Outdated Show resolved Hide resolved
icepyx/core/read.py Outdated Show resolved Hide resolved
Comment on lines +437 to +444
if self.is_s3 is True and len(self._filelist) > 2:
warnings.warn(
"Processing more than two s3 files can take a prohibitively long time. "
"Approximate access time (using `.load()`) can exceed 6 minutes per data "
"variable.",
stacklevel=2,
)
_confirm_proceed()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rwegener2 I wanted to trigger the _confirm_proceed() so I passed in a list of s3 urls... I got the confirmation, but the warning never showed (and the confirmation is confusing without it). Wondering if it makes sense to kick the warning into the function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, interesting. The warning appeared when I was running it. That was a few commits back, though, so 🤷‍♀️
I think it's fine to add the warning to the _confirm_proceed() function. My guess is that either way we'll need to increase the stacklevel to 3 or 4 if the warning isn't showing up at level 2.

Copy link
Member

@JessicaS11 JessicaS11 Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to recall I was messing with the warnings settings in my hub for a presentation recently... since this is an important one, I added a cell to turn warnings on (changing the stack level didn't help in my case). Let me know if you think we should take a different approach. (EDIT: I left the warning where it was since the message depends on what triggers the confirmation.) Otherwise, this should be good to go.

@JessicaS11 JessicaS11 merged commit 4cf8841 into development Jan 4, 2024
2 checks passed
@JessicaS11 JessicaS11 deleted the s3_read branch January 4, 2024 15:02
JessicaS11 added a commit that referenced this pull request Jan 5, 2024
JessicaS11 added a commit that referenced this pull request Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants