Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand icepyx to read s3 data #468

Merged
merged 97 commits into from
Jan 4, 2024
Merged
Show file tree
Hide file tree
Changes from 95 commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
9d09ff9
mvp remove intake from Read
rwegener2 Aug 1, 2023
e5458a1
Merge branch 'development' into refactor_intake
rwegener2 Aug 29, 2023
24f6a42
delete is2cat and references
rwegener2 Aug 29, 2023
b13b847
remove extra comments
rwegener2 Aug 30, 2023
0779b80
update doc strings
rwegener2 Aug 30, 2023
1cfbf72
update tests
rwegener2 Aug 30, 2023
de61d87
update documentation for removing intake
rwegener2 Aug 30, 2023
9f06611
update approach paragraph
rwegener2 Aug 30, 2023
d019b9a
remove one more instance of catalog from the docs
rwegener2 Aug 30, 2023
156ea89
clear jupyter history
rwegener2 Aug 30, 2023
b26ca4e
Update icepyx/core/read.py
rwegener2 Sep 1, 2023
ce1ca76
remove intake and related modules
rwegener2 Sep 1, 2023
fd00aeb
Merge branch 'development' into read_arguments
rwegener2 Sep 4, 2023
431af78
mvp with new read parameters
rwegener2 Sep 5, 2023
612662e
clean up remainder of file and remove extraneous comments
rwegener2 Sep 5, 2023
c16a003
maintain backward compatibility and combine arguments
rwegener2 Sep 5, 2023
7648078
update to new error message
rwegener2 Sep 5, 2023
4cfbfdb
update docs
rwegener2 Sep 8, 2023
f7f823b
glob kwargs and list error
rwegener2 Sep 8, 2023
203f3ad
formatting updates
rwegener2 Sep 8, 2023
10d1591
Apply suggestions from code review
rwegener2 Sep 12, 2023
0b23d1e
remove num_files
rwegener2 Sep 12, 2023
6f5bead
fix docs test typo
rwegener2 Sep 12, 2023
035ee5a
trying again to fix the build
rwegener2 Sep 12, 2023
903c351
add feedback to docs page
rwegener2 Sep 12, 2023
d842bde
Merge branch 'development' into read_arguments
rwegener2 Sep 13, 2023
5e06de9
fix typo
rwegener2 Sep 14, 2023
9ca29f1
Merge branch 'development' into read_arguments
rwegener2 Sep 14, 2023
e8e35ad
Merge branch 'development' into read_arguments
rwegener2 Sep 18, 2023
e3566f8
mvp for making a standalone variables class
rwegener2 Sep 18, 2023
1d53341
update QUEST and GenQuery classes for argo integration (#441)
JessicaS11 Sep 25, 2023
44fd8cc
clean comments
rwegener2 Oct 3, 2023
69dce54
split data_source into seperate arguments
rwegener2 Oct 16, 2023
72e1e37
clean dev notes
rwegener2 Oct 16, 2023
83d24fb
update docstrings
rwegener2 Oct 16, 2023
a187328
little fixes
rwegener2 Oct 17, 2023
dce23f9
upgrade Variables to an stand alone import
rwegener2 Oct 17, 2023
3561be8
update example notebooks
rwegener2 Oct 17, 2023
d13ac33
hide get_latest_version
rwegener2 Oct 17, 2023
593b9d1
update api docs
rwegener2 Oct 17, 2023
d03f9fb
temporarily disable OpenAltimetry API tests (#459)
JessicaS11 Oct 18, 2023
ee8b79f
fix spot number calculation (#458)
JessicaS11 Oct 18, 2023
a1a723d
Fix a broken link in IS2_data_access.ipynb (#456)
whyjz Oct 18, 2023
d86cc9e
update Read input arguments (#444)
rwegener2 Oct 18, 2023
aedbcce
enable QUEST kwarg handling (#452)
JessicaS11 Oct 19, 2023
652a815
remove variables from components section
rwegener2 Oct 20, 2023
120694a
fix error dropping components
rwegener2 Oct 20, 2023
b4d59d6
move latest_version to is2ref
rwegener2 Oct 20, 2023
3f4bfa7
current status
rwegener2 Oct 23, 2023
a29e756
mvp updated read class
rwegener2 Oct 23, 2023
3f3cb1f
Merge branch 'development' into indep_vars
rwegener2 Oct 23, 2023
4f8e95a
add error message if no vars.wanted
rwegener2 Oct 26, 2023
73f929e
docs: add rwegener2 as a contributor for bug, code, and 6 more (#460)
allcontributors[bot] Oct 26, 2023
a56a9c8
docs: add jpswinski as a contributor for review (#461)
allcontributors[bot] Oct 26, 2023
d0838c8
update query class to append required vars
rwegener2 Oct 26, 2023
8127dbf
remove local filepaths
rwegener2 Oct 27, 2023
b3341c1
clean extraneous comments
rwegener2 Oct 27, 2023
a184d9c
Merge branch 'development' into indep_vars
rwegener2 Oct 27, 2023
bdcc9bd
docs: add whyjz as a contributor for tutorial (#462)
allcontributors[bot] Oct 27, 2023
8ff1e70
Update icepyx/core/query.py
rwegener2 Oct 31, 2023
3217565
remove redundant lines
rwegener2 Oct 31, 2023
c0e5f4e
Merge branch 'development' into indep_vars
rwegener2 Oct 31, 2023
defc76f
respond to review
rwegener2 Oct 31, 2023
5de9173
Merge branch 'indep_vars' of https://github.com/icesat2py/icepyx into…
rwegener2 Oct 31, 2023
4bf2ba8
add forgotten docstring from previous PR
rwegener2 Oct 31, 2023
11625ec
allow Variables to read s3urls
rwegener2 Oct 31, 2023
fb90b0c
add newest icepyx citations (#455)
JessicaS11 Nov 2, 2023
0f18d56
add warning if user is accessing data outside NSIDC bucket
rwegener2 Nov 2, 2023
d5747fa
Variables as an independent class (#451)
rwegener2 Nov 7, 2023
2e84bbc
resolve merge conflicts
rwegener2 Nov 7, 2023
1e0bc69
make warning clearer
rwegener2 Nov 7, 2023
c1a0f99
mvp for s3 data reads
rwegener2 Nov 7, 2023
7801b33
cleaning mvp
rwegener2 Nov 7, 2023
6c4109a
Merge branch 'development' into s3_read
JessicaS11 Dec 5, 2023
6a9da53
implement user warnings and clean up
rwegener2 Dec 12, 2023
9dcfd5a
Merge branch 'development' into s3_read
rwegener2 Dec 12, 2023
861fb83
Update icepyx/core/read.py
rwegener2 Dec 20, 2023
f3191a2
fix local read auth requirement
rwegener2 Dec 20, 2023
66ca27f
fix bytes error for version parsing
rwegener2 Dec 20, 2023
b0828d8
expand cloud data access notebook
rwegener2 Dec 20, 2023
a579ef6
final edits to cloud tutorial
rwegener2 Dec 21, 2023
1051671
Merge branch 'development' into s3_read
rwegener2 Dec 21, 2023
f0260f1
Merge branch 'development' into s3_read
JessicaS11 Dec 22, 2023
1a7f028
apply black formatting
JessicaS11 Dec 22, 2023
59776fc
fix import statements
JessicaS11 Dec 22, 2023
8b5e071
fix few text items
JessicaS11 Dec 22, 2023
5d9397c
remove provider kwarg from get_s3fs_session
JessicaS11 Jan 2, 2024
15d6081
Merge branch 'development' into s3_read
JessicaS11 Jan 2, 2024
db7dfe5
fix local read bug
rwegener2 Jan 3, 2024
3e8af7f
fix file linting warnings
JessicaS11 Jan 3, 2024
5b91c85
move warning about proceeding with cloud reads
JessicaS11 Jan 4, 2024
4b7ed93
change warning stack level
JessicaS11 Jan 4, 2024
aad73a9
change warning stack level
JessicaS11 Jan 4, 2024
5d86f72
put stack warning level back to 2
JessicaS11 Jan 4, 2024
fb6e2a1
add warnings filter to notebook
JessicaS11 Jan 4, 2024
79a1a5f
put warning back since it depends on what triggers the confirm procee…
JessicaS11 Jan 4, 2024
307bc04
GitHub action UML generation auto-update
JessicaS11 Jan 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
282 changes: 238 additions & 44 deletions doc/source/example_notebooks/IS2_cloud_data_access.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,35 +12,59 @@
"## Notes\n",
JessicaS11 marked this conversation as resolved.
Show resolved Hide resolved
"1. ICESat-2 data became publicly available on the cloud on 29 September 2022. Thus, access methods and example workflows are still being developed by NSIDC, and the underlying code in icepyx will need to be updated now that these data (and the associated metadata) are available. We appreciate your patience and contributions (e.g. reporting bugs, sharing your code, etc.) during this transition!\n",
"2. This example and the code it describes are part of ongoing development. Current limitations to using these features are described throughout the example, as appropriate.\n",
"3. You **MUST** be working within an AWS instance. Otherwise, you will get a permissions error.\n",
"4. Cloud authentication is still more user-involved than we'd like. We're working to address this - let us know if you'd like to join the conversation!"
"3. You **MUST** be working within an AWS instance. Otherwise, you will get a permissions error."
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"## Querying for data and finding s3 urls"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import earthaccess\n",
"import icepyx as ipx"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Make sure the user sees important warnings if they try to read a lot of data from the cloud\n",
"import warnings\n",
"warnings.filterwarnings(\"always\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"user_expressions": []
},
"source": [
"Create an icepyx Query object"
"We will start the way we often do: by creating an icepyx Query object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# bounding box\n",
"# \"producerGranuleId\": \"ATL03_20191130221008_09930503_004_01.h5\",\n",
"short_name = 'ATL03'\n",
"spatial_extent = [-45, 58, -35, 75]\n",
"date_range = ['2019-11-30','2019-11-30']"
Expand All @@ -49,25 +73,32 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"reg=ipx.Query(short_name, spatial_extent, date_range)"
"reg = ipx.Query(short_name, spatial_extent, date_range)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"tags": [],
"user_expressions": []
},
"source": [
"## Get the granule s3 urls\n",
"You must specify `cloud=True` to get the needed s3 urls.\n",
"This function returns a list containing the list of the granule IDs and a list of the corresponding urls."
"### Get the granule s3 urls\n",
"\n",
"With this query object you can get a list of available granules. This function returns a list containing the list of the granule IDs and a list of the corresponding urls. Use `cloud=True` to get the needed s3 urls."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"gran_ids = reg.avail_granules(ids=True, cloud=True)\n",
Expand All @@ -80,19 +111,114 @@
"user_expressions": []
},
"source": [
"## Log in to Earthdata and generate an s3 token\n",
"You can use icepyx's existing login functionality to generate your s3 data access token, which will be valid for *one* hour. The icepyx module will renew the token for you after an hour, but if viewing your token over the course of several hours you may notice the values will change.\n",
"## Determining variables of interest"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"There are several ways to view available variables. One is to use the existing Query object:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"reg.order_vars.avail()"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"Another way is to use the variables module:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ipx.Variables(product=short_name).avail()"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"We can also do this using a specific s3 filepath from the Query object:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ipx.Variables(path=gran_ids[1][0]).avail()"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"From any of these methods we can see that `h_ph` is a variable for this data product, so we will read that variable in the next step."
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"#### A Note on listing variables using s3 urls\n",
"\n",
"You can access your s3 credentials using:"
"We can use the Variables module with an s3 url to explore available data variables the same way we do with local files. An important difference, however, is how the available variables list is created. When reading a local file the variables module will traverse the entire file and search for variables that are present in that file. This method it too time intensive with the s3 data, so instead the the product / version of the data product is read from the file and all possible variables associated with that product/version are reporting as available. As long as you are using the NSIDC provided s3 paths provided via Earthdata search and the Query object these lists will be the same."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [],
"user_expressions": []
},
"source": [
"#### A Note on authentication\n",
"\n",
"Notice that accessing cloud data requires two layers of authentication: 1) authenticating with your Earthdata Login 2) authenticating for cloud access. These both happen behind the scenes, without the need for users to provide any explicit commands.\n",
"\n",
"Icepyx uses earthaccess to generate your s3 data access token, which will be valid for *one* hour. Icepyx will also renew the token for you after an hour, so if viewing your token over the course of several hours you may notice the values will change.\n",
"\n",
"If you do want to see your s3 credentials, you can access them using:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# uncommenting the line below will print your temporary login credentials\n",
"# uncommenting the line below will print your temporary aws login credentials\n",
"# reg.s3login_credentials"
]
},
Expand All @@ -111,68 +237,136 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"user_expressions": []
},
"source": [
"## Set up your s3 file system using your credentials"
"## Choose a data file and access the data\n",
"\n",
"**Note: If you get a PermissionDenied Error when trying to read in the data, you may not be sending your request from an AWS hub in us-west2. We're currently working on how to alert users if they will not be able to access ICESat-2 data in the cloud for this reason**\n",
"\n",
"We are ready to read our data! We do this by creating a reader object and using the s3 url returned from the Query object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"s3 = earthaccess.get_s3fs_session(daac='NSIDC', provider=reg.s3login_credentials)"
"# the first index, [1], gets us into the list of s3 urls\n",
"# the second index, [0], gets us the first entry in that list.\n",
"s3url = gran_ids[1][0]\n",
"# s3url = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2019/11/30/ATL03_20191130221008_09930503_004_01.h5'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"tags": [],
"user_expressions": []
},
"source": [
"## Select an s3 url and access the data\n",
"Data read in capabilities for cloud data are coming soon in icepyx (targeted Spring 2023). Stay tuned and we'd love for you to join us and contribute!\n",
"\n",
"**Note: If you get a PermissionDenied Error when trying to read in the data, you may not be sending your request from an AWS hub in us-west2. We're currently working on how to alert users if they will not be able to access ICESat-2 data in the cloud for this reason**"
"Create the Read object"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# the first index, [1], gets us into the list of s3 urls\n",
"# the second index, [0], gets us the first entry in that list.\n",
"s3url = gran_ids[1][0]\n",
"# s3url = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2019/11/30/ATL03_20191130221008_09930503_004_01.h5'"
"reader = ipx.Read(s3url)"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"This reader object gives us yet another way to view available variables."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import h5py\n",
"import numpy as np"
"reader.vars.avail()"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"Next, we append our desired variable to the `wanted_vars` list:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"reader.vars.append(var_list=['h_ph'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"Finally, we load the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%time f = h5py.File(s3.open(s3url,'rb'),'r')"
"%%time\n",
"\n",
"# This may take 5-10 minutes\n",
"reader.load()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"user_expressions": []
},
"source": [
"### Some important caveats\n",
"\n",
"While the cloud data reading is functional within icepyx, it is very slow. Approximate timing shows it takes ~6 minutes of load time per variable per file from s3. Because of this you will recieve a warning if you try to load either more than three variables or two files at once.\n",
"\n",
"The slow load speed is a demonstration of the many steps involved in making cloud data actionable - the data supply chain needs optimized source data, efficient low level data readers, and high level libraries which are enabled to use the fastest low level data readers. Not all of these pieces fully developed right now, but the progress being made it exciting and there is lots of room for contribution!"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"#### Credits\n",
"* notebook by: Jessica Scheick\n",
"* historic source material: [is2-nsidc-cloud.py](https://gist.github.com/bradlipovsky/80ab6a7aff3d3524b9616a9fc176065e#file-is2-nsidc-cloud-py-L28) by Brad Lipovsky"
"* notebook by: Jessica Scheick and Rachel Wegener"
]
}
],
Expand All @@ -192,7 +386,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.10.13"
}
},
"nbformat": 4,
Expand Down
Loading