Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read arguments #444

Merged
merged 37 commits into from
Oct 18, 2023
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
9d09ff9
mvp remove intake from Read
rwegener2 Aug 1, 2023
e5458a1
Merge branch 'development' into refactor_intake
rwegener2 Aug 29, 2023
24f6a42
delete is2cat and references
rwegener2 Aug 29, 2023
b13b847
remove extra comments
rwegener2 Aug 30, 2023
0779b80
update doc strings
rwegener2 Aug 30, 2023
1cfbf72
update tests
rwegener2 Aug 30, 2023
de61d87
update documentation for removing intake
rwegener2 Aug 30, 2023
9f06611
update approach paragraph
rwegener2 Aug 30, 2023
d019b9a
remove one more instance of catalog from the docs
rwegener2 Aug 30, 2023
156ea89
clear jupyter history
rwegener2 Aug 30, 2023
b26ca4e
Update icepyx/core/read.py
rwegener2 Sep 1, 2023
ce1ca76
remove intake and related modules
rwegener2 Sep 1, 2023
fd00aeb
Merge branch 'development' into read_arguments
rwegener2 Sep 4, 2023
431af78
mvp with new read parameters
rwegener2 Sep 5, 2023
612662e
clean up remainder of file and remove extraneous comments
rwegener2 Sep 5, 2023
c16a003
maintain backward compatibility and combine arguments
rwegener2 Sep 5, 2023
7648078
update to new error message
rwegener2 Sep 5, 2023
4cfbfdb
update docs
rwegener2 Sep 8, 2023
f7f823b
glob kwargs and list error
rwegener2 Sep 8, 2023
203f3ad
formatting updates
rwegener2 Sep 8, 2023
10d1591
Apply suggestions from code review
rwegener2 Sep 12, 2023
0b23d1e
remove num_files
rwegener2 Sep 12, 2023
6f5bead
fix docs test typo
rwegener2 Sep 12, 2023
035ee5a
trying again to fix the build
rwegener2 Sep 12, 2023
903c351
add feedback to docs page
rwegener2 Sep 12, 2023
d842bde
Merge branch 'development' into read_arguments
rwegener2 Sep 13, 2023
5e06de9
fix typo
rwegener2 Sep 14, 2023
9ca29f1
Merge branch 'development' into read_arguments
rwegener2 Sep 14, 2023
e8e35ad
Merge branch 'development' into read_arguments
rwegener2 Sep 18, 2023
4bcc518
Merge branch 'development' into read_arguments
JessicaS11 Sep 26, 2023
b2c2735
depreciate -> deprecate
JessicaS11 Oct 9, 2023
6b953f9
Apply suggestions from code review
rwegener2 Oct 10, 2023
45704a4
elaborate on multiple products warning
rwegener2 Oct 10, 2023
2bf2808
clarify glob section
rwegener2 Oct 10, 2023
1242881
test product name error
rwegener2 Oct 10, 2023
5f8589a
Merge branch 'development' into read_arguments
JessicaS11 Oct 18, 2023
e18cf7a
GitHub action UML generation auto-update
JessicaS11 Oct 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 127 additions & 53 deletions doc/source/example_notebooks/IS2_data_read-in.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,8 @@
"metadata": {},
rwegener2 marked this conversation as resolved.
Show resolved Hide resolved
rwegener2 marked this conversation as resolved.
Show resolved Hide resolved
rwegener2 marked this conversation as resolved.
Show resolved Hide resolved
rwegener2 marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

@JessicaS11 JessicaS11 Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

glob will not by default search all of the subdirectories for matching filepaths, but it has the ability to do so. If you would like to search recursively, you can achieve this by:

  1. passing the recursive=True argument into glob_kwargs (shown below)
  2. use /**/ in the filepath to match any level of nested folders (not shown below)
  3. using glob directly to create a list of filepaths (shown below)


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I streamlined this explanation, like your comment suggests. Let me know if it still isn't clear!

"outputs": [],
"source": [
"path_root = '/full/path/to/your/data/'\n",
"pattern = \"processed_ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5\"\n",
"reader = ipx.Read(path_root, \"ATL06\", pattern) # or ipx.Read(filepath, \"ATLXX\") if your filenames match the default pattern"
"path_root = '/full/path/to/your/ATL06_data/'\n",
"reader = ipx.Read(path_root)"
]
},
{
Expand Down Expand Up @@ -111,10 +110,9 @@
"\n",
"Reading in ICESat-2 data with icepyx happens in a few simple steps:\n",
"1. Let icepyx know where to find your data (this might be local files or urls to data in cloud storage)\n",
"2. Tell icepyx how to interpret the filename format\n",
"3. Create an icepyx `Read` object\n",
"4. Make a list of the variables you want to read in (does not apply for gridded products)\n",
"5. Load your data into memory (or read it in lazily, if you're using Dask)\n",
"2. Create an icepyx `Read` object\n",
"3. Make a list of the variables you want to read in (does not apply for gridded products)\n",
"4. Load your data into memory (or read it in lazily, if you're using Dask)\n",
"\n",
"We go through each of these steps in more detail in this notebook."
]
Expand Down Expand Up @@ -168,21 +166,18 @@
{
"cell_type": "markdown",
"id": "e8da42c1",
"metadata": {},
"metadata": {
"user_expressions": []
},
"source": [
"### Step 1: Set data source path\n",
"\n",
"Provide a full path to the data to be read in (i.e. opened).\n",
"Currently accepted inputs are:\n",
"* a directory\n",
"* a single file\n",
"\n",
"All files to be read in *must* have a consistent filename pattern.\n",
"If a directory is supplied as the data source, all files in any subdirectories that match the filename pattern will be included.\n",
"\n",
"S3 bucket data access is currently under development, and requires you are registered with NSIDC as a beta tester for cloud-based ICESat-2 data.\n",
"icepyx is working to ensure a smooth transition to working with remote files.\n",
"We'd love your help exploring and testing these features as they become available!"
"* a string path to directory - all files from the directory will be opened\n",
"* a string path to single file - one file will be opened\n",
"* a list of filepaths - all files in the list will be opened\n",
"* a glob string (see [glob](https://docs.python.org/3/library/glob.html)) - any files matching the glob pattern will be opened"
]
},
{
Expand All @@ -208,86 +203,145 @@
{
"cell_type": "code",
"execution_count": null,
"id": "e683ebf7",
"id": "fac636c2-e0eb-4e08-adaa-8f47623e46a1",
"metadata": {},
"outputs": [],
"source": [
"# urlpath = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2019/11/30/ATL03_20191130221008_09930503_004_01.h5'"
"# list_of_files = ['/my/data/ATL06/processed_ATL06_20190226005526_09100205_006_02.h5', \n",
"# '/my/other/data/ATL06/processed_ATL06_20191202102922_10160505_006_01.h5']"
]
},
{
"cell_type": "markdown",
"id": "92743496",
"id": "ba3ebeb0-3091-4712-b0f7-559ddb95ca5a",
"metadata": {
"user_expressions": []
},
"source": [
"### Step 2: Create a filename pattern for your data files\n",
"#### Glob Strings\n",
"\n",
"[glob](https://docs.python.org/3/library/glob.html) is a Python library which allows users to list files in their file systems whose paths match a given pattern. Icepyx uses the glob library to give users greater flexibility over their input file lists.\n",
"\n",
"glob works using `*` and `?` as wildcard characters, where `*` matches any number of characters and `?` matches a single character. For example:\n",
"\n",
"Files provided by NSIDC typically match the format `\"ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5\"` where the parameters in curly brackets indicate a parameter name (left of the colon) and character length or format (right of the colon).\n",
"Some of this information is used during data opening to help correctly read and label the data within the data structure, particularly when multiple files are opened simultaneously.\n",
"* `/this/path/*.h5`: refers to all `.h5` files in the `/this/path` folder (Example matches: \"/this/path/processed_ATL03_20191130221008_09930503_006_01.h5\" or \"/this/path/myfavoriteicsat-2file.h5\")\n",
"* `/this/path/*ATL07*.h5`: refers to all `.h5` files in the `/this/path` folder that have ATL07 in the filename. (Example matches: \"/this/path/ATL07-02_20221012220720_03391701_005_01.h5\" or \"/this/path/processed_ATL07.h5\")\n",
"* `/this/path/ATL??/*.h5`: refers to all `.h5` files that are in a subfolder of `/this/path` and a subdirectory of `ATL` followed by any 2 characters (Example matches: \"/this/path/ATL03/processed_ATL03_20191130221008_09930503_006_01.h5\", \"/this/path/ATL06/myfile.h5\")\n",
"\n",
"By default, icepyx will assume your filenames follow the default format.\n",
"However, you can easily read in other ICESat-2 data files by supplying your own filename pattern.\n",
"For instance, `pattern=\"ATL{product:2}-{datetime:%Y%m%d%H%M%S}-Sample.h5\"`. A few example patterns are provided below."
"See the glob documentation or other online explainer tutorials for more in depth explanation, or advanced glob paths such as character classes and ranges."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7318abd0",
"metadata": {},
"outputs": [],
"cell_type": "markdown",
"id": "20286c76-5632-4420-b2c9-a5a6b1952672",
"metadata": {
"user_expressions": []
},
"source": [
"#### Recursive Directory Search"
]
},
{
"cell_type": "markdown",
"id": "632bd1ce-2397-4707-a63f-9d5d2fc02fbc",
"metadata": {
"user_expressions": []
},
"source": [
"glob will not by default search all of the subdirectories for matching filepaths, but it has the ability to do so. To search recursively you need to 1) use `/**/` in the filepath to match any level of nested folders and 2) use the `recursive=True` argument. \n",
"\n",
"If you would like to search recursively, you can achieve this by either:\n",
"1. passing the `recursive` argument into `glob_kwargs`\n",
"2. using glob directly to create a list of filepaths"
]
},
{
"cell_type": "markdown",
"id": "da0cacd8-9ddc-4c31-86b6-167d850b989e",
"metadata": {
"user_expressions": []
},
"source": [
"# pattern = 'ATL06-{datetime:%Y%m%d%H%M%S}-Sample.h5'\n",
"# pattern = 'ATL{product:2}-{datetime:%Y%m%d%H%M%S}-Sample.h5'"
"Method 1: passing the `recursive` argument into `glob_kwargs`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f43e8664",
"id": "e276b876-9ec7-4991-8520-05c97824b896",
"metadata": {},
"outputs": [],
"source": [
"# pattern = \"ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5\""
"ipx.Read('/path/to/**/folder', glob_kwargs={'recursive': True})"
]
},
{
"cell_type": "markdown",
"id": "f5a1e85e-fc4a-405f-9710-0cb61b827f2c",
"metadata": {
"user_expressions": []
},
"source": [
"You can use `glob_kwargs` for any additional argument to Python's builtin `glob.glob` that you would like to pass in via icepyx."
]
},
{
"cell_type": "markdown",
"id": "76de9539-710c-49f6-9e9e-238849382c33",
"metadata": {
"user_expressions": []
},
"source": [
"Method 2: using glob directly to create a list of filepaths"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "992a77fb",
"id": "be79b0dd-efcf-4d50-bdb0-8e3ae8e8e38c",
"metadata": {},
"outputs": [],
"source": [
"# grid_pattern = \"ATL{product:2}_GL_0311_{res:3}m_{version:3}_{revision:2}.nc\""
"import glob"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6aec1a70",
"metadata": {},
"id": "5d088571-496d-479a-9fb7-833ed7e98676",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"pattern = \"processed_ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5\""
"list_of_files = glob.glob('/path/to/**/folder', recursive=True)\n",
"ipx.Read(list_of_files)"
]
},
{
"cell_type": "markdown",
"id": "4275b04c",
"id": "08df2874-7c54-4670-8f37-9135ea296ff5",
"metadata": {
"user_expressions": []
},
"source": [
"### Step 3: Create an icepyx read object\n",
"```{admonition} Read Module Update\n",
"Previously, icepyx required two additional conditions: 1) a `product` argument and 2) that your files either matched the default `filename_pattern` or that the user provided their own `filename_pattern`. These two requirements have been removed. `product` is now read directly from the file metadata (the root group's `short_name` attribute). Flexibility to specify multiple files via the `filename_pattern` has been replaced with the [glob string](https://docs.python.org/3/library/glob.html) feature, and by allowing a list of filepaths as an argument.\n",
"\n",
"The `Read` object has two required inputs:\n",
"- `path` = a string with the full file path or full directory path to your hdf5 (.h5) format files.\n",
"- `product` = the data product you're working with, also known as the \"short name\".\n",
"The `product` and `filename_pattern` arguments have been maintained for backwards compatibility, but will be fully removed in icepyx version 1.0.0.\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "4275b04c",
"metadata": {
"user_expressions": []
},
"source": [
"### Step 2: Create an icepyx read object\n",
"\n",
"The `Read` object also accepts the optional keyword input:\n",
"- `pattern` = a formatted string indicating the filename pattern required for Intake's path_as_pattern argument."
"Using the `data_source` described in Step 1, we can create our Read object."
]
},
{
Expand All @@ -299,7 +353,17 @@
},
"outputs": [],
"source": [
"reader = ipx.Read(data_source=path_root, product=\"ATL06\", filename_pattern=pattern) # or ipx.Read(filepath, \"ATLXX\") if your filenames match the default pattern"
"reader = ipx.Read(data_source=path_root)"
]
},
{
"cell_type": "markdown",
"id": "7b2acfdb-75eb-4c64-b583-2ab19326aaee",
"metadata": {
"user_expressions": []
},
"source": [
"The Read object now contains the list of matching files that will eventually be loaded into Python. You can inspect its properties, such as the files that were located or the identified product, directly on the Read object."
]
},
{
Expand All @@ -309,7 +373,17 @@
"metadata": {},
"outputs": [],
"source": [
"reader._filelist"
"reader.filelist"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7455ee3f-f9ab-486e-b4c7-2fa2314d4084",
"metadata": {},
"outputs": [],
"source": [
"reader.product"
]
},
{
Expand All @@ -319,7 +393,7 @@
"user_expressions": []
},
"source": [
"### Step 4: Specify variables to be read in\n",
"### Step 3: Specify variables to be read in\n",
"\n",
"To load your data into memory or prepare it for analysis, icepyx needs to know which variables you'd like to read in.\n",
"If you've used icepyx to download data from NSIDC with variable subsetting (which is the default), then you may already be familiar with the icepyx `Variables` module and how to create and modify lists of variables.\n",
Expand Down Expand Up @@ -426,7 +500,7 @@
"user_expressions": []
},
"source": [
"### Step 5: Loading your data\n",
"### Step 4: Loading your data\n",
"\n",
"Now that you've set up all the options, you're ready to read your ICESat-2 data into memory!"
]
Expand Down Expand Up @@ -541,9 +615,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "general",
"display_name": "icepyx-dev",
"language": "python",
"name": "general"
"name": "icepyx-dev"
},
"language_info": {
"codemirror_mode": {
Expand Down
2 changes: 2 additions & 0 deletions doc/source/user_guide/documentation/read.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ Attributes
.. autosummary::
:toctree: ../../_icepyx/

Read.filelist
Read.product
Read.vars


Expand Down
5 changes: 3 additions & 2 deletions icepyx/core/is2ref.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ def _validate_product(product):
"""
Confirm a valid ICESat-2 product was specified
"""
error_msg = "A valid product string was not provided. Check user input, if given, or file metadata."
if isinstance(product, str):
product = str.upper(product)
assert product in [
Expand All @@ -40,9 +41,9 @@ def _validate_product(product):
"ATL20",
"ATL21",
"ATL23",
], "Please enter a valid product"
], error_msg
else:
raise TypeError("Please enter a product string")
raise TypeError(error_msg)
return product


Expand Down
Loading