update Read input arguments (#444)

* add filelist and product properties to Read object * deprecate filename_pattern and product class Read inputs * transition to data_source input as a string (including glob string) or list * update tutorial with changes and user guidance for using glob --------- Co-authored-by: Jessica Scheick <[email protected]>
icesat2py · Nov 15, 2023 · e591d83 · e591d83
1 parent 9727e3e
commit e591d83
Show file tree

Hide file tree

Showing 7 changed files with 353 additions and 183 deletions.
diff --git a/doc/source/example_notebooks/IS2_data_read-in.ipynb b/doc/source/example_notebooks/IS2_data_read-in.ipynb
@@ -63,9 +63,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "path_root = '/full/path/to/your/data/'\n",
-    "pattern = \"processed_ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5\"\n",
-    "reader = ipx.Read(path_root, \"ATL06\", pattern) # or ipx.Read(filepath, \"ATLXX\") if your filenames match the default pattern"
+    "path_root = '/full/path/to/your/ATL06_data/'\n",
+    "reader = ipx.Read(path_root)"
    ]
   },
   {
@@ -111,10 +110,9 @@
     "\n",
     "Reading in ICESat-2 data with icepyx happens in a few simple steps:\n",
     "1. Let icepyx know where to find your data (this might be local files or urls to data in cloud storage)\n",
-    "2. Tell icepyx how to interpret the filename format\n",
-    "3. Create an icepyx `Read` object\n",
-    "4. Make a list of the variables you want to read in (does not apply for gridded products)\n",
-    "5. Load your data into memory (or read it in lazily, if you're using Dask)\n",
+    "2. Create an icepyx `Read` object\n",
+    "3. Make a list of the variables you want to read in (does not apply for gridded products)\n",
+    "4. Load your data into memory (or read it in lazily, if you're using Dask)\n",
     "\n",
     "We go through each of these steps in more detail in this notebook."
    ]
@@ -168,21 +166,18 @@
   {
    "cell_type": "markdown",
    "id": "e8da42c1",
-   "metadata": {},
+   "metadata": {
+    "user_expressions": []
+   },
    "source": [
     "### Step 1: Set data source path\n",
     "\n",
     "Provide a full path to the data to be read in (i.e. opened).\n",
     "Currently accepted inputs are:\n",
-    "* a directory\n",
-    "* a single file\n",
-    "\n",
-    "All files to be read in *must* have a consistent filename pattern.\n",
-    "If a directory is supplied as the data source, all files in any subdirectories that match the filename pattern will be included.\n",
-    "\n",
-    "S3 bucket data access is currently under development, and requires you are registered with NSIDC as a beta tester for cloud-based ICESat-2 data.\n",
-    "icepyx is working to ensure a smooth transition to working with remote files.\n",
-    "We'd love your help exploring and testing these features as they become available!"
+    "* a string path to directory - all files from the directory will be opened\n",
+    "* a string path to single file - one file will be opened\n",
+    "* a list of filepaths - all files in the list will be opened\n",
+    "* a glob string (see [glob](https://docs.python.org/3/library/glob.html)) - any files matching the glob pattern will be opened"
    ]
   },
   {
@@ -208,86 +203,147 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "e683ebf7",
+   "id": "fac636c2-e0eb-4e08-adaa-8f47623e46a1",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# urlpath = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2019/11/30/ATL03_20191130221008_09930503_004_01.h5'"
+    "# list_of_files = ['/my/data/ATL06/processed_ATL06_20190226005526_09100205_006_02.h5', \n",
+    "#                  '/my/other/data/ATL06/processed_ATL06_20191202102922_10160505_006_01.h5']"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "92743496",
+   "id": "ba3ebeb0-3091-4712-b0f7-559ddb95ca5a",
    "metadata": {
     "user_expressions": []
    },
    "source": [
-    "### Step 2: Create a filename pattern for your data files\n",
+    "#### Glob Strings\n",
+    "\n",
+    "[glob](https://docs.python.org/3/library/glob.html) is a Python library which allows users to list files in their file systems whose paths match a given pattern. Icepyx uses the glob library to give users greater flexibility over their input file lists.\n",
+    "\n",
+    "glob works using `*` and `?` as wildcard characters, where `*` matches any number of characters and `?` matches a single character. For example:\n",
     "\n",
-    "Files provided by NSIDC typically match the format `\"ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5\"` where the parameters in curly brackets indicate a parameter name (left of the colon) and character length or format (right of the colon).\n",
-    "Some of this information is used during data opening to help correctly read and label the data within the data structure, particularly when multiple files are opened simultaneously.\n",
+    "* `/this/path/*.h5`: refers to all `.h5` files in the `/this/path` folder (Example matches: \"/this/path/processed_ATL03_20191130221008_09930503_006_01.h5\" or \"/this/path/myfavoriteicsat-2file.h5\")\n",
+    "* `/this/path/*ATL07*.h5`: refers to all `.h5` files in the `/this/path` folder that have ATL07 in the filename. (Example matches: \"/this/path/ATL07-02_20221012220720_03391701_005_01.h5\" or \"/this/path/processed_ATL07.h5\")\n",
+    "* `/this/path/ATL??/*.h5`: refers to all `.h5` files that are in a subfolder of `/this/path` and a subdirectory of `ATL` followed by any 2 characters (Example matches: \"/this/path/ATL03/processed_ATL03_20191130221008_09930503_006_01.h5\", \"/this/path/ATL06/myfile.h5\")\n",
     "\n",
-    "By default, icepyx will assume your filenames follow the default format.\n",
-    "However, you can easily read in other ICESat-2 data files by supplying your own filename pattern.\n",
-    "For instance, `pattern=\"ATL{product:2}-{datetime:%Y%m%d%H%M%S}-Sample.h5\"`. A few example patterns are provided below."
+    "See the glob documentation or other online explainer tutorials for more in depth explanation, or advanced glob paths such as character classes and ranges."
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7318abd0",
-   "metadata": {},
-   "outputs": [],
+   "cell_type": "markdown",
+   "id": "20286c76-5632-4420-b2c9-a5a6b1952672",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "#### Recursive Directory Search"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "632bd1ce-2397-4707-a63f-9d5d2fc02fbc",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "glob will not by default search all of the subdirectories for matching filepaths, but it has the ability to do so.\n",
+    "\n",
+    "If you would like to search recursively, you can achieve this by either:\n",
+    "1. passing the `recursive` argument into `glob_kwargs` and including `\\**\\` in your filepath\n",
+    "2. using glob directly to create a list of filepaths\n",
+    "\n",
+    "Each of these two methods are shown below."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "da0cacd8-9ddc-4c31-86b6-167d850b989e",
+   "metadata": {
+    "user_expressions": []
+   },
    "source": [
-    "# pattern = 'ATL06-{datetime:%Y%m%d%H%M%S}-Sample.h5'\n",
-    "# pattern = 'ATL{product:2}-{datetime:%Y%m%d%H%M%S}-Sample.h5'"
+    "Method 1: passing the `recursive` argument into `glob_kwargs`"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "f43e8664",
+   "id": "e276b876-9ec7-4991-8520-05c97824b896",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# pattern = \"ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5\""
+    "ipx.Read('/path/to/**/folder', glob_kwargs={'recursive': True})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f5a1e85e-fc4a-405f-9710-0cb61b827f2c",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "You can use `glob_kwargs` for any additional argument to Python's builtin `glob.glob` that you would like to pass in via icepyx."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "76de9539-710c-49f6-9e9e-238849382c33",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "Method 2: using glob directly to create a list of filepaths"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "992a77fb",
+   "id": "be79b0dd-efcf-4d50-bdb0-8e3ae8e8e38c",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# grid_pattern = \"ATL{product:2}_GL_0311_{res:3}m_{version:3}_{revision:2}.nc\""
+    "import glob"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "6aec1a70",
-   "metadata": {},
+   "id": "5d088571-496d-479a-9fb7-833ed7e98676",
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
-    "pattern = \"processed_ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5\""
+    "list_of_files = glob.glob('/path/to/**/folder', recursive=True)\n",
+    "ipx.Read(list_of_files)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "4275b04c",
+   "id": "08df2874-7c54-4670-8f37-9135ea296ff5",
    "metadata": {
     "user_expressions": []
    },
    "source": [
-    "### Step 3: Create an icepyx read object\n",
+    "```{admonition} Read Module Update\n",
+    "Previously, icepyx required two additional conditions: 1) a `product` argument and 2) that your files either matched the default `filename_pattern` or that the user provided their own `filename_pattern`. These two requirements have been removed. `product` is now read directly from the file metadata (the root group's `short_name` attribute). Flexibility to specify multiple files via the `filename_pattern` has been replaced with the [glob string](https://docs.python.org/3/library/glob.html) feature, and by allowing a list of filepaths as an argument.\n",
     "\n",
-    "The `Read` object has two required inputs:\n",
-    "- `path` = a string with the full file path or full directory path to your hdf5 (.h5) format files.\n",
-    "- `product` = the data product you're working with, also known as the \"short name\".\n",
+    "The `product` and `filename_pattern` arguments have been maintained for backwards compatibility, but will be fully removed in icepyx version 1.0.0.\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4275b04c",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "### Step 2: Create an icepyx read object\n",
     "\n",
-    "The `Read` object also accepts the optional keyword input:\n",
-    "- `pattern` = a formatted string indicating the filename pattern required for Intake's path_as_pattern argument."
+    "Using the `data_source` described in Step 1, we can create our Read object."
    ]
   },
   {
@@ -299,7 +355,17 @@
    },
    "outputs": [],
    "source": [
-    "reader = ipx.Read(data_source=path_root, product=\"ATL06\", filename_pattern=pattern) # or ipx.Read(filepath, \"ATLXX\") if your filenames match the default pattern"
+    "reader = ipx.Read(data_source=path_root)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b2acfdb-75eb-4c64-b583-2ab19326aaee",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "The Read object now contains the list of matching files that will eventually be loaded into Python. You can inspect its properties, such as the files that were located or the identified product, directly on the Read object."
    ]
   },
   {
@@ -309,7 +375,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "reader._filelist"
+    "reader.filelist"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7455ee3f-f9ab-486e-b4c7-2fa2314d4084",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "reader.product"
    ]
   },
   {
@@ -319,7 +395,7 @@
     "user_expressions": []
    },
    "source": [
-    "### Step 4: Specify variables to be read in\n",
+    "### Step 3: Specify variables to be read in\n",
     "\n",
     "To load your data into memory or prepare it for analysis, icepyx needs to know which variables you'd like to read in.\n",
     "If you've used icepyx to download data from NSIDC with variable subsetting (which is the default), then you may already be familiar with the icepyx `Variables` module and how to create and modify lists of variables.\n",
@@ -426,7 +502,7 @@
     "user_expressions": []
    },
    "source": [
-    "### Step 5: Loading your data\n",
+    "### Step 4: Loading your data\n",
     "\n",
     "Now that you've set up all the options, you're ready to read your ICESat-2 data into memory!"
    ]
@@ -541,9 +617,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "general",
+   "display_name": "icepyx-dev",
    "language": "python",
-   "name": "general"
+   "name": "icepyx-dev"
   },
   "language_info": {
    "codemirror_mode": {