diff --git a/.github/workflows/main.yaml b/.github/workflows/main.yaml index d39ecb1a..9736c134 100644 --- a/.github/workflows/main.yaml +++ b/.github/workflows/main.yaml @@ -30,7 +30,7 @@ jobs: with: path: _build # NOTE: change key to "jupyterbook-DATE" to force rebuilding cache - key: jupyterbook-20230626 + key: jupyterbook-20230707 - name: Install Conda environment with Micromamba uses: mamba-org/setup-micromamba@v1 diff --git a/.github/workflows/preview.yaml b/.github/workflows/preview.yaml index 36e52d50..42e95e97 100644 --- a/.github/workflows/preview.yaml +++ b/.github/workflows/preview.yaml @@ -21,7 +21,7 @@ jobs: with: path: _build # NOTE: change key to "jupyterbook-DATE" to force rebuilding cache - key: jupyterbook-20230626 + key: jupyterbook-20230707 - name: Install Conda environment with Micromamba uses: mamba-org/setup-micromamba@v1 diff --git a/_config.yml b/_config.yml index 78978501..6ec14e9d 100644 --- a/_config.yml +++ b/_config.yml @@ -79,5 +79,6 @@ sphinx: rediraffe_redirects: scipy-tutorial/00_overview.ipynb: overview/get-started.md workshops/scipy2022/README.md: overview/fundamental-path/README.md + fundamentals/02.1_working_with_labeled_data.ipynb: fundamentals/02.1_indexing_Basic.ipynb bibtex_reference_style: author_year # or label, super, \supercite diff --git a/_toc.yml b/_toc.yml index 6e8ea89e..976b745a 100644 --- a/_toc.yml +++ b/_toc.yml @@ -19,7 +19,7 @@ parts: - file: fundamentals/01.1_io - file: fundamentals/02_labeled_data.md sections: - - file: fundamentals/02.1_working_with_labeled_data + - file: fundamentals/02.1_indexing_Basic.ipynb - file: fundamentals/02.2_manipulating_dimensions - file: fundamentals/03_computation.md sections: @@ -37,6 +37,10 @@ parts: - caption: Intermediate chapters: - file: intermediate/01-high-level-computation-patterns + - file: intermediate/indexing/indexing + sections: + - file: intermediate/indexing/advanced-indexing.ipynb + - file: intermediate/indexing/boolean-masking-indexing.ipynb - file: intermediate/xarray_and_dask - file: intermediate/xarray_ecosystem - file: intermediate/hvplot diff --git a/advanced/backends/backends.md b/advanced/backends/backends.md index a7312b44..a53df382 100644 --- a/advanced/backends/backends.md +++ b/advanced/backends/backends.md @@ -22,7 +22,7 @@ Xarray bundles several backends internally for the following formats: External Backends that use the new backend API (xarray >= v0.18.0) that allows to add support for backend without any change to Xarray - [cfgrib](https://github.com/ecmwf/cfgrib) - GRIB -- [tiledb](https://pythonrepo.com/repo/TileDB-Inc-TileDB-xarray) - TileDB +- [tiledb](https://github.com/TileDB-Inc/TileDB-CF-Py) - TileDB - [rioxarray](https://corteva.github.io/rioxarray/stable/) - GeoTIFF, JPEG-2000, ESRI-hdr, etc (via GDAL) - [xarray-sentinel](https://github.com/bopen/xarray-sentinel) - Sentinel-1 SAFE - ... diff --git a/fundamentals/02.1_indexing_Basic.ipynb b/fundamentals/02.1_indexing_Basic.ipynb new file mode 100644 index 00000000..8fc5d018 --- /dev/null +++ b/fundamentals/02.1_indexing_Basic.ipynb @@ -0,0 +1,741 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Indexing and Selecting Data\n", + "\n", + "## Learning Objectives\n", + "\n", + "- Understanding the difference between position and label-based indexing\n", + "- Select data by position using `.isel` with values or slices\n", + "- Select data by label using `.sel` with values or slices\n", + "- Use nearest-neighbor lookups with `.sel`\n", + "- Select timeseries data by date/time with values or slices\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "Xarray offers extremely flexible indexing routines that combine the best features of NumPy and Pandas for data selection.\n", + "\n", + "The most basic way to access elements of a `DataArray` object is to use Python’s `[]` syntax, such as `array[i, j]`, where `i` and `j` are both integers.\n", + "\n", + "As xarray objects can store coordinates corresponding to each dimension of an array, label-based indexing is also possible (e.g. `.sel(latitude=0)`, similar to `pandas.DataFrame.loc`). In label-based indexing, the element position `i` is automatically looked-up from the coordinate values.\n", + "\n", + "By leveraging the labeled dimensions and coordinates provided by Xarray, users can effortlessly access, subset, and manipulate data along multiple axes, enabling complex operations such as slicing, masking, and aggregating data based on specific criteria. \n", + "\n", + "This indexing and selection capability of Xarray not only enhances data exploration and analysis workflows but also promotes reproducibility and efficiency by providing a convenient interface for working with multi-dimensional data structures.\n", + "\n", + "## Quick Overview \n", + "\n", + "In total, xarray supports four different kinds of indexing, as described below and summarized in this table:\n", + "\n", + "| Dimension lookup | Index lookup | `DataArray` syntax | `Dataset` syntax |\n", + "| ---------------- | ------------ | ---------------------| ---------------------|\n", + "| Positional | By integer | `da[:,0]` | *not available* |\n", + "| Positional | By label | `da.loc[:,'IA']` | *not available* |\n", + "| By name | By integer | `da.isel(space=0)` or `da[dict(space=0)]` | `ds.isel(space=0)` or `ds[dict(space=0)]` |\n", + "| By name | By label | `da.sel(space='IA')` or `da.loc[dict(space='IA')]` | `ds.sel(space='IA')` or `ds.loc[dict(space='IA')]` |\n", + "\n", + "\n", + "----------\n", + "\n", + "In this tutorial, first we cover the positional indexing and label-based indexing, next we will cover more advanced techniques such as nearest neighbor lookups. \n", + "\n", + "First, let's import packages: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import xarray as xr\n", + "\n", + "xr.set_options(display_expand_attrs=False, display_expand_data=False);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we’ll use air temperature tutorial dataset from the [National Center for Environmental Prediction](https://www.weather.gov/ncep/). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.tutorial.load_dataset(\"air_temperature\")\n", + "ds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da = ds[\"air\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Position-based Indexing\n", + "\n", + "Indexing a `DataArray` directly works (mostly) just like it does for numpy `ndarrays`, except that the returned object is always another `DataArray`:\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### NumPy Positional Indexing\n", + "\n", + "When working with numpy, indexing is done by position (slices/ranges/scalars).\n", + "\n", + "For example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "np_array = ds[\"air\"].data # numpy array\n", + "np_array.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Indexing is 0-based in NumPy:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "np_array[1, 0, 0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Similarly, we can select a range in NumPy:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# extract a time-series for one spatial location\n", + "np_array[:, 20, 40]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "jp-MarkdownHeadingCollapsed": true, + "tags": [] + }, + "source": [ + "### Positional Indexing with Xarray" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Xarray offers extremely flexible indexing routines that combine the best\n", + "features of NumPy and pandas for data selection." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "#### NumPy style indexing with Xarray\n", + "\n", + "NumPy style indexing works exactly the same with Xarray but it also preserves labels and metadata. \n", + "\n", + "This approach however does not take advantage of the dimension names and coordinate location information that is present in a Xarray object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da[:, 20, 40]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{caution}\n", + "Positional indexing deviates from the NumPy behavior when indexing with multiple arrays. \n", + "```\n", + "We can show this with an example: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "np_array[:, [0, 1], [0, 1]].shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da[:, [0, 1], [0, 1]].shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Please note how the dimension of the `DataArray()` object is different from the `numpy.ndarray`.\n", + "\n", + "```{tip}\n", + "However, users can still achieve NumPy-like pointwise indexing across multiple labeled dimensions by using Xarray vectorized indexing techniques. We will delve further into this topic in the advanced indexing notebook.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Positional Indexing Using Dimension Names" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Remembering the axis order can be challenging even with 2D arrays:\n", + "- is `np_array[0,3]` the first row and third column or first column and third row? \n", + "- or did I store these samples by row or by column when I saved the data?!. \n", + "\n", + "The difficulty is compounded with added dimensions. \n", + "\n", + "Xarray objects eliminate much of the mental overhead by allowing indexing using dimension names instead of axes numbers:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da.isel(lat=20, lon=40).plot();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Slicing is also possible similarly:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da.isel(time=slice(0, 20), lat=20, lon=40).plot();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{note}\n", + "Using the `isel` method, the user can choose/slice the specific elements from a Dataset or DataArray.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But what if I wanted to select data only for 2014, how would I know the indices for it? Xarray reduce this complexity by introducing label-based indexing. \n", + "\n", + "## Label-based Indexing\n", + "\n", + "To select data by coordinate labels instead of integer indices we can use the same syntax, using `sel` instead of `isel`:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For example, let's select the data for one day 2014-01-01 at Lat 25 N and Lon 210 E using `sel` :" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "da.sel(time=\"2014-01-01\", lat=25, lon=210).plot();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's select data for year 2014 at this coordinate:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "da.sel(lat=50.0, lon=200.0, time=\"2014\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Similarly we can do slicing or filter a date range using the `.slice` function: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# demonstrate slicing\n", + "da.sel(time=slice(\"2014-02-14\", \"2014-12-13\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Xarray also supports label-based indexing, just like pandas using `.loc`. To do label based indexing, use the `loc` attribute:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da.loc[\"2014-02-14\":\"2014-12-13\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da.time" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dropping using `drop_sel`\n", + "\n", + "If instead of selecting data we want to drop it, we can use `drop_sel` method with syntax similar to `sel`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da.drop_sel(lat=50.0, lon=200.0, time=\"2014\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So far, all the above will require us to specify exact coordinate values, but what if we don't have the exact values? We can use nearest neighbor lookups to address this issue:\n", + "\n", + "## Nearest Neighbor Lookups\n", + "\n", + "The label based selection methods `sel()` support `method` and `tolerance` keyword argument. The `method` parameter allows for enabling nearest neighbor (inexact) lookups by use of the methods `pad`, `backfill` or `nearest`:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "da.sel(lat=52.25, lon=251.8998, method=\"nearest\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`tolerance` argument limits the maximum distance for valid matches with an inexact lookup:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da.sel(lat=52.25, lon=251.8998, method=\"nearest\", tolerance=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{tip}\n", + "All of these indexing methods work on the dataset too!\n", + "```\n", + "\n", + "We can also use these methods to index all variables in a dataset simultaneously, returning a new dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.sel(lat=52.25, lon=251.8998, method=\"nearest\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises\n", + "\n", + "Practice the syntax you’ve learned so far:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{exercise}\n", + ":label: indexing-1\n", + "\n", + "Select the first 30 entries of `latitude` and 30th to 40th entries of `longitude`:\n", + "```\n", + "\n", + "````{solution} indexing-1\n", + ":class: dropdown\n", + "```python\n", + "ds.isel(lat=slice(None, 30), lon=slice(30, 40))\n", + "```\n", + "\n", + "````" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{exercise}\n", + ":label: indexing-2\n", + "\n", + "Select all data at 75 degree north and between Jan 1, 2013 and Oct 15, 2013 :\n", + "```\n", + "````{solution} indexing-2\n", + ":class: dropdown\n", + "```python\n", + "ds.sel(lat=75, time=slice(\"2013-01-01\", \"2013-10-15\"))\n", + "```\n", + "````" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{exercise}\n", + ":label: indexing-3\n", + "\n", + "Remove all entries at 260 and 270 degrees :\n", + "\n", + "```\n", + "````{solution} indexing-3\n", + ":class: dropdown\n", + "```python\n", + "ds.drop_sel(lon=[260, 270])\n", + "```\n", + "````" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Datetime Indexing\n", + "\n", + "\n", + "Datetime indexing is a critical feature when working with time series data, which is a common occurrence in many fields, including finance, economics, and environmental sciences. Essentially, datetime indexing allows you to select data points or a series of data points that correspond to certain date or time criteria. This becomes essential for time-series analysis where the date or time information associated with each data point can be as critical as the data point itself.\n", + "\n", + "Let's see some of the techniques to perform datetime indexing in Xarray:\n", + "\n", + "### Selecting data based on single datetime\n", + "\n", + "Let's say we have a Dataset ds and we want to select data at a particular date and time, for instance, '2013-01-01' at 6AM. We can do this by using the `sel` (select) method, like so:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.sel(time='2013-01-01 06:00')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By default, datetime selection will return a range of values that match the provided string. For e.g. `time=\"2013-01-01\"` will return all timestamps for that day (4 of them here):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.sel(time='2013-01-01')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use this feature to select all points in a month" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ds.sel(time=\"2014-May\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "or a year" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ds.sel(time=\"2014\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Selecting data for a range of dates\n", + "\n", + "Now, let's say we want to select data between a certain range of dates. We can still use the `sel` method, but this time we will combine it with slice:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# This will return a subset of the dataset corresponding to the entire year of 2013.\n", + "ds.sel(time=slice('2013-01-01', '2013-12-31'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{note}\n", + "\n", + "The slice function takes two arguments, start and stop, to make a slice that includes these endpoints. When we use `slice` with the `sel` method, it provides an efficient way to select a range of dates. The above example shows the usage of slice for datetime indexing.\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Indexing with a DatetimeIndex or date string list\n", + "\n", + "Another technique is to use a list of datetime objects or date strings for indexing. For example, you could select data for specific, non-contiguous dates like this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dates = ['2013-07-09', '2013-10-11', '2013-12-24']\n", + "ds.sel(time=dates)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Fancy indexing based on year, month, day, or other datetime components\n", + "\n", + "In addition to the basic datetime indexing techniques, Xarray also supports \"fancy\" indexing options, which can provide more flexibility and efficiency in your data analysis tasks. You can directly access datetime components such as year, month, day, hour, etc. using the `.dt` accessor. Here is an example of selecting all data points from July across all years:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.sel(time=ds.time.dt.month == 7)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Or, if you wanted to select data from a specific day of each month, you could use:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.sel(time=ds.time.dt.day == 15)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "In total, Xarray supports four different kinds of indexing, as described below and summarized in this table:\n", + "\n", + "\n", + "| Dimension lookup | Index lookup | `DataArray` syntax | `Dataset` syntax |\n", + "| ---------------- | ------------ | ---------------------| ---------------------|\n", + "| Positional | By integer | `da[:,0]` | *not available* |\n", + "| Positional | By label | `da.loc[:,'IA']` | *not available* |\n", + "| By name | By integer | `da.isel(space=0)` or `da[dict(space=0)]` | `ds.isel(space=0)` or `ds[dict(space=0)]` |\n", + "| By name | By label | `da.sel(space='IA')` or `da.loc[dict(space='IA')]` | `ds.sel(space='IA')` or `ds.loc[dict(space='IA')]` |\n", + "\n", + "\n", + "For enhanced indexing capabilities across all methods, you can utilize DataArray objects as an indexer. For more detailed information, please see the Advanced Indexing notebook.\n", + "\n", + "\n", + "## More Resources\n", + "\n", + "- [Xarray Docs - Indexing and Selecting Data](https://docs.xarray.dev/en/stable/indexing.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": true, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": true + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/fundamentals/02.1_working_with_labeled_data.ipynb b/fundamentals/02.1_working_with_labeled_data.ipynb deleted file mode 100644 index 716d8265..00000000 --- a/fundamentals/02.1_working_with_labeled_data.ipynb +++ /dev/null @@ -1,319 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "# Working with labeled data\n", - "\n", - "Learning goals:\n", - "\n", - "- Use different forms of indexing to select data based on position and\n", - " coordinates\n", - "- Select datetime ranges\n", - "\n", - "Scientific data is inherently *labeled*. For example, time series data includes timestamps that label individual periods or points in time, spatial data has coordinates (e.g. longitude, latitude, elevation), and model or laboratory experiments are often identified by unique identifiers. In this notebook we'll see that labeled dimensions make code much easier to understand!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import pandas as pd\n", - "import xarray as xr" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We'll start by comparing common indexing operations with a `numpy` array and equivalent `xarray` DataArray:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# axis0: x, axis1: y\n", - "np_array = np.arange(10).reshape(2, 5)\n", - "np_array" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "da = xr.DataArray(np_array, dims=(\"x\", \"y\"))\n", - "da" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Position-based indexing\n", - "\n", - "### Indexing\n", - "\n", - "Recall that *indexing* is selecting a value from an array based on its position" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "np_array[0, 3]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "da.isel(x=0, y=3) # or da[{\"x\": 0, \"y\": 3}]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Slicing\n", - "\n", - "And *slicing* retrieves a range of values" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "np_array[:2, 1:]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "da.isel(x=slice(None, 2), y=slice(1, None))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Label-based indexing\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Remembering the axis order can be challenging even with 2D arrays (is np_array[0,3] the first row and third column *or first column and third row*? or did I store these samples by row or by column when I saved the data?!). The difficulty is compounded with added dimensions. Xarray objects eliminate much of the mental overhead by adding coordinate labels:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "arr = xr.DataArray(\n", - " data=np.arange(48).reshape(4, 2, 6),\n", - " dims=(\"u\", \"v\", \"time\"),\n", - " coords={\n", - " \"u\": [-3.2, 2.1, 5.3, 6.5],\n", - " \"v\": [-1, 2.6],\n", - " \"time\": pd.date_range(\"2009-01-05\", periods=6, freq=\"M\"),\n", - " },\n", - ")\n", - "arr" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To select data by coordinate **labels** instead of *integer indices* we can use the\n", - "same syntax, using `sel` instead of `isel`:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "arr.sel(u=5.3, time=\"2009-04-30\") # or arr.loc[{\"u\": 5.3, \"time\": \"2009-04-30\"}]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "this will require us to specify exact coordinate values. If we don't have those, we can use the `method` parameter (see `Dataset.sel` for documentation):" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "arr.sel(u=5, time=\"2009-04-28\", method=\"nearest\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can also select multiple values:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "arr.sel(u=[-3.2, 6.5], time=slice(\"2009-02-28\", \"2009-05-31\"))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If instead of selecting data we want to drop it, we can use `drop_sel`:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "arr.drop_sel(u=[-3.2, 6.5])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Exercises\n", - "\n", - "Practice the syntax you've learned with the xarray tutorial dataset! " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds = xr.tutorial.open_dataset(\"air_temperature\")\n", - "ds" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "1. Select the first 30 entries of latitude and 20th to 40th entries of longitude\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - }, - "tags": [ - "hide-output" - ] - }, - "outputs": [], - "source": [ - "ds.isel(lat=slice(None, 30), lon=slice(20, 40))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "2. Select all data at 75 degree north and between Jan 1, 2013 and Oct 15, 2013\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - }, - "tags": [ - "hide-output" - ] - }, - "outputs": [], - "source": [ - "ds.sel(lat=75, time=slice(\"2013-01-01\", \"2013-10-15\"))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "3. Remove all entries at 260 and 270 degrees" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - }, - "tags": [ - "hide-output" - ] - }, - "outputs": [], - "source": [ - "ds.drop_sel(lon=[260, 270])" - ] - } - ], - "metadata": { - "interpreter": { - "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/intermediate/01-high-level-computation-patterns.ipynb b/intermediate/01-high-level-computation-patterns.ipynb index ba429693..72aaa688 100644 --- a/intermediate/01-high-level-computation-patterns.ipynb +++ b/intermediate/01-high-level-computation-patterns.ipynb @@ -83,7 +83,7 @@ "\n", "\n", "\n", - "```{Note}\n", + "```{note}\n", "the documentation links in this tutorial point to the DataArray implementations of each function, but they are also available for DataSet objects.\n", "```\n" ] diff --git a/intermediate/indexing/advanced-indexing.ipynb b/intermediate/indexing/advanced-indexing.ipynb new file mode 100644 index 00000000..0f57cae1 --- /dev/null +++ b/intermediate/indexing/advanced-indexing.ipynb @@ -0,0 +1,277 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Advanced Indexing\n", + "\n", + "## Learning Objectives\n", + "\n", + "* Orthogonal vs. Vectorized and Pointwise Indexing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "In the previous notebooks, we learned basic forms of indexing with xarray (positional and name based dimensions, integer and label based indexing), Datetime Indexing, and nearest neighbor lookups. In this tutorial, we will lean how Xarray indexing is different from Numpy and how to do vectorized/pointwise indexing using Xarray. \n", + "First, let's import packages needed for this repository: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import xarray as xr\n", + "\n", + "\n", + "xr.set_options(display_expand_attrs=False)\n", + "np.set_printoptions(threshold=10, edgeitems=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this notebook, we’ll use air temperature tutorial dataset from the National Center for Environmental Prediction. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.tutorial.load_dataset(\"air_temperature\")\n", + "da = ds.air\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Orthogonal Indexing \n", + "\n", + "As we learned in the previous tutorial, positional indexing deviates from the behavior exhibited by NumPy when indexing with multiple arrays. However, Xarray pointwise indexing supports the indexing along multiple labeled dimensions using list-like objects similar to NumPy indexing behavior.\n", + "\n", + "If you only provide integers, slices, or unlabeled arrays (array without dimension names, such as `np.ndarray`, `list`, but not `DataArray()`) indexing can be understood as orthogonally (i.e. along independent axes, instead of using NumPy’s broadcasting rules to vectorize indexers). \n", + "\n", + "*Orthogonal* or *outer* indexing considers one-dimensional arrays in the same way as slices when deciding the output shapes. The principle of outer or orthogonal indexing is that the result mirrors the effect of independently indexing along each dimension with integer or boolean arrays, treating both the indexed and indexing arrays as one-dimensional. This method of indexing is analogous to vector indexing in programming languages like MATLAB, Fortran, and R, where each indexer component *independently* selects along its corresponding dimension. \n", + "\n", + "For example : " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da.isel(time=0, lat=[2, 4, 10, 13], lon=[1, 6, 7]).plot(); # -- orthogonal indexing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For more flexibility, you can supply `DataArray()` objects as indexers. Dimensions on resultant arrays are given by the ordered union of the indexers’ dimensions:\n", + "\n", + "For example, in the example below we do orthogonal indexing using `DataArray()` objects. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "target_lat = xr.DataArray([31, 41, 42, 42], dims=\"degrees_north\")\n", + "target_lon = xr.DataArray([200, 201, 202, 205], dims=\"degrees_east\")\n", + "\n", + "da.sel(lat=target_lat, lon=target_lon, method=\"nearest\") # -- orthogonal indexing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the above example, you can see how the output shape is `time` x `lats` x `lons`. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "But what if we would like to find the information from the nearest grid cell to a collection of specified points (for example, weather stations or tower data)?\n", + "\n", + "## Vectorized or Pointwise Indexing\n", + "\n", + "Like NumPy and pandas, Xarray supports indexing many array elements at once in a\n", + "*vectorized* manner. \n", + "\n", + "**Vectorized indexing** or **Pointwise Indexing** using `DataArrays()` can be used to extract information from the nearest grid cells of interest, for example, the nearest climate model grid cells to a collection of specified weather station latitudes and longitudes.\n", + "\n", + "```{hint}\n", + "To trigger vectorized indexing behavior, you will need to provide the selection dimensions with a new shared output dimension name. \n", + "```\n", + "\n", + "In the example below, the selections of the closest latitude and longitude are renamed to an output dimension named `points`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define target latitude and longitude (where weather stations might be)\n", + "lat_points = xr.DataArray([31, 41, 42, 42], dims=\"points\")\n", + "lon_points = xr.DataArray([200, 201, 202, 205], dims=\"points\")\n", + "lat_points" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lon_points" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, retrieve data at the grid cells nearest to the target latitudes and longitudes (weather stations):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da.sel(lat=lat_points, lon=lon_points, method=\"nearest\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "👆 Please notice how the shape of our `DataArray` is `time` x `points`, extracting time series for each weather stations. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da.sel(lat=lat_points, lon=lon_points, method=\"nearest\").dims" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{attention}\n", + "Please note that slices or sequences/arrays without named-dimensions are treated as if they have the same dimension which is indexed along.\n", + "```\n", + "\n", + "For example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da.sel(lat=[20, 30, 40], lon=lon_points, method=\"nearest\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{warning}\n", + "If an indexer is a `DataArray()`, its coordinates should not conflict with the selected subpart of the target array (except for the explicitly indexed dimensions with `.loc`/`.sel`). Otherwise, `IndexError` will be raised!\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Analogously, label-based pointwise-indexing is also possible by the `.sel()` method:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da = xr.DataArray(\n", + " np.random.rand(4, 3),\n", + " [\n", + " (\"time\", pd.date_range(\"2000-01-01\", periods=4)),\n", + " (\"space\", [\"IA\", \"IL\", \"IN\"]),\n", + " ],\n", + ")\n", + "times = xr.DataArray(pd.to_datetime([\"2000-01-03\", \"2000-01-02\", \"2000-01-01\"]), dims=\"new_time\")\n", + "\n", + "\n", + "# -- get data for each state and each time:\n", + "da.sel(space=xr.DataArray([\"IA\", \"IL\", \"IN\"], dims=[\"new_time\"]), time=times)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Additional Resources\n", + "\n", + "- [Xarray Docs - Indexing and Selecting Data](https://docs.xarray.dev/en/stable/indexing.html)\n" + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": true, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": true + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/intermediate/indexing/boolean-masking-indexing.ipynb b/intermediate/indexing/boolean-masking-indexing.ipynb new file mode 100644 index 00000000..821dba20 --- /dev/null +++ b/intermediate/indexing/boolean-masking-indexing.ipynb @@ -0,0 +1,501 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Boolean Indexing & Masking\n", + "\n", + "## Learning Objectives\n", + "\n", + "* The concept of boolean masks\n", + "* Dropping/Masking data using `where`\n", + "* Using `isin` for creating a boolean mask" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "*Boolean masking*, known as *boolean indexing*, is a functionality in Python that enables the filtering of values based on a specific condition.\n", + "\n", + "A boolean mask refers to a binary array or a boolean-valued (`True`/`False`) array that is used as a *filter* to select specific elements from another array. The boolean mask acts as a criterion or condition, where each element in the mask corresponds to an element in the target array. An element in the target array is selected when the corresponding `mask` value is `True`. \n", + "\n", + "Xarray provides different capabilities to allow filtering and boolean indexing. In this notebook, we will learn more about it.\n", + "\n", + "First, let's import the packages needed for this notebook: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cartopy.crs as ccrs\n", + "import numpy as np\n", + "import xarray as xr\n", + "from matplotlib import pyplot as plt\n", + "import matplotlib as mpl\n", + "\n", + "xr.set_options(display_expand_attrs=False)\n", + "np.set_printoptions(threshold=10, edgeitems=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this tutorial, we’ll use the Regional Arctic System Mode (RASM) example dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.tutorial.load_dataset(\"rasm\").isel(time=0)\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this dataset, the logical coordinates are `x` and `y`, while the physical coordinates are `xc` and `yc`, which represent the latitudes and longitude of the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(ds.xc.attrs)\n", + "print(ds.yc.attrs)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da = ds.Tair\n", + "da" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Masking with `where()`\n", + "\n", + "Indexing methods on Xarray objects generally return a subset of the original data. However, it is sometimes useful to select an object with the same shape as the original data, but with some elements masked. \n", + "\n", + "By applying `.where()`, the original data's shape is maintained, with values masked based on a Boolean condition. Values that satisfy the condition (`True`) are returned unchanged, while values that do not meet the condition (`False`) are replaced with a predefined value.\n", + "\n", + "In the example below, we replace all `nan` values with `-9999`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Let's replace the missing values (nan) with some placeholder\n", + "ds.Tair.where(ds.Tair.notnull(), -9999)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, in the example above `.where()` preserved the **shape** of the original data by masking the values with a boolean condition. \n", + "\n", + "Most uses of `.where()` check whether or not specific data values are less than or greater than a constant value. \n", + "\n", + "The data values specified in the boolean condition of `.where()` can be any of the following:\n", + "\n", + "* a `DataArray`\n", + "* a `Dataset`\n", + "* a function\n", + "\n", + "In the following example, we make use of `.where()` to mask all temperature below 0°C.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da_masked = da.where(da >= 0)\n", + "\n", + "# -- making both plots for comparison:\n", + "fig, axes = plt.subplots(ncols=2, figsize=(15, 5))\n", + "\n", + "# -- for reference (without masking):\n", + "da.plot(ax=axes[0], vmin=-30, vmax=30, cmap=mpl.cm.RdBu_r)\n", + "\n", + "# -- masked DataArray\n", + "da_masked.plot(ax=axes[1], vmin=-30, vmax=30, cmap=mpl.cm.RdBu_r);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{tip}\n", + "By default Xarray set the masked values to `nan`. But as we saw in the first example, we can set it to other values too. \n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{exercise}\n", + ":label: boolean-2\n", + "\n", + "Using the syntax you’ve learned so far, mask all the points with latitudes above 60° N.\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# write your answer here!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "````{solution} boolean-2\n", + ":class: dropdown\n", + "```python\n", + "da_masked = da.where(da.yc >= 60)\n", + "da_masked[0, :, :].plot();\n", + "```\n", + "````" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As mentioned above, by default `where` maintains the original size of the data. You can use the option `drop=True` to clip coordinate elements that are fully masked:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da_masked = da.where(da.yc > 60, drop=True)\n", + "da_masked.plot();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Please note that in this dataset, the variables `xc` (longitude) and `yc` (latitude) are two-dimensional scalar fields.\n", + "\n", + "When we plotted the data variable `Tair`, by default we get the logical coordinates (i.e. `x` and `y`) as we show in the example above. \n", + "\n", + "In order to visualize the data on a conventional latitude-longitude grid, we can take advantage of Xarray’s ability to apply `cartopy` map projections." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure(figsize=(14, 6))\n", + "ax = plt.axes(projection=ccrs.PlateCarree())\n", + "ax.set_global()\n", + "ds.Tair.plot.pcolormesh(ax=ax, transform=ccrs.PlateCarree(), x=\"xc\", y=\"yc\", add_colorbar=False)\n", + "ax.coastlines()\n", + "ax.set_ylim([20, 90]);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using `where` with Multiple Conditions\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Xarray's `.where()` function, boolean conditions can be combined using logical operators. The bitwise `and` operator (`&`) and the bitwise `or` operator (`|`) are relevant in this case. This allows for specifying multiple masking conditions within a single `.where()` statement.\n", + "\n", + "We can select data for one specific region using bound boxes. For example, here we want to access data over a region over Alaska :" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# -- define a region\n", + "min_lon = 190\n", + "min_lat = 55\n", + "max_lon = 230\n", + "max_lat = 85" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First we have to create our boolean masks:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mask_lon = (ds.xc >= min_lon) & (ds.xc <= max_lon)\n", + "mask_lat = (ds.yc >= min_lat) & (ds.yc <= max_lat)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we can use the boolean masks for filtering data for that region: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da_masked = da.where(mask_lon & mask_lat, drop=True)\n", + "\n", + "da_masked.plot();" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure(figsize=(5, 5))\n", + "ax = plt.axes(projection=ccrs.PlateCarree())\n", + "ax.set_global()\n", + "da_masked.plot.pcolormesh(ax=ax, transform=ccrs.PlateCarree(), x=\"xc\", y=\"yc\", add_colorbar=False)\n", + "ax.coastlines()\n", + "ax.set_ylim([50, 80])\n", + "ax.set_xlim([-180, -120]);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise\n", + "\n", + "If we load air temperature dataset from NCEP, we could use `sel` method for selecting a region:\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "````{exercise}\n", + ":label: boolean-1\n", + "\n", + "If we load air temperature dataset from NCEP, we could use `sel` method for selecting a region:\n", + "\n", + "```python\n", + "ds = xr.tutorial.open_dataset(\"air_temperature\")\n", + "ds_region = ds.sel(lat=slice(75,50), lon=slice(250,300))\n", + "\n", + "ds_region.air.plot();\n", + "```\n", + "Can you use a similar method as above using `sel` to crop a region using the RASM dataset? Why?\n", + "\n", + "````\n", + "\n", + "````{solution} boolean-1\n", + ":class: dropdown\n", + "This method will not work here as the dimensions are different from coordinates here. Specifically, the variables xc (longitude) and yc (latitude) are two-dimensional scalar fields, which differ from the logical coordinates represented by x and y.\n", + "\n", + "So the code below will not give the correct answer!\n", + "```python\n", + "cropped_ds = ds.sel(x=slice(min_lat,max_lat), y=slice(min_lon,max_lon))\n", + "cropped_ds.Tair.plot()\n", + "```\n", + "````\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using `xr.where` with a Function\n", + "\n", + "We can use `xr.where` with a function as a condition too. For example, here we want to convert temperature to Kelvin and find if temperature is greater than 280 K:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define a function to use as a condition\n", + "def is_greater_than_threshold(x, threshold=300):\n", + " # function to convert temp to K\n", + " # and compare with threshold\n", + " x = x + 273.15\n", + " return x > threshold\n", + "\n", + "\n", + "# Apply the condition using xarray.where()\n", + "masked_data = xr.where(is_greater_than_threshold(da, 280), da, 0)\n", + "\n", + "masked_data.plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Selecting Values with `isin`\n", + "\n", + "To check whether elements of an xarray object contain a single object, you can compare with the equality operator `==` (e.g., `arr == 3`). \n", + "\n", + "To check multiple values, we use `isin()`:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is a simple example: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x_da = xr.DataArray([1, 2, 3, 4, 5], dims=[\"x\"])\n", + "\n", + "# -- select points with values equal to 2 and 4:\n", + "x_da.isin([2, 4])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{tip}\n", + "`isin()` works particularly well with `where()` to support indexing by arrays that are not already labels of an array. \n", + "```\n", + "\n", + "For example, we have another `DataArray` that displays the status flags of the data-collecting device for our data. \n", + "\n", + "Here, flags with value 0 and -1 signifies the device was functioning correctly, while 0 indicates a malfunction, implying that the resulting data collected may not be accurate." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "flags = xr.DataArray(np.random.randint(-1, 5, da.shape), dims=da.dims, coords=da.coords)\n", + "flags" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we want to only see the data for points where out measurement device is working correctly: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "da_masked = da.where(flags.isin([1, 2, 3, 4, 5]), drop=True)\n", + "da_masked.plot();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{warning}\n", + "Please note that when done repeatedly, this type of indexing is significantly slower than using `sel()`. \n", + "\n", + "Use `sel` instead of `where` as much as possible.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Additional Resources\n", + "\n", + "- [Xarray Docs - Indexing and Selecting Data](https://docs.xarray.dev/en/stable/indexing.html)\n" + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": true, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": true + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/intermediate/indexing/indexing.md b/intermediate/indexing/indexing.md new file mode 100644 index 00000000..72dae2a4 --- /dev/null +++ b/intermediate/indexing/indexing.md @@ -0,0 +1,5 @@ +# Indexing + +```{tableofcontents} + +``` diff --git a/overview/fundamental-path/README.md b/overview/fundamental-path/README.md index 7c768049..2dd373f0 100644 --- a/overview/fundamental-path/README.md +++ b/overview/fundamental-path/README.md @@ -31,9 +31,7 @@ _Below are links to sections of this website that are part of this journey_: ``` ```{dropdown} Working with Labeled Data -{doc}`../../fundamentals/02.1_working_with_labeled_data` - - +{doc}`../../fundamentals/02.1_indexing_Basic` ``` ```{dropdown} Computation diff --git a/workshops/scipy2023/README.md b/workshops/scipy2023/README.md index 1ff7b87c..aa967eef 100644 --- a/workshops/scipy2023/README.md +++ b/workshops/scipy2023/README.md @@ -59,7 +59,11 @@ Once your codespace is launched, the following happens: ``` ```{dropdown} Indexing +-{doc}`../../fundamentals/02.1_indexing_Basic` +-{doc}`../../intermediate/indexing/boolean-masking-indexing` + +-{doc}`../../intermediate/indexing/advanced-indexing` ``` ```{dropdown} Computational Patterns