diff --git a/fundamentals/01_data_structures.md b/fundamentals/01_data_structures.md index 5add1dab..04b1b907 100644 --- a/fundamentals/01_data_structures.md +++ b/fundamentals/01_data_structures.md @@ -1,5 +1,69 @@ # Data Structures +Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”) +are an essential part of computational science. They are encountered in a wide +range of fields, including physics, astronomy, geoscience, bioinformatics, +engineering, finance, and deep learning. In Python, [NumPy](https://numpy.org/) +provides the fundamental data structure and API for working with raw ND arrays. +However, real-world datasets are usually more than just raw numbers; they have +labels which encode information about how the array values map to locations in +space, time, etc. + +The N-dimensional nature of xarray’s data structures makes it suitable for +dealing with multi-dimensional scientific data, and its use of dimension names +instead of axis labels (`dim='time'` instead of `axis=0`) makes such arrays much +more manageable than the raw numpy ndarray: with xarray, you don’t need to keep +track of the order of an array’s dimensions or insert dummy dimensions of size 1 +to align arrays (e.g., using np.newaxis). + +The immediate payoff of using xarray is that you’ll write less code. The +long-term payoff is that you’ll understand what you were thinking when you come +back to look at it weeks or months later. + +## Example: Weather forecast + +Here is an example of how we might structure a dataset for a weather forecast: + + + +You'll notice multiple data variables (temperature, precipitation), coordinate +variables (latitude, longitude), and dimensions (x, y, t). We'll cover how these +fit into Xarray's data structures below. + +Xarray doesn’t just keep track of labels on arrays – it uses them to provide a +powerful and concise interface. For example: + +- Apply operations over dimensions by name: `x.sum('time')`. + +- Select values by label (or logical location) instead of integer location: + `x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`. + +- Mathematical operations (e.g., `x - y`) vectorize across multiple dimensions + (array broadcasting) based on dimension names, not shape. + +- Easily use the split-apply-combine paradigm with groupby: + `x.groupby('time.dayofyear').mean()`. + +- Database-like alignment based on coordinate labels that smoothly handles + missing values: `x, y = xr.align(x, y, join='outer')`. + +- Keep track of arbitrary metadata in the form of a Python dictionary: + `x.attrs`. + +## Example: Mosquito genetics + +Although the Xarray library was originally developed with Earth Science datasets in mind, the datastructures work well across many other domains! For example, below is a side-by-side view of a data schematic on the left and Xarray Dataset representation on the right taken from a mosquito genetics analysis: + + + +The data can be stored as a 3-dimensional array, where one dimension of the array corresponds to positions (**variants**) within a reference genome, another dimension corresponds to the individual mosquitoes that were sequenced (**samples**), and a third dimension corresponds to the number of genomes within each individual (**ploidy**)." + +You can explore this dataset in detail via the [training course in data analysis for genomic surveillance of African malaria vectors](https://anopheles-genomic-surveillance.github.io/workshop-5/module-1-xarray.html)! + +## Explore on your own + +The following collection of notebooks provide interactive code examples for working with example datasets and constructing Xarray data structures manually. + ```{tableofcontents} ``` diff --git a/fundamentals/01_datastructures.ipynb b/fundamentals/01_datastructures.ipynb index 655a1795..e3131c77 100644 --- a/fundamentals/01_datastructures.ipynb +++ b/fundamentals/01_datastructures.ipynb @@ -9,59 +9,12 @@ "In this lesson, we cover the basics of Xarray data structures. Our\n", "learning goals are as follows. By the end of the lesson, we will be able to:\n", "\n", + ":::{admonition} Learning Goals\n", "- Understand the basic data structures (`DataArray` and `Dataset` objects) in Xarray\n", - "\n", - "---\n", - "\n", - "## Introduction\n", - "\n", - "Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”)\n", - "are an essential part of computational science. They are encountered in a wide\n", - "range of fields, including physics, astronomy, geoscience, bioinformatics,\n", - "engineering, finance, and deep learning. In Python, [NumPy](https://numpy.org/)\n", - "provides the fundamental data structure and API for working with raw ND arrays.\n", - "However, real-world datasets are usually more than just raw numbers; they have\n", - "labels which encode information about how the array values map to locations in\n", - "space, time, etc.\n", - "\n", - "Here is an example of how we might structure a dataset for a weather forecast:\n", - "\n", - "\n", - "\n", - "You'll notice multiple data variables (temperature, precipitation), coordinate\n", - "variables (latitude, longitude), and dimensions (x, y, t). We'll cover how these\n", - "fit into Xarray's data structures below.\n", - "\n", - "Xarray doesn’t just keep track of labels on arrays – it uses them to provide a\n", - "powerful and concise interface. For example:\n", - "\n", - "- Apply operations over dimensions by name: `x.sum('time')`.\n", - "\n", - "- Select values by label (or logical location) instead of integer location:\n", - " `x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.\n", - "\n", - "- Mathematical operations (e.g., `x - y`) vectorize across multiple dimensions\n", - " (array broadcasting) based on dimension names, not shape.\n", - "\n", - "- Easily use the split-apply-combine paradigm with groupby:\n", - " `x.groupby('time.dayofyear').mean()`.\n", - "\n", - "- Database-like alignment based on coordinate labels that smoothly handles\n", - " missing values: `x, y = xr.align(x, y, join='outer')`.\n", - "\n", - "- Keep track of arbitrary metadata in the form of a Python dictionary:\n", - " `x.attrs`.\n", - "\n", - "The N-dimensional nature of xarray’s data structures makes it suitable for\n", - "dealing with multi-dimensional scientific data, and its use of dimension names\n", - "instead of axis labels (`dim='time'` instead of `axis=0`) makes such arrays much\n", - "more manageable than the raw numpy ndarray: with xarray, you don’t need to keep\n", - "track of the order of an array’s dimensions or insert dummy dimensions of size 1\n", - "to align arrays (e.g., using np.newaxis).\n", - "\n", - "The immediate payoff of using xarray is that you’ll write less code. The\n", - "long-term payoff is that you’ll understand what you were thinking when you come\n", - "back to look at it weeks or months later.\n" + "- Customize the display of Xarray objects\n", + "- Access variables, coordinates, and arbitrary metadata\n", + "- Transform to tabular Pandas data structures\n", + ":::" ] }, { @@ -72,13 +25,10 @@ "\n", "Xarray provides two data structures: the `DataArray` and `Dataset`. The\n", "`DataArray` class attaches dimension names, coordinates and attributes to\n", - "multi-dimensional arrays while `Dataset` combines multiple arrays.\n", + "multi-dimensional arrays while `Dataset` combines multiple DataArrays.\n", "\n", "Both classes are most commonly created by reading data.\n", - "To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial.\n", - "\n", - "Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n", - "We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name." + "To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial." ] }, { @@ -88,7 +38,13 @@ "outputs": [], "source": [ "import numpy as np\n", - "import xarray as xr" + "import xarray as xr\n", + "import pandas as pd\n", + "\n", + "# When working in a Jupyter Notebook you might want to customize Xarray display settings to your liking\n", + "# The following settings reduce the amount of data displayed out by default\n", + "xr.set_options(display_expand_attrs=False, display_expand_data=False)\n", + "np.set_printoptions(threshold=10, edgeitems=2)" ] }, { @@ -97,7 +53,10 @@ "source": [ "### Dataset\n", "\n", - "`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n" + "`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n", + "\n", + "Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n", + "We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name." ] }, { @@ -147,14 +106,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### What is all this anyway? (String representations)\n", + "#### HTML vs text representations\n", "\n", "Xarray has two representation types: `\"html\"` (which is only available in\n", "notebooks) and `\"text\"`. To choose between them, use the `display_style` option.\n", "\n", "So far, our notebook has automatically displayed the `\"html\"` representation (which we will continue using).\n", - "The `\"html\"` representation is interactive, allowing you to collapse sections (left arrows) and\n", - "view attributes and values for each value (right hand sheet icon and data symbol)." + "The `\"html\"` representation is interactive, allowing you to collapse sections (▶) and\n", + "view attributes and values for each value (📄 and ≡)." ] }, { @@ -180,7 +139,7 @@ "- an unordered list of *coordinates* or dimensions with coordinates with one item\n", " per line. Each item has a name, one or more dimensions in parentheses, a dtype\n", " and a preview of the values. Also, if it is a dimension coordinate, it will be\n", - " marked with a `*`.\n", + " printed in **bold** font.\n", "- an alphabetically sorted list of *dimensions without coordinates* (if there are any)\n", "- an unordered list of *attributes*, or metadata" ] @@ -379,15 +338,6 @@ "methods on `xarray` objects:\n" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd" - ] - }, { "cell_type": "code", "execution_count": null, @@ -429,8 +379,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "**to_series**: This will always convert `DataArray` objects to\n", - "`pandas.Series`, using a `MultiIndex` for higher dimensions\n" + "### to_series\n", + "This will always convert `DataArray` objects to `pandas.Series`, using a `MultiIndex` for higher dimensions\n" ] }, { @@ -446,9 +396,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "**to_dataframe**: This will always convert `DataArray` or `Dataset`\n", - "objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named\n", - "for this.\n" + "### to_dataframe\n", + "\n", + "This will always convert `DataArray` or `Dataset` objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named for this. Since columns in a `DataFrame` need to have the same index, they are\n", + "broadcasted." ] }, { @@ -459,23 +410,6 @@ "source": [ "ds.air.to_dataframe()" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Since columns in a `DataFrame` need to have the same index, they are\n", - "broadcasted.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds.to_dataframe()" - ] } ], "metadata": { diff --git a/workshops/scipy2024/index.ipynb b/workshops/scipy2024/index.ipynb index 14e7172d..de817e0b 100644 --- a/workshops/scipy2024/index.ipynb +++ b/workshops/scipy2024/index.ipynb @@ -20,7 +20,7 @@ ":::{admonition} Learning Goals\n", "- Orient yourself to Xarray resources to continue on your Xarray journey!\n", "- Effectively use Xarray’s multidimensional indexing and computational patterns\n", - "- Understand how Xarray can wrap other array types in the scientific Python ecosystem\n", + "- Understand how Xarray integrates with other libraries in the scientific Python ecosystem\n", "- Learn how to leverage Xarray’s powerful backend and extension capabilities to customize workflows and open a variety of scientific datasets\n", ":::\n", "\n", @@ -35,7 +35,7 @@ "| Introduction and Setup | 1:30 (10 min) | --- | \n", "| The Xarray Data Model | 1:40 (40 min) | [Data structures](../../fundamentals/01_datastructures.ipynb)
[Basic Indexing](../../fundamentals/02.1_indexing_Basic.ipynb) | \n", "| *10 minute Break* \n", - "| Indexing & Computational Patterns | 2:30 (50 min) | [Advanced Indexing](../../intermediate/indexing/indexing.md)
[Computation Patterns](../../intermediate/01-high-level-computation-patterns.ipynb)
| \n", + "| Indexing & Computational Patterns | 2:30 (50 min) | [Advanced Indexing](../../intermediate/indexing/indexing.md)
[Computational Patterns](../../intermediate/01-high-level-computation-patterns.ipynb)
| \n", "| *10 minute Break* | \n", "| Xarray Integrations and Extensions | 3:30 (50 min) | [The Xarray Ecosystem](../../intermediate/xarray_ecosystem.ipynb) | \n", "| *10 minute Break* | \n", @@ -81,6 +81,14 @@ "- Max Jones (CarbonPlan)\n", "- Wietze Suijker (Space Intelligence)" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": {