diff --git a/fundamentals/01_data_structures.md b/fundamentals/01_data_structures.md
index 5add1dab..04b1b907 100644
--- a/fundamentals/01_data_structures.md
+++ b/fundamentals/01_data_structures.md
@@ -1,5 +1,69 @@
# Data Structures
+Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”)
+are an essential part of computational science. They are encountered in a wide
+range of fields, including physics, astronomy, geoscience, bioinformatics,
+engineering, finance, and deep learning. In Python, [NumPy](https://numpy.org/)
+provides the fundamental data structure and API for working with raw ND arrays.
+However, real-world datasets are usually more than just raw numbers; they have
+labels which encode information about how the array values map to locations in
+space, time, etc.
+
+The N-dimensional nature of xarray’s data structures makes it suitable for
+dealing with multi-dimensional scientific data, and its use of dimension names
+instead of axis labels (`dim='time'` instead of `axis=0`) makes such arrays much
+more manageable than the raw numpy ndarray: with xarray, you don’t need to keep
+track of the order of an array’s dimensions or insert dummy dimensions of size 1
+to align arrays (e.g., using np.newaxis).
+
+The immediate payoff of using xarray is that you’ll write less code. The
+long-term payoff is that you’ll understand what you were thinking when you come
+back to look at it weeks or months later.
+
+## Example: Weather forecast
+
+Here is an example of how we might structure a dataset for a weather forecast:
+
+
+
+You'll notice multiple data variables (temperature, precipitation), coordinate
+variables (latitude, longitude), and dimensions (x, y, t). We'll cover how these
+fit into Xarray's data structures below.
+
+Xarray doesn’t just keep track of labels on arrays – it uses them to provide a
+powerful and concise interface. For example:
+
+- Apply operations over dimensions by name: `x.sum('time')`.
+
+- Select values by label (or logical location) instead of integer location:
+ `x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.
+
+- Mathematical operations (e.g., `x - y`) vectorize across multiple dimensions
+ (array broadcasting) based on dimension names, not shape.
+
+- Easily use the split-apply-combine paradigm with groupby:
+ `x.groupby('time.dayofyear').mean()`.
+
+- Database-like alignment based on coordinate labels that smoothly handles
+ missing values: `x, y = xr.align(x, y, join='outer')`.
+
+- Keep track of arbitrary metadata in the form of a Python dictionary:
+ `x.attrs`.
+
+## Example: Mosquito genetics
+
+Although the Xarray library was originally developed with Earth Science datasets in mind, the datastructures work well across many other domains! For example, below is a side-by-side view of a data schematic on the left and Xarray Dataset representation on the right taken from a mosquito genetics analysis:
+
+
+
+The data can be stored as a 3-dimensional array, where one dimension of the array corresponds to positions (**variants**) within a reference genome, another dimension corresponds to the individual mosquitoes that were sequenced (**samples**), and a third dimension corresponds to the number of genomes within each individual (**ploidy**)."
+
+You can explore this dataset in detail via the [training course in data analysis for genomic surveillance of African malaria vectors](https://anopheles-genomic-surveillance.github.io/workshop-5/module-1-xarray.html)!
+
+## Explore on your own
+
+The following collection of notebooks provide interactive code examples for working with example datasets and constructing Xarray data structures manually.
+
```{tableofcontents}
```
diff --git a/fundamentals/01_datastructures.ipynb b/fundamentals/01_datastructures.ipynb
index 655a1795..e3131c77 100644
--- a/fundamentals/01_datastructures.ipynb
+++ b/fundamentals/01_datastructures.ipynb
@@ -9,59 +9,12 @@
"In this lesson, we cover the basics of Xarray data structures. Our\n",
"learning goals are as follows. By the end of the lesson, we will be able to:\n",
"\n",
+ ":::{admonition} Learning Goals\n",
"- Understand the basic data structures (`DataArray` and `Dataset` objects) in Xarray\n",
- "\n",
- "---\n",
- "\n",
- "## Introduction\n",
- "\n",
- "Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”)\n",
- "are an essential part of computational science. They are encountered in a wide\n",
- "range of fields, including physics, astronomy, geoscience, bioinformatics,\n",
- "engineering, finance, and deep learning. In Python, [NumPy](https://numpy.org/)\n",
- "provides the fundamental data structure and API for working with raw ND arrays.\n",
- "However, real-world datasets are usually more than just raw numbers; they have\n",
- "labels which encode information about how the array values map to locations in\n",
- "space, time, etc.\n",
- "\n",
- "Here is an example of how we might structure a dataset for a weather forecast:\n",
- "\n",
- "\n",
- "\n",
- "You'll notice multiple data variables (temperature, precipitation), coordinate\n",
- "variables (latitude, longitude), and dimensions (x, y, t). We'll cover how these\n",
- "fit into Xarray's data structures below.\n",
- "\n",
- "Xarray doesn’t just keep track of labels on arrays – it uses them to provide a\n",
- "powerful and concise interface. For example:\n",
- "\n",
- "- Apply operations over dimensions by name: `x.sum('time')`.\n",
- "\n",
- "- Select values by label (or logical location) instead of integer location:\n",
- " `x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.\n",
- "\n",
- "- Mathematical operations (e.g., `x - y`) vectorize across multiple dimensions\n",
- " (array broadcasting) based on dimension names, not shape.\n",
- "\n",
- "- Easily use the split-apply-combine paradigm with groupby:\n",
- " `x.groupby('time.dayofyear').mean()`.\n",
- "\n",
- "- Database-like alignment based on coordinate labels that smoothly handles\n",
- " missing values: `x, y = xr.align(x, y, join='outer')`.\n",
- "\n",
- "- Keep track of arbitrary metadata in the form of a Python dictionary:\n",
- " `x.attrs`.\n",
- "\n",
- "The N-dimensional nature of xarray’s data structures makes it suitable for\n",
- "dealing with multi-dimensional scientific data, and its use of dimension names\n",
- "instead of axis labels (`dim='time'` instead of `axis=0`) makes such arrays much\n",
- "more manageable than the raw numpy ndarray: with xarray, you don’t need to keep\n",
- "track of the order of an array’s dimensions or insert dummy dimensions of size 1\n",
- "to align arrays (e.g., using np.newaxis).\n",
- "\n",
- "The immediate payoff of using xarray is that you’ll write less code. The\n",
- "long-term payoff is that you’ll understand what you were thinking when you come\n",
- "back to look at it weeks or months later.\n"
+ "- Customize the display of Xarray objects\n",
+ "- Access variables, coordinates, and arbitrary metadata\n",
+ "- Transform to tabular Pandas data structures\n",
+ ":::"
]
},
{
@@ -72,13 +25,10 @@
"\n",
"Xarray provides two data structures: the `DataArray` and `Dataset`. The\n",
"`DataArray` class attaches dimension names, coordinates and attributes to\n",
- "multi-dimensional arrays while `Dataset` combines multiple arrays.\n",
+ "multi-dimensional arrays while `Dataset` combines multiple DataArrays.\n",
"\n",
"Both classes are most commonly created by reading data.\n",
- "To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial.\n",
- "\n",
- "Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n",
- "We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name."
+ "To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial."
]
},
{
@@ -88,7 +38,13 @@
"outputs": [],
"source": [
"import numpy as np\n",
- "import xarray as xr"
+ "import xarray as xr\n",
+ "import pandas as pd\n",
+ "\n",
+ "# When working in a Jupyter Notebook you might want to customize Xarray display settings to your liking\n",
+ "# The following settings reduce the amount of data displayed out by default\n",
+ "xr.set_options(display_expand_attrs=False, display_expand_data=False)\n",
+ "np.set_printoptions(threshold=10, edgeitems=2)"
]
},
{
@@ -97,7 +53,10 @@
"source": [
"### Dataset\n",
"\n",
- "`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n"
+ "`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n",
+ "\n",
+ "Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n",
+ "We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name."
]
},
{
@@ -147,14 +106,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "#### What is all this anyway? (String representations)\n",
+ "#### HTML vs text representations\n",
"\n",
"Xarray has two representation types: `\"html\"` (which is only available in\n",
"notebooks) and `\"text\"`. To choose between them, use the `display_style` option.\n",
"\n",
"So far, our notebook has automatically displayed the `\"html\"` representation (which we will continue using).\n",
- "The `\"html\"` representation is interactive, allowing you to collapse sections (left arrows) and\n",
- "view attributes and values for each value (right hand sheet icon and data symbol)."
+ "The `\"html\"` representation is interactive, allowing you to collapse sections (▶) and\n",
+ "view attributes and values for each value (📄 and ≡)."
]
},
{
@@ -180,7 +139,7 @@
"- an unordered list of *coordinates* or dimensions with coordinates with one item\n",
" per line. Each item has a name, one or more dimensions in parentheses, a dtype\n",
" and a preview of the values. Also, if it is a dimension coordinate, it will be\n",
- " marked with a `*`.\n",
+ " printed in **bold** font.\n",
"- an alphabetically sorted list of *dimensions without coordinates* (if there are any)\n",
"- an unordered list of *attributes*, or metadata"
]
@@ -379,15 +338,6 @@
"methods on `xarray` objects:\n"
]
},
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import pandas as pd"
- ]
- },
{
"cell_type": "code",
"execution_count": null,
@@ -429,8 +379,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "**to_series
**: This will always convert `DataArray` objects to\n",
- "`pandas.Series`, using a `MultiIndex` for higher dimensions\n"
+ "### to_series\n",
+ "This will always convert `DataArray` objects to `pandas.Series`, using a `MultiIndex` for higher dimensions\n"
]
},
{
@@ -446,9 +396,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "**to_dataframe
**: This will always convert `DataArray` or `Dataset`\n",
- "objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named\n",
- "for this.\n"
+ "### to_dataframe\n",
+ "\n",
+ "This will always convert `DataArray` or `Dataset` objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named for this. Since columns in a `DataFrame` need to have the same index, they are\n",
+ "broadcasted."
]
},
{
@@ -459,23 +410,6 @@
"source": [
"ds.air.to_dataframe()"
]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Since columns in a `DataFrame` need to have the same index, they are\n",
- "broadcasted.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "ds.to_dataframe()"
- ]
}
],
"metadata": {
diff --git a/workshops/scipy2024/index.ipynb b/workshops/scipy2024/index.ipynb
index 14e7172d..de817e0b 100644
--- a/workshops/scipy2024/index.ipynb
+++ b/workshops/scipy2024/index.ipynb
@@ -20,7 +20,7 @@
":::{admonition} Learning Goals\n",
"- Orient yourself to Xarray resources to continue on your Xarray journey!\n",
"- Effectively use Xarray’s multidimensional indexing and computational patterns\n",
- "- Understand how Xarray can wrap other array types in the scientific Python ecosystem\n",
+ "- Understand how Xarray integrates with other libraries in the scientific Python ecosystem\n",
"- Learn how to leverage Xarray’s powerful backend and extension capabilities to customize workflows and open a variety of scientific datasets\n",
":::\n",
"\n",
@@ -35,7 +35,7 @@
"| Introduction and Setup | 1:30 (10 min) | --- | \n",
"| The Xarray Data Model | 1:40 (40 min) | [Data structures](../../fundamentals/01_datastructures.ipynb)
[Basic Indexing](../../fundamentals/02.1_indexing_Basic.ipynb) | \n",
"| *10 minute Break* \n",
- "| Indexing & Computational Patterns | 2:30 (50 min) | [Advanced Indexing](../../intermediate/indexing/indexing.md)
[Computation Patterns](../../intermediate/01-high-level-computation-patterns.ipynb)
| \n",
+ "| Indexing & Computational Patterns | 2:30 (50 min) | [Advanced Indexing](../../intermediate/indexing/indexing.md)
[Computational Patterns](../../intermediate/01-high-level-computation-patterns.ipynb)
| \n",
"| *10 minute Break* | \n",
"| Xarray Integrations and Extensions | 3:30 (50 min) | [The Xarray Ecosystem](../../intermediate/xarray_ecosystem.ipynb) | \n",
"| *10 minute Break* | \n",
@@ -81,6 +81,14 @@
"- Max Jones (CarbonPlan)\n",
"- Wietze Suijker (Space Intelligence)"
]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1",
+ "metadata": {},
+ "outputs": [],
+ "source": []
}
],
"metadata": {