Skip to content

Commit

Permalink
update student version with curriculum book changes
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Oct 2, 2024
1 parent e33ceb5 commit 5cb4c7e
Show file tree
Hide file tree
Showing 10 changed files with 175,483 additions and 174,940 deletions.
43 changes: 38 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,43 @@
# GeoSMART Curriculum Jupyter Book (ESS 469/569)

[![Deploy](https://github.com/geo-smart/mlgeo-book/actions/workflows/deploy.yaml/badge.svg)](https://github.com/geo-smart/mlgeo-book/actions/workflows/deploy.yaml)
[![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://geo-smart.github.io/mlgeo-book)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/geo-smart/mlgeo-book/HEAD?urlpath=lab)
[![Deploy](https://github.com/geo-smart/mlgeo-instructor/actions/workflows/deploy.yaml/badge.svg)](https://github.com/geo-smart/mlgeo-instructor/actions/workflows/deploy.yaml)
[![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://geo-smart.github.io/mlgeo-instructor)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/geo-smart/mlgeo-instructor/HEAD?urlpath=lab)
[![GeoSMART Library Badge](book/img/curricula_badge.svg)](https://geo-smart.github.io/curriculum)
[![Student Version](book/img/student_version_badge.svg)](https://geo-smart.github.io/mlgeo-book/)

## About
## Repository Overview

This repository stores configuration for GeoSMART curriculum content, specifically the student version of the book. This version of the book should never be directly edited, as the student version is automatically generated on push.
This repository stores configuration for GeoSMART curriculum content, specifically the teacher version of the book. Only this version of the book should ever be edited, as the student version is automatically generated on push by github actions.

## Making Changes

Edit the book content by modifying the `_config.yml`, `_toc.yml` and `*.ipynb` files in the `book` directory. The book is hosted on Github Pages and will be automatically updated on push, and the student book will also be created automatically on push.

Making changes requires that you set up a conda environment and build locally before making sure that it will build with github actions. We accepted rendered notebooks, but some oddities, such as kernels different than python, will make it crash. So we recommend that contributors first build the book with the added notebooks.

```sh
conda env create -f ./conda/environment.yml
conda activate curriculum_book

```

To modify the exact differences between this book and the student book, edit `.github/workflows/clean_book.py`. When you push, a github action will clone the repo and run this python file which modifies certain parts of `*.ipynb` file contents, then pushes to the student repo. To edit the student repo's README, edit `STUDENT_README.md`. The Github Actions workflow also automatically replaces `README.md` with `STUDENT_README.md` in the student repo.

### `Student Response Sections`

One modifications made by the `clean_book.py` workflow is to clear sections marked for student response. Code cells marked for student response may contain code in the teacher version of the book, but will have their code removed and replaced with a TODO comment in the student version.

To mark a code cell to be cleared, insert a markdown cell directly preceding it with the following content:

````markdown
```{admonition} Student response section
This section is left for the student to complete.
```
````

## Serving Locally

Activate the `curriculum_book` conda environment (or any conda environment that has the necessary jupyter book dependencies). Navigate to the root folder of the curriculum book repository in anaconda prompt, then run `python server.py`.

On startup, the server will run `jb build book` to build all changes to the notebook and create the compiled HTML. The server code can take a `--no-build` flag (or `--nb` shorthand) if you don't want to build any changes you've made to the notebooks. In the case that you don't want to build changes made to the notebooks, you can just run `python serer.py --nb` from any terminal with python installed.
10 changes: 10 additions & 0 deletions STUDENT_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# GeoSMART Curriculum Jupyter Book (ESS 469/569)

[![Deploy](https://github.com/geo-smart/mlgeo-book/actions/workflows/deploy.yaml/badge.svg)](https://github.com/geo-smart/mlgeo-book/actions/workflows/deploy.yaml)
[![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://geo-smart.github.io/mlgeo-book)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/geo-smart/mlgeo-book/HEAD?urlpath=lab)
[![GeoSMART Library Badge](book/img/curricula_badge.svg)](https://geo-smart.github.io/curriculum)

## About

This repository stores configuration for GeoSMART curriculum content, specifically the student version of the book. This version of the book should never be directly edited, as the student version is automatically generated on push.
1 change: 0 additions & 1 deletion book/Chapter1-GettingStarted/1.5_version_control_git.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# 1.5 Version Control & GitHub

---
## Version Control
Version Control is a system that organize and tracks the version of codes.

Expand Down
8 changes: 4 additions & 4 deletions book/Chapter1-GettingStarted/1.6_data_gallery.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ We provide a series of small, curated data set for the course. These data set ar
To download similar data, we made a MLGEO-dataset (https://github.com/UW-MLGEO/MLGeo-dataset).


# How to Download a File from a MLGeo-dataset Repository
## How to Download a File from a MLGeo-dataset Repository

To download a file from a GitHub repository, follow these steps:

Expand Down Expand Up @@ -37,15 +37,15 @@ The collection of data aims to represent the diversity of data sets encountered

The data includes time series of various time scales (from the second to the 100ka). The data is stored either in CSV files for the class, but typically is stored in CSV, Arrow, H5, NetCDF, TileDB, mseed and other disciplinary-specific format.

<!-- For Vscode -->
![Geoscientific Temporal Data](geocast-alldata.png)

<!-- For Jupyter Book -->

```{figure} geocast-alldata.png
:width: 400px
---
name: Geoscientific Temporal Data
alt: PhD Comics, Version Control of a PhD
alt: Geoscientific Temporal Data
---
Figure 1: Geoscience Temporal Data: x-axis represent time normalized, y-axis is normalized time series offset by indexing in the data set. The data sets includes extreme events, dynamic seismic waves, CO2 rising, seasonal pattern over 15+ years such as hydrological and weather signals.
```
*Figure 1: Geoscience Temporal Data: x-axis represent time normalized, y-axis is normalized time series offset by indexing in the data set. The data sets includes extreme events, dynamic seismic waves, CO2 rising, seasonal pattern over 15+ years such as hydrological and weather signals.*
72 changes: 61 additions & 11 deletions book/Chapter2-DataManipulation/2.1_Data_Definitions.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,72 @@
# 2.1 Data Definitions

Geoscientific data is particularly diverse: point measurements of soil moisture, high rate time series (1000 samples per second) seismograms, rasterized LandSAT imagery, Geospatial and Temporal simulated geophysical fields.
Data is foundational to geosciences, allowing us to observe, model, and predict natural processes. Understanding the various types of data, their formats, and how they are structured is key to effectively using them in research and applications. In this lecture, we will discuss data modalities encountered in geoscience, typical data formats, the concept of arrays, and data frames. Geoscientific data is particularly diverse: point measurements of soil moisture, high rate time series (1000 samples per second) seismograms, rasterized LandSAT imagery, Geospatial and Temporal simulated geophysical fields.


## The data modality
Modality refers to the field, or genre of measurements. Different modalities may be seismograms, GPS displacement time series, surface air temperature time series. All of them are point-based measurements, share the same data type (1D arrays), could be saved in the same data format (e.g., CSV file), but sense different physical fields.
<!-- For Vscode -->
![Geoscientific Temporal Data](Dalle-geoscientific-data.png)

**The data type** refers to the type of an object. Geoscientific data is *numeric* (floats, integer) and from which you can calculate things. It can also be *categorical* (i.e. qualitative or nominal).
<!-- For Jupyter Book -->
```{figure} Dalle-geoscientific-data.png
:width: 400px
---
name: Geoscientific Data AI-Art
alt: Geoscientific Data AI-Art
---
AI-Art from Dall-e: geoscientific data with dataframes, geospatial, and temporal data.
```
*AI-Art from Dall-e: geoscientific data with dataframes, geospatial, and temporal data.*

**The data format** refers to the specific type of parsing schema in a file (H5, CSV, JSON). It can be binary (H5), using standard character encodings (CSV, JSON), compressed (H5, Parquet), ... more details in Chapter 2.5.

The difference in dimensionalities among geoscientific data challenges the design of machine learning models across disciplines. For most machine-learning practices, data modalities are classified as **dimension**. One example is a geophysical model that uses sattelite imagery (2D in space) with time series (1D in time) from point-based sensor measurements to predict an output.
---
## Data modality in Geosciences
In geosciences, data come in multiple modalities depending on the source, nature of the measurements, and intended applications:

## Data Frames
A **DataFrame** is a tabular data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. Dataframes are relational databases. Their *data schema* defines how data is organized within the dataframe: it defines the column names to specific values.

Data frames can be saved in row based file formats (Comma Separated Value CSV) or column-based formats (Parquet).
* **In-situ Data**: Measurements taken directly at the site of interest. In-situ data often comes as time series. Examples include:
* Temperature readings from weather stations.
* Seismic wave data from seismographs.
* Soil moisture content from field sensors.
* **Remote Sensing Data**: Collected from instruments not in direct contact with the object of study, often using satellites, drones, or aircraft. *Geospatial Data* are tied to specific locations on Earth’s surface, often represented as maps or grids (e.g., GIS data). Examples include:
* Spectral data (e.g., multispectral or hyperspectral images) from satellites.
* Topography data using LiDAR or radar systems.
* Sea surface temperature from satellites.
* **Model Data**: Simulated data generated from computational models. For example:
* Climate models predicting future temperatures or precipitation.
* Hydrological models simulating water flow in river basins.
* **Geophysical Data**: Subsurface measurements derived through indirect methods like seismic surveys, gravity, or magnetic studies.


## Data Formats in Geosciences
Geoscientific data is typically stored in formats that optimize storage, access, and sharing. Common formats include:

**NetCDF** (Network Common Data Form): Commonly used for multidimensional scientific data, such as atmospheric, oceanic, or climate model outputs. It efficiently stores array-based data with metadata.

**HDF** (Hierarchical Data Format): Similar to NetCDF but more general, used for large datasets including satellite imagery.

**CSV** (Comma-Separated Values): A simple format for tabular data. It's human-readable and widely supported across software, but less efficient for large or multidimensional datasets.

**GeoTIFF**: A popular format for raster geospatial data, often used in remote sensing and GIS applications.

**Shapefiles**: A vector data format for geographic information system (GIS) software, which contains geometric locations and attribute information of spatial features.

Most of these files are not cloud optimized, and we will explore next new format to accomodate large cloud storage systems.

## Arrays
An array is a fundamental data structure used to store collections of values, often representing multidimensional data (e.g., gridded spatial data). Arrays in geosciences typically represent data like temperature, pressure, or rainfall on a grid.

Typical Dimensions of Arrays:
* 1D Arrays: A single sequence of data, such as temperature measurements over time at one location.
* 2D Arrays: Often represent gridded spatial data (e.g., a map of precipitation over a region).
* 3D Arrays: Can include additional dimensions, such as time or depth. For instance, a 3D array could represent temperature at various depths and over time for a given region, 3D Earth model of geophysical properties such as seismic wavespeed, time varying (snapshots) of seismic wavefields, ...
* 4D Arrays: Add even more complexity, such as a time-varying 3D grid (e.g., atmospheric data changing over space and time), time-lapse images of the subsurface properties.

## Data Frames
A data frame is a two-dimensional, tabular data structure, commonly used in data analysis. Data frames can be thought of as equivalent to a spreadsheet or database table, where:

* Each **column** represents a variable or feature (e.g., date, location, temperature).
* Each **row** corresponds to an observation or data point.
Data frames are popular in programming environments like R and Python (via the Pandas library) because they offer flexibility in handling mixed data types (numerical, categorical, etc.) and are ideal for statistical analysis and data manipulation.

[Lecture Slides](../../img/Google_Slides_Logo.svg)[! (https://docs.google.com/presentation/d/1PVu8vbYtX0G4W41TB537Irm5V845E4uPsIrWQRfoQB0/edit?usp=sharing)
See Lecture Slides.
## Lecture Slides
[![Lecture Slides](../img/Google_Slides_Logo.svg)](https://docs.google.com/presentation/d/1PVu8vbYtX0G4W41TB537Irm5V845E4uPsIrWQRfoQB0/edit?usp=sharing)
Loading

0 comments on commit 5cb4c7e

Please sign in to comment.