Skip to content

Commit

Permalink
update student version with curriculum book changes
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed May 29, 2023
1 parent 880cffe commit 38037e6
Show file tree
Hide file tree
Showing 30 changed files with 175,135 additions and 175,030 deletions.
4 changes: 2 additions & 2 deletions book/Chapter1-GettingStarted/1.1_python_environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ A Python virtual environment is an isolated working copy of a specific version o

- You may have on your computer different Python codes with different versions of packages
- You give your code to a friend
- Some of your packages may depend on other packages, with a specific version. How to make sure you have the right version of everything?
- Some of your packages may depend on other packages, with a specific version. How do you make sure you have the right version of everything?

## How to deal with this?
Install [miniconda](https://docs.conda.io/en/latest/miniconda.html).
Expand All @@ -34,7 +34,7 @@ Install [miniconda](https://docs.conda.io/en/latest/miniconda.html).

... to check conda version to make sure its installed.

env list
conda env list

... to list out available environments (the starred * environment is the current activate environment).

Expand Down
25 changes: 13 additions & 12 deletions book/Chapter1-GettingStarted/1.2_jupyter_environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,63 +30,64 @@ Used for:
* Text on RStudio files

### Basic Markdown commands
Headings <br>
***
Headings <br>

\# Heading level 1 <br>
\#\# Heading level 2 <br>
\#\#\# Heading level 3 <br>
***

Paragraphs: Leave a blank line
***

This is my first paragraph.

This is my second paragraph.
***


Bold text<br>
***

\*\*This is my bold text\*\*
***

Italic Text<br>
***

\*This is my italic text\*
***

Bold and italic text
***

\*\*\*This is my bold, italic text\*\*\*
***

Scratched text
***

\~\~Scratched Text\~\~
***

Markdown supports **HTML** text. For instance, one can <u>underline</u> a <u>text</u>
***

\<\u\>text\<\/\u\>
***

Line break: use `<br>`<br>
***

This is my first line.<br>
This is my second line.
***


Ordered list
***

1. First item
2. Second item
3. Third item
4. Fourth item
***

Unordered list
***

\- First item <br>
\- Second item <br>
\- Third item <br>
Expand All @@ -96,12 +97,12 @@ Unordered list
Link URL such as the course [Github](https://github.com/UW-ESS-DS/ESS490-590-Autmn22)

[Github](https://github.com/UW-ESS-DS/ESS490-590-Autmn22)

***

Insert images such as <img src="../img/GeoSMART_logo.svg" width="200"/>, use

<img src="glass.png" width="200"/>

***


LateX in the code cells
Expand Down
47 changes: 27 additions & 20 deletions book/Chapter1-GettingStarted/1.3_version_control_git.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,9 +122,11 @@ To add files to your repository:
(*) Everytime you start working on your repository, make sure you have the up-to-date version on your local environment. In CLI, this means:
git pull
To modify ``mycode.py``, use the command:
git checkout mycode.py
Expand All @@ -148,33 +150,38 @@ Then copy the ``environment.yml`` file from the MLGeo2022 course into your own r
2. **From the GitHub Desktop app**:
* Click `Add` on the top of the left sidebar
* Click `Create new repository` and choose: Name, short description, local path (avoid a home directory on other cloud (Dropbox or Github) to reduce headaches), initialize with a README, choose license, choose Git Ignore (most likely programming language)
* Open in Visual Studio Code or other prefered editor.
* Open in Visual Studio Code or other preferred editor.
* Check that the GitHub page is up to date
* Fetch, Push, Pull, etc
3. **From the CLI**:
Initialize a local directory as a git repository:
git init
git add *
git commit -m "my first commit"
git push
* Initialize a local directory as a git repository:
git init
git add *
git commit -m "my first commit"
git push
* Be aware of the need to use passwords or tokens. Your configuration may also be incomplete, so re run configuration listed above.
* You can now start writing codes and documentation. GitHub uses _staging_ as terminology to keep track of the new changes that could be _committed_ to the remove server. To add the new script:
git add newfile.py
git commit -m "add new script"
git push
Be aware of the need to use passwords or tokens. Your configuration may also be incomplete, so re run configuration listed above.
* If you made some code modification but prefer using the version that is on the GitHub, you can unstage and reset the file from the last version:
git reset HEAD newfile.py
git checkout newfile.py
You can now start writing codes and documentation. GitHub uses _staging_ as terminology to keep track of the new changes that could be _committed_ to the remove server. To add the new script:
git add newfile.py
git commit -m "add new script"
git push
* To do the same for the entire repos:
If you made some code modification but prefer using the version that is on the GitHub, you can unstage and reset the file from the last version:
git reset HEAD newfile.py
git checkout newfile.py
To do the same for the entire repos:
git reset HEAD~
git checkout
git status
git reset HEAD~
git checkout
git status
## Work as a team with GitHub
Expand All @@ -200,7 +207,7 @@ alt: GitHub project management, From: Earth Lab
Pull requests using GitHub: Found in [EarthDataScience](https://www.earthdatascience.org/courses/intro-to-earth-data-science/git-github/github-collaboration/). Source: Earth Lab, Alana Faller
```

* Use **GitHub Issues** to post bugs or performance issues, so that the contributors can keep track and address them. There are templates to posting issues, and online [discussions](https://medium.com/nyc-planning-digital/writing-a-proper-github-issue-97427d62a20f) about it. The main takeway are:
* Use **GitHub Issues** to post bugs or performance issues, so that the contributors can keep track and address them. There are templates to posting issues, and online [discussions](https://medium.com/nyc-planning-digital/writing-a-proper-github-issue-97427d62a20f) about it. The main takeways are:
- Avoid duplication, check if somebody else has had the same issue
- Use template

Expand All @@ -219,7 +226,7 @@ Further discussions. [here](https://rewind.com/blog/best-practices-for-using-git

## Publish your software

If the software will be used for furture research and would be cited by the community, publish your software on Zenodo. GitHub provides guidelines [here](https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content).
If the software will be used for future research and would be cited by the community, publish your software on Zenodo. GitHub provides guidelines [here](https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content).


:::{note}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 1.4 Computing Environments

We will discuss on how to work with various environments
We will discuss how to work with various environments

## Local environment.

Expand Down Expand Up @@ -33,7 +33,7 @@ HPC systems have 1) a compute cluster, 2) a scratch file system (temporary), and

Institutions may have their own HPC systems. At UW, the system is called [Hyak](https://hyak.uw.edu/).

National HPC resources require specific request for allocation. Requests are typically done using [ACCESS](https://allocations.access-ci.org/) or [TACC](https://www.tacc.utexas.edu/).
National HPC resources require a specific request for allocation. Requests are typically done using [ACCESS](https://allocations.access-ci.org/) or [TACC](https://www.tacc.utexas.edu/).


HPC can deploy virtual cloud systems to allow horizontal scaling and cloudstore-like file systems {cite:p}`.`.
Expand All @@ -60,7 +60,7 @@ Here is an example of a Google Colab:

### AWS

AWS is the amazon services for cloud. It is the cloud leader. Chapter 7 details access and usage of these resources.
AWS is the Amazon services for cloud. It is the cloud leader. Chapter 7 details access and usage of these resources.

Their JupyterHub for machine learning is ran out of [Sagemaker Studio](https://aws.amazon.com/sagemaker/). The first 250 hours of use (within the first 2 months) are *free*.

Expand All @@ -73,7 +73,7 @@ Some specific data set that could be used in this book:
* **Seismic Data**
- Southern California Seismic Network. [Here](https://aws.amazon.com/marketplace/pp/prodview-c4rk5lxymj43i?sr=0-99&ref_=beagle&applicationId=AWSMPContessa).
- Distributed Acoustic Sensing (DAS) PoroTomo experiment. [Here](https://aws.amazon.com/marketplace/pp/prodview-qd7w6cbnmssl2?sr=0-41&ref_=beagle&applicationId=AWSMPContessa).
- OpenEEW: low cost seismometers distributed in populated areas. [Here](https://aws.amazon.com/marketplace/pp/prodview-ot34yes3afyhq?sr=0-1&ref_=beagle&applicationId=AWSMPContessa)
- OpenEEW: low cost seismometers distributed in populated areas. [Here](https://aws.amazon.com/marketplace/pp/prodview-ot34yes3afyhq?sr=0-1&ref_=beagle&applicationId=AWSMPContessa).

* **Oceanography Data**

Expand Down
5 changes: 3 additions & 2 deletions book/Chapter2-DataManipulation/2.10_MLready_data.ipynb
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2.10 ML-ready data\n",
"\n",
"\n",
"Preparing and pre-processing data to integrate in machine learning workflow is the fundamental towards good machine learning project.\n",
"Preparing and pre-processing data to integrate in machine learning workflow is fundamental towards a good machine learning project.\n",
"\n",
"- Organize the data in machine-readable formats and data structures that can be manipulated automatically in the ML workflow:\n",
" * arrange data in numpy arrays, Xarrays, or pandas. \n",
Expand All @@ -16,7 +17,7 @@
" * extract statistical, temporal, or spectral features (use tsfresh, tsfel, ...)\n",
" * transform the data into Fourier or Wavelet space (use scipy fft or cwt module)\n",
" * reduce dimension by taking the PCA or ICA of the data. Save these features into file or metadata (use scikit-learn PCA or FastICA module). \n",
" * explore the dimensionality of the remaining feature space. Find correlations among features (use plotly interactie plotting, seaborn scatterplot visualization, or the pandas.corr matrix)\n",
" * explore the dimensionality of the remaining feature space. Find correlations among features (use plotly interactive plotting, seaborn scatterplot visualization, or the pandas.corr matrix)\n",
" * Further reduce the dimension using:\n",
" + Feature *selection* finds the dimensions that explain the data without loss of information and ends with a smaller dimensionality of the input data. A *forward selection* approach starts with one variable that decreases the error the most and add one by one. A *backward selection* starts with all variables and removes them one by one.\n",
" + Feature *extraction* finds a new set of dimension as a combination of the original dimensions. They can be supervised or unsupervised depending on the output information. \n",
Expand Down
4 changes: 2 additions & 2 deletions book/Chapter2-DataManipulation/2.1_Data_Definitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ Modality refers to the field, or genre of measurements. Different modalities may

**The data format** refers to the specific type of parsing schema in a file (H5, CSV, JSON). It can be binary (H5), using standard character encodings (CSV, JSON), compressed (H5, Parquet), ... more details in Chapter 2.5.

The difference in dimensionalities among geoscientific data challenges the design of machine learning models across disciplines. For most machine-learning practictioner, data modalities are classified as **dimension**. One example is a geophysical model that uses sattelite imagery (2D in space) with time series (1D in time) from point-based sensor measurements to predict an output.
The difference in dimensionalities among geoscientific data challenges the design of machine learning models across disciplines. For most machine-learning practices, data modalities are classified as **dimension**. One example is a geophysical model that uses sattelite imagery (2D in space) with time series (1D in time) from point-based sensor measurements to predict an output.

## Data Frames
A **DataFrame** is a tabular data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. Dataframes are relational databases. Their *data schema* defines how data is organized within the dataframe: it defines the column names to specific values.

Data frames can be saved in row based file formacs (Comma Separated Value CSV) or column-based formats (Parquet).
Data frames can be saved in row based file formats (Comma Separated Value CSV) or column-based formats (Parquet).

13 changes: 8 additions & 5 deletions book/Chapter2-DataManipulation/2.2_Numpy_arrays.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "47124ce9",
"metadata": {},
Expand All @@ -22,7 +23,7 @@
"\n",
"Sequence of data can be stored in python *lists*. Lists are very flexible; data of identical type can be appended to list on the fly.\n",
"\n",
"Numpy arrays area multi-dimensional objects of specific data types (floats, strings, integers, ...). ! Numpy arrays should be declared first ! Allocating memory of the data ahead of time can save computational time. Numpy arrays support arithmetic operations.\n",
"Numpy arrays are multi-dimensional objects of specific data types (floats, strings, integers, ...). Numpy arrays should be declared first! Allocating memory for the data ahead of time can save computational time. Numpy arrays support arithmetic operations.\n",
"There are numerous tutorials to get help. https://www.earthdatascience.org/courses/intro-to-earth-data-science/scientific-data-structures-python/numpy-arrays/"
]
},
Expand All @@ -36,8 +37,8 @@
"# import module\n",
"import numpy as np\n",
"\n",
"# define an array of dimension one from.\n",
"#this is a list of floats:\n",
"# define an array of one dimesion\n",
"# this is a list of floats:\n",
"a=[0.7 , 0.75, 1.85]\n",
"# convert a list to a numpy array\n",
"a_nparray = np.array(a)"
Expand Down Expand Up @@ -95,11 +96,12 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "178b20ce",
"metadata": {},
"source": [
"## Introduction fo Matplotlib\n",
"## Introduction to Matplotlib\n",
"\n",
"**Some tips from Sofware-Carpentry**\n",
"\n",
Expand Down Expand Up @@ -253,11 +255,12 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comparing 1D and 2D arrays\n",
"Comparing data often means calculating a distance or dissimilarity between the two data. Similarity is equialent to proximity of two data.\n",
"Comparing data often means calculating a distance or dissimilarity between the two data. Similarity is equivalent to the proximity of two data.\n",
"\n",
"**Euclidian distance**\n",
"\n",
Expand Down
Loading

0 comments on commit 38037e6

Please sign in to comment.