Skip to content

Lecture notes and notebooks for statistical data analysis and machine learning in Earth science

License

Notifications You must be signed in to change notification settings

leonard-seydoux/earth-data-science-2024

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Earth data science

This repository contains the materical for the class Earth data science delivered at the Institut de Physique du Globe de Paris for master students. The course is an introduction to scientific computing and the use of Python for solving geophysical problems. The course is mostly based on practical sessions where students will learn how to use Python to solve problems related to the Earth sciences with statistical and machine learning methods. The course and notebooks rely on the Python scikit-learn library, pandas, pytorch, and the deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville. This course is a legacy of the course of the same name by Antoine Lucas. The lectures are taught by Léonard Seydoux and the practicals by Antoine Lucas, Alexandre Fournier, Éléonore Stutzmann, and Léonard Seydoux.

The goal of this course is to introduce students to the basics of scientific computing and to the use of Python for solving geophysical problems. The course mostly consists in practical sessions where students will learn how to use Python to solve problems related to the Earth sciences mith statistical and machine learning methods. The course and notebooks rely on the Python scikit-learn library, pandas, pytorch, and the deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville.

The course contains 8 hour of lecture followed by 20 hours of practical sessions made with Jupyter notebooks. The lecture notes are available in the lectures folder and the practicals in the labs folder. You can find an introductory README file in each folder.

Lectures

The lectures will fit within two sessions of four hours each. The following list of lectures is proposed in the different subfolders of the lectures folder.

  1. Introduction to machine learning. This section introduces the use cases of machine learning in the Earth sciences and the basic concepts of supervised and unsupervised learning.
  2. Definitions. This section introduces the basic definitions of machine learning, including the various notations and the different types of learning.
  3. Supervised machine learning: regression. This section introduces the concept of regression and the different metrics used to evaluate the performance of a regression model.
  4. Supervised machine learning: classification. This section introduces the concept of classification and the different metrics used to evaluate the performance of a classification model
  5. Deep learning: the multilayer perceptron. This section introduces the concept of deep learning and the multilayer perceptron
  6. Deep learning: convolutional neural networks. This section introduces the concept of convolutional neural networks
  7. Applications. This section introduces the different applications of machine learning in the Earth sciences
  8. Unsupervised learning. This section introduces the concept of unsupervised learning, relying a lot on the previously seen concepts
  9. Notebooks. A brief introduction to the Jupyter notebooks and the Python programming language.

Labs

The following list of labs is proposed in the different subfolders of the labs folder.

  1. Self-evaluation (1 hour). This lab is a self-evaluation of your Python skills. It is required to enroll in the course. A small solution will be delivered at the beginning of the lab session.
  2. River sensor calibration (4 hours). This lab allow to perform a first simple machine learning task: the calibration of a river sensor with supervised learning, where the goal is to predict the suspended sediment concentration from the turbidity of the water.
  3. Earthquake location (~4 hours). In this lab, we will use Bayesian inference to locate the earthquake that occurred near the city of Le Teil in November 2019. We will also play around with prior distributions and see how they affect the posterior distribution.
  4. Lidar data classification (~8 hours). In this lab, we will classify lidar cloud points into different classes using supervised machine learning tools. Since this is a more complex task, we will take more time to complete it.
  5. Deep learning (~4 hours). In this lab, we will explore several deep learning architectures to perform several supervised tasks, including digit recognition, and volcano monitoring.

The solution to the different labs will be proposed progressively during the course in the corresponding folders. Note that the solutions provided are not necessarily the best ones. The main idea of these sessions is for you to be overly curious and to try to find the solutions that best fit your needs, and your understanding of the problem. Some of you may complete the tasks at a faster pace than others, and we encourage you to help your peers during the labs, and also to explore further aspects of the problems that are not covered in the labs.

Running the Jupyter labs

Python environment

The easiest way to run most notebooks of this course is to create a new Anaconda environment with the following set of commands. We decided not to go with an environment file to allow for more flexibility in Python versions.

The following lines create a new environment called earth-data-science without any package installed. Then, we install the most constrained packages first (namely, obspy) which will install the latest compatible version of python, numpy and scipy. Finally, we install the rest of the packages.

conda create -n earth-data-science
conda activate earth-data-science
conda install -c conda-forge obspy
conda install -c conda-forge numpy scipy matplotlib pandas jupyter scikit-learn cartopy ipywidgets rasterio seaborn
pip install tqdm 
pip install laspy

Once this is done, you must select the kernel earth-data-science in Jupyter to run the notebooks. Please inform your instructor if you have any problem with this.

Execution

The notebooks can be either ran locally or on a remote server. The remote server is available at the following address: https://charline.ipgp.fr. You can log in with your IPGP credentials. Therein, you can apply clone to download the notebooks from this repository (e.g. git clone https://github.com/leonard-seydoux/earth-data-science.git).