This project contains tools and scripts used to analyze and interpret seismic data for first break picking. These have been developed by Mila in collaboration with the Geological Survey of Canada (GSC), which is part of National Resources Canada (NRCan). We also provide links below to our multi-survey seismic dataset which is generously hosted by NRCan.
Two publications are associated with this repository:
"Deep Learning Benchmark for First Break Detection from Hardrock Seismic Reflection Data", St-Charles et al., GEOPHYSICS, 2023: [Open Access link]
@article{stcharles2023hardpicks_preprint,
title={Deep Learning Benchmark for First Break Detection from Hardrock Seismic Reflection Data},
author={St-Charles, Pierre-Luc and Rousseau, Bruno and Ghosn, Joumana and Bellefleur, Gilles and Schetselaar, Ernst},
journal={Geophysics},
volume={89},
number={1},
pages={1--68},
year={2023},
publisher={Society of Exploration Geophysicists}
}
"A Multi-Survey Dataset and Benchmark for First Break Picking in Hard Rock Seismic Exploration", St-Charles et al., ML4PS 2021 (Neurips 2021 Workshop): [PDF link]
@inproceedings{stcharles2021hardpicks_workshop,
title={A multi-survey dataset and benchmark for first break picking in hard rock seismic exploration},
author={St-Charles, Pierre-Luc and Rousseau, Bruno and Ghosn, Joumana and Bellefleur, Gilles and Schetselaar, Ernst},
booktitle={Proc. of the 2021 NeurIPS Workshop on Machine Learning for the Physical Sciences (ML4PS)},
year={2021}
}
Before downloading any data, make sure you read and understand the data licensing terms below.
Mila and Natural Resources Canada have obtained licences from Glencore Canada Corporation and Trevali Mining Corporation to distribute field seismic data from the Brunswick 3D and Halfmile Lake 3D seismic surveys, respectively, under a Creative Commons Attribution 4.0 International License (CC BY 4.0). These datasets are in the Hierarchical Data Format (HDF5) and have first arrival labels included in trace headers.
The Lalor 3D and Sudbury 3D seismic data are distributed under the Open Government Licence – Canada. Canada grants to the licensee a non-exclusive, fully paid, royalty-free right and licence to exercise all intellectual property rights in the data. This includes the right to use, incorporate, sublicense (with further right of sublicensing), modify, improve, further develop, and distribute the Data; and to manufacture or distribute derivative products.
The formatting of these datasets is similar to the other two.
Please use the following attribution statement wherever applicable:
Contains information licensed under the Open Government Licence – Canada.
The HDF5 files are hosted on AWS, and can be downloaded directly:
We demonstrate how to parse and display the raw data in this notebook.
The cross-validation folds used in the NeurIPS 2021 ML4PS workshop paper are as follow:
Fold Sudbury:
Train: Halfmile, Lalor;
Valid: Brunswick;
Test: Sudbury
Fold Brunswick:
Train: Sudbury, Halfmile;
Valid: Lalor;
Test: Brunswick
Fold Halfmile:
Train: Lalor, Brunswick;
Valid: Sudbury;
Test: Halfmile
Fold Lalor:
Train: Brunswick, Sudbury;
Valid: Halfmile;
Test: Lalor
For the GEOPHYSICS 2023 version, refer to Table 3 of the paper.
We thanks Glencore Canada Corporation and Trevali Mining Corporation for providing access and allowing us to include and distribute the Brunswick 3D and Halfmile 3D seismic data as part of this benchmark dataset. We also thank E. Adam, S. Cheraghi, and A. Malehmir for providing first breaks for the Brunswick, Halfmile, and Sudbury data.
The hardpicks
package is provided as a reference for researchers to see how we implemented and
trained the models used in the experiments described in our papers. It is NOT a production-ready
codebase, and it requires a decent understanding of Python and PyTorch to dig into and use.
For in-depth examples on how to parse the proposed dataset, on how to use different parts of
the hardpicks
package, and for information about hyperparameters used in configuration files,
refer to the notebooks present in the examples
subdirectory here.
With the proposed hardpicks
package and its API, the entry point to train deep neural networks
is the following script:
hardpicks/main.py
This script takes in many command-line arguments to describe the model that should be trained and
where the data is located. Most of the required arguments come under the form of YAML configuration
files; some examples of these can be found in the examples
subdirectory here.
The following tables describe at a high level the content of the code base.
Folder | Description |
---|---|
config/ |
Configuration files and utilities for experiment definitions and project management. |
data/ |
Contains the directory structure where raw data will be parsed as well as other task-specific data files (e.g. bad gather lists, fold configurations, mlflow analysis scripts, ...). |
docs/ |
Scripts used to generate HTML documentation for the project. |
examples/ |
Many examples on how to execute the code on different platforms. |
hardpicks/ |
Python package that contains all the modules and utilities to perform deep learning model training and evaluation. |
tests/ |
Battery of unit tests. |
Some utility scripts in the code base may be of particular interest to new users:
Script | Description |
---|---|
hardpicks/analysis/fbp/bad_gathers/dataframe_analyzer.py |
GUI inspection tool used to identify poorly annotated line gathers for first break picking. |
./linting_test.sh |
Used in our Continuous Integration (CI) pipeline to insure code quality. Not directly related to the project. |
All functionalities that are related to the training or evaluation of predictive models are
implemented as part of the hardpicks
package. A brief description of its subpackages is provided
below. For more information on these, visit the package's README pages.
Library Subpackages | Description |
---|---|
hardpicks.analysis |
Contains standalone scripts and utilities used for the generation of plots and tables used for data analysis. Some of these scripts may be outdated, as they are typically not updated each time datasets and preprocessing techniques change. More information here. |
hardpicks.data |
Contains classes and functions used for data loading. This subpackage is used by both analysis scripts and model training/evaluation scripts. More information here. |
hardpicks.metrics |
Contains evaluation utilities used to compute metrics and produce reports during/after model training. More information here. |
hardpicks.models |
Contains modules and layer implementations used to construct predictive models as well as optimizer and scheduler implementations used for training. More information here. |
hardpicks.utils |
Contains generic utility functions used across all other subpackages. More information here. |
Copyright (c) 2023 Mila - Institut Québecois d'Intelligence Artificielle.
This software package is licensed under Apache 2.0 terms. See the LICENSE file for more information. For the license acknowledgements of 3rd-party dependencies, refer to the ACKNOWLEDGEMENTS file.
This project relies on Conda to create a virtual environment and manage dependencies. See https://anaconda.org/ for details.
Create the conda environment (this might take some time, as most packages are pinned!):
conda env create -f environment.yml
Activate the environment:
conda activate hardpicks-dev
Install the project for development:
python setup.py develop
Note that this command indicates that main.py
is the entrypoint, and correspondingly this script
is in the executable path after installation.
The code base is unit tested by using the pytest
library. To run the tests (from the root folder):
pytest
This does not require the presence of GPUs.
Note that the code should already be installed at this point. The main.py
script can be invoked
as follows:
main --data $DATA_BASE_DIR \ # location of the data
--output $OUTPUT \ # output directory
--mlflow-output=$MLFLOW \ # where to direct mlflow logs (OPTIONAL)
--tensorboard-output=$TENSORBOARD \ # where to direct tensorboard logs (OPTIONAL)
--config $CONFIG \ # path to model and training configuration parameters
--gpu "0" \ # which GPU to train on (relevant when many jobs run on a multi-GPU machine)
--disable-progressbar >& $CONFIG_DIR/$LOG_FILENAME &
Note that code execution is in principle possible on a CPU, but is extremely slow for serious training.
Running a hyperparameter search with Orion can be done as follows.
orion -v hunt --config $ORION_CONFIG main \
--data $DATA_BASE_DIR \
--output $OUTPUT/'{exp.working_dir}/{exp.name}_{trial.id}/' \
--mlflow-output=$MLFLOW \
--tensorboard-output=$TENSORBOARD \
--config $CONFIG \
--gpu "0" \
--disable-progressbar >& $CONFIG_DIR/$LOG_FILENAME &
For a concrete example of the Orion config file, see data/fbp/folds/orion_config.yaml
. Also, the
model and training configuration file should indicate which parameters Orion should search over.
For a concrete example, see data/fbp/folds/foldA.yaml
.
To automatically generate the documentation for this project, cd to the docs
folder then run:
make html
To view the documents locally, open docs/_build/html/index.html
in a browser.