This repository is meant as an evolving guide/log of the collaborative research untertaken within the Climate Data Science Lab (CDSLab).
We hope that this living document can serve as inspiration to other groups, who want to try to try alternative ways of conducting science, and hope that a variety of feedback will continously improve it.
TBW
To execute a collaborative project we need to define how to organize the main components of a research project: Work and Code/Data
The Work Structure determines when to have meetings, what to define as milestones and as a bonus how to celebrate successful milestones. The Code/Data Structure sets rules on how data needed for the particular project is generated, processed, and checked.
As an initial experiment we will try to organize our work around fixed work intervals (often called 'sprints').
A sprint will last 2 weeks and start and end with a synchronous meeting.
- Sprint Planning:
- Review/adjust tasks
- Prioritize tasks for this sprint
- Distribute tasks amongs members (where possible assign pairs to a task).
- Sprint Review:
- Demo of progress/findings/failures to all members.
- Feedback on how the work structure. How is it going? How could we improve.
Within each sprint, members are expected to commit a certain amount of their weekly hours to the sprint, but are free to organize when. This leaves freedom to accomodate different work styles (e.g. an hour every other day vs a full day hack).
- How often/little do we have to check in (or stand up) to keep things flowing nicely, but not bug our days down even more with meetings?
We believe that most modern science projects consist at the core out of code which generates and analyzes data.
The basic building blocks of a 'science project' in this context are:
- Data
- Code
- Publications (paper, blogpost, report)
Lets start with a very simplistic project
In this case the code in the repository generates some figures from the data, combines it with some text and we have a paper 🤗.
The reality of most science projects is not that simple. In many cases to get to a published result, projects depend on several datasets, require heavy processing (often creating intermediate data in the process) until the final paper can be written up. Furthermore these intermediate data might actually be used in several papers. It is thus useful to separate the concept of a 'paper'-repository from a 'project'-repository:
These two types of repositories have very different requirements:
- The 'paper' repository should be lightweight code to produce figures, and compile a document from these figures. Ideally this repository could be sent to a journal as is, and would be fully reproducible from the reduced data genearted in the 'project' repo. This repository does not necessarily need to contain all the bells and whistles of a python package.
- The 'project' repository is the main workhorse of a scientific project. This repository needs to contain organized, reproducible pipelines to transform the raw data into the output needed to produce the figures in the 'paper' repo.
We assume that all 'raw' data in this project is already ARCO data. Working with local files on a supercomputer likely leads to different requirements, which we will not consider here.
The key task of the project repo is to represent and execute some sort of pipeline which transforms data in one or more stages (steps).
There are a variety of workflow 'engines' available. Let's try to outline our requirements and make an informed decision about which system best suits our needs.
- Reproducible Processing Pipeline.
- Compute only when necessary.
Basically we need two types of triggers that lead to a recomputation of any target:
- Data based triggers: E.g. if a target store is not available (or doesnt fulfil some quality control) it needs to be regenerated.
- If anything in the code that produces the target store is changed it needs to get reproduced. Ideally this would detect any imported modules, and check these for changes too. If that is not possible we need to somehow prohibit imports...
- 'CI for data' - run automated checks on the data produced.
- Easily reproduce an older stage of the pipeline, e.g. for debugging.
- Agnostic to the choice of cloud storage/compute used.
- Versioning This is trivial for the code, since we will use git for all of our development. But how can we tie stored data to e.g. a certain commit of the repo?
- We do not want to use notebooks in our pipeline. Notebooks can be used for exploration and final visualization, but any code that enters the actual pipeline, needs to be refactored into modules/scripts.
- Manually setting up github actions
- Snakemake
- Prefect
- Xarray-Beam
- ...