Skip to content

Commit

Permalink
Virus logic merge (#563)
Browse files Browse the repository at this point in the history
* WIP merge viruses

* Allow frontend/server to restart on crash

* Fix genome downloads

* Fix linter errors

* WIP subtype select

* RSV ingest adjustments for new combined logic

* combined deployment - all 3 viruses

* Adapt RSV references/features to new format

* Conditional sequence preprocessing dependent on virus

* Add descriptions to flu references

* Fix bug where deleting subtype URL field affects selected location nodes

* Add subtype selectors

* Add example data tarballs, packaging script

* Custom coordinate support

* Additional custom sequences/custom coordinates support

* Unify all virus workflows

* Add back RSV preprocess step, update README, rename SARS2 ingest workflows

* Bump version to 2.7.0-dev2
  • Loading branch information
atc3 authored Jul 26, 2022
1 parent 423bcf2 commit 632f57d
Show file tree
Hide file tree
Showing 112 changed files with 10,727 additions and 4,310 deletions.
63 changes: 51 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
![](https://covidcg.org/cg_logo_v13.png)

## COVID-19 CG (CoV Genetics)

**Article now up at eLife: [https://doi.org/10.7554/eLife.63409](https://doi.org/10.7554/eLife.63409)**
Expand Down Expand Up @@ -48,7 +50,15 @@ $ docker-compose up -d # Run all services
$ docker-compose down # Shut down all services when finished
```

**NOTE**: When starting from a fresh database, the server will automatically seed the database with data from the `example_data_genbank` folder. This process may take a few minutes as ~50K genomes are loaded into the database.
The default deployment (`docker-compose.yml`) will run all 3 sites at the same time (sars2, rsv, and flu). For virus-specific sites, see `docker-compose.sars2.yml`, etc. Run a specific deployment with:

```bash
docker compose -f docker-compose.sars2.yml build
docker compose -f docker-compose.sars2.yml up -d
...
```

**NOTE**: When starting from a fresh database, the server will automatically seed the database with data from the `example_data_genbank` folder. Data provided with the repository is in raw/gzipped form and needs to be unarchived and processed before the data can be loaded into the database. Please see the [Analysis Pipeline](#analysis-pipeline) section for instructions on processing this data.

### Dependency changes

Expand Down Expand Up @@ -187,21 +197,37 @@ For OSX M1 chips, use the alternative environment `environment_osx-arm64.yaml`.

### Ingestion

Three ingestion workflows are currently available, `workflow_genbank_ingest`, `workflow_custom_ingest`, and `workflow_gisaid_ingest`.
Currently available ingest workflows are:

SARS2:

- `workflow_sars2_gisaid_ingest`
- `workflow_sars2_genbank_ingest`
- `workflow_sars2_custom_ingest`

RSV:

- `workflow_rsv_genbank_ingest`
- `workflow_rsv_custom_ingest`

Flu:

**NOTE: While the GISAID ingestion pipeline is provided as open-source, it is intended only for internal use**.
- `workflow_flu_genbank_ingest`
- `workflow_flu_custom_ingest`

Both `workflow_genbank_ingest` and `workflow_gisaid_ingest` are designed to automatically download and process data from their respective data source. The `workflow_custom_ingest` can be used for analyzing and visualizing your own in-house SARS-CoV-2 data. More details are available in README files within each ingestion pipeline's folder. Each ingestion workflow is parametrized by its own config file . i.e., `config/config_genbank.yaml` for the GenBank workflow.
**NOTE: While GISAID ingestion pipelines are provided as open-source, it is intended only for internal use**.

For example, you can run the GenBank ingestion pipeline with:
GenBank ingest pipelines are designed to automatically download and process data from their respective data source.

"Custom" ingest pipelines can be used for analyzing and visualizing in-house data. More details are available in README files within each ingestion pipeline's folder. Each ingestion workflow is parametrized by its own config file. i.e., `config/config_sars2_genbank.yaml` for the SARS-CoV-2 GenBank workflow.

For example, you can run the SARS-CoV-2 GenBank ingestion pipeline with:

```bash
$ cd workflow_genbank_ingest
$ snakemake --use-conda
$ cd workflow_sars2_genbank_ingest
$ snakemake --use-conda # Conda required specifically for SARS2 GenBank ingest in order to run Pangolin lineage assignments
```

Both `workflow_genbank_ingest` and `workflow_gisaid_ingest` are designed to be run regularly, and attempt to chunk data in a way that minimizes expensive reprocessing/realignment in the downstream main analysis step. The `workflow_custom_ingest` pipeline does not attempt to chunk data to minimize expensive reprocessing but this can be accomplished outside of covidcg by dividing up your sequence data into separate FASTA files.

### Main Analysis

The main data analysis pipeline is located in `workflow_main`. It requires data, in a data folder, from the ingestion pipeline. The data folder is defined in the `config/config_[workflow].yaml` file. The path to the config file is required for the main workflow, as it needs to know what kind of data to expect (as described in the config files).
Expand All @@ -210,12 +236,25 @@ For example, if you ingested data from GenBank, run the main analysis pipeline w

```bash
cd workflow_main
snakemake --configfile ../config/config_genbank.yaml
snakemake --configfile ../config/config_sars2_genbank.yaml
```

This pipeline will align sequences to the reference sequence with `minimap2`, extract mutations on both the NT and AA level, and combine all metadata and mutation information into one file: `data_package.json.gz`.
This pipeline will align sequences to the reference sequence with `minimap2`, extract mutations on both the NT and AA level, and combine all metadata and mutation information data. The output data can be uploaded to a PostgreSQL database with `workflow_main/scripts/push_to_database.py`. Or, you can use the output files directly for your own analyses.

### Example data

Example data from GenBank is provided for all viruses, and is located in gzipped tarballs inside the `example_data_genbank` folder. Data for some viruses is truncated by submission date in order to lighten data load and speed up development on smaller machines.

To extract the data:

```bash
$ cd example_data_genbank
$ tar -xzf sars2.tar.gz
$ tar -xzf rsv.tar.gz
$ tar -xzf flu.tar.gz
```

The output data can be uploaded to a PostgreSQL database with `workflow_main/scripts/push_to_database.py`. Or, you can use the output files directly for your own analyses.
These tarballs contain only raw sequences and metadata, and mimic the output from their respective ingest pipelines. Once the files are extracted, run the main analysis workflow described above.

---

Expand Down
8 changes: 4 additions & 4 deletions build/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ fi

# SARS2

gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg",_CONFIGFILE="config/config_gisaid.yaml",_TAG_NAME="${CG_VERSION}", . && \
gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg-genbank",_CONFIGFILE="config/config_genbank.yaml",_TAG_NAME="${CG_VERSION}" . && \
gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg-private",_CONFIGFILE="config/config_gisaid_private.yaml",_TAG_NAME="${CG_VERSION}" . && \
gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg-alpha",_CONFIGFILE="config/config_alpha.yaml",_TAG_NAME="${CG_VERSION}", .
gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg",_CONFIGFILE="config/config_sars2_gisaid.yaml",_TAG_NAME="${CG_VERSION}", . && \
gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg-genbank",_CONFIGFILE="config/config_sars2_genbank.yaml",_TAG_NAME="${CG_VERSION}" . && \
gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg-private",_CONFIGFILE="config/config_sars2_gisaid_private.yaml",_TAG_NAME="${CG_VERSION}" . && \
gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg-alpha",_CONFIGFILE="config/config_sars2_alpha.yaml",_TAG_NAME="${CG_VERSION}", .

# RSV

Expand Down
3 changes: 2 additions & 1 deletion config/config_flu_genbank.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
# GLOBAL
# ------------------

# Virus this config is written for
virus: "flu"

# Path to folder with downloaded and processed data
Expand Down Expand Up @@ -108,7 +109,7 @@ mutation_partition_break: "M"
# which is structured as "user1:pass1,user2:pass2,..."
login_required: false

dev_hostname: "http://localhost:5001"
dev_hostname: "http://localhost:5003"
prod_hostname: "https://flu.genbank.pathmut.org"
# prod_hostname: "http://localhost:8080"

Expand Down
2 changes: 1 addition & 1 deletion config/config_flu_gisaid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ mutation_partition_break: "Y"
# which is structured as "user1:pass1,user2:pass2,..."
login_required: false

dev_hostname: "http://localhost:5001"
dev_hostname: "http://localhost:5003"
prod_hostname: "https://flu.gisaid.pathmut.org"
# prod_hostname: "http://localhost:8080"

Expand Down
129 changes: 0 additions & 129 deletions config/config_genbank.yaml

This file was deleted.

18 changes: 15 additions & 3 deletions config/config_rsv_custom.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ static_data_folder: "static_data"
example_data_folder: "example_data_genbank"

# Database for this virus
postgres_db: "rsvg_dev"
postgres_db: "rsv_custom_dev"
# ------------------
# INGEST
# ------------------
Expand All @@ -30,6 +30,8 @@ chunk_size: 100000
# ANALYSIS
# --------------------

segments: ["1"]

# Mutations with less than this number of global occurrences will be ignored
mutation_count_threshold: 3

Expand All @@ -52,15 +54,25 @@ group_cols:
description: ""
show_collapse_options: false

# AZ report options
report_gene: F
report_group_col: subtype
report_group_references:
A: KX858757.1
B: KX858756.1

# Surveillance plot options
# see: workflow_main/scripts/surveillance.py
surv_group_col: "subtype"
surv_start_date: "1996-01-01"
surv_start_date: "1956-01-01"
surv_period: "Y"
surv_min_combo_count: 50
surv_min_single_count: 50
surv_start_date_days_ago: 90
surv_end_date_days_ago: 30
surv_group_references:
A: KX858757.1
B: KX858756.1

# ---------------
# DATABASE
Expand All @@ -83,7 +95,7 @@ mutation_partition_break: "Y"
# which is structured as "user1:pass1,user2:pass2,..."
login_required: false

dev_hostname: "http://localhost:5001"
dev_hostname: "http://localhost:5002"
prod_hostname: "http://localhost:8080"

# ----------------------
Expand Down
21 changes: 14 additions & 7 deletions config/config_rsv_genbank.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ static_data_folder: "static_data/rsv"
example_data_folder: "example_data_genbank/rsv"

# Database for this virus
postgres_db: "rsvg_dev"
postgres_db: "rsv_genbank_dev"

# ------------------
# INGEST
Expand All @@ -32,6 +32,8 @@ chunk_size: 100000
# ANALYSIS
# --------------------

segments: ["1"]

# Mutations with less than this number of global occurrences will be ignored
mutation_count_threshold: 3

Expand Down Expand Up @@ -59,11 +61,13 @@ group_cols:
title: "Genotype"
description: ""
show_collapse_options: false
subtype:
name: "subtype"
title: "Subtype"
description: ""
show_collapse_options: false

# AZ report options
report_gene: F
report_group_col: subtype
report_group_references:
A: KX858757.1
B: KX858756.1

# Surveillance plot options
# see: workflow_main/scripts/surveillance.py
Expand All @@ -74,6 +78,9 @@ surv_min_combo_count: 50
surv_min_single_count: 50
surv_start_date_days_ago: 90
surv_end_date_days_ago: 30
surv_group_references:
A: KX858757.1
B: KX858756.1

# ---------------
# DATABASE
Expand All @@ -96,7 +103,7 @@ mutation_partition_break: "Y"
# which is structured as "user1:pass1,user2:pass2,..."
login_required: true

dev_hostname: "http://localhost:5001"
dev_hostname: "http://localhost:5002"
prod_hostname: "https://rsv.pathmut.org"
# prod_hostname: "http://localhost:8080"

Expand Down
5 changes: 3 additions & 2 deletions config/config_alpha.yaml → config/config_sars2_alpha.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,15 @@ data_folder: "data"

# Path to folder with genome information (reference.fasta, genes.json, proteins.json)
# This path is relative to the project root
static_data_folder: "static_data"
static_data_folder: "static_data/sars2"

# Path to folder with data to use in development
# This path is relative to the project root
example_data_folder: "example_data_genbank"

# Database for this virus
postgres_db: "cg_dev"
postgres_db: "cg_alpha_dev"

# ------------------
# INGEST
# ------------------
Expand Down
Loading

0 comments on commit 632f57d

Please sign in to comment.