Virus logic merge (#563)

* WIP merge viruses * Allow frontend/server to restart on crash * Fix genome downloads * Fix linter errors * WIP subtype select * RSV ingest adjustments for new combined logic * combined deployment - all 3 viruses * Adapt RSV references/features to new format * Conditional sequence preprocessing dependent on virus * Add descriptions to flu references * Fix bug where deleting subtype URL field affects selected location nodes * Add subtype selectors * Add example data tarballs, packaging script * Custom coordinate support * Additional custom sequences/custom coordinates support * Unify all virus workflows * Add back RSV preprocess step, update README, rename SARS2 ingest workflows * Bump version to 2.7.0-dev2
vector-engineering · Jul 26, 2022 · 632f57d · 632f57d
1 parent 423bcf2
commit 632f57d
Show file tree

Hide file tree

Showing 112 changed files with 10,727 additions and 4,310 deletions.
diff --git a/README.md b/README.md
@@ -1,3 +1,5 @@
+![](https://covidcg.org/cg_logo_v13.png)
+
 ## COVID-19 CG (CoV Genetics)
 
 **Article now up at eLife: [https://doi.org/10.7554/eLife.63409](https://doi.org/10.7554/eLife.63409)**
@@ -48,7 +50,15 @@ $ docker-compose up -d # Run all services
 $ docker-compose down # Shut down all services when finished
 ```
 
-**NOTE**: When starting from a fresh database, the server will automatically seed the database with data from the `example_data_genbank` folder. This process may take a few minutes as ~50K genomes are loaded into the database.
+The default deployment (`docker-compose.yml`) will run all 3 sites at the same time (sars2, rsv, and flu). For virus-specific sites, see `docker-compose.sars2.yml`, etc. Run a specific deployment with:
+
+```bash
+docker compose -f docker-compose.sars2.yml build
+docker compose -f docker-compose.sars2.yml up -d
+...
+```
+
+**NOTE**: When starting from a fresh database, the server will automatically seed the database with data from the `example_data_genbank` folder. Data provided with the repository is in raw/gzipped form and needs to be unarchived and processed before the data can be loaded into the database. Please see the [Analysis Pipeline](#analysis-pipeline) section for instructions on processing this data.
 
 ### Dependency changes
 
@@ -187,21 +197,37 @@ For OSX M1 chips, use the alternative environment `environment_osx-arm64.yaml`.
 
 ### Ingestion
 
-Three ingestion workflows are currently available, `workflow_genbank_ingest`, `workflow_custom_ingest`, and `workflow_gisaid_ingest`.
+Currently available ingest workflows are:
+
+SARS2:
+
+- `workflow_sars2_gisaid_ingest`
+- `workflow_sars2_genbank_ingest`
+- `workflow_sars2_custom_ingest`
+
+RSV:
+
+- `workflow_rsv_genbank_ingest`
+- `workflow_rsv_custom_ingest`
+
+Flu:
 
-**NOTE: While the GISAID ingestion pipeline is provided as open-source, it is intended only for internal use**.
+- `workflow_flu_genbank_ingest`
+- `workflow_flu_custom_ingest`
 
-Both `workflow_genbank_ingest` and `workflow_gisaid_ingest` are designed to automatically download and process data from their respective data source. The `workflow_custom_ingest` can be used for analyzing and visualizing your own in-house SARS-CoV-2 data. More details are available in README files within each ingestion pipeline's folder. Each ingestion workflow is parametrized by its own config file . i.e., `config/config_genbank.yaml` for the GenBank workflow.
+**NOTE: While GISAID ingestion pipelines are provided as open-source, it is intended only for internal use**.
 
-For example, you can run the GenBank ingestion pipeline with:
+GenBank ingest pipelines are designed to automatically download and process data from their respective data source.
+
+"Custom" ingest pipelines can be used for analyzing and visualizing in-house data. More details are available in README files within each ingestion pipeline's folder. Each ingestion workflow is parametrized by its own config file. i.e., `config/config_sars2_genbank.yaml` for the SARS-CoV-2 GenBank workflow.
+
+For example, you can run the SARS-CoV-2 GenBank ingestion pipeline with:
 
 ```bash
-$ cd workflow_genbank_ingest
-$ snakemake --use-conda
+$ cd workflow_sars2_genbank_ingest
+$ snakemake --use-conda # Conda required specifically for SARS2 GenBank ingest in order to run Pangolin lineage assignments
 ```
 
-Both `workflow_genbank_ingest` and `workflow_gisaid_ingest` are designed to be run regularly, and attempt to chunk data in a way that minimizes expensive reprocessing/realignment in the downstream main analysis step. The `workflow_custom_ingest` pipeline does not attempt to chunk data to minimize expensive reprocessing but this can be accomplished outside of covidcg by dividing up your sequence data into separate FASTA files.
-
 ### Main Analysis
 
 The main data analysis pipeline is located in `workflow_main`. It requires data, in a data folder, from the ingestion pipeline. The data folder is defined in the `config/config_[workflow].yaml` file. The path to the config file is required for the main workflow, as it needs to know what kind of data to expect (as described in the config files).
@@ -210,12 +236,25 @@ For example, if you ingested data from GenBank, run the main analysis pipeline w
 
 ```bash
 cd workflow_main
-snakemake --configfile ../config/config_genbank.yaml
+snakemake --configfile ../config/config_sars2_genbank.yaml
 ```
 
-This pipeline will align sequences to the reference sequence with `minimap2`, extract mutations on both the NT and AA level, and combine all metadata and mutation information into one file: `data_package.json.gz`.
+This pipeline will align sequences to the reference sequence with `minimap2`, extract mutations on both the NT and AA level, and combine all metadata and mutation information data. The output data can be uploaded to a PostgreSQL database with `workflow_main/scripts/push_to_database.py`. Or, you can use the output files directly for your own analyses.
+
+### Example data
+
+Example data from GenBank is provided for all viruses, and is located in gzipped tarballs inside the `example_data_genbank` folder. Data for some viruses is truncated by submission date in order to lighten data load and speed up development on smaller machines.
+
+To extract the data:
+
+```bash
+$ cd example_data_genbank
+$ tar -xzf sars2.tar.gz
+$ tar -xzf rsv.tar.gz
+$ tar -xzf flu.tar.gz
+```
 
-The output data can be uploaded to a PostgreSQL database with `workflow_main/scripts/push_to_database.py`. Or, you can use the output files directly for your own analyses.
+These tarballs contain only raw sequences and metadata, and mimic the output from their respective ingest pipelines. Once the files are extracted, run the main analysis workflow described above.
 
 ---
 

diff --git a/build/build.sh b/build/build.sh
@@ -9,10 +9,10 @@ fi
 
 # SARS2
 
-gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg",_CONFIGFILE="config/config_gisaid.yaml",_TAG_NAME="${CG_VERSION}", . && \
-gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg-genbank",_CONFIGFILE="config/config_genbank.yaml",_TAG_NAME="${CG_VERSION}" . && \
-gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg-private",_CONFIGFILE="config/config_gisaid_private.yaml",_TAG_NAME="${CG_VERSION}" . && \
-gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg-alpha",_CONFIGFILE="config/config_alpha.yaml",_TAG_NAME="${CG_VERSION}", .
+gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg",_CONFIGFILE="config/config_sars2_gisaid.yaml",_TAG_NAME="${CG_VERSION}", . && \
+gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg-genbank",_CONFIGFILE="config/config_sars2_genbank.yaml",_TAG_NAME="${CG_VERSION}" . && \
+gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg-private",_CONFIGFILE="config/config_sars2_gisaid_private.yaml",_TAG_NAME="${CG_VERSION}" . && \
+gcloud builds submit --config build/cloudbuild.yaml --substitutions=_TARGET="cg-alpha",_CONFIGFILE="config/config_sars2_alpha.yaml",_TAG_NAME="${CG_VERSION}", .
 
 # RSV
 

diff --git a/config/config_flu_genbank.yaml b/config/config_flu_genbank.yaml
@@ -2,6 +2,7 @@
 #       GLOBAL
 # ------------------
 
+# Virus this config is written for
 virus: "flu"
 
 # Path to folder with downloaded and processed data
@@ -108,7 +109,7 @@ mutation_partition_break: "M"
 # which is structured as "user1:pass1,user2:pass2,..."
 login_required: false
 
-dev_hostname: "http://localhost:5001"
+dev_hostname: "http://localhost:5003"
 prod_hostname: "https://flu.genbank.pathmut.org"
 # prod_hostname: "http://localhost:8080"
 

diff --git a/config/config_flu_gisaid.yaml b/config/config_flu_gisaid.yaml
@@ -119,7 +119,7 @@ mutation_partition_break: "Y"
 # which is structured as "user1:pass1,user2:pass2,..."
 login_required: false
 
-dev_hostname: "http://localhost:5001"
+dev_hostname: "http://localhost:5003"
 prod_hostname: "https://flu.gisaid.pathmut.org"
 # prod_hostname: "http://localhost:8080"
 

diff --git a/config/config_genbank.yaml b/config/config_genbank.yaml
diff --git a/config/config_rsv_custom.yaml b/config/config_rsv_custom.yaml
@@ -18,7 +18,7 @@ static_data_folder: "static_data"
 example_data_folder: "example_data_genbank"
 
 # Database for this virus
-postgres_db: "rsvg_dev"
+postgres_db: "rsv_custom_dev"
 # ------------------
 #       INGEST
 # ------------------
@@ -30,6 +30,8 @@ chunk_size: 100000
 #       ANALYSIS
 # --------------------
 
+segments: ["1"]
+
 # Mutations with less than this number of global occurrences will be ignored
 mutation_count_threshold: 3
 
@@ -52,15 +54,25 @@ group_cols:
     description: ""
     show_collapse_options: false
 
+# AZ report options
+report_gene: F
+report_group_col: subtype
+report_group_references:
+  A: KX858757.1
+  B: KX858756.1
+
 # Surveillance plot options
 # see: workflow_main/scripts/surveillance.py
 surv_group_col: "subtype"
-surv_start_date: "1996-01-01"
+surv_start_date: "1956-01-01"
 surv_period: "Y"
 surv_min_combo_count: 50
 surv_min_single_count: 50
 surv_start_date_days_ago: 90
 surv_end_date_days_ago: 30
+surv_group_references:
+  A: KX858757.1
+  B: KX858756.1
 
 # ---------------
 #    DATABASE
@@ -83,7 +95,7 @@ mutation_partition_break: "Y"
 # which is structured as "user1:pass1,user2:pass2,..."
 login_required: false
 
-dev_hostname: "http://localhost:5001"
+dev_hostname: "http://localhost:5002"
 prod_hostname: "http://localhost:8080"
 
 # ----------------------

diff --git a/config/config_rsv_genbank.yaml b/config/config_rsv_genbank.yaml
@@ -19,7 +19,7 @@ static_data_folder: "static_data/rsv"
 example_data_folder: "example_data_genbank/rsv"
 
 # Database for this virus
-postgres_db: "rsvg_dev"
+postgres_db: "rsv_genbank_dev"
 
 # ------------------
 #       INGEST
@@ -32,6 +32,8 @@ chunk_size: 100000
 #       ANALYSIS
 # --------------------
 
+segments: ["1"]
+
 # Mutations with less than this number of global occurrences will be ignored
 mutation_count_threshold: 3
 
@@ -59,11 +61,13 @@ group_cols:
     title: "Genotype"
     description: ""
     show_collapse_options: false
-  subtype:
-    name: "subtype"
-    title: "Subtype"
-    description: ""
-    show_collapse_options: false
+
+# AZ report options
+report_gene: F
+report_group_col: subtype
+report_group_references:
+  A: KX858757.1
+  B: KX858756.1
 
 # Surveillance plot options
 # see: workflow_main/scripts/surveillance.py
@@ -74,6 +78,9 @@ surv_min_combo_count: 50
 surv_min_single_count: 50
 surv_start_date_days_ago: 90
 surv_end_date_days_ago: 30
+surv_group_references:
+  A: KX858757.1
+  B: KX858756.1
 
 # ---------------
 #    DATABASE
@@ -96,7 +103,7 @@ mutation_partition_break: "Y"
 # which is structured as "user1:pass1,user2:pass2,..."
 login_required: true
 
-dev_hostname: "http://localhost:5001"
+dev_hostname: "http://localhost:5002"
 prod_hostname: "https://rsv.pathmut.org"
 # prod_hostname: "http://localhost:8080"
 

diff --git a/config/config_alpha.yaml → config/config_sars2_alpha.yaml b/config/config_alpha.yaml → config/config_sars2_alpha.yaml
@@ -11,14 +11,15 @@ data_folder: "data"
 
 # Path to folder with genome information (reference.fasta, genes.json, proteins.json)
 # This path is relative to the project root
-static_data_folder: "static_data"
+static_data_folder: "static_data/sars2"
 
 # Path to folder with data to use in development
 # This path is relative to the project root
 example_data_folder: "example_data_genbank"
 
 # Database for this virus
-postgres_db: "cg_dev"
+postgres_db: "cg_alpha_dev"
+
 # ------------------
 #       INGEST
 # ------------------