Article now up at eLife: https://doi.org/10.7554/eLife.63409
Table of Contents
- COVID-19 CG (CoV Genetics)
- Installation
- Per-service installation
- Analysis Pipeline
- About the project
- Data enabling COVID CG
- Citing COVID CG
The COVID-19 CG website comprises of 3 services (PostgreSQL database, Flask server, React frontend). These can be run separately (see detailed instructions at per-service installation) but we recommend using Docker to manage these services.
The analysis pipeline for processing raw SARS-CoV-2 genomes is a separate install, and described below in Analysis Pipeline
- Install Docker
- Clone this repository:
git clone https://github.com/vector-engineering/covidcg.git
$ cd covidcg
$ docker-compose build # Build containers
# (Re-builds only necessary if packages or
# dependencies have changed)
$ docker-compose up -d # Run all services
$ docker-compose down # Shut down all services when finished
NOTE: When starting from a fresh database, the server will automatically seed the database with data from the example_data_genbank
folder. This process may take a few minutes as ~50K genomes are loaded into the database.
If the dependencies for the JS change (i.e., a change in package.json
), then you can rebuild the frontend container with:
$ docker-compose down
$ docker-compose build --no-cache frontend
$ docker-compose up
A rebuild will also need to be run if the toolchains change (webpack*.js
or anything in tools/
)
For files outside of src
, i.e., in config/
or in static_data/
, the container will need to be restarted but not rebuilt.
For dependency changes for the server (i.e., changes in requirements.txt
)
$ docker-compose down
$ docker-compose build --no-cache server
$ docker-compose up
To erase the local development database, delete the postgres docker volume with:
$ docker-compose down -v # -v will delete the volume
$ docker-compose up
We recommend developing with Docker and docker-compose
. More details on the installation for each service can be found in their respective Dockerfile
s in the services/
folder, and in the docker-compose.yml
file. Running each service separately is not recommended and not tested on our end. Since we are not actively testing per-service installations, please submit a GitHub issue if you run into any problems during installation or running.
First, clone this repository: git clone https://github.com/vector-engineering/covidcg.git
Requirements:
- curl
- node.js > 8.0.0
- npm
This app was built from the react-slingshot example app.
-
Install Node 8.0.0 or greater
Need to run multiple versions of Node? Use nvm.
-
Install Git.
-
Disable safe write in your editor to assure hot reloading works properly.
-
Complete the steps below for your operating system:
macOS
- Install watchman via
brew install watchman
to avoid this issue which occurs if your macOS has no appropriate file watching service installed.
Linux
-
Run this to increase the limit on the number of files Linux will watch. Here's why.
echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf && sudo sysctl -p
.
- Install watchman via
-
Install NPM packages
npm install
-
Run the app
CONFIGFILE=config/config_genbank.yaml npm start -s
This will run the automated build process, start up a webserver, and open the application in your default browser. When doing development with this kit, this command will continue watching all your files. Every time you hit save the code is rebuilt, linting runs, and tests run automatically. Note: The -s flag is optional. It enables silent mode which suppresses unnecessary messages during the build.
This development environment was tested with PostgreSQL 12
Please provide DB connection information to the Flask server with the following environment variables:
- POSTGRES_USER
- POSTGRES_PASSWORD
- POSTGRES_DB
- POSTGRES_HOST
- POSTGRES_PORT
Requirements:
- Python3 (Python >= 3.8) with virtual environments. We recommend conda via. miniconda3, but python3 with
virtualenv
or any other virtual environment provider should also work fine
Install dependencies:
$ cd services/server
$ pip install -r requirements.txt
Run server:
$ cd services/server
$ ./serve.sh # Run Flask server in development mode
Data analysis is run with Snakemake, Python scripts, and bioinformatics tools such as bowtie2
. Please ensure that the conda environment is configured correctly (See Pipeline Installation).
Data analysis is broken up into two snakemake pipelines: 1) ingestion and 2) main. The ingestion pipeline downloads, chunks, and prepares metadata for the main analysis, and the main pipeline analyzes sequences, extracts SNVs, and compiles data for display in the web application.
Configuration of the pipeline is defined in the config/config_[workflow].yaml
files.
- Clone this repository:
git clone https://github.com/vector-engineering/covidcg.git
- Install miniconda3
- Create conda environment:
$ conda config --add channels bioconda # Add package download locations
$ conda config --add channels conda-forge
$ conda env create -f environment.yml
If the conda environment is taking forever to resolve, it's probably because of snakemake
. If so, you'll have to install packages manually (sorry for this! please let us know if this happens so we can update our environment file):
# Install python and snakemake first, let conda choose the specific snakemake version
$ conda create -n covid-cg python=3.9 snakemake-minimal
# Install dependencies manually
$ conda install numpy scipy pandas bowtie2 samtools
$ pip install pysam
Three ingestion workflows are currently available, workflow_genbank_ingest
, workflow_custom_ingest
, and workflow_gisaid_ingest
.
NOTE: While the GISAID ingestion pipeline is provided as open-source, it is intended only for internal use.
Both workflow_genbank_ingest
and workflow_gisaid_ingest
are designed to automatically download and process data from their respective data source. The workflow_custom_ingest
can be used for analyzing and visualizing your own in-house SARS-CoV-2 data. More details are available in README files within each ingestion pipeline's folder. Each ingestion workflow is parametrized by its own config file . i.e., config/config_genbank.yaml
for the GenBank workflow.
For example, you can run the GenBank ingestion pipeline with:
$ cd workflow_genbank_ingest
$ snakemake --use-conda
Both workflow_genbank_ingest
and workflow_gisaid_ingest
are designed to be run regularly, and attempt to chunk data in a way that minimizes expensive reprocessing/realignment in the downstream main analysis step. The workflow_custom_ingest
pipeline does not attempt to chunk data to minimize expensive reprocessing but this can be accomplished outside of covidcg by dividing up your sequence data into separate FASTA files.
The main data analysis pipeline is located in workflow_main
. It requires data, in a data folder, from the ingestion pipeline. The data folder is defined in the config/config_[workflow].yaml
file. The path to the config file is required for the main workflow, as it needs to know what kind of data to expect (as described in the config files).
For example, if you ingested data from GenBank, run the main analysis pipeline with:
cd workflow_main
snakemake --configfile ../config/config_genbank.yaml
This pipeline will align sequences to the reference sequence with bowtie2
, extract SNVs on both the NT and AA level, and combine all metadata and SNV information into one file: data_package.json.gz
.
NOTE: bowtie2
, the sequence aligner we use, usually uses anywhere from 8 – 10 GB of RAM per CPU during the alignment step. If the pipeline includes the alignment step, then only use as many cores as you have RAM / 10. i.e., if your machine has 128 GB RAM, then you can run at most 128 / 10 ~= 12 cores.
To pass this data onto the front-end application, host the data_package.json.gz
file on an accessible endpoint, then specify that endpoint in the data_package_url
field in the config/config_[workflow]
file that you are using.
This project is developed by the Vector Engineering Lab:
- Albert Tian Chen (Broad Institute)
- Kevin Altschuler
- Shing Hei Zhan, PhD (University of British Columbia)
- Alina Yujia Chan, PhD (Broad Institute)
- Ben Deverman, PhD (Broad Institute)
Contact the authors by email: [email protected]
Python/snakemake scripts were run and tested on MacOS 10.15.4 (8 threads, 16 GB RAM), Google Cloud Debian 10 (buster), (64 threads, 412 GB RAM), and Windows 10/Ubuntu 20.04 via. WSL2 (48 threads, 128 GB RAM)
We are extremely grateful to the GISAID Initiative and all its data contributors, i.e. the Authors from the Originating laboratories responsible for obtaining the speciments and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based.
Elbe, S., and Buckland-Merrett, G. (2017) Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges, 1:33-46. DOI:10.1002/gch2.1018 PMCID: 31565258
Users are encouraged to share, download, and further analyze data from this site. Plots can be downloaded as PNG or SVG files, and the data powering the plots and tables can be downloaded as well. Please attribute any data/images to covidcg.org, or cite our manuscript:
Chen AT, Altschuler K, Zhan SH, Chan YA, Deverman BE. COVID-19 CG enables SARS-CoV-2 mutation and lineage tracking by locations and dates of interest. eLife (2021), doi: https://doi.org/10.7554/eLife.63409
Note: When using results from these analyses in your manuscript, ensure that you acknowledge the contributors of data, i.e. We gratefully acknowledge all the Authors from the Originating laboratories responsible for obtaining the speciments and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based.
and cite the following reference(s):
Shu, Y., McCauley, J. (2017) GISAID: Global initiative on sharing all influenza data – from vision to reality. EuroSurveillance, 22(13) DOI:10.2807/1560-7917.ES.2017.22.13.30494 PMCID: PMC5388101
COVID-19 CG is distributed by an MIT license.
Please feel free to contribute to this project by opening an issue or pull request in the GitHub repository.