Charité diagnostic pipeline spec

This repo contains a slurm-pipeline specification file (specification.json) and associated scripts for processing Charité diagnostic data.

bih-pipeline metadata

Suppose you have a new run, named 200101 (note that although this looks like a date, you can actually use a name, like 200101-WURS). But, there must be a run file in the data/runs dir of the bih-pipeline repo with a runId matching 200101 (see note on summarize-run.py failing below for the reason why).

Running on csd3

There are various make commands below. It's good to know about the -n option to make, which will print out what would be done if you didn't use -n. I (Terry) use make -n TARGET often, and so should you!

You need to do all the following on the Cambridge cluster.

Make a subdirectory for the run

$ mkdir projects/charite/200101

Transfer all Illumina FASTQ files from BIH

Into the 200101 directory you just made. You'll need to know where these are on the BIH cluster (I think under /fast/projects/civ-diagnostics/work/raw - TODO: check!). You can remove the files with names that contain *_I1_* and *_I2_*, those are the sequencing files for the indices.

Setup the run

Note that the following is run in the parent directory of the 200101 directory (i.e., the top-level charite directory):

$ cd projects/charite
$ ./setup-run.sh 200101

The setup-run.sh script will put a Makefile (which is in fact a symbolic link to Makefile.toplevel in this repo) in the 200101 directory and also move all the FASTQ you transferred into sub-directories, assuming their filenames can be parsed.

Copy the pipeline into each sample sub-directory

Again, in the top-level charite directory:

$ cd projects/charite
$ ./setup-pipeline.sh 200101/[DW]_*

Per-run and/or per-sample settings

You can specify per-run or per-sample settings by making a file run-settings.sh in the top-level directory for the run (e.g., in 200409-SARS-2/run-settings.sh) or by putting a sample-settings.sh file into the directory for a sample (e.g., in 200409-SARS-2/D_200409_3_885_1_swab_RNA/sample-settings.sh). These files can be used to override settings in common.sh. The variables and functions defined in these files will be accessible to pipeline scripts because they all source the common.sh file, which in turn sources the settings files (if they exist).

Easy setting of the sample type

As a convenience (and for backwards compatibility) you can use ./set-sample-type.sh to set the type of a sample. This just results in a line being placed in the sample-settings.sh file:

This only needs to be done if some of the samples should make a human coronavirus (SARS-CoV-2) consensus and BAM file and should only match against one coronavirus sequence. If you don't do this step, the pipeline will generate many massive FASTQ output files containing essentially the same reads.

To set a sample run to run the HCoV pipeline:

$ ./set-sample-type.sh hcov DIRNAME [DIRNAME...]

Or to set a sample run to run the medmuseum pipeline:

$ ./set-sample-type.sh medmuseum DIRNAME [DIRNAME...]

Or to set a sample run to run the standard pipeline:

$ ./set-sample-type.sh standard DIRNAME [DIRNAME...]

As mentioned, standard is the default.

You can see the run types for all sub-directories via

$ make print-standard
$ make print-hcov
$ make print-medmuseum

You should do this to check that the samples you expect to be run using the hcov pipeline are recognized as such.

In addition to the standard and hcov sample types that you can set, you can also specify if the reads should be trimmed by 29 bases on the 3prime and 5 prime end. To do this use

$ ./set-sample-type.sh trim DIRNAME [DIRNAME...]

The 29 base cut-off is used based on the length of the primers for SARS-CoV-2 amplification by Julia Schneider.

Put reference files in place for HCoV processing

For runs that are of type hcov, a reference coronavirus sequence is needed. By default the sequence in /rds/project/djs200/rds-djs200-acorg/bt/root/share/civ/hcov/hcov-reference.fasta will be used. This is currently a symbolic link to EPI_ISL_402125.fasta file (sequence id hCoV-19/Wuhan-Hu-1/2019|EPI_ISL_402125) in the data/sequences directory of the 2019-nCoV-sequences repo.

If you do not want to use this default reference for a sample, you can put a file (or a symbolic link) called reference.fasta into the individual 006-hcov directory for that sample. You have to do this for each sample that should be aligned against a non-default reference. The reference will be aligned against using Bowtie2 and a consensus will be made based on this. There is no need to build a Bowtie2 index for your reference, that will be done automatically. Just put a reference.fasta in place.

TODO: improve this by just letting the user give a different value for hcovReference and hcovReferenceIndex in sample-settings.sh. That will need a little work in 006-hcov/hcov.sh and some work on the server to go find the reference.fasta files and put their paths into the settings file.

Start the pipeline

Once you have set the run types (if any are non-standard) and hcov reference files, you can run the pipeline:

$ cd projects/charite/200101
$ make run

Monitoring

Try

$ cd projects/charite/200101
$ make status

to see a list of the slurm-pipeline.{done,error,running} files. You're hoping to see a full set of .done files.

Make HTML and upload it to civnb.info

When the pipeline run is completely done.

$ make

This makes the HTML, tars it up, transfers it to civnb.info, and untars it over there. This will only work if there is a run file in the data/runs dir of the bih-pipeline repo with a runId matching 200101 (in the present example). If there is not, the summarize-run.py call will fail.

You'll need to have ssh'd into Cambridge using the -A argument to ssh (and have run ssh-add locally to start the ssh agent) for this to run seamlessly and not ask you for a password.

View your output

Results are then available at https://civnb.info/diagnostics/.

Making BAM files and consensuses

If you want to make a BAM file and consensus against a specific reference, you can do this once the trimming of the FASTQ for the sample(s) in question has completed.

You then cd into the 006-hcov directory of each sample:

$ cd D_200219_4_555_5_isolate_RNA/pipelines/standard/006-hcov

Then run the hcov.sh script, giving a reference FASTA file as its only argument, or put your reference into reference.fasta and run with no arguments. Note that you might want to run this on an exclusive machine so that the bowtie2 process (and things it launches) can use 32 cores and you don't clog up a login machine:

sbatch-run.py --job hcov --time 00:30:00 --exclusive ./hcov.sh

The hcov.sh script will create you a BAM and consensus file, and various others. E.g.:

D_200219_4_555_5_isolate_RNA_S4_R1_001-alignment.fasta
D_200219_4_555_5_isolate_RNA_S4_R1_001-consensus.fasta
D_200219_4_555_5_isolate_RNA_S4_R1_001-coverage.txt
D_200219_4_555_5_isolate_RNA_S4_R1_001-read-count.txt
D_200219_4_555_5_isolate_RNA_S4_R1_001-reference-consensus-comparison.txt
D_200219_4_555_5_isolate_RNA_S4_R1_001.bam
D_200219_4_555_5_isolate_RNA_S4_R1_001.bam.bai
D_200219_4_555_5_isolate_RNA_S4_R1_001.vcf.gz
D_200219_4_555_5_isolate_RNA_S4_R1_001.vcf.gz.tbi

You can then do whatever you like with those files.

For convenience, if you do the above on several samples, you can then go to the top level of the run and type

$ make zip-hcov

Which will make you a zip file with a name like 200220-nCoV-isolates-hcov.zip (the 200220-nCoV-isolates here is the run id) that you can send to someone. This will have the BAM and consensus (and more, as above) from all the 006-hcov sub-directories where you made consensuses.

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
00-start		00-start
005-trim		005-trim
0055-rrna		0055-rrna
006-hcov		006-hcov
007-flash		007-flash
0075-mrna		0075-mrna
008-spades		008-spades
01-stats		01-stats
02-map		02-map
025-dedup		025-dedup
03-diamond-civ-dna-large		03-diamond-civ-dna-large
03-diamond-civ-dna		03-diamond-civ-dna
03-diamond-civ-rna		03-diamond-civ-rna
035-collect-unmapped		035-collect-unmapped
04-panel-civ-dna-encephalitis		04-panel-civ-dna-encephalitis
04-panel-civ-dna-large-encephalitis		04-panel-civ-dna-large-encephalitis
04-panel-civ-dna-large		04-panel-civ-dna-large
04-panel-civ-dna		04-panel-civ-dna
04-panel-civ-rna-encephalitis		04-panel-civ-rna-encephalitis
04-panel-civ-rna-hcov		04-panel-civ-rna-hcov
04-panel-civ-rna		04-panel-civ-rna
06-stop		06-stop
07-error		07-error
bin		bin
csd3lib		csd3lib
test		test
.gitignore		.gitignore
Makefile		Makefile
Makefile.toplevel		Makefile.toplevel
README.md		README.md
blacklist		blacklist
common.sh		common.sh
coronavirus-regex.txt		coronavirus-regex.txt
encephalitis-regex.txt		encephalitis-regex.txt
specification.json		specification.json
whitelist-hcov		whitelist-hcov

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Charité diagnostic pipeline spec

bih-pipeline metadata

Running on csd3

Make a subdirectory for the run

Transfer all Illumina FASTQ files from BIH

Setup the run

Copy the pipeline into each sample sub-directory

Per-run and/or per-sample settings

Easy setting of the sample type

Put reference files in place for HCoV processing

Start the pipeline

Monitoring

Make HTML and upload it to civnb.info

View your output

Making BAM files and consensuses

About

Releases

Packages

Contributors 4

Languages

VirologyCharite/csd3-pipeline

Folders and files

Latest commit

History

Repository files navigation

Charité diagnostic pipeline spec

bih-pipeline metadata

Running on csd3

Make a subdirectory for the run

Transfer all Illumina FASTQ files from BIH

Setup the run

Copy the pipeline into each sample sub-directory

Per-run and/or per-sample settings

Easy setting of the sample type

Put reference files in place for HCoV processing

Start the pipeline

Monitoring

Make HTML and upload it to civnb.info

View your output

Making BAM files and consensuses

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages