Somatic-ShortV

The scripts in this repository call somatic short variants (SNPs and indels) from tumour and matched normal BAM files following GATK's Best Practices Workflow. The workflow and scripts have been specifically optimised to run efficiently and at scale on the National Compute Infrastructure, Gadi.

⚠️ Notice of workflow deprecation

This workflow is no longer actively supported or maintained. While you are welcome to use the existing code, please note that no further updates, bug fixes, or support will be provided.

For questions or alternative recommendations for University of Sydney staff and students, please get in touch with [email protected]. You can find alternatives at WorkflowHub.

Thank you for your understanding!

Features of this pipeline

High compute efficiency
Scalable through scatter-gather parallelism with openmpi and nci.parallel
Checker scripts (no KSU cost)
Checkpointing and resumption of only failed tasks if required
Support: this repository is actively monitored, however major updates are not currently planned. We do however have plans to Nextflow this workflow.
Note as at 31st January 2024: some performance loss has been observed for the scatter-gather parallelism running multiple tasks under nci-parallel. Since Nextflow workflows do not rely on this scatter method, we will be focusing on that as a solve.

Reference genome: GRCh38/hg38 + ALT contigs

By default, this pipeline is compatible with BAMs aligned to the GRCh38/hg38 + ALT contigs reference genome. Scatter-gather parallelism has been designed to operate over 3,200 evenly sized genomic intervals (~1Mb in size) across your sample cohort. Some genomic intervals are excluded - these typically include repetitive regions which can significantly impede on compute performance.

Excluded sites

Excluded sites are listed in the Delly group's sv_repeat_telomere_centromere.bed file. This is included in the Reference dataset. The BED file contains:

telemeres
centromeres
chrUn (unplaced)
chrUn_XXXXX_decoy (decoy)
chrN_XXXXX_random (unlocalized)
chrEBV

Set up

This pipeline can be implemented after running the Fastq-to-BAM pipeline. The scripts use relative paths, so correct set-up is important.

The minimum requirements and high level directory structure resemble the following:

├── Final_bams
├── <cohort>.config
├── Reference
└── Somatic-ShortV

Output will be created at this directory level. Submit jobs within the Somatic-ShortV directory

1. Clone this respository

In your high level directory git clone https://github.com/Sydney-Informatics-Hub/Somatic-ShortV.git. Somatic-ShortV contains the scripts of the workflow. Submit all jobs within Somatic-ShortV.

2. Tool dependencies

The tools required to run this pipeline include:

openmpi/4.1.0
nci-parallel/1.0.0a
GATK4. The pipeline is compatible with versions: gatk/4.1.2.0,gatk/4.1.8.1, gatk/4.2.1.0. It has not been tested with other versions of GATK4.

Use module avail to check if they are currently globally installed.

If you have used Fastq-to-BAM pipeline, you can skip the remaining set up steps.

3. Prepare your `<cohort>.config` file

The steps in this pipeline operate on samples present in your <cohort>.config file.

See here for a full description
<cohort>.config is a TSV file with one row per unique sample, matching the format in the example below. Start header lines with #
LabSampleID's are your in-house sample IDs. Input and output files are named with this ID.
- Normal samples should be named <patientID>-N<normalID>
- Matched tumour samples should be named <patientID>-T<tumourID>. Multiple tumour samples are OK. Only <patientID>-T<tumourID>, <patientID>-M<tumourID>, <patientID>-P<tumourID> extensions are recognised.
- <patientID> is used to find normal and tumour samples belonging to a single patient

The below is an example <cohort>.config with two patient samples. Patient 1 has 2 matched tumour samples.

#SampleID	LabSampleID	Seq_centre
SAMPLE1	PATIENT1-N	AGRF
SAMPLE2	PATIENT1-T1	AGRF
SAMPLE3	PATIENT1-T2	AGRF
SAMPLE4	PATIENT2-N	AGRF
SAMPLE5	PATIENT2-T	AGRF

4. Prepare your BAM files

BAM files should be at the sample level
BAM and BAI (index) filenames should follow:
- <patientID>-N<ID>.final.bam for normal samples
- <patientID>-T<tumourID>.final.bam for tumour samples

5. Download the `Reference` directory

Ensure you have Reference directory from Fastq-to-BAM. This contains input data required for Somatic-ShortV.

The reference used includes Human genome: hg38 + alternate contigs and a list of scatter interval files for scatter-gather parallelism.

User guide

This pipeline will perform GATK4's Somatic Short Variant Discovery (SNVs and Indels) best practices workflows for all samples present in <cohort>.config.

Adding new samples to a cohort for PoN: If you have sequenced new samples belonging to a cohort that was previously sequenced and wish to re-create PoN with all samples, you can skip some of the processing steps for the previously sequenced samples. If you would like to do this:

Please create a config file containing the new samples only and process these from step 1.
Follow optional steps from step 5 onwards if you would like to consolidate the new samples with previously processed samples.

General structure

Each step in the pipeline generally includes:

A script to create an input file, where 1 line of input = 1 scatter task
A "run_parallel.pbs" script, the job you submit to run "task.sh" for each input created
A checker script and if required, an additional script to re-submit failed jobs
The checker scripts tar archive log files if checks have passed
Job logs written to ./Logs and task logs written to ./Logs/<jobname>

By default, compute resources are optimal for processing 40 normal (35X) and 40 matched tumour (70X) samples. See benchmarking metrics for additional guidance with compute resource requests.

Create the Panel of Normals

! Submit all jobs from the Somatic-ShortV directory

1. Scatter and create PoN across genomic intervals

The scripts below run Mutect2 in tumour only mode for the normal samples to create sample.pon.index.vcf, sample.pon.index.vcf.idx and sample.pon.index.vcf.stats. By default, this job scatters the task into 3,200 genomic intervals per sample (index is a number given to a unique interval). Normal samples are ideally samples sequenced on the same platform and chemistry (library kit) as tumour samples. These are used to filter sequencing artefacts (polymerase slippage occurs pretty much at the same genomic loci for short read sequencing technologies) as well as germline variants. Read more about PoN on GATK's website

Create task inputs:

sh gatk4_pon_make_input.sh /path/to/cohort.config

Edit and run the script below to scatter task inputs for parallel processing. Adjust and compute resource requests to suit your cohort, then:

qsub gatk4_pon_run_parallel.pbs

Perform checks

The next script checks that tasks for the previous job have completed successfully. Inputs for failed tasks are written to gatk_pon_missing.inputs for re-submission. Checks include: sample.pon.index.vcf, sample.pon.index.vcf.idx and sample.pon.index.vcf.stats files exist and are non-empty; GATK logs are checked for "SUCCESS" or "error" messages.

The checker script can be run on the login node. I advise using the nohup command to run this script so that the script runs without being killed.

Replace /path/to/cohort.config with your cohort.config file and run:

nohup sh gatk4_pon_check_sample_parallel.sh /path/to/cohort.config 2> /dev/null 1> ./Logs/gatk4_pon_check.log &

Check the output by cat ./Logs/gatk_pon_check.log to see if the job was successful. If there are no failed tasks, move to 2. Gather interval VCFs into sample level VCFs

Re-running failed gatk4_pon.sh tasks

If there are tasks to re-run from step 1 (check number of tasks to re-run using wc -l Inputs/gatk4_pon_missing.inputs), re-run the failed tasks. After adjusting and compute resource requests (usually one node normal node is sufficient for ~3200 tasks) in gatk4_pon_missing_run_parallel.pbs, submit the job by:

qsub gatk4_pon_missing_run_parallel.pbs

The steps under "Perform checks" above can be repeated until all tasks pass checks.

gatk4_pon passes checks

If all checks pass, the checker script prints task duration and memory per task (genomic interval) to ./Logs/gatk4_pon/sample. Each sample log directory is archived into .Logs/gatk4_pon/sample_logs.tar.gz. Tar archives reduce iNode quota.

2. Gather interval VCFs into sample level VCFs

Gather interval sample.pon.index.vcf, sample.pon.index.vcf.idx into single sample, gzipped VCFs. sample.pon.vcf.gz and sample.pon.vcf.gz.tbi are wrtten to ../cohort_PoN

Create inputs (1 sample per task, an arguments file is created per sample). This can actually take a while and I recommend using nohup if you have > 200 samples.

sh gatk4_pon_gathervcfs_make_input.sh /path/to/cohort.config

Or with nohup (output is written to nohup.out):

nohup gatk4_pon_gathervcfs_make_input.sh /path/to/cohort.config 2>&1 &

Edit and run the script below to scatter task inputs for parallel processing. Adjust and compute resource requests to suit your cohort, then:

qsub gatk4_pon_gathervcfs_run_parallel.pbs

Perform checks

The next script checks that tasks for the previous job have completed successfully. Inputs for failed tasks are written to ./Inputs/gatk_pon_gathervcfs_missing.inputs for re-submission. The script checks sample.pon.vcf.gz and sample.pon.vcf.gz.tbi files are present and not empty in ./cohort_PoN. Log files are checked for errors.

Run this script on the login node (quick):

sh gatk4_pon_gathervcfs_check.sh /path/to/cohort.config

If all tasks are successful, we recommend backing up sample.pon.vcf.gz and sample.pon.vcf.gz.tbi to long term storage and continuing to 3. Consolidate samples with GenomicsDBImport.

Re-running failed gatk4_pon_gathervcfs.sh tasks

Check the number of tasks to re-run using wc -l ./Inputs/gatk4_pon_gathervcfs_missing.inputs. Adjust and compute resource requests in gatk4_pon_gathervcfs_missing_run_parallel.pbs then run it:

qsub gatk4_pon_gathervcfs_missing_run_parallel.pbs

Perform checks and re-run failed tasks until all tasks have completed successfully.

3. Consolidate samples with GenomicsDBImport

The downstream analysis are performed across multiple samples and sample data is first consolidated with GenomicsDBImport. By default, this occurs at 3,200 genomic intervals. If you are processing a new cohort, start at Consolidate PoN.

Including previously analysed samples

To create a new PoN to include previously processed sample data (e.g. when you want to combine previously processed samples with newly sequenced samples), perform the additional steps in below, otherwise skip this step.

Downstream steps require:

A cohort.config file
The directory ../cohort_PoN containing sample.pon.vcf.gz, sample.pon.vcf.gz.tbi for each sample in cohort.config

These can be created with some simple commands, copying data or using symbolic links. The example below shows how this can be done if you used this pipeline to analyse a separate cohort:

├── Final_bams
├── samplesSet1.config # Previously analysed cohort
├── samplesSet2.config # Currently analysed cohort
├── samplesSet1andSet2.config # Joined cohort - this needs to be created
├── samplesSet1_PoN # Previously analysed cohort
├── samplesSet2_PoN # Currently analysed cohort
├── samplesSet1andSet2_PoN # Joined cohort - this needs to be created
├── Reference
└── Somatic-ShortV

Concatenate the config files of the previously sequenced samples (e.g. in samplesSet1.config) and newly sequenced samples (e.g. in samplesSet2.config) into a new config file (e.g. in samplesSet1andSet2.config) by:

sh concat_configs.sh samplesSet1andSet2.config samplesSet1.config samplesSet2.config

Create symbolic links for sample.pon.vcf.gz, sample.pon.vcf.gz.tbi into ../samplesSet1andSet2_PoN using this script:

sh setup_pon_from_concat_config.sh samplesSet1andSet2.config

Consolidate PoN

Create inputs:

sh gatk4_pon_genomicsdbimport_make_input.sh /path/to/cohort.config

Edit and run the script below to scatter task inputs for parallel processing. Adjust and compute resource requests to suit your cohort (this job scales to cohort size with memory), then:

qsub gatk4_pon_genomicsdbimport_run_parallel.pbs

Perform checks

Check the job when it's complete by:

nohup sh gatk4_pon_genomicsdbimport_check.sh /path/to/cohort.config 2> /dev/null 1> ./Logs/gatk4_pon_genomicsdbimport.log &

This takes a couple of minutes. Check the ./Logs/gatk4_pon_genomicsdbimport.log file. If all tasks were successful, continue to 4. Create PoN per genomic interval.

Re-run failed tasks

Check the number of tasks to re-run using wc -l ./Inputs/gatk4_pon_genomicsdbimport_missing.inputs. Adjust and compute resource requests in gatk4_pon_genomicsdbimport_missing_run_parallel.pbs then run it:

qsub gatk4_pon_genomicsdbimport_missing_run_parallel.pbs

Perform check and re-run failed tasks until all tasks have completed successfully.

4. Create PoN per genomic interval

Here, hg38 gnomAD are used as a germline resource by default (af-only-gnomad.hg38.vcf.gz obtained from GATK's Google Cloud Resource Bucket. You may wish to change this by specifying the resource you wish to use in the gatk4_cohort_pon_make_input.sh file at germline=

Create inputs:

sh gatk4_cohort_pon_make_input.sh /path/to/cohort.config

Edit and run the script below to scatter task inputs for parallel processing. Adjust and compute resource requests to suit your cohort, then:

qsub gatk4_cohort_pon_run_parallel.pbs

Perform checks

Check that each task for gatk4_cohort_pon_run_parallel.pbs ran successfully. This script checks that there is a non-empty VCF and TBI file for all genomic intervals that the job operated on and that there were no error messages in the log files. The script runs collects duration and memory used per task or genomic interval and then cleans up by gzip tar archiving log files. Run:

sh gatk4_cohort_pon_check.sh /path/to/cohort.config

If there are no failed tasks, move on to 5. Gather intervals and sort into a single, multisample PoN.

Re-run failed tasks

Check the number of tasks to re-run using wc -l ./Inputs/gatk4_cohort_pon_missing.inputs. Adjust and compute resource requests in the script below, then submit the job:

qsub gatk4_cohort_pon_missing_run_parallel.pbs

5. Gather intervals and sort into a single, multisample PoN

Gather and sort interval cohort PoN to a single VCFs into a single, multisample sorted and indexed PoN VCF. This job performs these commands as a single task.

Set the config variable and submit the job:

qsub gatk4_cohort_pon_gather_sort.pbs

Perform checks

sh gatk4_cohort_pon_gather_sort_check.sh

Errors should be investigated before re-running gatk4_cohort_pon_gather_sort.pbs.

This is the last step for a creating a panel of normals and you should now have the following outputs for your <cohort>.config file:

../<cohort>_cohort_PoN/<cohort>.sorted.pon.vcf.gz
../<cohort>_cohort_PoN/<cohort>.sorted.pon.vcf.gz.tbi

Variant calling with Mutect2

Once you have completed creating your panel of normals, you may begin calling somatic variants with Mutect2 in tumour-matched normal mode. Variants are filtered downstream. Variants will be called for each unique tumour-normal pair (i.e. if you have 3 tumour samples matching 1 normal sample, 3 VCF files for each pair will be produced if LabSampleID's follow Set up guide)

1. Scatter Mutect2 tasks

For each unique tumour-normal pair in <cohort>.config, write inputs for scattering Mutect2 over 3,201 for genomic intervals. One tasks processes the mitochondial chromosome, for those who wish to perform mitochondial analysis.

sh gatk4_mutect2_make_input.sh /path/to/cohort.config

Edit and run the script below to scatter task inputs for parallel processing. Adjust and compute resource requests to suit your cohort.

Note: We recommend setting the walltime limit low (scale compute to 2 - 4 normal nodes per pair, leaving walltime=02:00:00). Whilst we've implemented strategies to avoid tasks this situation, occassionally a spurious sample pair can have a task that "hangs" (happens about ~1 task/50 pairs). This causes idle CPUs at the tail end of the job. These spurious tasks can be found and re-run when checks are performed in a separate job.

qsub gatk4_mutect2_run_parallel.pbs

Outputs for each unique tumour normal pair (index is the interval id or chrM):

<patient>-T<ID>_<patient>-N<ID>.unfiltered.<index>.vcf.gz
<patient>-T<ID>_<patient>-N<ID>.unfiltered.<index>.vcf.gz.tbi
<patient>-T<ID>_<patient>-N<ID>.unfiltered.<index>.vcf.gz.stats
<patient>-T<ID>_<patient>-N<ID>.f1r2.<index>.tar.gz

Perform checks

The script below checks that the expected output files are preset. Log files are checked for "SUCCESS" or error messages. The script takes ~30 seconds/40 tumour-normal pairs. We recommend using nohup if you have a large number of pairs.

sh gatk4_mutect2_check_pair_parallel.sh /path/to/cohort.config

Check output or Inputs/gatk4_mutect2_missing.inputs for failed tasks.

Re-run failed tasks

If there are tasks to re-run, adjust and compute resource requests in gatk4_mutect2_missing_run_parallel.pbs, then submit your job by:

qsub gatk4_mutect2_missing_run_parallel.pbs

2. Gather Mutect2 VCFs

Gather the unfiltered Mutect2 interval outputs for each tumour normal pair. Create inputs by:

sh gatk4_mutect2_gathervcfs_make_input.sh /path/to/cohort.config

Adjust and compute resource requests in gatk4_mutect2_gathervcfs_run_parallel.pbs, then submit your job by:

qsub gatk4_mutect2_gathervcfs_run_parallel.pbs

For each tumour normal pair in your cohort.config file, you will now have:

../Mutect2/<patient>-T<ID>_<patient>-N<ID>/<patient>-T<ID>_<patient>-N<ID>.unfiltered.vcf.gz
../Mutect2/<patient>-T<ID>_<patient>-N<ID>/<patient>-T<ID>_<patient>-N<ID>.unfiltered.vcf.gz.tbi

You will also have input files for the filtering steps, including:

Per interval .stats files for MergeMutectStats
Per interval f1r2 files for LearnReadOrientationModel

Prepare for filtering

The following jobs prepare input files for the last part of the Somatic-ShortV pipeline, FilterMutectCalls. There are three parts and each of these can be run concurrently:

MergeMutectStats
LearnReadOrientationModel
GetPileupSummaries and CalculaterContamination

1. MergeMutectStats

After variant calling with Mutect2, stats files are created for each interval. Gather these into tumour-normal pair stats file. Stats files contain the number of callable sites, (by default the callable depth is 10).

Create inputs by:

sh gatk4_mergemutectstats_make_input.sh /path/to/cohort.config

Adjust and compute resource requests in gatk4_mergemutectstats_run_parallel.pbs, then submit your job by:

qsub gatk4_mergemutectstats_run_parallel.pbs

For each tumour-normal pair in your cohort.config file, you will now have:

../Mutect2/<patient>-T<ID>_<patient>-N<ID>/<patient>-T<ID>_<patient>-N<ID>.unfiltered.vcf.gz.stats

2. LearnReadOrientationModel

This step aims to remove substitution error bias on a single strand. This applies to FFPE tumour samples and samples sequenced on Illumina Novaseq machines. It is recommended to run this, even if your samples/sequencing machines are not prone to orientation bias.

Create a job input file and arguments file for each tumour normal pair (containining f1r2 genomic interval files for the pair created from Mutect2) by:

sh gatk4_learnreadorientationmodel_make_input.sh /path/to/cohort.config

Adjust and compute resource requests in gatk4_learnreadorientationmodel_run_parallel.pbs, then submit your job by:

qsub gatk4_learnreadorientationmodel_run_parallel.pbs

Perform checks

Check that LearnReadOrientationModel ran successfully for each tumour-normal pair. The log files are checked for SUCCESS and error messages and that the expected output file ../Mutect2/TumourID_NormalID/TumourID_NormalID_read-orientation-model.tar.gz exists and is not empty. Create inputs by:

sh gatk4_learnreadorientationmodel_check.sh /path/to/cohort.config

For each tumour-normal pair in your cohort.config file, you will now have:

../Mutect2/<patient>-T<ID>_<patient>-N<ID>/<patient>-T<ID>_<patient>-N<ID>_read-orientation-model.tar.gz

GetPileupSummaries & CalculateContamination

With the assumption that known germline variant sites are biallelic, any site that presents multiple variant alleles suggests possible sample contamination.

GetPileupSummaries uses a sample BAM file and a germline resource containing common biallelic variants. This tabulates pileup metrics for CalculateContamination, summarizing the counts of reads that support the reference, alternate and other alleles.

Germline resources

The ../Reference directory downloadable and described in Fastq-to-BAM contains two germline resources that can be used for GetPileupSummaries:

Common biallelic variants from the ExAC dataset (containing 60,706 exomes), lifted to the hg38 reference genome in ../Reference/gatk-best-practices/somatic-hg38/small_exac_common_3.hg38.vcf.gz
Common biallelic variants from gnomAD (76,156 whole genomes) mapped to hg38 in ../Reference/gatk-best-practices/somatic-hg38/af-only-gnomad.common_biallelic.hg38.vcf.gz. This was created using the optional SelectVariants tool in the gatk4_selectvariants.pbs script.

Due to the vast difference in number of loci between these two public resources, the methodology has been updated to enable use of the af-only-gnomad.common_biallelic.hg38.vcf.gz on NCI Gadi. Issues running GetPileupSummaries using this VCF can be read about here and here. So, the resource you choose to use for this step in the workflow ("small_exac" or "gnomad") will dictate which scripts (which processing methodology) you use for this step.

Optionally, you can generate your own common biallelic variant resource. If so, please compare the number of loci in your VCF to the above two public resources, and use the script set that best matches the size of your resource.

0. Optional: Create common biallelic variant resources

gatk4_selectvariants.pbs takes your public resource VCF, and selects common biallelic SNP variants (by default, those with an allele frequency of > 0.05).

In the gatk4_selectvariants.pbs script, replace <> with your resource for resource=<path/to/public_dataset.vcf.gz> and output file resource_common=</path/to/output_public_dataset_common_biallelic.vcf.gz>. Adjust your and compute resources, then submit your job by:

qsub gatk4_selectvariants.pbs

1. GetPileupSummaries

With your common biallelic germline resource, run GetPileupSummaries for all samples in your cohort.config file.

IMPORTANT NOTE ON COMMON BIALLELIC VCF RESOURCE

The default germline resource used is small_exac_common_3.hg38.vcf.gz. To use this resource, please follow Option 1: using small resource eg "small_exac". If you would like to change this resource, set common_biallelic variable to the path to your chosen resource. If using a large resource eg af-only-gnomad.common_biallelic.hg38.vcf.gz please use the script set and methodology described in Option 2: using large resource eg "gnomad" `.

Option 1: using small resource eg "small_exac"

Create inputs

sh gatk4_getpileupsummaries_make_input.sh /path/to/cohort.config

Submit the parallel-by-sample job

Adjust and compute resource requests in gatk4_getpileupsummaries_run_parallel.pbs, then submit your job by:

qsub gatk4_getpileupsummaries_run_parallel.pbs

Perform checks

GetPileupSummaries task log files are checked for SUCCESS and error messages. The script also checks that the expected output <cohort>_GetPileupSummaries/samples_pileups.table exists and it not empty.

First, edit the common_biallelic variable in gatk4_getpileupsummaries_check.sh so that it is consistent with what you used in step 15. Then:

sh gatk4_getpileupsummaries_check.sh /path/to/cohort.config

The script will print the number of successful and unsuccessful tasks to the terminal. Failed tasks will be written to Inputs/gatk4_getpileupsummaries_missing.inputs. If there are failed tasks to re-run, adjust and compute resource requests in gatk4_getpileupsummaries_missing_run_parallel.pbs, then submit your job by:

qsub gatk4_getpileupsummaries_missing_run_parallel.pbs

Option 2: using large resource eg "gnomad"

This section describes a method to compute GATK GetPileupSummaries using scatter-gather parallelism over intervals rather than the whole genome at a time. The rest of this workflow has used nci-parallel to achieve this. At the time of writing this set of scripts (01-31-2024) CPU efficiency was observed to be reduced by this method, so an alternate approach has been incorporated that splits the genomic intervals into three 'chunks' of intervals per sample, then submits the 'chunks' with a good old bash for loop. Three chunks per sample seemed a fair compromise between walltime and not thrashing the scheduler. There are 8 scripts specific for this part of the workflow, with 6 run steps, but do not be dissuaded - it's pretty straightforward.

The first twp steps (three scripts total) need only be done once per 'common_biallelic' resource, eg gnomad. This splits the genome by interval (we selected 24 yet 18 were returned, as 'BALANCING_WITHOUT_INTERVAL_SUBDIVISION' mode was applied). Then, the gnomad VCF resource is split into 18 smaller VCF files. The intention was to parallelise this many splits per sample, but see note above. The interval list files and VCF interval files are written to appropriate locations within the Reference directory, so please make sure this is writable by you before you continue.

The remaining 4 steps (5 scripts) utilise the interval VCF files to execute three GetPileupSummaries jobs per sample, check the job outputs, and then concatenate the scattered interval tables into one GetPileupSummaries table per sample (after which, the scattered interval tables can be deleted).

For all PBS scripts described here, please adjust your NCI project code at -P and l storage requirements!

Prepare the intervals (need only be run once per VCF resource)

After ensuring that you have read-write acces to your Reference directory and its subfolders, run this fast and light script on the login node:

bash gatk4_getpileupsummaries_splitIntervals.sh

This will create intervals files in ../Reference/GetPileupSummaries_intervals.

Then, check that the filepath to your large resource is correct within the script gatk4_getpileupsummaries_split_common_vcf_run.sh. Default is currently ../Reference/gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz.

Run the following command to split the VCF according to the newly created 18 interval files:

bash gatk4_getpileupsummaries_split_common_vcf_run.sh

This will submit 18 separate PBS jobs, one per interval. This is very fast on the compute nodes but quite slow on the login node, thus the preference to schedule these small tasks on the cluster.

Output will be new zipped VCF files and index files within the parent directory of the resource VCF, with a newly created sub-directory based on the VCF file prefix, eg ../Reference/gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.

Scatter-gather processing for the cohort

Make inputs

As usual, make the inputs, providing the filepath of your samples config file as first and only command line argument:

bash gatk4_getpileupsummaries_byInterval_make_input.sh <config-filepath>

The created inputs file should be 3 X N, where N is the number of samples. Three 'chunks' of intervals are run per sample rather than scattering all 18 intervals - please see justification above. The default walltime for the GetPileupSummaries job is 3 hours per job (where one job is one chunk of intervals for one sample). This is sufficient for a very high-coverage BAM (~ 174X samples required 2 hrs). You may wish to reduce this to 2 or 1 hrs in the script gatk4_getpileupsummaries_byInterval.pbs depending on your sample coverage, in order to minimise potential queue time. The jobs are run on the hugemem queue which is a more scarce resource compared to normal queue, so queue time may be a factor for large cohorts if the hugemem queue is in high demand at the time of processing. Changing to the normal or normalbw queue is generally not recommended, as CPU efficiency and SU usage is poorer given the need to request more CPUS than GATK can use (1) due to memory requirements.

Submit the GetPileupSummaries jobs

Once the inputs file is made, and you are happy with the walltime set in the PBS script, submit all chunks for all samples with:

bash gatk4_getpileupsummaries_byInterval_runLoop.sh

Perform checks

Once the jobs have completed, check all jobs with:

bash gatk4_getpileupsummaries_byInterval_check.sh

If all jobs were successful, the following message will be returned:

No issues detected. Please continue with gatk4_getpileupsummaries_byInterval_concat.sh

If issues were detected, a new inputs file for the failed tasks will be written to Inputs/gatk4_getpileupsummaries.inputs-failed. A complete set of re-run failed scripts have not been written - in the event of a small number of failures, ie on walltime, you can easily take a copy of the script gatk4_getpileupsummaries_byInterval.pbs and increase the walltime, adjust the inputs file, and supply the correct variables as per present in the original launcher script gatk4_getpileupsummaries_byInterval_runLoop.sh. It is most likely that only one or a handful of the intervals within the 'chunk' of intervals failed to complete (in the event that the failure was insufficient walltme). In this case, resubmitting just those intervals, rather than the entire chunk of intervals, would be wise to minmise wasted KSU. Given the future move of this workflow away from nci-parallel and towards Nextflow, time has not been spent to create this level of detail here. If you need help with this, please reach out.

Gather (concatenate) the output per sample

This step takes around 3-6 seconds per sample on the login node. Run:

bash gatk4_getpileupsummaries_byInterval_concat.sh <config-filepath>

This will take the scattered pileup tables for each sample from ../<cohortName>_GetPileupSummaries/scatter/ and concatenate them (avoiding redundant headers) into ../<cohortName>_GetPileupSummaries/<LabSampleID>_pileups.table.

Once you are satisfied that the final sample pileups are complete, you can delete the scatter directory:

rm -rf ../<cohortName>_GetPileupSummaries/scatter/

Benchmarking metrics for this step

This shows compute usage for 2 samples, one 81 GB BAM file ('smallSample', approx 73X raw coverage) and the other 219 GB BAM ('largeSample', approx 174X coverage). Each job was run on one Gadi hugemem CPU, allowing 28 GB java mem to the GATK tool.

The 4-digit strings within the job name column are the intervals processed within that chunked job. Note that the chunks contain different collections of intervals between the two samples, due to testing different interval divisions. The sum KSU and walltime metrics are still valid. The interval chunks shown for the 'large' sample are the current default.

#JobName	CPUs_requested	Mem_requested	Mem_used	CPUtime_mins	Walltime_mins	JobFS_used	Efficiency	Service_units
smallSample_0000-0001-0002-0003-0004.o	1	31.0GB	22.77GB	62.7	63.43	685.84KB	0.99	3.17
smallSample_0005-0006-0007-0008-0009-0010-0011.o	1	31.0GB	26.08GB	64.2	64.93	688.83KB	0.99	3.25
smallSample_0012-0013-0014-0015-0016-0017.o	1	31.0GB	19.15GB	44.17	44.63	685.84KB	0.99	2.23
largeSample_0000-0001-0002-0003.o	1	31.0GB	31.0GB	88.88	89.08	684.34KB	1	4.45
largeSample_0004-0005-0006-0007-0008-0009.o	1	31.0GB	31.0GB	95.98	96.22	687.33KB	1	4.81
largeSample_0010-0011-0012-0013-0014-0015-0016-0017.o	1	31.0GB	31.0GB	111.68	111.98	688.83KB	1	5.6

2. Calculate contamination

Calculate the fraction of reads coming from cross sample contamination using CalculateContamination using pileups tables from GetPileupSummaries as inputs. The resulting contamination table is used in FilterMutectCalls. Create inputs by:

sh gatk4_calculatecontamination_make_input.sh /path/to/cohort.config

FilterMutectCalls

You are finally ready to obtain a filtered set of somatic variants using FilterMutectCalls. You will need the inputs from the previous steps, including:

<patient>-T<ID>_<patient>-N<ID>.unfiltered.vcf.gz (from Mutect2, using PoN)
<patient>-T<ID>_<patient>-N<ID>.unfiltered.vcf.gz.stats (from Mutect2, genomic intervals gathered into a single stats file with MergeMutectStats)
tumor_segments.table (from GetPileupSummaries & CalculateContamination)
<patient>-T<ID>_<patient>-N<ID>_contamination.table (from GetPileupSummaries & CalculateContamination)
<patient>-T<ID>_<patient>-N<ID>_read-orientation-model.tar.gz (from Mutect2 & LearnReadOrientationModel)

Create input files for each task (FilterMutectCalls for a single tumour normal pair) by:

sh gatk4_filtermutectcalls_make_input.sh /path/to/cohort.config

Adjust and compute resource requests in gatk4_filtermutectcalls_run_parallel.pbs, then submit your job by:

qsub gatk4_filtermutectcalls_run_parallel.pbs

Benchmarking metrics

The following benchmarks were obtained from processing 20 tumour-normal pairs (34X and 70X, human samples). My apologies that the steps are not in chronological order 😊

#JobName	CPUs_requested	CPUs_used	Mem_requested	Mem_used	CPUtime	CPUtime_mins	Walltime_req	Walltime_used	Walltime_mins	JobFS_req	JobFS_used	Efficiency	Service_units
gatk4_cohort_pon_144	144	144	4.39TB	110.13GB	16:39:15	999.25	5:00:00	0:24:51	24.85	1.46GB	9.08MB	0.28	178.92
gatk4_cohort_pon_48	48	48	1.46TB	102.54GB	17:26:26	1046.43	10:00:00	0:26:04	26.07	500.0MB	9.08MB	0.84	62.56
gatk4_cohort_pon_96	96	96	2.93TB	110.97GB	17:07:23	1027.38	10:00:00	0:26:07	26.12	1000.0MB	9.07MB	0.41	125.36
gatk4_cohort_pon_gather_sort	1	1	18.0GB	4.28GB	0:02:18	2.3	1:00:00	0:03:15	3.25	100.0MB	0B	0.71	0.49
gatk4_getpileupsummaries_exac*	48	48	190GB	145.94GB	10:42:43	642.72	15:00:00	01:03:58	63.97	100.0MB	40.09MB	0.21	102.35
gatk4_mutect2_1920	1920	1920	7.5TB	6.61TB	1890:56:30	113456.5	4:00:00	1:07:04	67.07	3.91GB	32.77MB	0.88	4292.27
gatk4_mutect2_2880	2880	2880	11.25TB	9.81TB	1963:57:43	117837.72	2:00:00	0:48:33	48.55	5.86GB	33.38MB	0.84	4660.8
gatk4_mutect2_3840	3840	3840	15.0TB	13.07TB	2033:47:24	122027.4	2:00:00	0:39:04	39.07	7.81GB	32.77MB	0.81	5000.53
gatk4_mutect2_960	960	960	3.75TB	3.3TB	1865:51:58	111951.97	4:00:00	2:05:44	125.73	1.95GB	32.77MB	0.93	4023.47
gatk4_pon_1920	1920	1920	7.5TB	6.39TB	1546:53:50	92813.83	2:00:00	0:50:06	50.1	3.91GB	21.23MB	0.96	3206.4
gatk4_pon_2880	2880	2880	11.25TB	8.62TB	1601:44:32	96104.53	2:00:00	0:35:13	35.22	5.86GB	21.23MB	0.95	3380.8
gatk4_pon_3840	3840	3840	15.0TB	11.36TB	1859:29:08	111569.13	2:00:00	0:31:43	31.72	7.81GB	21.23MB	0.92	4059.73
gatk4_pon_960	960	960	3.75TB	3.27TB	1537:26:16	92246.27	2:00:00	1:38:03	98.05	1.95GB	21.1MB	0.98	3137.6
gatk4_pon_gathervcfs_20	20	20	640.0GB	42.43GB	1:04:42	64.7	2:00:00	1:44:51	104.85	500.0MB	8.09MB	0.03	104.85
gatk4_pon_genomicsdbimport_144	144	144	4.39TB	698.85GB	22:28:42	1348.7	5:00:00	0:14:20	14.33	1.46GB	8.96MB	0.65	103.2
gatk4_pon_genomicsdbimport_48	48	48	1.46TB	314.08GB	21:27:48	1287.8	10:00:00	0:30:33	30.55	500.0MB	8.95MB	0.88	73.32
gatk4_pon_genomicsdbimport_96	96	96	2.93TB	602.88GB	21:48:05	1308.08	10:00:00	0:18:14	18.23	1000.0MB	8.95MB	0.75	87.52

*gatk4_getpileupsummaries_exac was benchmarked with 80 tumour (~70X) and normal (~35X) paired samples, using common_biallelic=../Reference/gatk-best-practices/somatic-hg38/small_exac_common_3.hg38.vcf.gz.

Cite us to support us!

Chew, T., Willet, C., Samaha, G., Menadue, B. J., Downton, M., Sun, Y., Kobayashi, R., & Sadsad, R. (2021). Somatic-ShortV (Version 1.0) [Computer software]. https://doi.org/10.48546/workflowhub.workflow.148.1

Acknowledgements

Acknowledgements (and co-authorship, where appropriate) are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub and national compute facilities.

Suggested acknowledgements:

NCI Gadi

The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney. This research/project was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government.

References

GATK 4: Van der Auwera et al. 2013 https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/0471250953.bi1110s43

OpenMPI: Graham et al. 2015 https://dl.acm.org/doi/10.1007/11752578_29

Name		Name	Last commit message	Last commit date
Latest commit History 282 Commits
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
concat_configs.sh		concat_configs.sh
gatk4_calculatecontamination.sh		gatk4_calculatecontamination.sh
gatk4_calculatecontamination_make_input.sh		gatk4_calculatecontamination_make_input.sh
gatk4_calculatecontamination_run_parallel.pbs		gatk4_calculatecontamination_run_parallel.pbs
gatk4_check_logs.pl		gatk4_check_logs.pl
gatk4_cohort_pon.sh		gatk4_cohort_pon.sh
gatk4_cohort_pon_check.sh		gatk4_cohort_pon_check.sh
gatk4_cohort_pon_gather_sort.pbs		gatk4_cohort_pon_gather_sort.pbs
gatk4_cohort_pon_gather_sort_check.sh		gatk4_cohort_pon_gather_sort_check.sh
gatk4_cohort_pon_make_input.sh		gatk4_cohort_pon_make_input.sh
gatk4_cohort_pon_missing_run_parallel.pbs		gatk4_cohort_pon_missing_run_parallel.pbs
gatk4_cohort_pon_run_parallel.pbs		gatk4_cohort_pon_run_parallel.pbs
gatk4_filtermutectcalls.sh		gatk4_filtermutectcalls.sh
gatk4_filtermutectcalls_make_input.sh		gatk4_filtermutectcalls_make_input.sh
gatk4_filtermutectcalls_run_parallel.pbs		gatk4_filtermutectcalls_run_parallel.pbs
gatk4_getpileupsummaries.sh		gatk4_getpileupsummaries.sh
gatk4_getpileupsummaries_byInterval.pbs		gatk4_getpileupsummaries_byInterval.pbs
gatk4_getpileupsummaries_byInterval_check.sh		gatk4_getpileupsummaries_byInterval_check.sh
gatk4_getpileupsummaries_byInterval_concat.sh		gatk4_getpileupsummaries_byInterval_concat.sh
gatk4_getpileupsummaries_byInterval_make_input.sh		gatk4_getpileupsummaries_byInterval_make_input.sh
gatk4_getpileupsummaries_byInterval_runLoop.sh		gatk4_getpileupsummaries_byInterval_runLoop.sh
gatk4_getpileupsummaries_check.sh		gatk4_getpileupsummaries_check.sh
gatk4_getpileupsummaries_make_input.sh		gatk4_getpileupsummaries_make_input.sh
gatk4_getpileupsummaries_missing_run_parallel.pbs		gatk4_getpileupsummaries_missing_run_parallel.pbs
gatk4_getpileupsummaries_run_parallel.pbs		gatk4_getpileupsummaries_run_parallel.pbs
gatk4_getpileupsummaries_splitIntervals.sh		gatk4_getpileupsummaries_splitIntervals.sh
gatk4_getpileupsummaries_split_common_vcf.pbs		gatk4_getpileupsummaries_split_common_vcf.pbs
gatk4_getpileupsummaries_split_common_vcf_run.sh		gatk4_getpileupsummaries_split_common_vcf_run.sh
gatk4_learnreadorientationmodel.sh		gatk4_learnreadorientationmodel.sh
gatk4_learnreadorientationmodel_check.sh		gatk4_learnreadorientationmodel_check.sh
gatk4_learnreadorientationmodel_make_input.sh		gatk4_learnreadorientationmodel_make_input.sh
gatk4_learnreadorientationmodel_missing_run_parallel.pbs		gatk4_learnreadorientationmodel_missing_run_parallel.pbs
gatk4_learnreadorientationmodel_run_parallel.pbs		gatk4_learnreadorientationmodel_run_parallel.pbs
gatk4_mergemutectstats.sh		gatk4_mergemutectstats.sh
gatk4_mergemutectstats_make_input.sh		gatk4_mergemutectstats_make_input.sh
gatk4_mergemutectstats_run_parallel.pbs		gatk4_mergemutectstats_run_parallel.pbs
gatk4_mutect2.sh		gatk4_mutect2.sh
gatk4_mutect2_check_pair.sh		gatk4_mutect2_check_pair.sh
gatk4_mutect2_check_pair_parallel.sh		gatk4_mutect2_check_pair_parallel.sh
gatk4_mutect2_gathervcfs.sh		gatk4_mutect2_gathervcfs.sh
gatk4_mutect2_gathervcfs_make_input.sh		gatk4_mutect2_gathervcfs_make_input.sh
gatk4_mutect2_gathervcfs_run_parallel.pbs		gatk4_mutect2_gathervcfs_run_parallel.pbs
gatk4_mutect2_make_input.sh		gatk4_mutect2_make_input.sh
gatk4_mutect2_missing_run_parallel.pbs		gatk4_mutect2_missing_run_parallel.pbs
gatk4_mutect2_pair.sh		gatk4_mutect2_pair.sh
gatk4_mutect2_pair_make_input.sh		gatk4_mutect2_pair_make_input.sh
gatk4_mutect2_run_parallel.pbs		gatk4_mutect2_run_parallel.pbs
gatk4_pon.sh		gatk4_pon.sh
gatk4_pon_check_sample.sh		gatk4_pon_check_sample.sh
gatk4_pon_check_sample_parallel.sh		gatk4_pon_check_sample_parallel.sh
gatk4_pon_checklogs.sh		gatk4_pon_checklogs.sh
gatk4_pon_checklogs_make_input.sh		gatk4_pon_checklogs_make_input.sh
gatk4_pon_checklogs_run_parallel.sh		gatk4_pon_checklogs_run_parallel.sh
gatk4_pon_gathervcfs.sh		gatk4_pon_gathervcfs.sh
gatk4_pon_gathervcfs_check.sh		gatk4_pon_gathervcfs_check.sh
gatk4_pon_gathervcfs_make_input.sh		gatk4_pon_gathervcfs_make_input.sh
gatk4_pon_gathervcfs_missing_run_parallel.pbs		gatk4_pon_gathervcfs_missing_run_parallel.pbs
gatk4_pon_gathervcfs_run_parallel.pbs		gatk4_pon_gathervcfs_run_parallel.pbs
gatk4_pon_genomicsdbimport.sh		gatk4_pon_genomicsdbimport.sh
gatk4_pon_genomicsdbimport_check.sh		gatk4_pon_genomicsdbimport_check.sh
gatk4_pon_genomicsdbimport_make_input.sh		gatk4_pon_genomicsdbimport_make_input.sh
gatk4_pon_genomicsdbimport_missing_run_parallel.pbs		gatk4_pon_genomicsdbimport_missing_run_parallel.pbs
gatk4_pon_genomicsdbimport_run_parallel.pbs		gatk4_pon_genomicsdbimport_run_parallel.pbs
gatk4_pon_make_input.sh		gatk4_pon_make_input.sh
gatk4_pon_missing_run_parallel.pbs		gatk4_pon_missing_run_parallel.pbs
gatk4_pon_run_parallel.pbs		gatk4_pon_run_parallel.pbs
gatk4_selectvariants.pbs		gatk4_selectvariants.pbs
get_interval_times_gatklog.pl		get_interval_times_gatklog.pl
setup_pon_from_concat_config.sh		setup_pon_from_concat_config.sh

License

Sydney-Informatics-Hub/Somatic-ShortV

Folders and files

Latest commit

History

Repository files navigation

Somatic-ShortV

⚠️ Notice of workflow deprecation

Features of this pipeline

Reference genome: GRCh38/hg38 + ALT contigs

Excluded sites

Set up

1. Clone this respository

2. Tool dependencies

3. Prepare your <cohort>.config file

4. Prepare your BAM files

5. Download the Reference directory

User guide

General structure

Create the Panel of Normals

1. Scatter and create PoN across genomic intervals

Perform checks

Re-running failed gatk4_pon.sh tasks

gatk4_pon passes checks

2. Gather interval VCFs into sample level VCFs

Perform checks

Re-running failed gatk4_pon_gathervcfs.sh tasks

3. Consolidate samples with GenomicsDBImport

Including previously analysed samples

Consolidate PoN

Perform checks

Re-run failed tasks

4. Create PoN per genomic interval

Perform checks

Re-run failed tasks

5. Gather intervals and sort into a single, multisample PoN

Perform checks

Variant calling with Mutect2

1. Scatter Mutect2 tasks

Perform checks

Re-run failed tasks

2. Gather Mutect2 VCFs

Prepare for filtering

1. MergeMutectStats

2. LearnReadOrientationModel

Perform checks

GetPileupSummaries & CalculateContamination

0. Optional: Create common biallelic variant resources

1. GetPileupSummaries

Option 1: using small resource eg "small_exac"

Create inputs

Submit the parallel-by-sample job

Perform checks

Option 2: using large resource eg "gnomad"

Prepare the intervals (need only be run once per VCF resource)

Scatter-gather processing for the cohort

Make inputs

Submit the GetPileupSummaries jobs

Perform checks

Gather (concatenate) the output per sample

Benchmarking metrics for this step

2. Calculate contamination

FilterMutectCalls

Benchmarking metrics

Cite us to support us!

Acknowledgements

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

3. Prepare your `<cohort>.config` file

5. Download the `Reference` directory

Packages