The scripts in this repository call somatic short variants (SNPs and indels) from tumour and matched normal BAM files following GATK's Best Practices Workflow. The workflow and scripts have been specifically optimised to run efficiently and at scale on the National Compute Infrastructure, Gadi.
This workflow is no longer actively supported or maintained. While you are welcome to use the existing code, please note that no further updates, bug fixes, or support will be provided.
For questions or alternative recommendations for University of Sydney staff and students, please get in touch with [email protected]. You can find alternatives at WorkflowHub.
Thank you for your understanding!
- High compute efficiency
- Scalable through scatter-gather parallelism with
openmpi
andnci.parallel
- Checker scripts (no KSU cost)
- Checkpointing and resumption of only failed tasks if required
- Support: this repository is actively monitored, however major updates are not currently planned. We do however have plans to Nextflow this workflow.
- Note as at 31st January 2024: some performance loss has been observed for the scatter-gather parallelism running multiple tasks under nci-parallel. Since Nextflow workflows do not rely on this scatter method, we will be focusing on that as a solve.
By default, this pipeline is compatible with BAMs aligned to the GRCh38/hg38 + ALT contigs reference genome. Scatter-gather parallelism has been designed to operate over 3,200 evenly sized genomic intervals (~1Mb in size) across your sample cohort. Some genomic intervals are excluded - these typically include repetitive regions which can significantly impede on compute performance.
Excluded sites are listed in the Delly group's sv_repeat_telomere_centromere.bed file. This is included in the Reference
dataset. The BED file contains:
- telemeres
- centromeres
- chrUn (unplaced)
- chrUn_XXXXX_decoy (decoy)
- chrN_XXXXX_random (unlocalized)
- chrEBV
This pipeline can be implemented after running the Fastq-to-BAM pipeline. The scripts use relative paths, so correct set-up is important.
The minimum requirements and high level directory structure resemble the following:
├── Final_bams
├── <cohort>.config
├── Reference
└── Somatic-ShortV
Output will be created at this directory level. Submit jobs within the Somatic-ShortV
directory
In your high level directory git clone https://github.com/Sydney-Informatics-Hub/Somatic-ShortV.git
. Somatic-ShortV
contains the scripts of the workflow. Submit all jobs within Somatic-ShortV
.
The tools required to run this pipeline include:
openmpi/4.1.0
nci-parallel/1.0.0a
- GATK4. The pipeline is compatible with versions:
gatk/4.1.2.0
,gatk/4.1.8.1
,gatk/4.2.1.0
. It has not been tested with other versions of GATK4.
Use module avail
to check if they are currently globally installed.
If you have used Fastq-to-BAM pipeline, you can skip the remaining set up steps.
The steps in this pipeline operate on samples present in your <cohort>.config
file.
- See here for a full description
<cohort>.config
is a TSV file with one row per unique sample, matching the format in the example below. Start header lines with#
- LabSampleID's are your in-house sample IDs. Input and output files are named with this ID.
- Normal samples should be named
<patientID>-N<normalID>
- Matched tumour samples should be named
<patientID>-T<tumourID>
. Multiple tumour samples are OK. Only<patientID>-T<tumourID>
,<patientID>-M<tumourID>
,<patientID>-P<tumourID>
extensions are recognised. <patientID>
is used to find normal and tumour samples belonging to a single patient
- Normal samples should be named
The below is an example <cohort>.config
with two patient samples. Patient 1 has 2 matched tumour samples.
#SampleID | LabSampleID | Seq_centre | Library(default=1) |
---|---|---|---|
SAMPLE1 | PATIENT1-N | AGRF | |
SAMPLE2 | PATIENT1-T1 | AGRF | |
SAMPLE3 | PATIENT1-T2 | AGRF | |
SAMPLE4 | PATIENT2-N | AGRF | |
SAMPLE5 | PATIENT2-T | AGRF |
- BAM files should be at the sample level
- BAM and BAI (index) filenames should follow:
<patientID>-N<ID>.final.bam
for normal samples<patientID>-T<tumourID>.final.bam
for tumour samples
Ensure you have Reference
directory from Fastq-to-BAM. This contains input data required for Somatic-ShortV.
The reference used includes Human genome: hg38 + alternate contigs and a list of scatter interval files for scatter-gather parallelism.
This pipeline will perform GATK4's Somatic Short Variant Discovery (SNVs and Indels) best practices workflows for all samples present in <cohort>.config
.
Adding new samples to a cohort for PoN: If you have sequenced new samples belonging to a cohort that was previously sequenced and wish to re-create PoN with all samples, you can skip some of the processing steps for the previously sequenced samples. If you would like to do this:
- Please create a config file containing the new samples only and process these from step 1.
- Follow optional steps from step 5 onwards if you would like to consolidate the new samples with previously processed samples.
Each step in the pipeline generally includes:
- A script to create an input file, where 1 line of input = 1 scatter task
- A "run_parallel.pbs" script, the job you submit to run "task.sh" for each input created
- A checker script and if required, an additional script to re-submit failed jobs
- The checker scripts tar archive log files if checks have passed
- Job logs written to
./Logs
and task logs written to./Logs/<jobname>
By default, compute resources are optimal for processing 40 normal (35X) and 40 matched tumour (70X) samples. See benchmarking metrics for additional guidance with compute resource requests.
! Submit all jobs from the Somatic-ShortV
directory
The scripts below run Mutect2 in tumour only mode for the normal samples to create sample.pon.index.vcf
, sample.pon.index.vcf.idx
and sample.pon.index.vcf.stats
. By default, this job scatters the task into 3,200 genomic intervals per sample (index is a number given to a unique interval). Normal samples are ideally samples sequenced on the same platform and chemistry (library kit) as tumour samples. These are used to filter sequencing artefacts (polymerase slippage occurs pretty much at the same genomic loci for short read sequencing technologies) as well as germline variants. Read more about PoN on GATK's website
Create task inputs:
sh gatk4_pon_make_input.sh /path/to/cohort.config
Edit and run the script below to scatter task inputs for parallel processing. Adjust and compute resource requests to suit your cohort, then:
qsub gatk4_pon_run_parallel.pbs
The next script checks that tasks for the previous job have completed successfully. Inputs for failed tasks are written to gatk_pon_missing.inputs
for re-submission. Checks include: sample.pon.index.vcf
, sample.pon.index.vcf.idx
and sample.pon.index.vcf.stats
files exist and are non-empty; GATK logs are checked for "SUCCESS" or "error" messages.
The checker script can be run on the login node. I advise using the nohup
command to run this script so that the script runs without being killed.
Replace /path/to/cohort.config
with your cohort.config
file and run:
nohup sh gatk4_pon_check_sample_parallel.sh /path/to/cohort.config 2> /dev/null 1> ./Logs/gatk4_pon_check.log &
Check the output by cat ./Logs/gatk_pon_check.log
to see if the job was successful. If there are no failed tasks, move to 2. Gather interval VCFs into sample level VCFs
If there are tasks to re-run from step 1 (check number of tasks to re-run using wc -l Inputs/gatk4_pon_missing.inputs
), re-run the failed tasks. After adjusting and compute resource requests (usually one node normal node is sufficient for ~3200 tasks) in gatk4_pon_missing_run_parallel.pbs
, submit the job by:
qsub gatk4_pon_missing_run_parallel.pbs
The steps under "Perform checks" above can be repeated until all tasks pass checks.
If all checks pass, the checker script prints task duration and memory per task (genomic interval) to ./Logs/gatk4_pon/sample
. Each sample log directory is archived into .Logs/gatk4_pon/sample_logs.tar.gz
. Tar archives reduce iNode quota.
Gather interval sample.pon.index.vcf
, sample.pon.index.vcf.idx
into single sample, gzipped VCFs. sample.pon.vcf.gz
and sample.pon.vcf.gz.tbi
are wrtten to ../cohort_PoN
Create inputs (1 sample per task, an arguments file is created per sample). This can actually take a while and I recommend using nohup if you have > 200 samples.
sh gatk4_pon_gathervcfs_make_input.sh /path/to/cohort.config
Or with nohup (output is written to nohup.out
):
nohup gatk4_pon_gathervcfs_make_input.sh /path/to/cohort.config 2>&1 &
Edit and run the script below to scatter task inputs for parallel processing. Adjust and compute resource requests to suit your cohort, then:
qsub gatk4_pon_gathervcfs_run_parallel.pbs
The next script checks that tasks for the previous job have completed successfully. Inputs for failed tasks are written to ./Inputs/gatk_pon_gathervcfs_missing.inputs
for re-submission. The script checks sample.pon.vcf.gz
and sample.pon.vcf.gz.tbi
files are present and not empty in ./cohort_PoN
. Log files are checked for errors.
Run this script on the login node (quick):
sh gatk4_pon_gathervcfs_check.sh /path/to/cohort.config
If all tasks are successful, we recommend backing up sample.pon.vcf.gz
and sample.pon.vcf.gz.tbi
to long term storage and continuing to 3. Consolidate samples with GenomicsDBImport.
Check the number of tasks to re-run using wc -l ./Inputs/gatk4_pon_gathervcfs_missing.inputs
. Adjust and compute resource requests in gatk4_pon_gathervcfs_missing_run_parallel.pbs
then run it:
qsub gatk4_pon_gathervcfs_missing_run_parallel.pbs
Perform checks and re-run failed tasks until all tasks have completed successfully.
The downstream analysis are performed across multiple samples and sample data is first consolidated with GenomicsDBImport. By default, this occurs at 3,200 genomic intervals. If you are processing a new cohort, start at Consolidate PoN.
To create a new PoN to include previously processed sample data (e.g. when you want to combine previously processed samples with newly sequenced samples), perform the additional steps in below, otherwise skip this step.
Downstream steps require:
- A
cohort.config
file - The directory
../cohort_PoN
containingsample.pon.vcf.gz
,sample.pon.vcf.gz.tbi
for each sample incohort.config
These can be created with some simple commands, copying data or using symbolic links. The example below shows how this can be done if you used this pipeline to analyse a separate cohort:
├── Final_bams
├── samplesSet1.config # Previously analysed cohort
├── samplesSet2.config # Currently analysed cohort
├── samplesSet1andSet2.config # Joined cohort - this needs to be created
├── samplesSet1_PoN # Previously analysed cohort
├── samplesSet2_PoN # Currently analysed cohort
├── samplesSet1andSet2_PoN # Joined cohort - this needs to be created
├── Reference
└── Somatic-ShortV
Concatenate the config files of the previously sequenced samples (e.g. in samplesSet1.config
) and newly sequenced samples (e.g. in samplesSet2.config
) into a new config file (e.g. in samplesSet1andSet2.config
) by:
sh concat_configs.sh samplesSet1andSet2.config samplesSet1.config samplesSet2.config
Create symbolic links for sample.pon.vcf.gz
, sample.pon.vcf.gz.tbi
into ../samplesSet1andSet2_PoN
using this script:
sh setup_pon_from_concat_config.sh samplesSet1andSet2.config
Create inputs:
sh gatk4_pon_genomicsdbimport_make_input.sh /path/to/cohort.config
Edit and run the script below to scatter task inputs for parallel processing. Adjust and compute resource requests to suit your cohort (this job scales to cohort size with memory), then:
qsub gatk4_pon_genomicsdbimport_run_parallel.pbs
Check the job when it's complete by:
nohup sh gatk4_pon_genomicsdbimport_check.sh /path/to/cohort.config 2> /dev/null 1> ./Logs/gatk4_pon_genomicsdbimport.log &
This takes a couple of minutes. Check the ./Logs/gatk4_pon_genomicsdbimport.log
file. If all tasks were successful, continue to 4. Create PoN per genomic interval.
Check the number of tasks to re-run using wc -l ./Inputs/gatk4_pon_genomicsdbimport_missing.inputs
. Adjust and compute resource requests in gatk4_pon_genomicsdbimport_missing_run_parallel.pbs
then run it:
qsub gatk4_pon_genomicsdbimport_missing_run_parallel.pbs
Perform check and re-run failed tasks until all tasks have completed successfully.
Here, hg38 gnomAD are used as a germline resource by default (af-only-gnomad.hg38.vcf.gz
obtained from GATK's Google Cloud Resource Bucket. You may wish to change this by specifying the resource you wish to use in the gatk4_cohort_pon_make_input.sh
file at germline=
Create inputs:
sh gatk4_cohort_pon_make_input.sh /path/to/cohort.config
Edit and run the script below to scatter task inputs for parallel processing. Adjust and compute resource requests to suit your cohort, then:
qsub gatk4_cohort_pon_run_parallel.pbs
Check that each task for gatk4_cohort_pon_run_parallel.pbs
ran successfully. This script checks that there is a non-empty VCF and TBI file for all genomic intervals that the job operated on and that there were no error messages in the log files. The script runs collects duration and memory used per task or genomic interval and then cleans up by gzip tar archiving log files. Run:
sh gatk4_cohort_pon_check.sh /path/to/cohort.config
If there are no failed tasks, move on to 5. Gather intervals and sort into a single, multisample PoN.
Check the number of tasks to re-run using wc -l ./Inputs/gatk4_cohort_pon_missing.inputs
. Adjust and compute resource requests in the script below, then submit the job:
qsub gatk4_cohort_pon_missing_run_parallel.pbs
Gather and sort interval cohort PoN to a single VCFs into a single, multisample sorted and indexed PoN VCF. This job performs these commands as a single task.
Set the config variable and submit the job:
qsub gatk4_cohort_pon_gather_sort.pbs
sh gatk4_cohort_pon_gather_sort_check.sh
Errors should be investigated before re-running gatk4_cohort_pon_gather_sort.pbs
.
This is the last step for a creating a panel of normals and you should now have the following outputs for your <cohort>.config
file:
../<cohort>_cohort_PoN/<cohort>.sorted.pon.vcf.gz
../<cohort>_cohort_PoN/<cohort>.sorted.pon.vcf.gz.tbi
Once you have completed creating your panel of normals, you may begin calling somatic variants with Mutect2 in tumour-matched normal mode. Variants are filtered downstream. Variants will be called for each unique tumour-normal pair (i.e. if you have 3 tumour samples matching 1 normal sample, 3 VCF files for each pair will be produced if LabSampleID's follow Set up guide)
For each unique tumour-normal pair in <cohort>.config
, write inputs for scattering Mutect2 over 3,201 for genomic intervals. One tasks processes the mitochondial chromosome, for those who wish to perform mitochondial analysis.
sh gatk4_mutect2_make_input.sh /path/to/cohort.config
Edit and run the script below to scatter task inputs for parallel processing. Adjust and compute resource requests to suit your cohort.
Note: We recommend setting the walltime limit low (scale compute to 2 - 4 normal nodes per pair, leaving walltime=02:00:00). Whilst we've implemented strategies to avoid tasks this situation, occassionally a spurious sample pair can have a task that "hangs" (happens about ~1 task/50 pairs). This causes idle CPUs at the tail end of the job. These spurious tasks can be found and re-run when checks are performed in a separate job.
qsub gatk4_mutect2_run_parallel.pbs
Outputs for each unique tumour normal pair (index is the interval id or chrM):
<patient>-T<ID>_<patient>-N<ID>.unfiltered.<index>.vcf.gz
<patient>-T<ID>_<patient>-N<ID>.unfiltered.<index>.vcf.gz.tbi
<patient>-T<ID>_<patient>-N<ID>.unfiltered.<index>.vcf.gz.stats
<patient>-T<ID>_<patient>-N<ID>.f1r2.<index>.tar.gz
The script below checks that the expected output files are preset. Log files are checked for "SUCCESS" or error messages. The script takes ~30 seconds/40 tumour-normal pairs. We recommend using nohup
if you have a large number of pairs.
sh gatk4_mutect2_check_pair_parallel.sh /path/to/cohort.config
Check output or Inputs/gatk4_mutect2_missing.inputs
for failed tasks.
If there are tasks to re-run, adjust and compute resource requests in gatk4_mutect2_missing_run_parallel.pbs
, then submit your job by:
qsub gatk4_mutect2_missing_run_parallel.pbs
Gather the unfiltered Mutect2 interval outputs for each tumour normal pair. Create inputs by:
sh gatk4_mutect2_gathervcfs_make_input.sh /path/to/cohort.config
Adjust and compute resource requests in gatk4_mutect2_gathervcfs_run_parallel.pbs
, then submit your job by:
qsub gatk4_mutect2_gathervcfs_run_parallel.pbs
For each tumour normal pair in your cohort.config
file, you will now have:
../Mutect2/<patient>-T<ID>_<patient>-N<ID>/<patient>-T<ID>_<patient>-N<ID>.unfiltered.vcf.gz
../Mutect2/<patient>-T<ID>_<patient>-N<ID>/<patient>-T<ID>_<patient>-N<ID>.unfiltered.vcf.gz.tbi
You will also have input files for the filtering steps, including:
- Per interval
.stats
files forMergeMutectStats
- Per interval
f1r2
files forLearnReadOrientationModel
The following jobs prepare input files for the last part of the Somatic-ShortV pipeline, FilterMutectCalls
. There are three parts and each of these can be run concurrently:
MergeMutectStats
LearnReadOrientationModel
GetPileupSummaries
andCalculaterContamination
After variant calling with Mutect2, stats
files are created for each interval. Gather these into tumour-normal pair stats
file. Stats files contain the number of callable sites, (by default the callable depth is 10).
Create inputs by:
sh gatk4_mergemutectstats_make_input.sh /path/to/cohort.config
Adjust and compute resource requests in gatk4_mergemutectstats_run_parallel.pbs
, then submit your job by:
qsub gatk4_mergemutectstats_run_parallel.pbs
For each tumour-normal pair in your cohort.config
file, you will now have:
../Mutect2/<patient>-T<ID>_<patient>-N<ID>/<patient>-T<ID>_<patient>-N<ID>.unfiltered.vcf.gz.stats
This step aims to remove substitution error bias on a single strand. This applies to FFPE tumour samples and samples sequenced on Illumina Novaseq machines. It is recommended to run this, even if your samples/sequencing machines are not prone to orientation bias.
Create a job input file and arguments file for each tumour normal pair (containining f1r2 genomic interval files for the pair created from Mutect2) by:
sh gatk4_learnreadorientationmodel_make_input.sh /path/to/cohort.config
Adjust and compute resource requests in gatk4_learnreadorientationmodel_run_parallel.pbs
, then submit your job by:
qsub gatk4_learnreadorientationmodel_run_parallel.pbs
Check that LearnReadOrientationModel ran successfully for each tumour-normal pair. The log files are checked for SUCCESS
and error
messages and that the expected output file ../Mutect2/TumourID_NormalID/TumourID_NormalID_read-orientation-model.tar.gz
exists and is not empty. Create inputs by:
sh gatk4_learnreadorientationmodel_check.sh /path/to/cohort.config
For each tumour-normal pair in your cohort.config
file, you will now have:
../Mutect2/<patient>-T<ID>_<patient>-N<ID>/<patient>-T<ID>_<patient>-N<ID>_read-orientation-model.tar.gz
With the assumption that known germline variant sites are biallelic, any site that presents multiple variant alleles suggests possible sample contamination.
GetPileupSummaries
uses a sample BAM file and a germline resource containing common biallelic variants. This tabulates pileup metrics for CalculateContamination
, summarizing the counts of reads that support the reference, alternate and other alleles.
Germline resources
The ../Reference
directory downloadable and described in Fastq-to-BAM contains two germline resources that can be used for GetPileupSummaries
:
- Common biallelic variants from the ExAC dataset (containing 60,706 exomes), lifted to the hg38 reference genome in
../Reference/gatk-best-practices/somatic-hg38/small_exac_common_3.hg38.vcf.gz
- Common biallelic variants from gnomAD (76,156 whole genomes) mapped to hg38 in
../Reference/gatk-best-practices/somatic-hg38/af-only-gnomad.common_biallelic.hg38.vcf.gz
. This was created using the optionalSelectVariants
tool in thegatk4_selectvariants.pbs
script.
Due to the vast difference in number of loci between these two public resources, the methodology has been updated to enable use of the af-only-gnomad.common_biallelic.hg38.vcf.gz
on NCI Gadi. Issues running GetPileupSummaries using this VCF can be read about here and here. So, the resource you choose to use for this step in the workflow ("small_exac" or "gnomad") will dictate which scripts (which processing methodology) you use for this step.
Optionally, you can generate your own common biallelic variant resource. If so, please compare the number of loci in your VCF to the above two public resources, and use the script set that best matches the size of your resource.
gatk4_selectvariants.pbs
takes your public resource VCF, and selects common biallelic SNP variants (by default, those with an allele frequency of > 0.05).
In the gatk4_selectvariants.pbs
script, replace <>
with your resource for resource=<path/to/public_dataset.vcf.gz>
and output file resource_common=</path/to/output_public_dataset_common_biallelic.vcf.gz>
. Adjust your and compute resources, then submit your job by:
qsub gatk4_selectvariants.pbs
With your common biallelic germline resource, run GetPileupSummaries
for all samples in your cohort.config
file.
IMPORTANT NOTE ON COMMON BIALLELIC VCF RESOURCE
The default germline resource used is small_exac_common_3.hg38.vcf.gz
. To use this resource, please follow Option 1: using small resource eg "small_exac"
. If you would like to change this resource, set common_biallelic
variable to the path to your chosen resource. If using a large resource eg af-only-gnomad.common_biallelic.hg38.vcf.gz
please use the script set and methodology described in Option 2: using large resource eg "gnomad"
`.
sh gatk4_getpileupsummaries_make_input.sh /path/to/cohort.config
Adjust and compute resource requests in gatk4_getpileupsummaries_run_parallel.pbs
, then submit your job by:
qsub gatk4_getpileupsummaries_run_parallel.pbs
GetPileupSummaries
task log files are checked for SUCCESS
and error
messages. The script also checks that the expected output <cohort>_GetPileupSummaries/samples_pileups.table
exists and it not empty.
First, edit the common_biallelic variable in gatk4_getpileupsummaries_check.sh so that it is consistent with what you used in step 15. Then:
sh gatk4_getpileupsummaries_check.sh /path/to/cohort.config
The script will print the number of successful and unsuccessful tasks to the terminal. Failed tasks will be written to Inputs/gatk4_getpileupsummaries_missing.inputs
. If there are failed tasks to re-run, adjust and compute resource requests in gatk4_getpileupsummaries_missing_run_parallel.pbs
, then submit your job by:
qsub gatk4_getpileupsummaries_missing_run_parallel.pbs
This section describes a method to compute GATK GetPileupSummaries using scatter-gather parallelism over intervals rather than the whole genome at a time. The rest of this workflow has used nci-parallel
to achieve this. At the time of writing this set of scripts (01-31-2024) CPU efficiency was observed to be reduced by this method, so an alternate approach has been incorporated that splits the genomic intervals into three 'chunks' of intervals per sample, then submits the 'chunks' with a good old bash for
loop. Three chunks per sample seemed a fair compromise between walltime and not thrashing the scheduler. There are 8 scripts specific for this part of the workflow, with 6 run steps, but do not be dissuaded - it's pretty straightforward.
The first twp steps (three scripts total) need only be done once per 'common_biallelic' resource, eg gnomad. This splits the genome by interval (we selected 24 yet 18 were returned, as 'BALANCING_WITHOUT_INTERVAL_SUBDIVISION' mode was applied). Then, the gnomad VCF resource is split into 18 smaller VCF files. The intention was to parallelise this many splits per sample, but see note above. The interval list files and VCF interval files are written to appropriate locations within the Reference
directory, so please make sure this is writable by you before you continue.
The remaining 4 steps (5 scripts) utilise the interval VCF files to execute three GetPileupSummaries jobs per sample, check the job outputs, and then concatenate the scattered interval tables into one GetPileupSummaries table per sample (after which, the scattered interval tables can be deleted).
For all PBS scripts described here, please adjust your NCI project code at -P
and l storage
requirements!
After ensuring that you have read-write acces to your Reference
directory and its subfolders, run this fast and light script on the login node:
bash gatk4_getpileupsummaries_splitIntervals.sh
This will create intervals files in ../Reference/GetPileupSummaries_intervals
.
Then, check that the filepath to your large resource is correct within the script gatk4_getpileupsummaries_split_common_vcf_run.sh
. Default is currently ../Reference/gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz
.
Run the following command to split the VCF according to the newly created 18 interval files:
bash gatk4_getpileupsummaries_split_common_vcf_run.sh
This will submit 18 separate PBS jobs, one per interval. This is very fast on the compute nodes but quite slow on the login node, thus the preference to schedule these small tasks on the cluster.
Output will be new zipped VCF files and index files within the parent directory of the resource VCF, with a newly created sub-directory based on the VCF file prefix, eg ../Reference/gatk-best-practices/somatic-hg38/af-only-gnomad.hg38
.
As usual, make the inputs, providing the filepath of your samples config file as first and only command line argument:
bash gatk4_getpileupsummaries_byInterval_make_input.sh <config-filepath>
The created inputs file should be 3 X N, where N is the number of samples. Three 'chunks' of intervals are run per sample rather than scattering all 18 intervals - please see justification above. The default walltime for the GetPileupSummaries job is 3 hours per job (where one job is one chunk of intervals for one sample). This is sufficient for a very high-coverage BAM (~ 174X samples required 2 hrs). You may wish to reduce this to 2 or 1 hrs in the script gatk4_getpileupsummaries_byInterval.pbs
depending on your sample coverage, in order to minimise potential queue time. The jobs are run on the hugemem
queue which is a more scarce resource compared to normal
queue, so queue time may be a factor for large cohorts if the hugemem
queue is in high demand at the time of processing. Changing to the normal
or normalbw
queue is generally not recommended, as CPU efficiency and SU usage is poorer given the need to request more CPUS than GATK can use (1) due to memory requirements.
Once the inputs file is made, and you are happy with the walltime set in the PBS script, submit all chunks for all samples with:
bash gatk4_getpileupsummaries_byInterval_runLoop.sh
Once the jobs have completed, check all jobs with:
bash gatk4_getpileupsummaries_byInterval_check.sh
If all jobs were successful, the following message will be returned:
No issues detected. Please continue with gatk4_getpileupsummaries_byInterval_concat.sh
If issues were detected, a new inputs file for the failed tasks will be written to Inputs/gatk4_getpileupsummaries.inputs-failed
. A complete set of re-run failed scripts have not been written - in the event of a small number of failures, ie on walltime, you can easily take a copy of the script gatk4_getpileupsummaries_byInterval.pbs
and increase the walltime, adjust the inputs file, and supply the correct variables as per present in the original launcher script gatk4_getpileupsummaries_byInterval_runLoop.sh
. It is most likely that only one or a handful of the intervals within the 'chunk' of intervals failed to complete (in the event that the failure was insufficient walltme). In this case, resubmitting just those intervals, rather than the entire chunk of intervals, would be wise to minmise wasted KSU. Given the future move of this workflow away from nci-parallel
and towards Nextflow, time has not been spent to create this level of detail here. If you need help with this, please reach out.
This step takes around 3-6 seconds per sample on the login node. Run:
bash gatk4_getpileupsummaries_byInterval_concat.sh <config-filepath>
This will take the scattered pileup tables for each sample from ../<cohortName>_GetPileupSummaries/scatter/
and concatenate them (avoiding redundant headers) into ../<cohortName>_GetPileupSummaries/<LabSampleID>_pileups.table
.
Once you are satisfied that the final sample pileups are complete, you can delete the scatter directory:
rm -rf ../<cohortName>_GetPileupSummaries/scatter/
This shows compute usage for 2 samples, one 81 GB BAM file ('smallSample', approx 73X raw coverage) and the other 219 GB BAM ('largeSample', approx 174X coverage). Each job was run on one Gadi hugemem
CPU, allowing 28 GB java mem to the GATK tool.
The 4-digit strings within the job name column are the intervals processed within that chunked job. Note that the chunks contain different collections of intervals between the two samples, due to testing different interval divisions. The sum KSU and walltime metrics are still valid. The interval chunks shown for the 'large' sample are the current default.
#JobName | CPUs_requested | Mem_requested | Mem_used | CPUtime_mins | Walltime_mins | JobFS_used | Efficiency | Service_units |
---|---|---|---|---|---|---|---|---|
smallSample_0000-0001-0002-0003-0004.o | 1 | 31.0GB | 22.77GB | 62.7 | 63.43 | 685.84KB | 0.99 | 3.17 |
smallSample_0005-0006-0007-0008-0009-0010-0011.o | 1 | 31.0GB | 26.08GB | 64.2 | 64.93 | 688.83KB | 0.99 | 3.25 |
smallSample_0012-0013-0014-0015-0016-0017.o | 1 | 31.0GB | 19.15GB | 44.17 | 44.63 | 685.84KB | 0.99 | 2.23 |
largeSample_0000-0001-0002-0003.o | 1 | 31.0GB | 31.0GB | 88.88 | 89.08 | 684.34KB | 1 | 4.45 |
largeSample_0004-0005-0006-0007-0008-0009.o | 1 | 31.0GB | 31.0GB | 95.98 | 96.22 | 687.33KB | 1 | 4.81 |
largeSample_0010-0011-0012-0013-0014-0015-0016-0017.o | 1 | 31.0GB | 31.0GB | 111.68 | 111.98 | 688.83KB | 1 | 5.6 |
Calculate the fraction of reads coming from cross sample contamination using CalculateContamination
using pileups tables from GetPileupSummaries
as inputs. The resulting contamination table is used in FilterMutectCalls
. Create inputs by:
sh gatk4_calculatecontamination_make_input.sh /path/to/cohort.config
You are finally ready to obtain a filtered set of somatic variants using FilterMutectCalls
. You will need the inputs from the previous steps, including:
<patient>-T<ID>_<patient>-N<ID>.unfiltered.vcf.gz
(from Mutect2, using PoN)<patient>-T<ID>_<patient>-N<ID>.unfiltered.vcf.gz.stats
(from Mutect2, genomic intervals gathered into a single stats file withMergeMutectStats
)tumor_segments.table
(from GetPileupSummaries & CalculateContamination)<patient>-T<ID>_<patient>-N<ID>_contamination.table
(from GetPileupSummaries & CalculateContamination)<patient>-T<ID>_<patient>-N<ID>_read-orientation-model.tar.gz
(from Mutect2 & LearnReadOrientationModel)
- Create input files for each task (
FilterMutectCalls
for a single tumour normal pair) by:
sh gatk4_filtermutectcalls_make_input.sh /path/to/cohort.config
Adjust and compute resource requests in gatk4_filtermutectcalls_run_parallel.pbs
, then submit your job by:
qsub gatk4_filtermutectcalls_run_parallel.pbs
The following benchmarks were obtained from processing 20 tumour-normal pairs (34X and 70X, human samples). My apologies that the steps are not in chronological order 😊
#JobName | CPUs_requested | CPUs_used | Mem_requested | Mem_used | CPUtime | CPUtime_mins | Walltime_req | Walltime_used | Walltime_mins | JobFS_req | JobFS_used | Efficiency | Service_units |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gatk4_cohort_pon_144 | 144 | 144 | 4.39TB | 110.13GB | 16:39:15 | 999.25 | 5:00:00 | 0:24:51 | 24.85 | 1.46GB | 9.08MB | 0.28 | 178.92 |
gatk4_cohort_pon_48 | 48 | 48 | 1.46TB | 102.54GB | 17:26:26 | 1046.43 | 10:00:00 | 0:26:04 | 26.07 | 500.0MB | 9.08MB | 0.84 | 62.56 |
gatk4_cohort_pon_96 | 96 | 96 | 2.93TB | 110.97GB | 17:07:23 | 1027.38 | 10:00:00 | 0:26:07 | 26.12 | 1000.0MB | 9.07MB | 0.41 | 125.36 |
gatk4_cohort_pon_gather_sort | 1 | 1 | 18.0GB | 4.28GB | 0:02:18 | 2.3 | 1:00:00 | 0:03:15 | 3.25 | 100.0MB | 0B | 0.71 | 0.49 |
gatk4_getpileupsummaries_exac* | 48 | 48 | 190GB | 145.94GB | 10:42:43 | 642.72 | 15:00:00 | 01:03:58 | 63.97 | 100.0MB | 40.09MB | 0.21 | 102.35 |
gatk4_mutect2_1920 | 1920 | 1920 | 7.5TB | 6.61TB | 1890:56:30 | 113456.5 | 4:00:00 | 1:07:04 | 67.07 | 3.91GB | 32.77MB | 0.88 | 4292.27 |
gatk4_mutect2_2880 | 2880 | 2880 | 11.25TB | 9.81TB | 1963:57:43 | 117837.72 | 2:00:00 | 0:48:33 | 48.55 | 5.86GB | 33.38MB | 0.84 | 4660.8 |
gatk4_mutect2_3840 | 3840 | 3840 | 15.0TB | 13.07TB | 2033:47:24 | 122027.4 | 2:00:00 | 0:39:04 | 39.07 | 7.81GB | 32.77MB | 0.81 | 5000.53 |
gatk4_mutect2_960 | 960 | 960 | 3.75TB | 3.3TB | 1865:51:58 | 111951.97 | 4:00:00 | 2:05:44 | 125.73 | 1.95GB | 32.77MB | 0.93 | 4023.47 |
gatk4_pon_1920 | 1920 | 1920 | 7.5TB | 6.39TB | 1546:53:50 | 92813.83 | 2:00:00 | 0:50:06 | 50.1 | 3.91GB | 21.23MB | 0.96 | 3206.4 |
gatk4_pon_2880 | 2880 | 2880 | 11.25TB | 8.62TB | 1601:44:32 | 96104.53 | 2:00:00 | 0:35:13 | 35.22 | 5.86GB | 21.23MB | 0.95 | 3380.8 |
gatk4_pon_3840 | 3840 | 3840 | 15.0TB | 11.36TB | 1859:29:08 | 111569.13 | 2:00:00 | 0:31:43 | 31.72 | 7.81GB | 21.23MB | 0.92 | 4059.73 |
gatk4_pon_960 | 960 | 960 | 3.75TB | 3.27TB | 1537:26:16 | 92246.27 | 2:00:00 | 1:38:03 | 98.05 | 1.95GB | 21.1MB | 0.98 | 3137.6 |
gatk4_pon_gathervcfs_20 | 20 | 20 | 640.0GB | 42.43GB | 1:04:42 | 64.7 | 2:00:00 | 1:44:51 | 104.85 | 500.0MB | 8.09MB | 0.03 | 104.85 |
gatk4_pon_genomicsdbimport_144 | 144 | 144 | 4.39TB | 698.85GB | 22:28:42 | 1348.7 | 5:00:00 | 0:14:20 | 14.33 | 1.46GB | 8.96MB | 0.65 | 103.2 |
gatk4_pon_genomicsdbimport_48 | 48 | 48 | 1.46TB | 314.08GB | 21:27:48 | 1287.8 | 10:00:00 | 0:30:33 | 30.55 | 500.0MB | 8.95MB | 0.88 | 73.32 |
gatk4_pon_genomicsdbimport_96 | 96 | 96 | 2.93TB | 602.88GB | 21:48:05 | 1308.08 | 10:00:00 | 0:18:14 | 18.23 | 1000.0MB | 8.95MB | 0.75 | 87.52 |
*gatk4_getpileupsummaries_exac was benchmarked with 80 tumour (~70X) and normal (~35X) paired samples, using common_biallelic=../Reference/gatk-best-practices/somatic-hg38/small_exac_common_3.hg38.vcf.gz
.
Chew, T., Willet, C., Samaha, G., Menadue, B. J., Downton, M., Sun, Y., Kobayashi, R., & Sadsad, R. (2021). Somatic-ShortV (Version 1.0) [Computer software]. https://doi.org/10.48546/workflowhub.workflow.148.1
Acknowledgements (and co-authorship, where appropriate) are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub and national compute facilities.
Suggested acknowledgements:
NCI Gadi
The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney. This research/project was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government.
GATK 4: Van der Auwera et al. 2013 https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/0471250953.bi1110s43
OpenMPI: Graham et al. 2015 https://dl.acm.org/doi/10.1007/11752578_29