Workflow for analyzing human PacBio whole genome sequencing (WGS) data using Workflow Description Language (WDL).
- Docker images used by this workflow are defined in the wdl-dockerfiles repo. Images are hosted in PacBio's quay.io.
- Common tasks that may be reused within or between workflows are defined in the wdl-common repo.
Workflow entrypoint: workflows/main.wdl
PacBio WGS Variant Pipeline performs read alignment, variant calling, and phasing. Joint-calling of small variants and structural variants for cohorts and optional variant filtering and annotation is also available for HiFi human WGS. The workflow can run using Azure, AWS, GCP, and HPC backends.
We recommend cloning the repo rather than downloading the release package. Some tasks and workflows are pulled in from other repositories. Ensure you have initialized submodules following cloning by running git submodule update --init --recursive
.
The workflow requires at minimum 64 cores and 256 GB of RAM. Ensure that the backend environment you're using has enough quota to run the workflow.
Reference datasets are hosted publicly for use in the pipeline. For data locations, see the backend-specific documentation and template inputs files for each backend with paths to publicly hosted reference files filled out.
- Select a backend environment
- Configure a workflow execution engine in the chosen environment
- Fill out the inputs JSON file for your cohort
- Run the workflow
The workflow can be run on Azure, AWS, GCP, or HPC. Your choice of backend will largely be determined by the location of your data.
For backend-specific configuration, see the relevant documentation:
An execution engine is required to run workflows. Two popular engines for running WDL-based workflows are miniwdl
and Cromwell
.
Because workflow dependencies are containerized, a container runtime is required. This workflow has been tested with Docker and Singularity container runtimes.
See the backend-specific documentation for details on setting up an engine.
Engine | Azure | AWS | GCP | HPC |
---|---|---|---|---|
miniwdl | Unsupported | Supported via the Amazon Genomics CLI | Unsupported | (SLURM only) Supported via the miniwdl-slurm plugin |
Cromwell | Supported via Cromwell on Azure | Supported via the Amazon Genomics CLI | Supported via Google's Pipelines API | Supported - Configuration varies depending on HPC infrastructure |
The input to a workflow run is defined in JSON format. Template input files with reference dataset information filled out are available for each backend:
Using the appropriate inputs template file, fill in the cohort and sample information (see Workflow Inputs for more information on the input structure).
If using an HPC backend, you will need to download the reference bundle and replace the <local_path_prefix>
in the input template file with the local path to the reference datasets on your HPC.
Run the workflow using the engine and backend that you have configured (miniwdl, Cromwell).
Note that the calls to miniwdl
and Cromwell
assume you are accessing the engine directly on the machine on which it has been deployed. Depending on the backend you have configured, you may be able to submit workflows using different methods (e.g. using trigger files in Azure, or using the Amazon Genomics CLI in AWS).
miniwdl run workflows/main.wdl -i <input_file_path.json>
java -jar <cromwell_jar_path> run workflows/main.wdl -i <input_file_path.json>
If Cromwell is running in server mode, the workflow can be submitted using cURL. Fill in the values of CROMWELL_URL and INPUTS_JSON below, then from the root of the repository, run:
# The base URL (and port, if applicable) of your Cromwell server
CROMWELL_URL=
# The path to your inputs JSON file
INPUTS_JSON=
(cd workflows && zip -r dependencies.zip humanwgs_structs.wdl cohort_analysis/ sample_analysis/ tertiary_analysis/ wdl-common/)
curl -X "POST" \
"${CROMWELL_URL}/api/workflows/v1" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "workflowSource=@workflows/main.wdl" \
-F "workflowInputs=@${INPUTS_JSON};type=application/json" \
-F "workflowDependencies=@workflows/dependencies.zip;type=application/zip"
To specify workflow options, add the following to the request (assuming your options file is a file called options.json
located in the pwd
): -F "[email protected];type=application/json"
.
This section describes the inputs required for a run of the workflow. Typically, only the humanwgs.cohort
and potentially run/backend-specific sections will be filled out by the user for each run of the workflow. Input templates with reference file locations filled out are provided for each backend.
A cohort can include one or more samples. Samples need not be related, but if you plan to run tertiary analysis, it is best to think of a cohort as a family of related samples. We have tested cohorts of up to 5 samples with 30x coverage. Larger cohorts may encounter memory issues during joint calling.
Type | Name | Description | Notes |
---|---|---|---|
String | cohort_id | A unique name for the cohort; used to name outputs | |
Array[Sample] | samples | The set of samples for the cohort. At least one sample must be defined. | |
Array[String] | phenotypes | Human Phenotype Ontology (HPO) phenotypes associated with the cohort. If no particular phenotypes are desired, the root HPO term, "HP:0000001" , can be used. |
Sample information for each sample in the workflow run.
Type | Name | Description | Notes |
---|---|---|---|
String | sample_id | A unique name for the sample; used to name outputs | |
Array[IndexData] | movie_bams | The set of unaligned movie BAMs associated with this sample | |
String? | sex | Sample sex | ["MALE", "FEMALE", null ]. If the sex field is missing or null , sex will be set to unknown. Used to set the expected sex chromosome karyotype for TRGT and HiFiCNV. |
Boolean | affected | Is this sample affected by the phenotype? | [true , false ] |
String? | father_id | Paternal sample_id |
|
String? | mother_id | Maternal sample_id |
Files associated with the reference genome.
These files are hosted publicly in each of the cloud backends; see backends/${backend}/inputs.${backend}.json
.
Type | Name | Description | Notes |
---|---|---|---|
String | name | Reference name; used to name outputs (e.g., "GRCh38") | Note: The workflow currently only supports GRCh38 and provides GCA_000001405.15_GRCh38_no_alt_analysis_set. |
IndexData | fasta | Reference genome and index | |
File | tandem_repeat_bed | Tandem repeat locations used by pbsv to normalize SV representation | |
File | trgt_tandem_repeat_bed | Tandem repeat sites to be genotyped by TRGT | |
IndexData | hificnv_exclude_bed | Compressed BED and index of regions to exclude from calling by HiFiCNV. We recommend cnv.excluded_regions.common_50.hg38.bed.gz. | |
File | hificnv_expected_bed_male | BED of expected copy number for male karyotype for HiFiCNV | |
File | hificnv_expected_bed_female | BED of expected copy number for female karyotype for HiFiCNV | |
File? | gnomad_af | gnomAD v3.1 allele frequences in slivar gnotate format |
required if run_tertiary_analysis is set to true |
File? | hprc_af | Allele frequences in ~100 Human Pangenome Reference Consortium (HPRC) samples in slivar gnotate format |
required if run_tertiary_analysis is set to true |
File? | gff | Ensembl GFF3 reference annotation | required if run_tertiary_analysis is set to true |
Array[IndexData?] | population_vcfs | An array of structural variant population VCFs | required if run_tertiary_analysis is set to true |
Files associated with slivar
annotation. These are required if run_tertiary_analysis
is set to true
.
These files are hosted publicly in each of the cloud backends; see backends/${backend}/inputs.${backend}.json
.
Type | Name | Description | Notes |
---|---|---|---|
File | slivar_js | Additional javascript functions for slivar | |
File | hpo_terms | HPO annotation lookups | |
File | hpo_dag | HPO annotation lookups | |
File | hpo_annotations | HPO annotation lookups | |
File | ensembl_to_hgnc | Ensembl to HGNC gene mapping | |
File | lof_lookup | Loss-of-function scores per gene | |
File | clinvar_lookup | ClinVar annotations per gene |
Type | Name | Description | Notes |
---|---|---|---|
String? | deepvariant_version | Version of deepvariant to use ["1.5.0"] | |
DeepVariantModel? | deepvariant_model | Optional alternate DeepVariant model file to use | |
Int? | pbsv_call_mem_gb | Optionally set RAM (GB) for pbsv_call during cohort analysis | |
Int? | glnexus_mem_gb | Optionally set RAM (GB) for GLnexus during cohort analysis | |
Boolean? | run_tertiary_analysis | Run the optional tertiary analysis steps [false ] |
[true , false ] |
String | backend | Backend where the workflow will be executed | ["Azure", "AWS", "GCP", "HPC"] |
String? | zones | Zones where compute will take place; required if backend is set to 'AWS' or 'GCP'. | |
String? | aws_spot_queue_arn | Queue ARN for the spot batch queue; required if backend is set to 'AWS' and preemptible is set to true |
Determining the AWS queue ARN |
String? | aws_on_demand_queue_arn | Queue ARN for the on demand batch queue; required if backend is set to 'AWS' and preemptible is set to false |
Determining the AWS queue ARN |
String? | container_registry | Container registry where workflow images are hosted. If left blank, PacBio's public Quay.io registry will be used. | |
Boolean | preemptible | If set to true , run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false , on-demand VMs will be used for every task. Ignored if backend is set to HPC. |
[true , false ] |
These files will be output for each sample defined in the cohort.
Type | Name | Description | Notes |
---|---|---|---|
Array[Array[File]] | bam_stats | TSV of length and quality for each read (per input BAM) | |
Array[Array[File]] | read_length_summary | For each input BAM, read length distribution (per input BAM) | |
Array[Array[File]] | read_quality_summary | For each input BAM, read quality distribution (per input BAM) | |
Array[IndexData] | small_variant_gvcfs | Small variants (SNPs and INDELs < 50bp) gVCFs called by DeepVariant (with index) | |
Array[File] | small_variant_vcf_stats | bcftools stats summary statistics for small variants |
|
Array[File] | small_variant_roh_out | Output of bcftools roh using --AF-dflt 0.4 |
|
Array[File] | small_variant_roh_bed | Regions of homozygosity determiend by bcftools roh using --AF-dflt 0.4 |
|
Array[IndexData] | sample_phased_small_variant_vcfs | Small variants called by DeepVariant and phased by HiPhase (with index) | |
Array[IndexData] | sample_phased_sv_vcfs | Structural variants called by pbsv and phased by HiPhase (with index) | |
Array[File] | sample_hiphase_stats | Phase block summary statistics written by HiPhase | |
Array[File] | sample_hiphase_blocks | Phase block list written by HiPhase | |
Array[File] | sample_hiphase_haplotags | Per-read haplotag information, written by HiPhase | |
Array[IndexData] | merged_haplotagged_bam | Aligned (by pbmm2), haplotagged (by HiPhase) reads (with index) | |
Array[File] | haplotagged_bam_mosdepth_summary | mosdepth summary of median depths per chromosome | |
Array[File] | haplotagged_bam_mosdepth_region_bed | mosdepthhttps://github.com/brentp/mosdepth BED of median coverage depth per 500 bp window | |
Array[IndexData] | trgt_repeat_vcf | Tandem repeat genotypes from TRGT (with index) | |
Array[IndexData] | trgt_spanning_reads | Fragments of HiFi reads spanning loci genotyped by TRGT (with index) | |
Array[File] | trgt_dropouts | Regions with insufficient coverage to genotype by TRGT | |
Array[Array[File]] | cpg_pileup_beds | 5mCpG site methylation probability pileups generated by pb-CpG-tools | |
Array[Array[File]] | cpg_pileup_bigwigs | 5mCpG site methylation probability pileups generated by pb-CpG-tools | |
Array[File] | paraphase_output | Output generated by Paraphase | |
Array[IndexData] | paraphase_realigned_bam | Realigned BAM for selected medically relevant genes in segmental duplications (with index), generated by Paraphase | |
Array[Array[File]] | paraphase_vcfs | Phased Variant calls for selected medically relevant genes in segmental duplications, generated by Paraphase | |
Array[IndexData] | hificnv_vcfs | VCF output containing copy number variant calls for the sample from HiFiCNV | |
Array[File] | hificnv_copynum_bedgraphs | Copy number values calculated for each region | |
Array[File] | hificnv_depth_bws | Bigwig file containing the depth measurements from HiFiCNV | |
Array[File] | hificnv_maf_bws | Bigwig file containing the minor allele frequency measurements from DeepVariant, generated by HiFiCNV |
These files will be output if the cohort includes more than one sample.
Type | Name | Description | Notes |
---|---|---|---|
IndexData? | cohort_small_variant_vcf | Small variants called by DeepVariant, joint-called by GLnexus, and phased by HiPhase (with index) | |
IndexData? | cohort_sv_vcf | Structural variants joint-called by pbsv and phased by HiPhase (with index) | |
File? | cohort_hiphase_stats | Phase block summary statistics written by HiPhase | |
File? | cohort_hiphase_blocks | Phase block list written by HiPhase |
These files will be output for each run of the workflow if run_tertiary_analysis
is set to true
. The files that are being annotated will depend on whether the number of samples is equal to or greater than one:
- If the number of samples is equal to one, the files being annotated in this step are the sample small variant VCF and SV VCF.
- If the number of samples is greater than one, the files being annotated in this step are the phased, joint-called small variant VCF and the cohort SV VCF.
Type | Name | Description | Notes |
---|---|---|---|
IndexData? | filtered_small_variant_vcf | Small variant calls that are filtered based on population frequency and annotated with cohort information, population frequency, gene, functional impact, etc., using slivar and bcftools csq |
|
IndexData? | compound_het_small_variant_vcf | Compound heterozygotes annotated with cohort information, population frequency, gene, functional impact, etc., using slivar and bcftools csq |
|
File? | filtered_small_variant_tsv | Filtered VCFs are reformatted as a human-readable TSV by slivar tsv |
|
File? | compound_het_small_variant_tsv | Filtered VCFs are reformatted as a human-readable TSV by slivar tsv |
|
IndexData? | filtered_svpack_vcf | Structural variant calls that are filtered based on population frequency and annotated with cohort information, population frequency, gene, functional impact, etc., using svpack | |
File? | filtered_svpack_tsv | Filtered VCFs are reformatted as a human-readable TSV by slivar tsv |
Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio's quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.
The Docker image used by a particular step of the workflow can be identified by looking at the docker
key in the runtime
block for the given task. Images can be referenced in the following table by looking for the name after the final /
character and before the @sha256:...
. For example, the image referred to here is "align_hifiasm":
~{runtime_attributes.container_registry}/align_hifiasm@sha256:3968cb<...>b01f80fe
Image | Major tool versions | Links |
---|---|---|
bcftools | Dockerfile | |
deepvariant | User-defined; default is version 1.5.0 | DeepVariant GitHub |
glnexus | GLnexus GitHub | |
hificnv | Dockerfile | |
hiphase | Dockerfile | |
htslib | Dockerfile | |
mosdepth | Dockerfile | |
paraphase | Dockerfile | |
pb-cpg-tools | Dockerfile | |
pbmm2 | Dockerfile | |
pbsv | Dockerfile | |
pyyaml | Dockerfile | |
samtools | Dockerfile | |
slivar | Dockerfile | |
svpack | Dockerfile | |
trgt | Dockerfile |
TO THE GREATEST EXTENT PERMITTED BY APPLICABLE LAW, THIS WEBSITE AND ITS CONTENT, INCLUDING ALL SOFTWARE, SOFTWARE CODE, SITE-RELATED SERVICES, AND DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. ALL WARRANTIES ARE REJECTED AND DISCLAIMED. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THE FOREGOING. PACBIO IS NOT OBLIGATED TO PROVIDE ANY SUPPORT FOR ANY OF THE FOREGOING, AND ANY SUPPORT PACBIO DOES PROVIDE IS SIMILARLY PROVIDED WITHOUT REPRESENTATION OR WARRANTY OF ANY KIND. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A REPRESENTATION OR WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACBIO.