Kids First Data Resource Center Pacific Biosciences Long Reads Alignment and Variant Calling Workflow
The Kids First Data Resource Center (KFDRC) Pacific Biosciences (PacBio) Long Reads Alignment and Variant Calling Workflow is a Common Workflow Language (CWL) implementation of various softwares used to take reads information generated by PacBio long reads sequencers and generate alignment and variant information. This pipeline was made possible thanks to significant software and support contributions from both Sentieon and Wang Genomics Lab. For more information on our collaborators, check out their websites:
- Sentieon: https://www.sentieon.com/
- Wang Genomics Lab: https://wglab.org/
- samtools head:
1.17
- samtools fastq:
1.15.1
- Sentieon Minimap2:
202112.01
- Sentieon util sort:
202112.01
- Sentieon DNAScope HiFi:
202112.01
- Sentieon LongReadSV:
202112.06
- LongReadSum:
1.2.0
- Sniffles:
2.0.7
- pbsv:
2.9.0
input_unaligned_bam
: The primary input of the PacBio Long Reads Workflow is an unaligned BAM and associated index.indexed_reference_fasta
: Any suitable human reference genome. KFDRC usesHomo_sapiens_assembly38.fasta
from Broad Institute.
dnascope_small_variants
: BGZIP and TABIX indexed VCF containing small variant calls made by Sentieon DNAScope HiFi onminimap2_aligned_bam
.longreadsum_bam_metrics
: BGZIP TAR containing various metrics collected by LongReadSum from theminimap2_aligned_bam
.minimap2_aligned_bam
: Indexed BAM file containing reads from theinput_unaligned_bam
aligned to theindexed_reference_fasta
.pbsv_structural_variants
: BGZIP and TABIX indexed VCF containing structural variant calls made by pbsv on theminimap2_aligned_bam
.sniffles_structural_variants
: BGZIP and TABIX indexed VCF containing structural variant calls made by Sniffles on theminimap2_aligned_bam
.longreadsv_structural_variants
: BGZIP and TABIX indexed VCF containing structural variant calls made by Sentieon LongReadSV on theminimap2_aligned_bam
.
- Read group information (
@RG
) is harvested from theinput_unaligned_bam
header usingsamtools head
andgrep
. - If user provides
biospecimen_name
input, that value replaces theSM
value pulled in the preceeding step. - Align
input_unaligned_bam
toindexed_reference_fasta
with tohe above@RG
information using samtools fastq, Sentieon Minimap2, and Sentieon sort. - Generate long reads alignment metrics from the
minimap2_aligned_bam
using LongReadSum. - Generate structural variant calls from the
minimap2_aligned_bam
using pbsv. - Generate structural variant calls from the
minimap2_aligned_bam
using Sniffles. - Generate structural variant calls from the
minimap2_aligned_bam
using Sentieon LongReadSV. - If the reads are not CLR, Generate small variant from the
minimap2_aligned_bam
using Sentieon DNAScope HiFi.
- D3b dockerfiles
- Testing Tools:
- KFDRC AWS s3 bucket: s3://kids-first-seq-data/broad-references/
- Cavatica: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/
- Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/