Skip to content

Latest commit

 

History

History
59 lines (51 loc) · 4.1 KB

PACBIO_WORKFLOW_README.md

File metadata and controls

59 lines (51 loc) · 4.1 KB

Kids First Data Resource Center Pacific Biosciences Long Reads Alignment and Variant Calling Workflow

The Kids First Data Resource Center (KFDRC) Pacific Biosciences (PacBio) Long Reads Alignment and Variant Calling Workflow is a Common Workflow Language (CWL) implementation of various softwares used to take reads information generated by PacBio long reads sequencers and generate alignment and variant information. This pipeline was made possible thanks to significant software and support contributions from both Sentieon and Wang Genomics Lab. For more information on our collaborators, check out their websites:

Relevant Softwares and Versions

Input Files

  • input_unaligned_bam: The primary input of the PacBio Long Reads Workflow is an unaligned BAM and associated index.
  • indexed_reference_fasta: Any suitable human reference genome. KFDRC uses Homo_sapiens_assembly38.fasta from Broad Institute.

Output Files

  • dnascope_small_variants: BGZIP and TABIX indexed VCF containing small variant calls made by Sentieon DNAScope HiFi on minimap2_aligned_bam.
  • longreadsum_bam_metrics: BGZIP TAR containing various metrics collected by LongReadSum from the minimap2_aligned_bam.
  • minimap2_aligned_bam: Indexed BAM file containing reads from the input_unaligned_bam aligned to the indexed_reference_fasta.
  • pbsv_structural_variants: BGZIP and TABIX indexed VCF containing structural variant calls made by pbsv on the minimap2_aligned_bam.
  • sniffles_structural_variants: BGZIP and TABIX indexed VCF containing structural variant calls made by Sniffles on the minimap2_aligned_bam.
  • longreadsv_structural_variants: BGZIP and TABIX indexed VCF containing structural variant calls made by Sentieon LongReadSV on the minimap2_aligned_bam.

Generalized Process

  1. Read group information (@RG) is harvested from the input_unaligned_bam header using samtools head and grep.
  2. If user provides biospecimen_name input, that value replaces the SM value pulled in the preceeding step.
  3. Align input_unaligned_bam to indexed_reference_fasta with tohe above @RG information using samtools fastq, Sentieon Minimap2, and Sentieon sort.
  4. Generate long reads alignment metrics from the minimap2_aligned_bam using LongReadSum.
  5. Generate structural variant calls from the minimap2_aligned_bam using pbsv.
  6. Generate structural variant calls from the minimap2_aligned_bam using Sniffles.
  7. Generate structural variant calls from the minimap2_aligned_bam using Sentieon LongReadSV.
  8. If the reads are not CLR, Generate small variant from the minimap2_aligned_bam using Sentieon DNAScope HiFi.

Basic Info

References