Version 3.1 (#19)

- Improve PMS2/PMS2CL differentiation - Output protein changes at five potentially pathogenic sites in OPN1LW/OPN1MW - Update region definitions for some families - Add a few regions for fusion calling - CYP2D6, GBA, CYP11B1, the CFH gene cluster - Improve VCFs - All gene copies are now in a single VCF file for each region per sample, reported as sample columns in the VCF. - Report boundary coordinates and the truncated status of a haplotype in the VCF. - Report groups of haplotypes on the same chromosome when this information is available.
PacificBiosciences · Apr 17, 2024 · dba17f8 · dba17f8
1 parent ad0b0f3
commit dba17f8
Show file tree

Hide file tree

Showing 49 changed files with 5,068 additions and 706 deletions.
diff --git a/README.md b/README.md
@@ -2,14 +2,15 @@
 
 <h1 align="center">Paraphase</h1>
 
-# Paraphase: HiFi-based caller for highly homologous genes
+<h3 align="center">HiFi-based caller for highly similar paralogous genes</h3>
 
 Many medically relevant genes fall into 'dark' regions where variant calling is limited due to high sequence homology with paralogs or pseudogenes. Paraphase is a Python tool that takes HiFi aligned BAMs as input (whole-genome or enrichment), phases haplotypes for genes of the same family, determines copy numbers and makes phased variant calls. 
 
 ![Paraphase diagram](docs/figures/paraphase_diagram.png)
-Paraphase takes all reads from a gene family, realigns to just the gene of interest and then phases them into haplotypes. This solves the problem of alignment difficulty due to sequence homology and allows us to examine all copies of genes in a gene family and call copy number changes and other variants.
+Paraphase takes all reads from a gene family, realigns to one representative gene of the family and then phases them into haplotypes. This approach bypasses the error-prone process of aligning reads to multiple similar regions and allows us to examine all copies of genes in a gene family. This gene-family-centered approach allows Paraphase to perform well when there is a copy number difference between an individual and the reference, as is often the case in segmental duplications.
+Futhermore, this approach also streamlines sequence comparisons between genes within the same family, making it straightforward to conduct analyses such as identifying non-allelic gene conversions.  
 
-Paraphase supports 161 segmental duplication [regions](docs/regions.md) in GRCh38. Among these, there are 11 medically relevant regions that are also supported in GRCh37/hg19:
+Paraphase supports 160 segmental duplication [regions](docs/regions.md) in GRCh38. Among these, there are 11 medically relevant regions that are also supported in GRCh37/hg19:
 - SMN1/SMN2 (spinal muscular atrophy)
 - RCCX module
   - CYP21A2 (21-Hydroxylase-Deficient Congenital Adrenal Hyperplasia)
@@ -24,6 +25,9 @@ Paraphase supports 161 segmental duplication [regions](docs/regions.md) in GRCh3
 - CFC1 (heterotaxy syndrome)
 - OPN1LW/OPN1MW (color vision deficiencies)
 - HBA1/HBA2 (Alpha-Thalassemia)
+- GBA (Gaucher disease and Parkison's disease)
+- CYP11B1/CYP11B2 (Glucocorticoid-remediable aldosteronism)
+- CFH/CFHR1/CFHR2/CFHR3/CFHR4 (large deletions/duplications, atypical hemolytic uremic syndrome and age-related macular degeneration)
 
 Please check out our [paper](https://www.cell.com/ajhg/fulltext/S0002-9297(23)00001-0) on its application to the gene SMN1 for more details about Paraphase.   
 Chen X, Harting J, Farrow E, et al. Comprehensive SMN1 and SMN2 profiling for spinal muscular atrophy analysis using long-read PacBio HiFi sequencing. The American Journal of Human Genetics. 2023;0(0). doi:10.1016/j.ajhg.2023.01.001
@@ -78,32 +82,28 @@ Please note that the input BAM should be one that's aligned to the ENTIRE refere
 Optional parameters:
 - `-g`: Region(s) to analyze, separated by comma. All supported [regions](docs/regions.md) will be analyzed if not specified. Please use region name, i.e. first column in the doc.
 - `-t`: Number of threads.
+- `-p`: Prefix of output files when the input is a single sample, i.e. use with `-b`. If not provided, prefix will be extracted from the name of the input BAM. 
 - `--genome`: Genome reference build. Default is `38`. If `37` or `19` is specified, Paraphase will run the analysis for GRCh37 or hg19, respectively (note that only 11 medically relevant [regions](docs/regions.md) are supported now for GRCh37/hg19).
-- `gene1only`: If specified, variants calls will be made against the main gene only for SMN1, PMS2, STRC, NCF1 and IKBKG, see [below](#interpreting-the-output).
+- `--gene1only`: If specified, variants calls will be made against the main gene only for SMN1, PMS2, STRC, NCF1 and IKBKG, see more information [here](docs/vcf.md).
 - `--novcf`: If specified, no VCF files will be produced.
 - `--samtools`: path to samtools. If the paths to samtools or minimap2 are not already in the PATH environment variable, they can be provided through the `--samtools` and `--minimap2` parameters.
 - `--minimap2`: path to minimap2
 
 ## Interpreting the output
 
-Paraphase produces a few output files in the directory specified by `-o`, with the sample ID as the prefix.
+Paraphase produces a few output files in the directory specified by `-o`, with the specified or default prefix.
 
-1. `.vcf` in `sampleID_vcfs` folder. A VCF file is written for each haplotype per gene family. There is also a `_variants.vcf` file containing merged variants from all haplotypes for each gene family. Note that this is not a diploid vcf as there are usually more than two copies of genes in a gene family in a sample.
+1. `.vcf` in `${prefix}_paraphase_vcfs` folder. A VCF file is written for each region (gene family). More descriptions on the VCF can be found [here](docs/vcf.md).
 
-As genes of the same family can be highly similar to each other in sequence and not easy to differentiate (at the sequence level or even at the functional level), variant calls are made against one selected "main" gene from the gene family (e.g. the functional gene is selected when the family has a gene and a pseudogene). In this way, all copies of the gene family can be evaluated for pathogenic variants and one can calculate the copy number of the functional genes in the family and hence infer the disease/carrier status.
+2. `.paraphase.bam`: This BAM file can be loaded into IGV for visualization of haplotypes (group reads by `HP` tag and color alignments by `YC` tag). All haplotypes are aligned against the main gene of interest. Tutorials/Examples are provided for medically relevant genes (See below).  
 
-Exceptions are SMN1 (paralog SMN2), PMS2 (pseudogene PMS2CL), STRC (pseudogene STRCP1), NCF1 (pseudogenes NCF1B and NCF1C) and IKBKG (pseudogene IKBKGP1), where gene differentiation is possible. In these families, haplotypes are assigned to each gene in the family, i.e. gene or paralog/pseudogene, and variants are called against the gene (or paralog/pseudogene) for the gene (or paralog/pseudogene) haplotypes, respectively. Variants calls can be made against the main gene only for these five families if `--gene1only` is specified. 
-
-2. `_realigned_tagged.bam`: This BAM file can be loaded into IGV for visualization of haplotypes (group reads by `HP` tag and color alignments by `YC` tag). All haplotypes are aligned against the main gene of interest. Tutorials/Examples are provided for medically relevant genes (See below).  
-
-3. `.json`: Output file summarizing haplotypes and variant calls for each gene family in each sample. In brief, a few generally used fields are explained below.
+3. `.paraphase.json`: Output file summarizing haplotypes and variant calls for each gene family in each sample. In brief, a few generally used fields are explained below.
 - `final_haplotypes`: phased haplotypes for all gene copies in a gene family
 - `total_cn`: total copy number of the family (sum of gene and paralog/pseudogene)
 - `two_copy_haplotypes`: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.
 - `haplotype_details`: lists information about each haplotype 
   - `boundary`: the boundary of the region that is resolved on the haplotype. This is useful when a haplotype is only partially phased.
 - `alleles_final`: haplotypes phased into alleles. This is possible when the segmental duplication is in tandem.
-- `region_depth`: median depth of the gene family (include all copies of gene and paralog/pseudogene) 
 
 Tutorials/Examples are provided for interpreting the `json` output and visualizing haplotypes for medically relevant genes listed below: 
 - [SMN1/SMN2](docs/SMN1_SMN2.md)
@@ -116,3 +116,6 @@ Tutorials/Examples are provided for interpreting the `json` output and visualizi
 - [F8](docs/F8.md)
 - [NEB](docs/NEB.md)
 - [NCF1](docs/NCF1.md)
+- [GBA](docs/GBA.md)
+- [CFH gene cluster](docs/CFH.md)
+
diff --git a/docs/CFH.md b/docs/CFH.md
@@ -0,0 +1,25 @@
+# CFH gene cluster
+
+The CFH gene cluster is a ~250kb genomic region that contains several genes CFH/CFHR1/CFHR2/CFHR3/CFHR4. This region is divided into two pairs of homology regions, where unequal crossing overs can lead to large deletions or duplications. These SVs are related to diseases such as atypical hemolytic uremic syndrome and age-related macular degeneration. Some of these SVs are quite common in the population. 
+
+Paraphase resolves gene copies in two homology regions (named `CFH` and `CFHR3` in the config), and summarizes results under `CFHclust` in the `json` file. To analyze this region specifically, use `-g CFH,CFHR3` in the command. Note that only SVs/fusions are called in this region. No VCF is produced, as the sequence similarity is low enough so variant calling should be accurate using standard variant callers.
+
+The `CFH` region contains the end of the CFH gene and the intergenic region between CFH and CFHR3. The `CFHR3` region contains part of CFHR3 all the way to part of CFHR1. In the genome, the order of these homology regions is as follows (also see examples below):
+
+`CFH`, followed by `CFHR3`, followed by `CFH(paralog)`, and followed by `CFHR3(paralog)`.
+
+## Fields in the `json` file
+
+- `fusions_called`: fusions created by deletion or duplication of the region betweeen two breakpoints. Reports the SV type (deletion or duplication) and the breakpoint coordinates.
+
+## Visualizing haplotypes
+
+To visualize phased haplotypes, load the output bam file in IGV, group reads by the `HP` tag and color alignments by `YC` tag. 
+
+Reads in gray are either unassigned or consistent with more than one possible haplotype. When two haplotypes are identical over a region, there can be more than one haplotype consistent with a read, and the read is randomly assigned to a haplotype and colored in gray. 
+
+![CFH example](figures/CFH.png)
+
+- The top panel shows a sample with no CNV. Left is the `CFH` region/module analyzed by Paraphase and the right is the `CFHR3` region. Paraphase resolves four copies for each region. In either region, two of the four copies are shorter with more mismatches, representing the paralogs that can no longer align beyond the end of the homology region.
+- The middle panel shows a sample with a deletion (`CFH_hap1`) in the `CFH` region (left), where the 5' end is longer and the 3' end is shorter. The red arrow marks the deletion breakpoint. The other side of the breakpoint can be found in the `fusions_called` field under `CFHclust` in the `json`. The `CFHR3` region is covered by the deletion so there are also only three copies found in this region. This SV is a deletion of CFHR3+CFHR1.
+- The bottom panel shows a sample with a different deletion (`CFHR3_hap2`) in the `CFHR3` region (right), where the 5' end is longer and the 3' end is shorter. The red arrow marks the deletion breakpoint. The other side of the breakpoint can be found in the `fusions_called` field under `CFHclust` in the `json`. The `CFH` region (the paralogous side) is covered by the deletion so there are also only three copies found in this region. This SV is a deletion of CFHR1+CFHR4 (CFHR4 is the gene downstream of CFHR1).
diff --git a/docs/GBA.md b/docs/GBA.md
@@ -0,0 +1,19 @@
+# GBA
+
+Pathogenic variants in GBA cause Gaucher disease and leads to an increased risk of Parkinson’s disease. GBA has sequence homology with its pseudogene GBAP1 particularly in its last three exons, where gene conversion can bring pseudogene-like variants into GBA. Unequal crossing overs can result in fusion genes between GBA and GBAP1.
+
+## Fields in the `json` file
+
+- `fusions_called`: fusions created by deletion or duplication of the region betweeen two breakpoints. Reports the SV type (deletion or duplication) and the breakpoints.
+
+## Visualizing haplotypes
+
+To visualize phased haplotypes, load the output bam file in IGV, group reads by the `HP` tag and color alignments by `YC` tag. 
+
+Reads in gray are either unassigned or consistent with more than one possible haplotype. When two haplotypes are identical over a region, there can be more than one haplotype consistent with a read, and the read is randomly assigned to a haplotype and colored in gray. 
+
+![GBA example](figures/GBA.png)
+
+- The top panel shows a sample with two copies of GBA and two copies of GBAP1. The GBAP1 copies are shorter because they can no longer align beyond the end of the homology region.
+- The middle panel shows a sample with a deletion (`hap3`), where the 5' end is consistent with the shorter GBAP1 and the 3' end is consistent with the longer GBA. The red arrow marks the deletion breakpoint. The other side of the breakpoint can be found in the `fusions_called` field in the `json`.
+- The bottom panel shows a sample with a duplicaton (`hap2`), where the 5' end is consistent with the longer GBA and the 3' end is consistent with the shorter GBAP1. The red arrow marks the duplication breakpoint. The other side of the breakpoint can be found in the `fusions_called` field in the `json`.
diff --git a/docs/figures/CFH.png b/docs/figures/CFH.png
diff --git a/docs/figures/GBA.png b/docs/figures/GBA.png