forked from awilfert/PSAP-pipeline
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PSAP_OUTPUT_GUIDE.txt
15 lines (12 loc) · 7.28 KB
/
PSAP_OUTPUT_GUIDE.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Output from PSAP pipeline
FILENAME.avinput.hg19_multianno.txt
This file is generated by the ANNOVAR software and contains the ANNOVAR output file. This file contains annotations for gene name, amino acid change, genomic function, and exonic function (if applicable) according to GencodeV19, allele frequencies from 3 population genetic databases (1000 Genomes Sep 2014, ESP6500 and ExAC (Mac63k_Freq), and CADD scores for all variants in the provided VCF file. No analyses are performed to generate this file.
FILENAME_popScore.txt
This file contains PSAP values for the candidate variants for each genetic model. The generation of this file requires that the FILENAME.avinput.hg19_multianno.txt exists, and uses the FILENAME.avinput (THIS FILE IS GENERATED BY ANNOVAR) and FILENAME.ped (THIS FILE IS PROVIDED BY THE USER) files as input. Prior to annotating the data with the PSAP, the input file is cleaned. Specifically, genes that are known to be problematic for the pipeline are removed, variants with major allele frequency discrepancies are removed (eg. missing in from the 61,000 exome population genetic database (ExAC, annotated as Mac63k_Freq), but present at greater than 1% allele frequency in EVS (6,500 exomes, annotated as esp6500si) or 1000 Genomes (2,500 exomes, annotated as 1000g2014sep), variants with missing CADD scores that are not indels are removed, variants that are not located within the CDS are removed (the data used to create the PSAP null distributions do not contain variants outside of the GencodeV19 CDS, so predictions for these variants would be unreliable), and variants with Mendelian inconsistencies are removed (eg. if mom and dad are homozygous for the reference allele, the child cannot have the alternative allele. De novos, however, are allowed). Please note, de novo variants cannot be accurately identified from a VCF file and will require further validation.
In the family analysis three columns are generated for each member of the family. All columns corresponding to an individual are labeled according to the individual ID's used in the VCF header and ped files. The genotypes for the individual can be found in the column with the individual's ID (eg. INDV01). The disease model considered for the variant listed can be found under the Dz.Model column with the individual's ID (eg. Dz.Model.INVD01). The PSAP value for the variant as determined by the genetic model considered and the gene can be found under the popScore column with the individual's ID (eg. popScore.INDV01). Each member of the family is annotated PSAP values for each disease model, as such it is possible to have three PSAP values for a given gene in an indivdual (eg. a family of 3 individuals can have as many as 9 popScore annotations in a gene; one for each genetic model in each person).
In the individual analysis only three columns are generated, and an output file for each individual is generated. The genotypes for the individual can be found in the Geno column. The disease model considered for the variant listed can be found under the Dz.Model column. The PSAP value for the variant as determined by the genetic model considered and the gene can be found under the popScore column. Please note, all disease models are condisdered, so it is possible to have three PSAP values for a given gene in an indivdual.
The three genetic models considered by the PSAP pipeline are homozygous recessive (REC-hom), compound heterozygous recessive (REC-chet), and heterozygous dominant (DOM-het). The candidate variant for the homozygous recessive model is the homozygous variant in the gene with the largest CADD score. This model is considered for all genes with at least one homozygous SNV. The candidate variant for the compound heterozygous recessive model is the pair of heterozygous variants with the largest CADD scores in the gene. In order for this model to be considered the gene must contain at least two variants. This model is evaluated by looking at the probability of observing the smaller of the two CADD scores, so only the second variant will be annotated as a REC-chet in this file. The other variant in the pair will be evaluated under the heterozygous dominant model and will be annotated as DOM-het. The candidate variant for the heterozygous dominant model is the heterozygous variant with the largest CADD score. This model will be considered for all genes with at least one heterozygous SNV.
Once the candidate variants for each genetic model are identified the PSAP values for those variants is calculated according to the null distribution for that gene and genetic model. The null distributions aim to determine the probability of observing the variant in the given gene and genetic model according to the variant's CADD score. For example, the PSAP value for a heterozygous variant in the gene PTEN with a CADD score of 20 under the heterozygous dominant model would represent the probability of observing a CADD score of 20 as the maximum CADD score for a heterozygous variant in the PTEN gene in the non-disease population.
If you believe incomplete penetrance may be present in your data, you should use this file to identify candidate variants by looking for variants in your affected individuals with a threshold of PSAP les than or equal to 10-5. This threshold has a false positive rate of 5% (ie. 5% of unaffected individuals will have a variant with a popScore greater than or equal to this threshold).
FILENAME.report.txt
This file contains the output from the candidate variant analysis. The generation of this file requires the FILENAME_popScore.txt file to exist, and the FILENAME.ped and FILENAME.avinput files as input. This file contains variants that are shared among affected individuals (shared variants must have the same genotype) and are not present in unaffected individuals in a genotype that would violate the mode of inheritance for the genetic model considered (eg. for the compound heterozygous model neither variant in the pair can be present in the unaffected individuals as compound heterozygous pair, or a homozygous pair). When the variants shared among affected individuals are identified this data is reduced into a single column and can be identified by looking for the individual ID of one of the affected individuals. The remaining columns correspond to each of the unaffected individuals. When the shared variants are filtered against the unaffected individuals, all compound heterozygous recessive variants will be present as the pair of variants (eg. two variants will be annotated as REC-chet in a gene and both variants will contain the same PSAP). If one of the two variants is also a valid candidate for the DOM-het model, it will also be present in a separate line and annotated as DOM-het and will have a PSAP for the corresponding model. A flag column is added that indicates whether a variant resides in a gene that is poorly covered in the ExAC database (1 or 100) and may have a poorly calibrated null distrubition, resides in known dominant disease causing gene (2 or 20), or resides in a known recessive disease causing gene (3). The candidate variants in the file are ordered according to the popScore of the affected individual(s) from smallest (least likely to be observed in the non-disease population) to the largest (most likely to be observed in the non-disease population).