Python script that integrates sequencing depth information of polymorphisms in variant call format (VCF) files and SuperMASSA software for quantitative genotype calling in polyploids.
Besides Python (v.2.7+) and Matplotlib, you will need the SuperMASSA source code, which is available here. SuperMASSA implements a Bayesian network to address the dosage calling problem taking genetic models into consideration, such as full-sib family and Hardy-Weinberg Equilibrium expected frequencies. Please see the original paper for details and cite it together with this one if you use it in a study that ends up published.
Please note that VCF2SM is not compatible with Python 3.
[-h]
-i INPUT -o OUTPUT -S SMSCRIPT -I INFERENCE
[-g GENO_PATTERN [GENO_PATTERN ...] | -r GENO_RANGE [GENO_RANGE ...]]
[-1 PAR1_PATTERN [PAR1_PATTERN ...] | -k PAR1_RANGE [PAR1_RANGE ...]]
[-2 PAR2_PATTERN [PAR2_PATTERN ...] | -l PAR2_RANGE [PAR2_RANGE ...]]
[--sF SF] [--eF EF] [-a ALLELE_DEPTH]
[-d MINIMUM_DEPTH] [-D MAXIMUM_DEPTH] [-e] [-M PLOIDY_RANGE]
[-V SIGMA_RANGE] [-f PLOIDY_FILTER] [-p POST]
[-n NAIVE_REPORTING] [-c CALLRATE] [-t THREADS]
The mandatory arguments include: i) the input/output files; ii) path to the SuperMASSA script; and iii) the inference mode.
VCF file with exact allele depth counts generated by software such as Tassel4-Poly, FreeBayes and GATK.
VCF file with quantitative genotype calls, i.e., depicting reference and alternative allele dosages.
Data from this paper is available here. You can download it and run it straight from your terminal as follows (no compilation needed):
python VCF2SM.py -i NewPlusOldCalls.headed.vcf -o NewPlusOldCalls.headed_poly.vcf -S ./src/SuperMASSA.py -I hw -a RA/AA -r 1:84 -d 15 -D 500 -M 4:6 -f 4 -p 0.80 -n 0.90 -c 0.75 -t 1
A more detailed explanation of each option is provided below:
-
Inference modes for SuperMASSA include the Hardy-Weinberg model (
-I hw
, as used above), F1 biparental crosses (-I f1
) and a general ploidy model, with no assumptions about segregation proportions or allele frequency (-I ploidy
). -
By default,
VCF2SM
searches for allele depth information in theAD
field, which is used by the TASSEL-GBS pipeline. If this information is present in other field(s), the argument-a
can be used to change this behavior. In the example above, theRA
andAA
fields contain information about the reference and alternative allele depths, respectively. -
It is common in genotyping-by-sequencing projects to include samples from different populations in the same sequencing run. If separate inference is intended for each population, the
-r
argument specifies a range of the indices to be used by SuperMASSA. As an example, we run the script for the first 84 individuals. Alternatively, one can specify a string pattern of the sample names to match. For instance, if the VCF file contains sample names starting with either PRJ1 or PRJ2, the option-g PRJ1
will use genotypes from the first project only. -
The minimum and maximum average allele depths per variant site are specified with the
-d
and-D
options, respectively. Polymorphisms that do not meet these criteria are filtered out. -
By default, ploidies from 2 to 16 are tested with SuperMASSA. This can be modified with the
-M
argument. Because the example dataset is from autotetraploid potato, we fit ploidies 4 and 6 (in this case, a higher than expected ploidy level is useful to capture variants that do not fit well with the tetraploid model). -
Combined with the argument above, we choose to only keep sites with an estimated ploidy of four, using the
-f 4
option. -
A posterior probability filter for each site is applied with the argument
-p 0.80
. Variants with lower probabilities are filtered out. -
SuperMASSA assigns a naive posterior probability to each individual in the population. In the example, we use
-n 0.90
to only keep samples with an associated probability of 0.90 or higher. -
To exclude loci with excessive missing data, we require a call rate of 0.75 or higher (
-c 0.75
). Note that only samples passing the naive reporting probability filter are considered as valid calls, i.e., samples that have allele depth information but do not fit well to any genotype group are treated as missing. -
Finally, the number of threads to be used in the parallel processing of sites is defined with
-t
.
VCF2SM provides some additional arguments that were not used in the previous example. These are briefly described next:
-
By default, SuperMASSA uses a greedy algorithm to fit the various models. If exact inference is desired, the
-e
option can be used, but please note that this increases runtime substantially. -
The range of variances to test can be changed with the
-V
argument, which is simply passed verbatim to SuperMASSA. -
Similarly to the sample range or pattern argument, the options
-1
and-2
, or-k
and-l
, can be used to specify index ranges or name patterns corresponding to both parents in F1 segregating progenies. -
If the variants are split in multiple input VCF files, the arguments
--sF
and--eF
specify the starting and ending file indices. In this case, the input and output file paths must contain a+
character, such as:-i in_path/input_chr+.vcf -o out_path/output_chr+.vcf --sF 1 --eF 10
(ten VCF files are expected).
Short | Long | Description | Details |
---|---|---|---|
-i |
--input |
Path to VCF input file | |
-o |
--output |
Path to VCF output file | |
-S |
--SMscript |
Path to the SuperMASSA script | |
-I |
--inference |
Inference mode for SuperMASSA | f1 or hw for respective F1 segregant or Hardy-Weinberg model; ploidy for non-model based ploidy estimation |
-g |
--geno_pattern |
Pattern(s) of genotype IDs to include as samples | |
-r |
--geno_range |
Indices of genotypes to include as samples | List or range in the form 1:150 |
-1 |
--par1_pattern |
Pattern(s) of genotype IDs for parent 1 | |
-k |
--par1_range |
Indices of genotypes to include as parent 1 | List or range in the form 1:5 |
-2 |
--par2_pattern |
Pattern(s) of genotype IDs for parent 2 | |
-l |
--par2_range |
Indices of genotypes to include as parent 2 | List or range in the form 1:5 |
--sF |
Starting file | ||
--eF |
Ending file | ||
-a |
--allele_depth |
VCF field(s) to get allele depths from | Default = AD . It also supports two fields in the RA/AA format |
-d |
--minimum_depth |
Minimum average depth per sample for variant site to be processed (not including parents, for F1 progenies) | Default = 0 |
-D |
--maximum_depth |
Maximum average depth per sample for variant site to be processed (not including parents, for F1 progenies) | Default = inf |
-e |
--exact |
Perform exact inference | Default is FALSE (performs approximate inference) |
-M |
--ploidy_range |
Ploidy range to test with SuperMASSA | Default = 2:16 |
-V |
--sigma_range |
Sigma range for SuperMASSA | Default = 0.01:1:0.05 (range in the form low:high:step) |
-f |
--ploidy_filter |
Range of ploidies to include in the output | Default = NULL (keep all ploidy levels tested) |
-p |
--post |
Minimum posterior probability to keep variant | Default = 0.0 |
-n |
--naive_reporting |
Naive posterior reporting threshold | Default = 0.0 |
-c |
--callrate |
Minimum call rate to keep variant | Default = 0.0 |
-t |
--threads |
Maximum number of threads to use | Default = 1 |
Pereira GS, Garcia AAF, Margarido GRA. (2018) A fully automated pipeline for quantitative genotype calling from next generation sequencing data in autopolyploids. BMC Bioinformatics 19:398. https://doi.org/10.1186/s12859-018-2433-6.
Serang O, Mollinari M, Garcia AAF. (2012) Efficient Exact Maximum a Posteriori Computation for Bayesian SNP Genotyping in Polyploids. PLoS ONE 7(2): e30906. https://doi.org/10.1371/journal.pone.0030906