VCF2SM

Python script that integrates sequencing depth information of polymorphisms in variant call format (VCF) files and SuperMASSA software for quantitative genotype calling in polyploids.

Dependencies

Besides Python (v.2.7+) and Matplotlib, you will need the SuperMASSA source code, which is available here. SuperMASSA implements a Bayesian network to address the dosage calling problem taking genetic models into consideration, such as full-sib family and Hardy-Weinberg Equilibrium expected frequencies. Please see the original paper for details and cite it together with this one if you use it in a study that ends up published.

Please note that VCF2SM is not compatible with Python 3.

Usage

[-h]
-i INPUT -o OUTPUT -S SMSCRIPT -I INFERENCE
[-g GENO_PATTERN [GENO_PATTERN ...] | -r GENO_RANGE [GENO_RANGE ...]]
[-1 PAR1_PATTERN [PAR1_PATTERN ...] | -k PAR1_RANGE [PAR1_RANGE ...]]
[-2 PAR2_PATTERN [PAR2_PATTERN ...] | -l PAR2_RANGE [PAR2_RANGE ...]]
[--sF SF] [--eF EF] [-a ALLELE_DEPTH]
[-d MINIMUM_DEPTH] [-D MAXIMUM_DEPTH] [-e] [-M PLOIDY_RANGE]
[-V SIGMA_RANGE] [-f PLOIDY_FILTER] [-p POST]
[-n NAIVE_REPORTING] [-c CALLRATE] [-t THREADS]

The mandatory arguments include: i) the input/output files; ii) path to the SuperMASSA script; and iii) the inference mode.

Input

VCF file with exact allele depth counts generated by software such as Tassel4-Poly, FreeBayes and GATK.

Output

VCF file with quantitative genotype calls, i.e., depicting reference and alternative allele dosages.

Example command

Data from this paper is available here. You can download it and run it straight from your terminal as follows (no compilation needed):

python VCF2SM.py -i NewPlusOldCalls.headed.vcf -o NewPlusOldCalls.headed_poly.vcf -S ./src/SuperMASSA.py -I hw -a RA/AA -r 1:84 -d 15 -D 500 -M 4:6 -f 4 -p 0.80 -n 0.90 -c 0.75 -t 1

A more detailed explanation of each option is provided below:

Inference modes for SuperMASSA include the Hardy-Weinberg model (-I hw, as used above), F1 biparental crosses (-I f1) and a general ploidy model, with no assumptions about segregation proportions or allele frequency (-I ploidy).
By default, VCF2SM searches for allele depth information in the AD field, which is used by the TASSEL-GBS pipeline. If this information is present in other field(s), the argument -a can be used to change this behavior. In the example above, the RA and AA fields contain information about the reference and alternative allele depths, respectively.
It is common in genotyping-by-sequencing projects to include samples from different populations in the same sequencing run. If separate inference is intended for each population, the -r argument specifies a range of the indices to be used by SuperMASSA. As an example, we run the script for the first 84 individuals. Alternatively, one can specify a string pattern of the sample names to match. For instance, if the VCF file contains sample names starting with either PRJ1 or PRJ2, the option -g PRJ1 will use genotypes from the first project only.
The minimum and maximum average allele depths per variant site are specified with the -d and -D options, respectively. Polymorphisms that do not meet these criteria are filtered out.
By default, ploidies from 2 to 16 are tested with SuperMASSA. This can be modified with the -M argument. Because the example dataset is from autotetraploid potato, we fit ploidies 4 and 6 (in this case, a higher than expected ploidy level is useful to capture variants that do not fit well with the tetraploid model).
Combined with the argument above, we choose to only keep sites with an estimated ploidy of four, using the -f 4 option.
A posterior probability filter for each site is applied with the argument -p 0.80. Variants with lower probabilities are filtered out.
SuperMASSA assigns a naive posterior probability to each individual in the population. In the example, we use -n 0.90 to only keep samples with an associated probability of 0.90 or higher.
To exclude loci with excessive missing data, we require a call rate of 0.75 or higher (-c 0.75). Note that only samples passing the naive reporting probability filter are considered as valid calls, i.e., samples that have allele depth information but do not fit well to any genotype group are treated as missing.
Finally, the number of threads to be used in the parallel processing of sites is defined with -t.

More Options

VCF2SM provides some additional arguments that were not used in the previous example. These are briefly described next:

By default, SuperMASSA uses a greedy algorithm to fit the various models. If exact inference is desired, the -e option can be used, but please note that this increases runtime substantially.
The range of variances to test can be changed with the -V argument, which is simply passed verbatim to SuperMASSA.
Similarly to the sample range or pattern argument, the options -1 and -2, or -k and -l, can be used to specify index ranges or name patterns corresponding to both parents in F1 segregating progenies.
If the variants are split in multiple input VCF files, the arguments --sF and --eF specify the starting and ending file indices. In this case, the input and output file paths must contain a + character, such as: -i in_path/input_chr+.vcf -o out_path/output_chr+.vcf --sF 1 --eF 10 (ten VCF files are expected).

Arguments

Short	Long	Description	Details
`-i`	`--input`	Path to VCF input file
`-o`	`--output`	Path to VCF output file
`-S`	`--SMscript`	Path to the SuperMASSA script
`-I`	`--inference`	Inference mode for SuperMASSA	`f1` or `hw` for respective F1 segregant or Hardy-Weinberg model; `ploidy` for non-model based ploidy estimation
`-g`	`--geno_pattern`	Pattern(s) of genotype IDs to include as samples
`-r`	`--geno_range`	Indices of genotypes to include as samples	List or range in the form `1:150`
`-1`	`--par1_pattern`	Pattern(s) of genotype IDs for parent 1
`-k`	`--par1_range`	Indices of genotypes to include as parent 1	List or range in the form `1:5`
`-2`	`--par2_pattern`	Pattern(s) of genotype IDs for parent 2
`-l`	`--par2_range`	Indices of genotypes to include as parent 2	List or range in the form `1:5`
	`--sF`	Starting file
	`--eF`	Ending file
`-a`	`--allele_depth`	VCF field(s) to get allele depths from	Default = `AD`. It also supports two fields in the `RA/AA` format
`-d`	`--minimum_depth`	Minimum average depth per sample for variant site to be processed (not including parents, for F1 progenies)	Default = `0`
`-D`	`--maximum_depth`	Maximum average depth per sample for variant site to be processed (not including parents, for F1 progenies)	Default = `inf`
`-e`	`--exact`	Perform exact inference	Default is `FALSE` (performs approximate inference)
`-M`	`--ploidy_range`	Ploidy range to test with SuperMASSA	Default = `2:16`
`-V`	`--sigma_range`	Sigma range for SuperMASSA	Default = `0.01:1:0.05` (range in the form low:high:step)
`-f`	`--ploidy_filter`	Range of ploidies to include in the output	Default = NULL (keep all ploidy levels tested)
`-p`	`--post`	Minimum posterior probability to keep variant	Default = `0.0`
`-n`	`--naive_reporting`	Naive posterior reporting threshold	Default = `0.0`
`-c`	`--callrate`	Minimum call rate to keep variant	Default = `0.0`
`-t`	`--threads`	Maximum number of threads to use	Default = `1`

Cite

Pereira GS, Garcia AAF, Margarido GRA. (2018) A fully automated pipeline for quantitative genotype calling from next generation sequencing data in autopolyploids. BMC Bioinformatics 19:398. https://doi.org/10.1186/s12859-018-2433-6.

Serang O, Mollinari M, Garcia AAF. (2012) Efficient Exact Maximum a Posteriori Computation for Bayesian SNP Genotyping in Polyploids. PLoS ONE 7(2): e30906. https://doi.org/10.1371/journal.pone.0030906

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
LICENSE		LICENSE
README.md		README.md
VCF2SM.py		VCF2SM.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VCF2SM

Dependencies

Usage

Input

Output

Example command

More Options

Arguments

Cite

About

Releases

Packages

Contributors 2

Languages

License

guilherme-pereira/vcf2sm

Folders and files

Latest commit

History

Repository files navigation

VCF2SM

Dependencies

Usage

Input

Output

Example command

More Options

Arguments

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages