Command-line application to classify variants in any VCF (Variant Call Format) file based on a decision tree.
Generate a personal access token in GitHub with at least the scope "read:packages".
Then add a settings.xml to your Maven .m2 folder, or edit it if you already have one. It should contain the following:
<?xml version="1.0"?>
<settings xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/SETTINGS/1.0.0">
<activeProfiles>
<activeProfile>github</activeProfile>
</activeProfiles>
<profiles>
<profile>
<id>github</id>
<repositories>
<repository>
<id>central</id>
<url>https://repo1.maven.org/maven2</url>
</repository>
<repository>
<id>github</id>
<url>https://maven.pkg.github.com/molgenis/vip-utils</url>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
</profile>
</profiles>
<servers>
<server>
<id>github</id>
<username>[YOUR VIP USERNAME]</username>
<password>[YOUR PERSONAL ACCESS TOKEN]</password>
</server>
</servers>
</settings>
- Java 21
usage: java -jar vcf-decision-tree.jar -i <arg> -c <arg> [-m <arg>] [-o <arg>] [-f]
[-s] [-l] [-p] [-d] [-pb <arg>] [-pd <arg>] [-ph <arg>] [-m <arg>]
-i,--input <arg> VEP* annotated input VCF file.
-m,--metadata <arg> VCF metadata file (.json).
-c,--config <arg> Input decision tree file (.json).
-o,--output <arg> Output VCF file (.vcf or .vcf.gz).
-f,--force Override the output file if it already exists.
-s,--strict Throw exception if field from the decision tree
is missing entirely in the input VCF.
-l,--labels Write decision tree outcome labels to output VCF
file.
-p,--path Write decision tree node path to output VCF
file.
-d,--debug Enable debug mode (additional logging).
-pb,--probands <arg> Comma-separated list of proband names.
-pd,--pedigree <arg> Comma-separated list of pedigree files (.ped).
-ph,--phenotypes <arg> Comma-separated list of sample-phenotypes (e.g.
HP:123 or HP:123;HP:234 or
sample0/HP:123,sample1/HP:234). Phenotypes are
CURIE formatted (prefix:reference) and separated
by a semicolon.
-m,--mode <arg> Run mode: 'variant' (default) or 'sample',
'sample' mode classifies provided probands, or
all samples if no probands given.
usage: java -jar vcf-decision-tree.jar -v
-v,--version Print version.
*:VEP
java -jar vcf-decision-tree.jar -i my.vcf -m field_metadata.json -c decision_tree.json -o out.vcf
java -jar vcf-decision-tree.jar -i my.vcf.gz -m field_metadata.json -c decision_tree.json -o out.vcf.gz
java -jar vcf-decision-tree.jar -i my.vcf.gz -m field_metadata.json -c decision_tree.json -o out.vcf.gz -f -l -p
java -jar vcf-decision-tree.jar -v
Each variant is classified using a decision tree which consists of decision nodes and leaf nodes.
Decision nodes perform a test on the variant which determines the outcome consisting of the next node to process and optionally a label. Leaf nodes are terminal nodes that determine the class for a variant.
COMMON, INFO, FORMAT
Any field in the VEP value can be used, if the field is unknown to the tool it is interpreted as a singel value string field.
This fieldtype uses the information provided by htsjdk about the Genotype FORMAT field.
Allowed values are:
- ALLELES: The alleles (list of strings) present in the genotype.
- ALLELE_NUM: The allele numbers corresponding with the index in the VCF ALT field.
- TYPE: The htsjdk genotype type, possible values: MIXED, HET, HOM_REF, HOM_VAR, NO_CALL, UNAVAILABLE.
- CALLED: Boolean indication if the genotype for this sample is called.
- MIXED: Boolean indication if the genotype is comprised of both calls and no-calls.
- NON_INFORMATIVE: Boolean that returns true if all samples PLs are 0.
- PHASED: Boolean indicating the genotype was called phased or unphased.
- PLOIDY: The ploidy of the genotype as an integer, null if no call is present.
This fieldtype is used to query properties of the samples, like phenotypes and pedigree information which are provided outside the VCF.
Allowed values are:
- ID: The sample identifier.
- AFFECTED_STATUS: The affected status of the sample, possible values: AFFECTED, UNAFFECTED, MISSING.
- SEX: The sex of the sample, possible values: MALE, FEMALE, UNKNOWN.
- FATHER_ID: The identifier for the father sample.
- MOTHER_ID: The identifier for the mother sample.
- FAMILY_ID: The identifier for the family.
- PHENOTYPES: The list of phenotypes for the sample.
In the variant classification mode the tool will output a classification per VEP value (CSQ), this classification will be added to the VEP value under the key "VIPC".
Optionally labels and the path through the tree can be annotated to the VEP value as well.
In the sample classification mode the tool will output a classification per VEP value as a comma separated list for which the index corresponds to the VEP value index, this classification will be added to the FORMAT fields value under the key "VIPC_S".
Optionally labels and the path through the tree can be annotated to the FORMAT fields as well.
see src/test/resources/example.json
and src/test/resources/example_sample.json
Variant classifications and optionally their paths and labels are annotated on the input VCF in the VIPC, VIPP and VIPL info fields.
see src/test/resources/example-classified.vcf
see src/test/resources/example-classified_paths-labels.vcf
see src/test/resources/example_sample-classified.vcf
see src/test/resources/example_sample-classified_paths-labels.vcf