diff --git a/index.html b/index.html index 5393408..5ac58e5 100644 --- a/index.html +++ b/index.html @@ -317,6 +317,7 @@
See the examples and the API documentation for diff --git a/search/search_index.json b/search/search_index.json index 816f13a..88f25a5 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to the icaparser Python package","text":"
This Python package provides functions for parsing JSON files created by Illumina's Connected Annotations (ICA)pipeline. ICA annotates mutations with \u201aa set of tools and data sources. This package allows to:
See the examples and the API documentation for further details.
"},{"location":"examples/","title":"Examples","text":""},{"location":"examples/#stripping-very-large-json-files","title":"Stripping very large JSON files","text":"Some JSON files from Illumina TSO panels (for example, TSO500) are not QC filtered and contain all detected genomic variants, irrespective of whether they pass the quality criteria. Such files can get very large, too large to be processed by any JSON parser. If your JSON file does not contain only QC-filtered variants (\"PASS\"), it needs to be stripped (filtered) first before using the icaparser module for further processing.
The code below can be run in Python in a terminal or in a Jupyter notebook. Terminal is recommended.
import icaparser as icap\nicap.strip_json_files(source_dir='../Data/Original', target_dir='../Data/Derived')\n
"},{"location":"examples/#simple-example","title":"Simple example","text":"The code below is the Hello World example for reading and filtering ICA JSON files with default filtering rules. For more sophisticated filtering options, see the API reference.
import icaparser as icap\njson_files = icap.get_dna_json_files('../Data/Derived')\nfirst_file = json_files[0]\n# Get the annotation data sources\nicap.get_data_sources(first_file)\n# Get pipeline run metadata\nicap.get_pipeline_metadata(json_files)\n# Get a mutation table\nmut_table = icap.get_mutation_table_for_files(json_files)\n
"},{"location":"installation/","title":"Installation instructions","text":""},{"location":"installation/#installation-of-the-icaparser-package","title":"Installation of the icaparser package","text":"It is recommended to create a new virtual environment with Python >= 3.9 and to install the icaparser package in that environment. Activate the environment and run:
pip install \"git+https://github.com/Bayer-Group/ica-parser.git#subdirectory=icaparser\"\n
If you want to install a particular development branch, use
pip install \"git+https://github.com/Bayer-Group/ica-parser.git@BRANCHNAME#subdirectory=icaparser\"\n
If you use Jupyter notebooks, the virtual environment should be added as a new Jupyter kernel. See Using Virtual Environments in Jupyter Notebook and Python - Parametric Thoughts how to do that.
"},{"location":"installation/#installation-of-ipywidgets","title":"Installation of ipywidgets","text":"Required for progress bars in Jupyter. Please refer to the Jupyter or JupyterLab documentation how to install the widgets. For example:
conda install jupyter # if not installed yet\nconda install jupyterlab_widgets\njupyter labextension install jupyter-matplotlib\njupyter lab build\nexit\n
\u2192 Restart Jupyter
"},{"location":"reference/","title":"API documentation","text":"Parser for JSON files from Illumina Connected Annotations pipeline.
"},{"location":"reference/#icaparser.icaparser.add_gene_types","title":"add_gene_types(positions)
","text":"Adds the gene type to each transcript.
Transcripts will be annotated with the gene type (oncogene, tsg, mixed) by adding a new attribute geneType
. Only transcripts with one of these three gene types get this additional annotation. Other transcripts will not get the geneType
attribute.
Parameters:
positions
(list
) \u2013 list of filtered or unfiltered positions from JSON files.
Returns:
list
\u2013 list of positions with additional annotation of transcripts.
Examples:
>>> import icaparser as icap\n>>> positions = icap.add_gene_types(positions)\n
"},{"location":"reference/#icaparser.icaparser.apply_mutation_classification_rules","title":"apply_mutation_classification_rules(positions, rule_set=get_default_mutation_classification_rules(), gene_type_map=get_default_gene_type_map(), hide_progress=False)
","text":"Applies mutation classification rules to all positions.
Each variant is categorized for each transcript that overlaps with the genomic position of the variant. Each transcript that passes the \"mutated\" or \"uncertain\" mutation classification rules gets a new attribute mutation_status
with the value \"mutated\" or \"uncertain\". The input list of positions is modified by adding the mutation_status
attribute to transcripts, and the modified list of positions is returned as the first element of the returned tuple.
In addition to modifying and returning the list of positions, this function also returns the assembled mutation status after aggregating the impact on all transcripts covering a variant. This is returned as the second item of the returned tuple. The impact depends on the type of gene (\"gof\" or \"lof\"), so the impacts are assembled separately for each gene type.
The impact of a particular mutational variant can be different for different overlapping transcript variants of a gene, and the transcript variants can also belong to different genes. The strongest impact on any overlapping transcript of a gene is defined as the impact of that mutational variant on the gene. The analyst must decide which isoforms are used to classify genes. For example, only canonical transcripts may be considered. Alternatively, all transcripts or a subset of transcripts may be used. Therefore, it is necessary to first apply transcript-level filters to all genomic positions before this function is called for determining the mutation status of genes.
The returned value is a multi-dimensional dictionary:
sample_id
\u2192 gene
\u2192 gene_type
\u2192 variant_id
\u2192 mutationStatus
Parameters:
positions
(list
) \u2013 list of positions.
rule_set
(dict
, default: get_default_mutation_classification_rules()
) \u2013 rules for classifying \"gof\" and \"lof\" genes. See the default value for an example if a custom rule set is needed.
gene_type_map
(dict
, default: get_default_gene_type_map()
) \u2013 dictionary for mapping gene types to canonical gene types. See the default value for an example if a custom rule set is needed.
Returns:
tuple[list, dict]
\u2013 A list of positions and a dictionary with assembled and aggregated mutations.
Examples:
>>> import icaparser as icap\n>>> positions, sample_muts = icap.apply_mutation_classification_rules(positions)\n
"},{"location":"reference/#icaparser.icaparser.cleanup_cosmic","title":"cleanup_cosmic(positions)
","text":"Remove Cosmic entries with alleles not matching the variant alleles.
ICA attaches Cosmic entries to variants based on position only, which leads to wrong assignments of Cosmic entries to variants. This function removes all Cosmic entries from a variant for which reference and altered alleles do not match those of the variant.
Filtering is done in place.
Parameters:
positions
(list
) \u2013 list of positions to clean up.
Returns:
list
\u2013 list of positions with cleaned up Cosmic entries.
common_variant_filter(variant, max_af=0.001)
","text":"Get a variant filter based on GnomAD, GnomAd Exome, and 1000 Genomes.
Returns True if none of the maximum allele frequencies from GnomAD, GnomAD exomes and 1000 genomes is greater than max_af
. The default value of 0.1 % for the maximum allele frequency corresponds to that of the AACR GENIE project.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
max_af
(float
, default: 0.001
) \u2013 the maximum allele frequency threshold.
Returns:
bool
\u2013 True if this is not a common variant.
explode_consequence(mutation_table, inplace=False)
","text":"Explode the VEP consequence column of a mutation table.
Exploding the VEP consequence column with the standard Pandas explode()
function would return consquences as strings, not as ordered categories. This function will instead return a consequence column which is an ordered category. The categories are ordered by their impact.
Exploding means that if a row of the input table has multiple consequences in the consequence column, the list of consequences will be split into single consequences and the output table will have multiple rows with a single consequence per row.
Parameters:
mutation_table
(DataFrame
) \u2013 the mutation table to explode
inplace
(bool
, default: False
) \u2013 if True, then modify the mutation_table in place instead of returning a new object
Returns:
DataFrame
\u2013 new mutation table with exploded consequences.
Examples:
>>> import icaparser as icap\n>>> icap.explode_consequence(mutation_table, inplace=True)\n
>>> mutation_table_exploded = icap.explode_consequence(mutation_table)\n
"},{"location":"reference/#icaparser.icaparser.filter_positions_by_transcripts","title":"filter_positions_by_transcripts(positions, filter_func)
","text":"Filter positions based on a filter function for transcripts.
Apply a filter function to all transcripts of each position. Transcripts not passing the filter are removed from the variants of a position. Variants without any transcript left are removed from a position. Positions without any variants left are removed from the returned list of positions.
Parameters:
positions
(list
) \u2013 list of positions to filter.
filter_func
(Callable[[dict], bool]
) \u2013 function taking a transcript and returning a bool. True means to keep the transcript.
Returns:
list
\u2013 filtered positions.
Examples:
>>> is_canonical_transcript = lambda x: x.get('isCanonical', False)\n>>> canonical_positions = icap.filter_positions_by_transcripts(\n non_common_positions,\n is_canonical_transcript\n )\n
"},{"location":"reference/#icaparser.icaparser.filter_positions_by_variants","title":"filter_positions_by_variants(positions, filter_func)
","text":"Filter positions based on a filter function for variants.
Apply a filter function to all variants of each position. Variants not passing the filter are removed from a position. Positions without any variants passing the filter are removed from the returned list.
Parameters:
positions
(list
) \u2013 list of positions to filter.
filter_func
(Callable[[dict], bool]
) \u2013 function taking a variant and returning a bool. True means to keep the variant.
Returns:
list
\u2013 filtered positions.
Examples:
>>> import icaparser as icap\n>>> max_af = 0.001\n>>> is_not_common_variant = lambda x: icap.common_variant_filter(x, max_af)\n>>> non_common_positions = icap.filter_positions_by_variants(\n positions,\n is_not_common_variant\n )\n
"},{"location":"reference/#icaparser.icaparser.filter_variants_by_transcripts","title":"filter_variants_by_transcripts(variants, filter_func)
","text":"Filter variants based on a filter function for transcripts.
Apply a filter function to all transcripts of each variant. Transcripts not passing the filter are removed from a variant. Variants without any transcripts passing the filter are removed from the returned list.
Parameters:
variants
(list
) \u2013 list of variants to filter.
filter_func
(Callable[[dict], bool]
) \u2013 function taking a transcript and returning a bool. True means to keep the transcript.
Returns:
list
\u2013 filtered variants.
get_aggregated_mutation_table(positions, sample_muts=None, mutation_classification_rules=get_default_mutation_classification_rules(), mutation_aggregation_rules=get_default_mutation_aggregation_rules(), gene_type_map=get_default_gene_type_map(), hide_progress=False)
","text":"Returns a sample-gene-mutationStatus table.
This function applies mutation classification rules to all mutational variants and aggregates the mutations according to the aggregation rules. This results in a table with one row for each sample-gene pair. The table contains several columns with impacts according to lof and gof rules on allele level and gene level and with one additional column with the maximum impact for both allele and gene level.
Parameters:
positions
(list
) \u2013 list of positions. If sample_muts is also specified, it is assumed that the positions have already been processed previously by apply_mutation_classification_rules
and we do not have to run mutation classification again.
sample_muts
(dict
, default: None
) \u2013 if apply_mutation_classification_rules
has been run before, you can use the second return value of that function as the sample_muts argument. This is helpful for very large datasets because otherwise apply_mutation_classification_rules
will be run again as an internal call within get_aggregated_mutation_table
, which is time consuming for very large data sets. This also means that if sample_muts
is provided as an argument, the mutation_classification_rules
argument is ignored and has no effect.
mutation_classification_rules
(dict
, default: get_default_mutation_classification_rules()
) \u2013 rules for classifying single mutations. See get_default_mutation_classification_rules()
for details.
mutation_aggregation_rules
(dict
, default: get_default_mutation_aggregation_rules()
) \u2013 rules for aggregation mutations. See get_default_mutation_aggregation_reles()
for details.
gene_type_map
(dict
, default: get_default_gene_type_map()
) \u2013 dictonary for mapping gene types to canonical gene types. See get_default_gene_type_map()
for details.
Returns:
DataFrame
\u2013 mutation table.
get_biotype_priority(biotype)
","text":"Get the numeric priority of a biotype.
The numeric priority of a biotype that is returned by this function is the same as defined by vcf2maf.pl by MSKCC. Biotypes are 'protein_coding', 'LRG_gene', ,'miRNA', ...
Parameters:
biotype
(str
) \u2013 the biotype for which the priority is to be returned.
Returns:
int
\u2013 the priority, smaller values mean higher priority.
get_clinvar(variant)
","text":"Get a table of all ClinVar annotations for a variant.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
Returns:
DataFrame
\u2013 table with ClinVar annotations.
get_clinvar_max_significance(variant, ordered_significances=_CLINVAR_ORDERED_SIGNIFICANCES)
","text":"Get the maximum signifinance for all ClinVar annotations of a variant.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
ordered_significances
(list
, default: _CLINVAR_ORDERED_SIGNIFICANCES
) \u2013 ranked order of ClinVar significances.
Returns:
str
\u2013 ClinVar significance of highest rank for the variant.
get_consequences(transcript)
","text":"Get a list of consequences for a transcript.
A list of consequences of a variant for a transcript is returned. If any of the annotated consequences is a combination of single consequences, separated by ampersands (&) or commas, the consequence is split into single consequences.
Parameters:
transcript
(dict
) \u2013 the transcript for which the consequences are to be returned.
Returns:
list
\u2013 the consequences, a list of strings.
get_cosmic_max_sample_count(variant, only_allele_specific=True)
","text":"Get the maximum sample count for all Cosmic annotations of a variant.
A variant can have no, one or multiple associated Cosmic identifiers. This function returns the maximum sample count of all Cosmic identifiers. For each Cosmic identifier, sample numbers are summed up across all indications. Returns 0 if no Cosmic identifier exists for this variant.
The 'only_allele_specific' argument is used to exclude Cosmic entries that annotate the same chromosomal location but an allele that is different from the allele of the annotated variant. ICA annotates a variant with all Cosmic entries for that chromosomal location, irrespective of alleles. When counting Cosmic samples, this leads to an overestimation of Cosmic sample counts for a particular variant. Therefore, 'only_allele_specific' is True by default to count only samples from Cosmic entries with matching alleles. Occasionally, it may be desired, though, to count all samples with mutations at a given position, irrespective of allele. For example, several different alleles at a functional site of a gene can lead to function-disrupting mutations, so we want to get the maximum sample count for any allele at that position. One might also think of adding the sample counts for all Cosmic entries annotating a variant, but this does not work due to redundancy of Cosmic entries. Older Cosmic versions often included the same sample in different Cosmic entries. And newer Cosmic versions often have multiple entries for an allele, one for each transcript variant, with the same underlying samples.
Parameters:
variant
(dict
) \u2013 the variant to investigate
only_allele_specific
(bool
, default: True
) \u2013 consider only cosmic entries with alleles matching the allele of the annotated variant
Returns:
int
\u2013 maximum cosmic sample count
get_data_sources(file)
","text":"Extract a table with annotation data sources from the JSON header.
Parameters:
file
(str
) \u2013 name of the ICA JSON file.
Returns:
DataFrame
\u2013 table with annotation data sources and their versions.
get_default_gene_type_map()
","text":"Returns the default gene type map.
The canonical gene types are gof
, lof
, and the union of both. Genes that need to be activated to drive a tumor are of type gof
. Genes that need to be deactivated to drive a tumor are of type lof
. Genes that need to be activated or deactivated depending on the context are of the union of both types. Genes for which it is unknown if they need to be activated or deactivated are also annotated with both types. Genes can be originally annotated with other type names than the canonical ones. The gene type map is used to map these other gene type names to the canonical gene types.
The default map is:
oncogene
\u2192 {\"gof\"}
tsg
\u2192 {\"lof\"}
Act
\u2192 {\"gof\"}
LoF
\u2192 {\"lof\"}
mixed
\u2192 {\"gof\", \"lof\"}
ambiguous
\u2192 {\"gof\", \"lof\"}
Returns:
dict
\u2013 mappings from gene types to canonical gene types.
Examples:
>>> import icaparser as icap\n>>> icap.get_default_gene_type_map()\n
"},{"location":"reference/#icaparser.icaparser.get_default_mutation_aggregation_rules","title":"get_default_mutation_aggregation_rules()
","text":"Returns the default mutation aggregation rules.
Two types of the mutation status of a gene are defined - allele level and gene level:
For gain of function (gof) genes, the classifications at both the allele and gene levels are identical unless there is supplementary information about activating modifications beyond mutations. In contrast, for loss of function (lof) genes, classifications at the allele and gene levels may diverge. For instance, a truncating mutation in a tumor suppressor gene typically disrupts the function of the affected allele. However, other alleles of the same gene may remain functionally active, meaning the gene as a whole can still be operational, unless the mutated allele is a dominant negative variant. For a gene to be considered completely dysfunctional, all its alleles must be impaired, either through additional mutations or other mechanisms such as copy number deletions or hypermethylation. Consequently, a single variant that disrupts function at the allele level does not necessarily imply disruption at the gene level.
For loss of function (lof) genes, the available information often falls short of allowing a reliable estimation of functional effects. As a result, heuristic rules must be employed, and the analyst is tasked with deciding whether to utilize allele-level or gene-level classifications. A lof gene is classified as functionally disrupted at gene level (strong impact) if it harbors at least two mutations, each either of strong impact or of uncertain impact. Should a lof gene possess only one such mutation, it is classified as having an uncertain impact at the gene level, regardless of whether the mutation exhibits a strong impact at the allele level. By differentiating the effects at both the allele and gene levels, we maintain the flexibility to determine in subsequent analyses how to consolidate these categories for further statistical evaluations.
The function returns a dictionary containing two keys: gof and lof. Associated with each key is a function that accepts a dictionary of counts as its input and outputs a tuple comprising two elements: the mutation status at the allele level and at the gene level. The input dictionary of counts is expected to have two keys, mutated and uncertain. The value for each key represents the number of variants within a gene classified as mutated or uncertain, respectively.
Returns:
dict
\u2013 the gof and lof allele level and gene level aggregation rules.
Examples:
>>> import icaparser as icap\n>>> icap.get_default_mutation_aggregation_rules()\n
"},{"location":"reference/#icaparser.icaparser.get_default_mutation_classification_rules","title":"get_default_mutation_classification_rules(cosmic_threshold=10)
","text":"Returns the default rules for classifying mutations.
Defines the default rules for classifying mutations. The returned dictionary has keys \"gof\" and \"lof\", and the respective values are the rule sets for these gene types. Each rule set is a dictionary with the keys \"mutated\" and \"uncertain\". The values for \"mutated\" or \"uncertain\" are dictionaries with three filter functions, a \"position_filter\", a \"variant_filter\", and a \"transcript_filter\". For example, a transcript will be called \"mutated\" if all three filters for \"mutated\" return True, and it will be called \"uncertain\", if all three filter functions for \"uncertain\" return True.
These are the default rules returned by this function:
GOF
mutated: non-deleterious hotspot mutations.
cosmic_threshold
.uncertain: non-deleterious mutations that aren't hotspots.
cosmic_threshold
.LOF
mutated: deleterious mutations (such as truncations, start or stop codon loss).
uncertain: amino acid sequence modifying mutations that are not most likely deleterious. This includes missense mutations and in-frame insertions and deletions.
Parameters:
cosmic_threshold
(int
, default: 10
) \u2013 for \"gof\" genes, this is the \"hotspot threshold\" for Cosmic, i.e., the minimum number of samples in Cosmic having that mutation to consider a mutation a hot spot and, therefore, call the mutation \"mutated\". If the number of Cosmic samples is smaller, the mutation is called \"uncertain\".
Returns:
dict
\u2013 default mutation classification rules.
Examples:
>>> import icaparser as icap\n>>> icap.get_default_mutation_classification_rules()\n>>> icap.get_default_mutation_classification_rules(cosmic_threshold=20)\n
"},{"location":"reference/#icaparser.icaparser.get_dna_json_files","title":"get_dna_json_files(base_dir, pattern='*MergedVariants_Annotated_filtered.json.gz')
","text":"Find DNA annotation JSON files in or below base_dir
.
Searches for ICA DNA annotation JSON files in and below base_dir
. All file names matching pattern
are returned.
Parameters:
base_dir
(str
) \u2013 base directory of directory subtree where to search for DNA annotation JSON files.
pattern
(str
, default: '*MergedVariants_Annotated_filtered.json.gz'
) \u2013 files names matching this pattern are returned.
Returns:
list
\u2013 file names.
get_gene_type(gene_symbol)
","text":"Get the gene type (oncogene, tsg, mixed) for a gene.
Parameters:
gene_symbol
(str
) \u2013 the gene symbol of the gene.
Returns:
str
\u2013 the gene type.
get_genes(file)
","text":"Extract gene annotation from a ICA JSON file.
The genes
section of ICA JSON files is optional. If this section is not included in the file, an empty list is returned.
Parameters:
file
(str
) \u2013 name of the ICA JSON file.
Returns:
list
\u2013 gene annotations.
get_gnomad_exome_max_af(variant, cohorts=['afr', 'amr', 'eas', 'nfe', 'sas'])
","text":"Get the maximum allele frequency for gnomAD Exome.
Get the maximum allele frequences across all major cohorts annotated by gnomAD, Exome excluding bottleneck populations (Ashkenazy Jews and Finish) and other.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
cohorts
(list
, default: ['afr', 'amr', 'eas', 'nfe', 'sas']
) \u2013 subpopulations to include.
Returns:
float
\u2013 maximum GnomAD Exome allele frequency.
get_gnomad_max_af(variant, cohorts=['afr', 'amr', 'eas', 'nfe', 'sas'])
","text":"Get the maximum allele frequency for gnomAD.
Get the maximum allele frequences across all major cohorts annotated by gnomAD, excluding bottleneck populations (Ashkenazy Jews and Finish) and other.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
cohorts
(list
, default: ['afr', 'amr', 'eas', 'nfe', 'sas']
) \u2013 subpopulations to include.
Returns:
float
\u2013 maximum GnomAD allele frequency.
get_header(file)
","text":"Extract the header element from a ICA JSON file.
Parameters:
file
(str
) \u2013 name of the ICA JSON file.
Returns:
dict
\u2013 header from the JSON file.
get_header_scalars(file)
","text":"Extract a table with all scalar attributes from the JSON header.
Parameters:
file
(str
) \u2013 name of the ICA JSON file.
Returns:
DataFrame
\u2013 table of scalar attributes and their values.
get_max_af(variant, source, cohorts=None)
","text":"Get the maximum allele frequency for a particular annotation source.
Get the maximum allele frequency across all cohorts annotated by the annotation source.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
source
(str
) \u2013 the annotation source to use, for example 'gnomad' or 'gnomadExome' or 'oneKg'.
cohorts
(list
, default: None
) \u2013 subpopulations to include; include all if None.
Returns:
float
\u2013 the maximum allele frequency.
Examples:
>>> import icaparser as icap\n>>> icap.get_max_af(variant, 'gnomad')\n
"},{"location":"reference/#icaparser.icaparser.get_multi_sample_positions","title":"get_multi_sample_positions(files, *args, **kwargs)
","text":"Extract all positions for a set of ICA JSON files.
The sample id is stored as an additional new attribute of the samples
element of a position. The samples
element is a list, although ICA usually only creates single sample JSON files.
Parameters:
files
(list
) \u2013 names of the ICA JSON files.
args
(object
, default: ()
) \u2013 extra arguments forwarded to get_positions().
kwargs
(object
, default: {}
) \u2013 extra named arguments forwarded to get_positions().
Returns:
list
\u2013 filtered positions from all files.
Examples:
>>> import icaparser as icap\n>>> positions = icap.get_multi_sample_positions(json_files)\n>>> print(positions[0]['samples'][0]['sampleId'])\n
"},{"location":"reference/#icaparser.icaparser.get_mutation_table_for_files","title":"get_mutation_table_for_files(json_files, max_af=0.001, min_vep_consequence_priority=6, min_cosmic_sample_count=0, only_canonical=False, extra_variant_filters=[], extra_transcript_filters=[])
","text":"Get an annotated table of all filtered transcripts from a list of ICA JSON files.
Load all positions from a list of ICA JSON files and filter them. Positions having any remaining variants and transcripts passing the filter are returned as an annotated table.
Parameters:
json_files
(list
) \u2013 list of ICA JSON files
max_af
(float
, default: 0.001
) \u2013 maximum allele frequency for gnomAD, gnomAD Exome and 1000 Genomes. Only variants with maximum allele frequencies below this threshold will be returned.
min_vep_consequence_priority
(int
, default: 6
) \u2013 only transcripts with a minimum VEP consequence priority not larger than this threshold will be retained. Consequences with priorities <= 6 change the protein sequence, consequences with priorities > 6 do not change the protein sequence.
min_cosmic_sample_count
(int
, default: 0
) \u2013 only variants with a maximum cosmic sample count not lower than this threshold will be retained
only_canonical
(bool
, default: False
) \u2013 if true, only canonical transcripts will be retained
extra_variant_filters
(list
, default: []
) \u2013 any additional filters to apply to variants. Filters shall return True to keep a variant.
extra_transcript_filters
(list
, default: []
) \u2013 any additional filters to apply to transcripts. Filters shall return True to keep a transcript.
Returns:
DataFrame
\u2013 table of annotated mutations and affected transcripts.
Examples:
>>> import icaparser as icap\n>>> extra_transcript_filters = [\n lambda x: x.get('source', '') == 'Ensembl',\n lambda x: x.get('hgnc', '') == 'KRAS'\n ]\n>>> mut_table = icap.get_mutation_table_for_files(\n json_files,\n extra_transcript_filters=extra_transcript_filters\n )\n
"},{"location":"reference/#icaparser.icaparser.get_mutation_table_for_position","title":"get_mutation_table_for_position(position)
","text":"Get an annotated table of all transcripts for a single position.
Returns an annotated table of all transcripts that are affected by a mutation at a position.
Parameters:
position
(dict
) \u2013 the position to investigate.
Returns:
DataFrame
\u2013 table of annotated mutations and affected transcripts.
get_mutation_table_for_positions(positions, hide_progress=False)
","text":"Get an annotated table of all transcripts for all positions.
Returns an annotated table of all transcripts that are affected by a mutation at any of the positions.
Parameters:
positions
(list
) \u2013 the positions to investigate.
Returns:
DataFrame
\u2013 table of annotated mutations and affected transcripts.
get_onekg_max_af(variant)
","text":"Get the maximum allele frequency for the 1000 Genomes Project.
Get the maximum allele frequences across all cohorts annotated by the 1000 Genomes Project.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
Returns:
float
\u2013 maximum 1000 genomes allele frequency.
get_pipeline_metadata(files)
","text":"Extract a table with metadata annotation pipeline run from the JSON header.
Parameters:
files
(list
) \u2013 names of the ICA JSON files.
Returns:
DataFrame
\u2013 table with metadata of pipeline runs.
get_position_by_coordinates(positions, chromosome, position)
","text":"Extract a particular position from a position list.
Parameters:
positions
(list
) \u2013 list of input positions.
chromosome
(str
) \u2013 name of the chromosome.
position
(int
) \u2013 numeric position on the chromosome.
Returns:
dict
\u2013 the position for the specified chromosome and numeric position.
Examples:
>>> import icaparser as icap\n>>> icap.get_position_by_coordinates(positions, 'chr1', 204399064)\n
"},{"location":"reference/#icaparser.icaparser.get_positions","title":"get_positions(file, variant_filters=[], transcript_filters=[])
","text":"Extract all positions from a ICA JSON file.
The sample id is stored as an additional new attribute of the samples
element of a position. The samples
element is a list, although ICA usually only creates single sample JSON files.
Parameters:
file
(str
) \u2013 name of the ICA JSON file
variant_filters
(list
, default: []
) \u2013 any filters to apply to variants. Filters shall return True to keep a variant.
transcript_filters
(list
, default: []
) \u2013 any filters to apply to transcripts. Filters shall return True to keep a transcript.
Returns:
list
\u2013 filtered positions from file.
Examples:
>>> transcript_filters = [\n lambda x: x.get('source', '') == 'Ensembl',\n lambda x: x.get('hgnc', '') == 'KRAS'\n ]\n>>> positions = icap.get_sample_positions(\n json_file,\n transcript_filters = transcript_filters\n )\n>>> print(positions[0]['samples'][0]['sampleId'])\n
"},{"location":"reference/#icaparser.icaparser.get_sample","title":"get_sample(file, suffix='(-D[^.]*)?\\\\.bam')
","text":"Extract the sample name from a ICA JSON file.
Parameters:
file
(str
) \u2013 name of the ICA JSON file.
suffix
(str
, default: '(-D[^.]*)?\\\\.bam'
) \u2013 regular expression to remove from the sample name in the JSON file. Defaults to '(-D[^.]*)?.bam'.
Returns:
str
\u2013 name of the sample annotated in the JSON file.
get_strongest_vep_consequence_name(transcript)
","text":"Get the name of the strongest VEP consequence for a transcript.
Parameters:
transcript
(dict
) \u2013 the transcript to investigate.
Returns:
str
\u2013 the consequence.
get_strongest_vep_consequence_priority(transcript)
","text":"Get the strongest priority of VEP consequence for a transcript.
Get the strongest numeric priority of all VEP consequences for a transcript. Smaller numeric priorities mean stronger impact.
Parameters:
transcript
(dict
) \u2013 the transcript to investigate.
Returns:
int
\u2013 the strongest numeric priority, smaller values mean higher priority.
get_strongest_vep_consequence_rank(transcript)
","text":"Get the strongest rank of VEP consequences for a transcript.
Get the strongest numeric rank of all VEP consequences for a transcript. Smaller ranks mean stronger impact.
The priority of consequences is taken into account first. So if two consequences have different priorities, the consequence with the higher priority (lower priority number) will be used, and the rank for this consequence will be returned. If there are multiple consequences with the same priority, the lowest (strongest) rank will be returned.
For clarification: ranks are unique, i.e. all VEP consequences ordered as listed on the VEP documentation page get the row number of this table assigned as rank.
However, several consequences can have the same priority (e.g., stop gained and frameshift have the same priority). Priorities are copied from vcf2maf.pl of MSKCC.
Parameters:
transcript
(dict
) \u2013 the transcript to investigate.
Returns:
int
\u2013 the rank of the VEP consequence with strongest impact.
get_vep_consequence_for_rank(rank)
","text":"Get the VEP consequence term of a numeric rank.
Parameters:
rank
(int
) \u2013 the numeric rank of the consequence term.
Returns:
str
\u2013 the consequence.
get_vep_priority_for_consequence(consequence)
","text":"Get the numeric priority of a VEP consequence term.
The numeric priority of a consequence that is returned by this function is the same as defined by vcf2maf.pl of MSKCC.
Parameters:
consequence
(str
) \u2013 the consequence term of the variant.
Returns: the priority of the consequence, smaller values mean higher priority.
"},{"location":"reference/#icaparser.icaparser.get_vep_rank_for_consequence","title":"get_vep_rank_for_consequence(consequence)
","text":"Get the numeric rank of a VEP consequence term.
The numeric rank of a consequence is the position of the consequence in this list of consequences for the Variant Effect Predictor VEP.
Parameters:
consequence
(str
) \u2013 the consequence term of the variant.
Returns:
int
\u2013 the rank of the consequence, smaller values mean higher rank.
split_multi_sample_json_file(json_file, output_dir)
","text":"Splits a multi-sample JSON file into sample specific JSON files.
This function reads a multi-sample JSON file that was generated by annotating a multi-sample VCF file with ICA and splits it into sample-specific JSON files.
Annotating very many single-sample VCF files with ICA is very time consuming, because ICA reads all annotation sources for each VCF file and this is dominating the runtime of ICA. It is therefore helpful to first merge many single-sample VCF files into one or a small number of multi-sample VCF files (for example, with bcftools merge
), to annotate the multi-sample VCF file with ICA, and then to split the multi-sample JSON output of ICA into single-sample JSON files. These single-sample JSON files are required for the rest of this package.
Parameters:
json_file
(str
) \u2013 the multi-sample json input file.
output_dir
(str
) \u2013 the directory where to write the single sample JSON files. The directory will be created if it does not exist.
Returns:
None
\u2013 None.
strip_json_file(ifname, ofname)
","text":"Reduce the JSON file size by keeping only 'PASS' variants.
JSON files from Illumina's ICA pipeline can be very large because they contain any deviation from the reference genome, irrespective of the quality of the mutation call. Gzip compressed JSON files with sizes in the gigabyte range cannot be processed by JSON packages that read the entire file into memory. It is necessary to first reduce the size of JSON files by removing all variants that do not meet Illumina's quality criteria.
This function reads a single JSON file and creates a single JSON outpout file by removing all variants that do not pass Illumina's quality criteria.
Parameters:
ifname
(str
) \u2013 name of the input file.
ofname
(str
) \u2013 name of the output file.
Returns:
None
\u2013 None.
strip_json_files(source_dir, target_dir, pattern='*.json.gz')
","text":"Strip all JSON files of a project by keeping only 'PASS' variants.
JSON files from Illumina's ICA pipeline can be very large because they contain any deviation from the reference genome, irrespective of the quality of the mutation call. Gzip compressed JSON files with sizes in the gigabyte range cannot be processed by JSON packages that read the entire file into memory. It is necessary to first reduce the size of JSON files by removing all variants that do not meet Illumina's quality criteria.
This function searches source_dir
recursively for all files matching the file_pattern
. Each of those files is processed and a stripped version keeping only variants that PASS Illumina's quality criteria is created. The output file has the same name as the input file. The directory structure below source_dir
is replicated in target_dir
. Output files get the suffix '_filtered.json.gz'.
Parameters:
source_dir
(str
) \u2013 directory where to search for input JSON files.
target_dir
(str
) \u2013 directory where to save the stripped outpout JSON files.
pattern
(str
, default: '*.json.gz'
) \u2013 files matching this pattern will be processed.
Returns:
None
\u2013 None.
Examples:
>>> strip_json_files('../Data/Original', '../Data/Derived')\n
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to the icaparser Python package","text":"This Python package provides functions for parsing JSON files created by Illumina's Connected Annotations (ICA)pipeline. ICA annotates mutations with \u201aa set of tools and data sources. This package allows to:
See the examples and the API documentation for further details.
"},{"location":"examples/","title":"Examples","text":""},{"location":"examples/#stripping-very-large-json-files","title":"Stripping very large JSON files","text":"Some JSON files from Illumina TSO panels (for example, TSO500) are not QC filtered and contain all detected genomic variants, irrespective of whether they pass the quality criteria. Such files can get very large, too large to be processed by any JSON parser. If your JSON file does not contain only QC-filtered variants (\"PASS\"), it needs to be stripped (filtered) first before using the icaparser module for further processing.
The code below can be run in Python in a terminal or in a Jupyter notebook. Terminal is recommended.
import icaparser as icap\nicap.strip_json_files(source_dir='../Data/Original', target_dir='../Data/Derived')\n
"},{"location":"examples/#simple-example","title":"Simple example","text":"The code below is the Hello World example for reading and filtering ICA JSON files with default filtering rules. For more sophisticated filtering options, see the API reference.
import icaparser as icap\njson_files = icap.get_dna_json_files('../Data/Derived')\nfirst_file = json_files[0]\n# Get the annotation data sources\nicap.get_data_sources(first_file)\n# Get pipeline run metadata\nicap.get_pipeline_metadata(json_files)\n# Get a mutation table\nmut_table = icap.get_mutation_table_for_files(json_files)\n
"},{"location":"installation/","title":"Installation instructions","text":""},{"location":"installation/#installation-of-the-icaparser-package","title":"Installation of the icaparser package","text":"It is recommended to create a new virtual environment with Python >= 3.9 and to install the icaparser package in that environment. Activate the environment and run:
pip install \"git+https://github.com/Bayer-Group/ica-parser.git#subdirectory=icaparser\"\n
If you want to install a particular development branch, use
pip install \"git+https://github.com/Bayer-Group/ica-parser.git@BRANCHNAME#subdirectory=icaparser\"\n
If you use Jupyter notebooks, the virtual environment should be added as a new Jupyter kernel. See Using Virtual Environments in Jupyter Notebook and Python - Parametric Thoughts how to do that.
"},{"location":"installation/#installation-of-ipywidgets","title":"Installation of ipywidgets","text":"Required for progress bars in Jupyter. Please refer to the Jupyter or JupyterLab documentation how to install the widgets. For example:
conda install jupyter # if not installed yet\nconda install jupyterlab_widgets\njupyter labextension install jupyter-matplotlib\njupyter lab build\nexit\n
\u2192 Restart Jupyter
"},{"location":"reference/","title":"API documentation","text":"Parser for JSON files from Illumina Connected Annotations pipeline.
"},{"location":"reference/#icaparser.icaparser.add_gene_types","title":"add_gene_types(positions)
","text":"Adds the gene type to each transcript.
Transcripts will be annotated with the gene type (oncogene, tsg, mixed) by adding a new attribute geneType
. Only transcripts with one of these three gene types get this additional annotation. Other transcripts will not get the geneType
attribute.
Parameters:
positions
(list
) \u2013 list of filtered or unfiltered positions from JSON files.
Returns:
list
\u2013 list of positions with additional annotation of transcripts.
Examples:
>>> import icaparser as icap\n>>> positions = icap.add_gene_types(positions)\n
"},{"location":"reference/#icaparser.icaparser.apply_mutation_classification_rules","title":"apply_mutation_classification_rules(positions, rule_set=get_default_mutation_classification_rules(), gene_type_map=get_default_gene_type_map(), hide_progress=False)
","text":"Applies mutation classification rules to all positions.
Each variant is categorized for each transcript that overlaps with the genomic position of the variant. Each transcript that passes the \"mutated\" or \"uncertain\" mutation classification rules gets a new attribute mutation_status
with the value \"mutated\" or \"uncertain\". The input list of positions is modified by adding the mutation_status
attribute to transcripts, and the modified list of positions is returned as the first element of the returned tuple.
In addition to modifying and returning the list of positions, this function also returns the assembled mutation status after aggregating the impact on all transcripts covering a variant. This is returned as the second item of the returned tuple. The impact depends on the type of gene (\"gof\" or \"lof\"), so the impacts are assembled separately for each gene type.
The impact of a particular mutational variant can be different for different overlapping transcript variants of a gene, and the transcript variants can also belong to different genes. The strongest impact on any overlapping transcript of a gene is defined as the impact of that mutational variant on the gene. The analyst must decide which isoforms are used to classify genes. For example, only canonical transcripts may be considered. Alternatively, all transcripts or a subset of transcripts may be used. Therefore, it is necessary to first apply transcript-level filters to all genomic positions before this function is called for determining the mutation status of genes.
The returned value is a multi-dimensional dictionary:
sample_id
\u2192 gene
\u2192 gene_type
\u2192 variant_id
\u2192 mutationStatus
Parameters:
positions
(list
) \u2013 list of positions.
rule_set
(dict
, default: get_default_mutation_classification_rules()
) \u2013 rules for classifying \"gof\" and \"lof\" genes. See the default value for an example if a custom rule set is needed.
gene_type_map
(dict
, default: get_default_gene_type_map()
) \u2013 dictionary for mapping gene types to canonical gene types. See the default value for an example if a custom rule set is needed.
Returns:
tuple[list, dict]
\u2013 A list of positions and a dictionary with assembled and aggregated mutations.
Examples:
>>> import icaparser as icap\n>>> positions, sample_muts = icap.apply_mutation_classification_rules(positions)\n
"},{"location":"reference/#icaparser.icaparser.cleanup_cosmic","title":"cleanup_cosmic(positions)
","text":"Remove Cosmic entries with alleles not matching the variant alleles.
ICA attaches Cosmic entries to variants based on position only, which leads to wrong assignments of Cosmic entries to variants. This function removes all Cosmic entries from a variant for which reference and altered alleles do not match those of the variant.
Filtering is done in place.
Parameters:
positions
(list
) \u2013 list of positions to clean up.
Returns:
list
\u2013 list of positions with cleaned up Cosmic entries.
common_variant_filter(variant, max_af=0.001)
","text":"Get a variant filter based on GnomAD, GnomAd Exome, and 1000 Genomes.
Returns True if none of the maximum allele frequencies from GnomAD, GnomAD exomes and 1000 genomes is greater than max_af
. The default value of 0.1 % for the maximum allele frequency corresponds to that of the AACR GENIE project.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
max_af
(float
, default: 0.001
) \u2013 the maximum allele frequency threshold.
Returns:
bool
\u2013 True if this is not a common variant.
explode_consequence(mutation_table, inplace=False)
","text":"Explode the VEP consequence column of a mutation table.
Exploding the VEP consequence column with the standard Pandas explode()
function would return consquences as strings, not as ordered categories. This function will instead return a consequence column which is an ordered category. The categories are ordered by their impact.
Exploding means that if a row of the input table has multiple consequences in the consequence column, the list of consequences will be split into single consequences and the output table will have multiple rows with a single consequence per row.
Parameters:
mutation_table
(DataFrame
) \u2013 the mutation table to explode
inplace
(bool
, default: False
) \u2013 if True, then modify the mutation_table in place instead of returning a new object
Returns:
DataFrame
\u2013 new mutation table with exploded consequences.
Examples:
>>> import icaparser as icap\n>>> icap.explode_consequence(mutation_table, inplace=True)\n
>>> mutation_table_exploded = icap.explode_consequence(mutation_table)\n
"},{"location":"reference/#icaparser.icaparser.filter_positions_by_transcripts","title":"filter_positions_by_transcripts(positions, filter_func)
","text":"Filter positions based on a filter function for transcripts.
Apply a filter function to all transcripts of each position. Transcripts not passing the filter are removed from the variants of a position. Variants without any transcript left are removed from a position. Positions without any variants left are removed from the returned list of positions.
Parameters:
positions
(list
) \u2013 list of positions to filter.
filter_func
(Callable[[dict], bool]
) \u2013 function taking a transcript and returning a bool. True means to keep the transcript.
Returns:
list
\u2013 filtered positions.
Examples:
>>> is_canonical_transcript = lambda x: x.get('isCanonical', False)\n>>> canonical_positions = icap.filter_positions_by_transcripts(\n non_common_positions,\n is_canonical_transcript\n )\n
"},{"location":"reference/#icaparser.icaparser.filter_positions_by_variants","title":"filter_positions_by_variants(positions, filter_func)
","text":"Filter positions based on a filter function for variants.
Apply a filter function to all variants of each position. Variants not passing the filter are removed from a position. Positions without any variants passing the filter are removed from the returned list.
Parameters:
positions
(list
) \u2013 list of positions to filter.
filter_func
(Callable[[dict], bool]
) \u2013 function taking a variant and returning a bool. True means to keep the variant.
Returns:
list
\u2013 filtered positions.
Examples:
>>> import icaparser as icap\n>>> max_af = 0.001\n>>> is_not_common_variant = lambda x: icap.common_variant_filter(x, max_af)\n>>> non_common_positions = icap.filter_positions_by_variants(\n positions,\n is_not_common_variant\n )\n
"},{"location":"reference/#icaparser.icaparser.filter_variants_by_transcripts","title":"filter_variants_by_transcripts(variants, filter_func)
","text":"Filter variants based on a filter function for transcripts.
Apply a filter function to all transcripts of each variant. Transcripts not passing the filter are removed from a variant. Variants without any transcripts passing the filter are removed from the returned list.
Parameters:
variants
(list
) \u2013 list of variants to filter.
filter_func
(Callable[[dict], bool]
) \u2013 function taking a transcript and returning a bool. True means to keep the transcript.
Returns:
list
\u2013 filtered variants.
get_aggregated_mutation_table(positions, sample_muts=None, mutation_classification_rules=get_default_mutation_classification_rules(), mutation_aggregation_rules=get_default_mutation_aggregation_rules(), gene_type_map=get_default_gene_type_map(), hide_progress=False)
","text":"Returns a sample-gene-mutationStatus table.
This function applies mutation classification rules to all mutational variants and aggregates the mutations according to the aggregation rules. This results in a table with one row for each sample-gene pair. The table contains several columns with impacts according to lof and gof rules on allele level and gene level and with one additional column with the maximum impact for both allele and gene level.
Parameters:
positions
(list
) \u2013 list of positions. If sample_muts is also specified, it is assumed that the positions have already been processed previously by apply_mutation_classification_rules
and we do not have to run mutation classification again.
sample_muts
(dict
, default: None
) \u2013 if apply_mutation_classification_rules
has been run before, you can use the second return value of that function as the sample_muts argument. This is helpful for very large datasets because otherwise apply_mutation_classification_rules
will be run again as an internal call within get_aggregated_mutation_table
, which is time consuming for very large data sets. This also means that if sample_muts
is provided as an argument, the mutation_classification_rules
argument is ignored and has no effect.
mutation_classification_rules
(dict
, default: get_default_mutation_classification_rules()
) \u2013 rules for classifying single mutations. See get_default_mutation_classification_rules()
for details.
mutation_aggregation_rules
(dict
, default: get_default_mutation_aggregation_rules()
) \u2013 rules for aggregation mutations. See get_default_mutation_aggregation_reles()
for details.
gene_type_map
(dict
, default: get_default_gene_type_map()
) \u2013 dictonary for mapping gene types to canonical gene types. See get_default_gene_type_map()
for details.
Returns:
DataFrame
\u2013 mutation table.
get_biotype_priority(biotype)
","text":"Get the numeric priority of a biotype.
The numeric priority of a biotype that is returned by this function is the same as defined by vcf2maf.pl by MSKCC. Biotypes are 'protein_coding', 'LRG_gene', ,'miRNA', ...
Parameters:
biotype
(str
) \u2013 the biotype for which the priority is to be returned.
Returns:
int
\u2013 the priority, smaller values mean higher priority.
get_clinvar(variant)
","text":"Get a table of all ClinVar annotations for a variant.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
Returns:
DataFrame
\u2013 table with ClinVar annotations.
get_clinvar_max_significance(variant, ordered_significances=_CLINVAR_ORDERED_SIGNIFICANCES)
","text":"Get the maximum signifinance for all ClinVar annotations of a variant.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
ordered_significances
(list
, default: _CLINVAR_ORDERED_SIGNIFICANCES
) \u2013 ranked order of ClinVar significances.
Returns:
str
\u2013 ClinVar significance of highest rank for the variant.
get_consequences(transcript)
","text":"Get a list of consequences for a transcript.
A list of consequences of a variant for a transcript is returned. If any of the annotated consequences is a combination of single consequences, separated by ampersands (&) or commas, the consequence is split into single consequences.
Parameters:
transcript
(dict
) \u2013 the transcript for which the consequences are to be returned.
Returns:
list
\u2013 the consequences, a list of strings.
get_cosmic_max_sample_count(variant, only_allele_specific=True)
","text":"Get the maximum sample count for all Cosmic annotations of a variant.
A variant can have no, one or multiple associated Cosmic identifiers. This function returns the maximum sample count of all Cosmic identifiers. For each Cosmic identifier, sample numbers are summed up across all indications. Returns 0 if no Cosmic identifier exists for this variant.
The 'only_allele_specific' argument is used to exclude Cosmic entries that annotate the same chromosomal location but an allele that is different from the allele of the annotated variant. ICA annotates a variant with all Cosmic entries for that chromosomal location, irrespective of alleles. When counting Cosmic samples, this leads to an overestimation of Cosmic sample counts for a particular variant. Therefore, 'only_allele_specific' is True by default to count only samples from Cosmic entries with matching alleles. Occasionally, it may be desired, though, to count all samples with mutations at a given position, irrespective of allele. For example, several different alleles at a functional site of a gene can lead to function-disrupting mutations, so we want to get the maximum sample count for any allele at that position. One might also think of adding the sample counts for all Cosmic entries annotating a variant, but this does not work due to redundancy of Cosmic entries. Older Cosmic versions often included the same sample in different Cosmic entries. And newer Cosmic versions often have multiple entries for an allele, one for each transcript variant, with the same underlying samples.
Parameters:
variant
(dict
) \u2013 the variant to investigate
only_allele_specific
(bool
, default: True
) \u2013 consider only cosmic entries with alleles matching the allele of the annotated variant
Returns:
int
\u2013 maximum cosmic sample count
get_data_sources(file)
","text":"Extract a table with annotation data sources from the JSON header.
Parameters:
file
(str
) \u2013 name of the ICA JSON file.
Returns:
DataFrame
\u2013 table with annotation data sources and their versions.
get_default_gene_type_map()
","text":"Returns the default gene type map.
The canonical gene types are gof
, lof
, and the union of both. Genes that need to be activated to drive a tumor are of type gof
. Genes that need to be deactivated to drive a tumor are of type lof
. Genes that need to be activated or deactivated depending on the context are of the union of both types. Genes for which it is unknown if they need to be activated or deactivated are also annotated with both types. Genes can be originally annotated with other type names than the canonical ones. The gene type map is used to map these other gene type names to the canonical gene types.
The default map is:
oncogene
\u2192 {\"gof\"}
tsg
\u2192 {\"lof\"}
Act
\u2192 {\"gof\"}
LoF
\u2192 {\"lof\"}
mixed
\u2192 {\"gof\", \"lof\"}
ambiguous
\u2192 {\"gof\", \"lof\"}
Returns:
dict
\u2013 mappings from gene types to canonical gene types.
Examples:
>>> import icaparser as icap\n>>> icap.get_default_gene_type_map()\n
"},{"location":"reference/#icaparser.icaparser.get_default_mutation_aggregation_rules","title":"get_default_mutation_aggregation_rules()
","text":"Returns the default mutation aggregation rules.
Two types of the mutation status of a gene are defined - allele level and gene level:
For gain of function (gof) genes, the classifications at both the allele and gene levels are identical unless there is supplementary information about activating modifications beyond mutations. In contrast, for loss of function (lof) genes, classifications at the allele and gene levels may diverge. For instance, a truncating mutation in a tumor suppressor gene typically disrupts the function of the affected allele. However, other alleles of the same gene may remain functionally active, meaning the gene as a whole can still be operational, unless the mutated allele is a dominant negative variant. For a gene to be considered completely dysfunctional, all its alleles must be impaired, either through additional mutations or other mechanisms such as copy number deletions or hypermethylation. Consequently, a single variant that disrupts function at the allele level does not necessarily imply disruption at the gene level.
For loss of function (lof) genes, the available information often falls short of allowing a reliable estimation of functional effects. As a result, heuristic rules must be employed, and the analyst is tasked with deciding whether to utilize allele-level or gene-level classifications. A lof gene is classified as functionally disrupted at gene level (strong impact) if it harbors at least two mutations, each either of strong impact or of uncertain impact. Should a lof gene possess only one such mutation, it is classified as having an uncertain impact at the gene level, regardless of whether the mutation exhibits a strong impact at the allele level. By differentiating the effects at both the allele and gene levels, we maintain the flexibility to determine in subsequent analyses how to consolidate these categories for further statistical evaluations.
The function returns a dictionary containing two keys: gof and lof. Associated with each key is a function that accepts a dictionary of counts as its input and outputs a tuple comprising two elements: the mutation status at the allele level and at the gene level. The input dictionary of counts is expected to have two keys, mutated and uncertain. The value for each key represents the number of variants within a gene classified as mutated or uncertain, respectively.
Returns:
dict
\u2013 the gof and lof allele level and gene level aggregation rules.
Examples:
>>> import icaparser as icap\n>>> icap.get_default_mutation_aggregation_rules()\n
"},{"location":"reference/#icaparser.icaparser.get_default_mutation_classification_rules","title":"get_default_mutation_classification_rules(cosmic_threshold=10)
","text":"Returns the default rules for classifying mutations.
Defines the default rules for classifying mutations. The returned dictionary has keys \"gof\" and \"lof\", and the respective values are the rule sets for these gene types. Each rule set is a dictionary with the keys \"mutated\" and \"uncertain\". The values for \"mutated\" or \"uncertain\" are dictionaries with three filter functions, a \"position_filter\", a \"variant_filter\", and a \"transcript_filter\". For example, a transcript will be called \"mutated\" if all three filters for \"mutated\" return True, and it will be called \"uncertain\", if all three filter functions for \"uncertain\" return True.
These are the default rules returned by this function:
GOF
mutated: non-deleterious hotspot mutations.
cosmic_threshold
.uncertain: non-deleterious mutations that aren't hotspots.
cosmic_threshold
.LOF
mutated: deleterious mutations (such as truncations, start or stop codon loss).
uncertain: amino acid sequence modifying mutations that are not most likely deleterious. This includes missense mutations and in-frame insertions and deletions.
Parameters:
cosmic_threshold
(int
, default: 10
) \u2013 for \"gof\" genes, this is the \"hotspot threshold\" for Cosmic, i.e., the minimum number of samples in Cosmic having that mutation to consider a mutation a hot spot and, therefore, call the mutation \"mutated\". If the number of Cosmic samples is smaller, the mutation is called \"uncertain\".
Returns:
dict
\u2013 default mutation classification rules.
Examples:
>>> import icaparser as icap\n>>> icap.get_default_mutation_classification_rules()\n>>> icap.get_default_mutation_classification_rules(cosmic_threshold=20)\n
"},{"location":"reference/#icaparser.icaparser.get_dna_json_files","title":"get_dna_json_files(base_dir, pattern='*MergedVariants_Annotated_filtered.json.gz')
","text":"Find DNA annotation JSON files in or below base_dir
.
Searches for ICA DNA annotation JSON files in and below base_dir
. All file names matching pattern
are returned.
Parameters:
base_dir
(str
) \u2013 base directory of directory subtree where to search for DNA annotation JSON files.
pattern
(str
, default: '*MergedVariants_Annotated_filtered.json.gz'
) \u2013 files names matching this pattern are returned.
Returns:
list
\u2013 file names.
get_gene_type(gene_symbol)
","text":"Get the gene type (oncogene, tsg, mixed) for a gene.
Parameters:
gene_symbol
(str
) \u2013 the gene symbol of the gene.
Returns:
str
\u2013 the gene type.
get_genes(file)
","text":"Extract gene annotation from a ICA JSON file.
The genes
section of ICA JSON files is optional. If this section is not included in the file, an empty list is returned.
Parameters:
file
(str
) \u2013 name of the ICA JSON file.
Returns:
list
\u2013 gene annotations.
get_gnomad_exome_max_af(variant, cohorts=['afr', 'amr', 'eas', 'nfe', 'sas'])
","text":"Get the maximum allele frequency for gnomAD Exome.
Get the maximum allele frequences across all major cohorts annotated by gnomAD, Exome excluding bottleneck populations (Ashkenazy Jews and Finish) and other.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
cohorts
(list
, default: ['afr', 'amr', 'eas', 'nfe', 'sas']
) \u2013 subpopulations to include.
Returns:
float
\u2013 maximum GnomAD Exome allele frequency.
get_gnomad_max_af(variant, cohorts=['afr', 'amr', 'eas', 'nfe', 'sas'])
","text":"Get the maximum allele frequency for gnomAD.
Get the maximum allele frequences across all major cohorts annotated by gnomAD, excluding bottleneck populations (Ashkenazy Jews and Finish) and other.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
cohorts
(list
, default: ['afr', 'amr', 'eas', 'nfe', 'sas']
) \u2013 subpopulations to include.
Returns:
float
\u2013 maximum GnomAD allele frequency.
get_header(file)
","text":"Extract the header element from a ICA JSON file.
Parameters:
file
(str
) \u2013 name of the ICA JSON file.
Returns:
dict
\u2013 header from the JSON file.
get_header_scalars(file)
","text":"Extract a table with all scalar attributes from the JSON header.
Parameters:
file
(str
) \u2013 name of the ICA JSON file.
Returns:
DataFrame
\u2013 table of scalar attributes and their values.
get_max_af(variant, source, cohorts=None)
","text":"Get the maximum allele frequency for a particular annotation source.
Get the maximum allele frequency across all cohorts annotated by the annotation source.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
source
(str
) \u2013 the annotation source to use, for example 'gnomad' or 'gnomadExome' or 'oneKg'.
cohorts
(list
, default: None
) \u2013 subpopulations to include; include all if None.
Returns:
float
\u2013 the maximum allele frequency.
Examples:
>>> import icaparser as icap\n>>> icap.get_max_af(variant, 'gnomad')\n
"},{"location":"reference/#icaparser.icaparser.get_multi_sample_positions","title":"get_multi_sample_positions(files, *args, **kwargs)
","text":"Extract all positions for a set of ICA JSON files.
The sample id is stored as an additional new attribute of the samples
element of a position. The samples
element is a list, although ICA usually only creates single sample JSON files.
Parameters:
files
(list
) \u2013 names of the ICA JSON files.
args
(object
, default: ()
) \u2013 extra arguments forwarded to get_positions().
kwargs
(object
, default: {}
) \u2013 extra named arguments forwarded to get_positions().
Returns:
list
\u2013 filtered positions from all files.
Examples:
>>> import icaparser as icap\n>>> positions = icap.get_multi_sample_positions(json_files)\n>>> print(positions[0]['samples'][0]['sampleId'])\n
"},{"location":"reference/#icaparser.icaparser.get_mutation_table_for_files","title":"get_mutation_table_for_files(json_files, max_af=0.001, min_vep_consequence_priority=6, min_cosmic_sample_count=0, only_canonical=False, extra_variant_filters=[], extra_transcript_filters=[])
","text":"Get an annotated table of all filtered transcripts from a list of ICA JSON files.
Load all positions from a list of ICA JSON files and filter them. Positions having any remaining variants and transcripts passing the filter are returned as an annotated table.
Parameters:
json_files
(list
) \u2013 list of ICA JSON files
max_af
(float
, default: 0.001
) \u2013 maximum allele frequency for gnomAD, gnomAD Exome and 1000 Genomes. Only variants with maximum allele frequencies below this threshold will be returned.
min_vep_consequence_priority
(int
, default: 6
) \u2013 only transcripts with a minimum VEP consequence priority not larger than this threshold will be retained. Consequences with priorities <= 6 change the protein sequence, consequences with priorities > 6 do not change the protein sequence.
min_cosmic_sample_count
(int
, default: 0
) \u2013 only variants with a maximum cosmic sample count not lower than this threshold will be retained
only_canonical
(bool
, default: False
) \u2013 if true, only canonical transcripts will be retained
extra_variant_filters
(list
, default: []
) \u2013 any additional filters to apply to variants. Filters shall return True to keep a variant.
extra_transcript_filters
(list
, default: []
) \u2013 any additional filters to apply to transcripts. Filters shall return True to keep a transcript.
Returns:
DataFrame
\u2013 table of annotated mutations and affected transcripts.
Examples:
>>> import icaparser as icap\n>>> extra_transcript_filters = [\n lambda x: x.get('source', '') == 'Ensembl',\n lambda x: x.get('hgnc', '') == 'KRAS'\n ]\n>>> mut_table = icap.get_mutation_table_for_files(\n json_files,\n extra_transcript_filters=extra_transcript_filters\n )\n
"},{"location":"reference/#icaparser.icaparser.get_mutation_table_for_position","title":"get_mutation_table_for_position(position)
","text":"Get an annotated table of all transcripts for a single position.
Returns an annotated table of all transcripts that are affected by a mutation at a position.
Parameters:
position
(dict
) \u2013 the position to investigate.
Returns:
DataFrame
\u2013 table of annotated mutations and affected transcripts.
get_mutation_table_for_positions(positions, hide_progress=False)
","text":"Get an annotated table of all transcripts for all positions.
Returns an annotated table of all transcripts that are affected by a mutation at any of the positions.
Parameters:
positions
(list
) \u2013 the positions to investigate.
Returns:
DataFrame
\u2013 table of annotated mutations and affected transcripts.
get_onekg_max_af(variant)
","text":"Get the maximum allele frequency for the 1000 Genomes Project.
Get the maximum allele frequences across all cohorts annotated by the 1000 Genomes Project.
Parameters:
variant
(dict
) \u2013 the variant to investigate.
Returns:
float
\u2013 maximum 1000 genomes allele frequency.
get_pipeline_metadata(files)
","text":"Extract a table with metadata annotation pipeline run from the JSON header.
Parameters:
files
(list
) \u2013 names of the ICA JSON files.
Returns:
DataFrame
\u2013 table with metadata of pipeline runs.
get_position_by_coordinates(positions, chromosome, position)
","text":"Extract a particular position from a position list.
Parameters:
positions
(list
) \u2013 list of input positions.
chromosome
(str
) \u2013 name of the chromosome.
position
(int
) \u2013 numeric position on the chromosome.
Returns:
dict
\u2013 the position for the specified chromosome and numeric position.
Examples:
>>> import icaparser as icap\n>>> icap.get_position_by_coordinates(positions, 'chr1', 204399064)\n
"},{"location":"reference/#icaparser.icaparser.get_positions","title":"get_positions(file, variant_filters=[], transcript_filters=[])
","text":"Extract all positions from a ICA JSON file.
The sample id is stored as an additional new attribute of the samples
element of a position. The samples
element is a list, although ICA usually only creates single sample JSON files.
Parameters:
file
(str
) \u2013 name of the ICA JSON file
variant_filters
(list
, default: []
) \u2013 any filters to apply to variants. Filters shall return True to keep a variant.
transcript_filters
(list
, default: []
) \u2013 any filters to apply to transcripts. Filters shall return True to keep a transcript.
Returns:
list
\u2013 filtered positions from file.
Examples:
>>> transcript_filters = [\n lambda x: x.get('source', '') == 'Ensembl',\n lambda x: x.get('hgnc', '') == 'KRAS'\n ]\n>>> positions = icap.get_sample_positions(\n json_file,\n transcript_filters = transcript_filters\n )\n>>> print(positions[0]['samples'][0]['sampleId'])\n
"},{"location":"reference/#icaparser.icaparser.get_sample","title":"get_sample(file, suffix='(-D[^.]*)?\\\\.bam')
","text":"Extract the sample name from a ICA JSON file.
Parameters:
file
(str
) \u2013 name of the ICA JSON file.
suffix
(str
, default: '(-D[^.]*)?\\\\.bam'
) \u2013 regular expression to remove from the sample name in the JSON file. Defaults to '(-D[^.]*)?.bam'.
Returns:
str
\u2013 name of the sample annotated in the JSON file.
get_strongest_vep_consequence_name(transcript)
","text":"Get the name of the strongest VEP consequence for a transcript.
Parameters:
transcript
(dict
) \u2013 the transcript to investigate.
Returns:
str
\u2013 the consequence.
get_strongest_vep_consequence_priority(transcript)
","text":"Get the strongest priority of VEP consequence for a transcript.
Get the strongest numeric priority of all VEP consequences for a transcript. Smaller numeric priorities mean stronger impact.
Parameters:
transcript
(dict
) \u2013 the transcript to investigate.
Returns:
int
\u2013 the strongest numeric priority, smaller values mean higher priority.
get_strongest_vep_consequence_rank(transcript)
","text":"Get the strongest rank of VEP consequences for a transcript.
Get the strongest numeric rank of all VEP consequences for a transcript. Smaller ranks mean stronger impact.
The priority of consequences is taken into account first. So if two consequences have different priorities, the consequence with the higher priority (lower priority number) will be used, and the rank for this consequence will be returned. If there are multiple consequences with the same priority, the lowest (strongest) rank will be returned.
For clarification: ranks are unique, i.e. all VEP consequences ordered as listed on the VEP documentation page get the row number of this table assigned as rank.
However, several consequences can have the same priority (e.g., stop gained and frameshift have the same priority). Priorities are copied from vcf2maf.pl of MSKCC.
Parameters:
transcript
(dict
) \u2013 the transcript to investigate.
Returns:
int
\u2013 the rank of the VEP consequence with strongest impact.
get_vep_consequence_for_rank(rank)
","text":"Get the VEP consequence term of a numeric rank.
Parameters:
rank
(int
) \u2013 the numeric rank of the consequence term.
Returns:
str
\u2013 the consequence.
get_vep_priority_for_consequence(consequence)
","text":"Get the numeric priority of a VEP consequence term.
The numeric priority of a consequence that is returned by this function is the same as defined by vcf2maf.pl of MSKCC.
Parameters:
consequence
(str
) \u2013 the consequence term of the variant.
Returns: the priority of the consequence, smaller values mean higher priority.
"},{"location":"reference/#icaparser.icaparser.get_vep_rank_for_consequence","title":"get_vep_rank_for_consequence(consequence)
","text":"Get the numeric rank of a VEP consequence term.
The numeric rank of a consequence is the position of the consequence in this list of consequences for the Variant Effect Predictor VEP.
Parameters:
consequence
(str
) \u2013 the consequence term of the variant.
Returns:
int
\u2013 the rank of the consequence, smaller values mean higher rank.
split_multi_sample_json_file(json_file, output_dir)
","text":"Splits a multi-sample JSON file into sample specific JSON files.
This function reads a multi-sample JSON file that was generated by annotating a multi-sample VCF file with ICA and splits it into sample-specific JSON files.
Annotating very many single-sample VCF files with ICA is very time consuming, because ICA reads all annotation sources for each VCF file and this is dominating the runtime of ICA. It is therefore helpful to first merge many single-sample VCF files into one or a small number of multi-sample VCF files (for example, with bcftools merge
), to annotate the multi-sample VCF file with ICA, and then to split the multi-sample JSON output of ICA into single-sample JSON files. These single-sample JSON files are required for the rest of this package.
Parameters:
json_file
(str
) \u2013 the multi-sample json input file.
output_dir
(str
) \u2013 the directory where to write the single sample JSON files. The directory will be created if it does not exist.
Returns:
None
\u2013 None.
strip_json_file(ifname, ofname)
","text":"Reduce the JSON file size by keeping only 'PASS' variants.
JSON files from Illumina's ICA pipeline can be very large because they contain any deviation from the reference genome, irrespective of the quality of the mutation call. Gzip compressed JSON files with sizes in the gigabyte range cannot be processed by JSON packages that read the entire file into memory. It is necessary to first reduce the size of JSON files by removing all variants that do not meet Illumina's quality criteria.
This function reads a single JSON file and creates a single JSON outpout file by removing all variants that do not pass Illumina's quality criteria.
Parameters:
ifname
(str
) \u2013 name of the input file.
ofname
(str
) \u2013 name of the output file.
Returns:
None
\u2013 None.
strip_json_files(source_dir, target_dir, pattern='*.json.gz')
","text":"Strip all JSON files of a project by keeping only 'PASS' variants.
JSON files from Illumina's ICA pipeline can be very large because they contain any deviation from the reference genome, irrespective of the quality of the mutation call. Gzip compressed JSON files with sizes in the gigabyte range cannot be processed by JSON packages that read the entire file into memory. It is necessary to first reduce the size of JSON files by removing all variants that do not meet Illumina's quality criteria.
This function searches source_dir
recursively for all files matching the file_pattern
. Each of those files is processed and a stripped version keeping only variants that PASS Illumina's quality criteria is created. The output file has the same name as the input file. The directory structure below source_dir
is replicated in target_dir
. Output files get the suffix '_filtered.json.gz'.
Parameters:
source_dir
(str
) \u2013 directory where to search for input JSON files.
target_dir
(str
) \u2013 directory where to save the stripped outpout JSON files.
pattern
(str
, default: '*.json.gz'
) \u2013 files matching this pattern will be processed.
Returns:
None
\u2013 None.
Examples:
>>> strip_json_files('../Data/Original', '../Data/Derived')\n
"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index cf7cbb8..b1a4ca1 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ