diff --git a/.gitattributes b/.gitattributes old mode 100644 new mode 100755 diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md old mode 100644 new mode 100755 diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md old mode 100644 new mode 100755 diff --git a/.gitignore b/.gitignore new file mode 100755 index 0000000..fa00a23 --- /dev/null +++ b/.gitignore @@ -0,0 +1,5 @@ +# Ignore filetypes +*.pyc +/python2env/ +/.ipynb_checkpoints/ +/.vscode/ diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md old mode 100644 new mode 100755 diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md old mode 100644 new mode 100755 diff --git a/ChangeLog.md b/ChangeLog.md new file mode 100644 index 0000000..6caaed0 --- /dev/null +++ b/ChangeLog.md @@ -0,0 +1,24 @@ +# NEAT v3.0 +- NEAT gen_reads now runs in Python 3 exclusively. The previous, Python 2 version is stored in the repo as v2.0, but will not be undergoing active development. +- Converted sequence objects to Biopython Sequence objects to take advantage of the Biopython library +- Converted cigar strings to lists. Now there are simply a functions that convert a generic cigar string to a list and vice versa. +- Tried to take better advantage of some Biopython libraries, such as for parsing fastas. +- For now, we've eliminated the "Jobs" option and the merge jobs function. We plan to implement multi-threading instead as a way to speed up NEAT's simulation process. +- Added a basic bacterial wrapper that will simulate multiple generations of bacteria based on an input fasta, mutate them and then produce the fastqs, bams, and vcfs for the resultant bacterial population. + + +## TODOs for v3.1 +NEAT is still undergoing active development with many exciting upgrades planned. We also plan to bring the code up to full production scale and will continue to improve the following features (if you would like to [Contribute](CONTRIBUTING.md)) +- Using Python's multithreading libraries, speed up NEAT's gen_reads tool significantly. +- Take advantage of pandas library for reading in bed files and other files. +- Code optimization for all gen_reads files (in source folder) +- Further cleanup to PEP8 standards +- Refactor the code to integrate NEAT's utilities into the package +- Improvements and standardization for the utilities, across the board +- VCF compare has some nice features and output, but is very slow to use. Can we improve this utility? + +For improvements, we have a lot of great ideas for general improvements aimed at better simulating bacteria, but we believe this same improvements will have applications in other species as well. +- Multiploidy - all right this has nothing to do with bacteria specifically, but it is a feature we would like to implement into gen_reads. +- Structural Variants - model large scale structural variants with an eye toward intergenic SVs. +- Transposable Elements - model transposons within the sequence +- Repeat regions - will bring a variety of interesting applications diff --git a/LICENSE.md b/LICENSE.md old mode 100644 new mode 100755 diff --git a/README.md b/README.md old mode 100644 new mode 100755 index 985c4b7..535e5f3 --- a/README.md +++ b/README.md @@ -1,4 +1,8 @@ -# neat-genreads +# The NEAT Project v3.0 +Welcome to the NEAT project, the NExt-generation sequencing Analysis Toolkit, version 3.0. Neat has now been updated with Python 3, and is moving toward PEP8 standards. There is still lots of work to be done. See the [ChangeLog](ChangeLog.md) for notes. + +Stay tuned over the coming weeks for exciting updates to NEAT, and learn how to [contribute](CONTRIBUTING.md) yourself. If you'd like to use some of our code, no problem! Just review the [license](LICENSE.md), first. + NEAT-genReads is a fine-grained read simulator. GenReads simulates real-looking data using models learned from specific datasets. There are several supporting utilities for generating models used for simulation. This is an in-progress v2.0 of the software. For a previous stable release please see: [genReads1](https://github.com/zstephens/genReads1) @@ -37,14 +41,20 @@ Table of Contents ## Requirements -* Python 2.7 -* Numpy 1.9.1+ +* Python >= 3.6 +* biopython >= 1.78 +* matplotlib >= 3.3.4 (optional, for plotting utilities) +* matplotlib_venn >= 0.11.6 (optional, for plotting utilities) +* pandas >= 1.2.1 +* numpy >= 1.19.5 +* pysam >= 0.16.0.1 + ## Usage Here's the simplest invocation of genReads using default parameters. This command produces a single ended fastq file with reads of length 101, ploidy 2, coverage 10X, using the default sequencing substitution, GC% bias, and mutation rate models. ``` -python genReads.py -r ref.fa -R 101 -o simulated_data +python gen_reads.py -r ref.fa -R 101 -o simulated_data ``` The most commonly added options are --pe, --bam, --vcf, and -c. @@ -59,30 +69,34 @@ Option | Description -c | Average coverage across the entire dataset. Default: 10 -e | Sequencing error model pickle file -E | Average sequencing error rate. The sequencing error rate model is rescaled to make this the average value. --p | ploidy [2] --t | bed file containing targeted regions; default coverage for targeted regions is 98% of -c option; default coverage outside targeted regions is 2% of -c option +-p | Sample Ploidy, default 2 +-tr | Bed file containing targeted regions; default coverage for targeted regions is 98% of -c option; default coverage outside targeted regions is 2% of -c option +-dr | Bed file with sample regions to discard. -to | off-target coverage scalar [0.02] -m | mutation model pickle file -M | Average mutation rate. The mutation rate model is rescaled to make this the average value. Must be between 0 and 0.3. These random mutations are inserted in addition to the once specified in the -v option. --s | input sample model +-Mb | Bed file containing positional mutation rates +-N | Below this quality score, base-call's will be replaced with N's -v | Input VCF file. Variants from this VCF will be inserted into the simulated sequence with 100% certainty. --pe | Paired-end fragment length mean and standard deviation. To produce paired end data, one of --pe or --pe-model must be specified. --pe-model | Empirical fragment length distribution. Can be generated using [computeFraglen.py](#computefraglenpy). To produce paired end data, one of --pe or --pe-model must be specified. --gc-model | Empirical GC coverage bias distribution. Can be generated using [computeGC.py](#computegcpy) ---job | Jobs IDs for generating reads in parallel ---nnr | save non-N ref regions (for parallel jobs) --bam | Output golden BAM file --vcf | Output golden VCF file +--fa | Output FASTA instead of FASTQ --rng | rng seed value; identical RNG value should produce identical runs of the program, so things like read locations, variant positions, error positions, etc, should all be the same. --gz | Gzip output FQ and VCF --no-fastq | Bypass generation of FASTQ read files +--discard-offtarget | Discard reads outside of targeted regions +--rescale-qual | Rescale Quality scores to match -E input +-d | Turn on debugging mode (useful for development) ## Functionality ![Diagram describing the way that genReads simulates datasets](docs/flow_new.png "Diagram describing the way that genReads simulates datasets") -NEAT genReads produces simulated sequencing datasets. It creates FASTQ files with reads sampled from a provided reference genome, using sequencing error rates and mutation rates learned from real sequencing data. The strength of genReads lies in the ability for the user to customize many sequencing parameters, produce 'golden', true positive datasets, and produce types of data that other simulators cannot (e.g. tumour/normal data). +NEAT gen_reads produces simulated sequencing datasets. It creates FASTQ files with reads sampled from a provided reference genome, using sequencing error rates and mutation rates learned from real sequencing data. The strength of genReads lies in the ability for the user to customize many sequencing parameters, produce 'golden', true positive datasets, and produce types of data that other simulators cannot (e.g. tumour/normal data). Features: @@ -111,7 +125,7 @@ The following commands are examples for common types of data to be generated. Th Simulate whole genome dataset with random variants inserted according to the default model. ``` -python genReads.py \ +python gen_reads.py \ -r hg19.fa \ -R 126 \ -o /home/me/simulated_reads \ @@ -124,7 +138,7 @@ python genReads.py \ Simulate a targeted region of a genome, e.g. exome, with -t. ``` -python genReads.py \ +python gen_reads.py \ -r hg19.fa \ -R 126 \ -o /home/me/simulated_reads \ @@ -138,7 +152,7 @@ python genReads.py \ Simulate a whole genome dataset with only the variants in the provided VCF file using -v and -M. ``` -python genReads.py \ +python gen_reads.py \ -r hg19.fa \ -R 126 \ -o /home/me/simulated_reads \ @@ -153,7 +167,7 @@ python genReads.py \ Simulate single-end reads by omitting the --pe option. ``` -python genReads.py \ +python gen_reads.py \ -r hg19.fa \ -R 126 \ -o /home/me/simulated_reads \ @@ -165,7 +179,7 @@ python genReads.py \ Simulate PacBio-like reads by providing an error model. ``` -python genReads.py \ +python gen_reads.py \ -r hg19.fa \ -R 5000 \ -e models/errorModel_pacbio_toy.p \ @@ -177,10 +191,10 @@ python genReads.py \ When possible, simulation can be done in parallel via multiple executions with different --job options. The resultant files will then need to be merged using utilities/mergeJobs.py. The following example shows splitting a simulation into 4 separate jobs (which can be run independently): ``` -python genReads.py -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 1 4 -python genReads.py -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 2 4 -python genReads.py -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 3 4 -python genReads.py -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 4 4 +python gen_reads.py -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 1 4 +python gen_reads.py -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 2 4 +python gen_reads.py -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 3 4 +python gen_reads.py -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 4 4 python mergeJobs.py -i /home/me/simulated_reads -o /home/me/simulated_reads_merged -s /path/to/samtools ``` @@ -190,9 +204,9 @@ In future revisions the dependence on SAMtools will be removed. To simulate human WGS 50X, try 50 chunks or less. # Utilities -Several scripts are distributed with genReads that are used to generate the models used for simulation. +Several scripts are distributed with gen_reads that are used to generate the models used for simulation. -## computeGC.py +## compute_gc.py Computes GC% coverage bias distribution from sample (bedrolls genomecov) data. Takes .genomecov files produced by BEDtools genomeCov (with -d option). @@ -212,7 +226,7 @@ python computeGC.py \ -o /path/to/model.p ``` -## computeFraglen.py +## compute_fraglen.py Computes empirical fragment length distribution from sample data. Takes SAM file via stdin: @@ -221,7 +235,7 @@ Takes SAM file via stdin: and creates fraglen.p model in working directory. -## genMutModel.py +## gen_mut_model.py Takes references genome and TSV file to generate mutation models: @@ -234,9 +248,20 @@ python genMutModel.py \ Trinucleotides are identified in the reference genome and the variant file. Frequencies of each trinucleotide transition are calculated and output as a pickle (.p) file. +Option | Description +------ |:---------- +-r | Reference file for organism in FASTA format. Required +-m | Mutation file for organism in VCF format. Required +-o | Path to output file and prefix. Required. +-b | BED file of regions to include +--save-trinuc | Save trinucleotide counts for reference +--human-sample | Use to skip unnumbered scaffolds in human references +--skip-common | Do not save common snps or high mutation areas + + ## genSeqErrorModel.py -Generates sequence error model for genReads.py -e option. +Generates sequence error model for gen_reads.py -e option. This script needs revision, to improve the quality-score model eventually, and to include code to learn sequencing errors from pileup data. ``` @@ -251,6 +276,7 @@ python genSeqErrorModel.py \ -s number of simulation iterations [1000000] \ --plot perform some optional plotting ``` + ## plotMutModel.py Performs plotting and comparison of mutation models generated from genMutModel.py. @@ -268,8 +294,6 @@ Tool for comparing VCF files. ``` python vcf_compare_OLD.py - --version show program's version number and exit \ - -h, --help show this help message and exit \ -r * Reference Fasta \ -g * Golden VCF \ -w * Workflow VCF \ diff --git a/bacterial_genreads_wrapper.py b/bacterial_genreads_wrapper.py new file mode 100755 index 0000000..5d31aa2 --- /dev/null +++ b/bacterial_genreads_wrapper.py @@ -0,0 +1,260 @@ +#!/usr/bin/env source + +import gen_reads +import argparse +import random +import pathlib +import gzip +import shutil +import sys +import copy +from time import time +# from Bio import SeqIO + + +class Bacterium: + def __init__(self, reference: str, name: str, chrom_names: list): + """ + Class Bacterium for keeping track of all the elements of a bacterium for the purposes of this simulation + :param reference: The str representing the location of the reference file + :param name: The name of this particular bacterium. + :param chrom_names: The list of chromosome names from the progenitor bacterium + """ + self.reference = pathlib.Path(reference) + self.name = name + self.chroms = chrom_names + # Temporarily set the reference as the bacterium's file, until it is analyzed + self.file = pathlib.Path(reference) + self.analyze() + + def __repr__(self): + return str(self.name) + + def __str__(self): + return str(self.name) + + def get_file(self): + return self.file + + def get_chroms(self): + return self.chroms + + def analyze(self): + """ + This function is supposed to just run genreads for the bacterium, but doing so requires some file + manipulation to unzip the file and fix genreads horribly formatted fasta file. + :return: None + """ + args = ['-r', str(self.reference), '-R', '101', '-o', self.name, '--fa', '-c', '1'] + gen_reads.main(args) + self.file = pathlib.Path().absolute() / (self.name + ".fasta.gz") + + # The following workaround is due to the fact that genReads cannot handle gzipped + # fasta files, so we have to unzip it for it to actually work. + unzipped_path = pathlib.Path().absolute() / (self.name + ".fasta") + unzip_file(self.file, unzipped_path) + pathlib.Path.unlink(pathlib.Path().absolute() / (self.name + ".fasta.gz")) # deletes unused zip file + self.file = unzipped_path + # end workaround + + # Now we further have to fix the fasta file, which outputs in a form that doesn't make much sense, + # so that it can be properly analyzed in the next generation by genreads. + temp_name_list = copy.deepcopy(self.chroms) + temp_file = self.file.parents[0] / 'neat_temporary_fasta_file.fa' + temp_file.touch() + chromosome_name = "" + sequence = "" + with self.file.open() as f: + for line in f: + if line.startswith(">"): + for name in temp_name_list: + if name in line: + if chromosome_name != ">" + name + "\n": + if sequence: + temp_file.open('a').write(chromosome_name + sequence) + sequence = "" + chromosome_name = ">" + name + "\n" + temp_name_list.remove(name) + else: + continue + if not chromosome_name: + print("Something went wrong with the generated fasta file.\n") + sys.exit(1) + else: + sequence = sequence + line + temp_file.open('a').write(chromosome_name + sequence) + shutil.copy(temp_file, self.file) + pathlib.Path.unlink(temp_file) + + def sample(self, coverage_value: int, fragment_size: int, fragment_std: int): + """ + This function simple runs genreads on the file associated with this bacterium + :param coverage_value: What depth of coverage to sample the reads at. + :param fragment_size: The mean insert size for the resultant fastqs. + :param fragment_std: The standard deviation of the insert size + :return: None + """ + args = ['-r', str(self.file), '-M', '0', '-R', '101', '-o', self.name, + '-c', str(coverage_value), '--pe', str(fragment_size), str(fragment_std), '--vcf', '--bam'] + + gen_reads.main(args) + + def remove(self): + """ + This function simple deletes the file associated with this bacterium, or raises an error if there is a problem + :return: None + """ + try: + pathlib.Path.unlink(self.file) + except FileExistsError: + print('\nThere was a problem deleting a file\n') + raise FileExistsError() + + +def unzip_file(zipped_file: pathlib, unzipped_file: pathlib): + """ + This unzips a gzipped file, then saves the unzipped file as a new file. + :param zipped_file: pathlib object that points to the zipped file + :param unzipped_file: pathlib object that points to the unzipped file + :return: None + """ + with gzip.open(zipped_file, 'rb') as f_in: + with open(unzipped_file, 'wb') as f_out: + shutil.copyfileobj(f_in, f_out) + + +def cull(population: list, percentage: float = 0.5) -> list: + """ + The purpose of this function will be to cull the bacteria created in the model + :param percentage: percentage of the population to eliminate + :param population: the list of members to cull + :return: The list of remaining members + """ + cull_amount = round(len(population) * percentage) + print("Culling {} members from population".format(cull_amount)) + for _ in range(cull_amount): + selection = random.choice(population) + population.remove(selection) + selection.remove() + return population + + +def initialize_population(reference: str, pop_size: int, chrom_names: list) -> list: + """ + The purpose of this function is to evolve the initial population of bacteria. All bacteria are stored as + Bacterium objects. + :param chrom_names: A list of contigs from the original fasta + :param reference: string path to the reference fasta file + :param pop_size: size of the population to initialize. + :return population: returns a list of bacterium objects. + """ + names = [] + for j in range(pop_size): + names.append("bacterium_0_{}".format(j+1)) + population = [] + for i in range(pop_size): + new_member = Bacterium(reference, names[i], chrom_names) + population.append(new_member) + return population + + +def evolve_population(population: list, generation: int) -> list: + """ + This evolves an existing population by doubling them (binary fission), then introducing random mutation to + each member of the population. + :param generation: Helps determine the starting point of the numbering system so the bacteria have unique names + :param population: A list of fasta files representing the bacteria. + :return: None + """ + children_population = population + population + names = [] + new_population = [] + for j in range(len(children_population)): + names.append("bacterium_{}_{}".format(generation, j+1)) + for i in range(len(children_population)): + child = Bacterium(children_population[i].get_file(), names[i], children_population[i].get_chroms()) + new_population.append(child) + return new_population + + +def sample_population(population: list, target_coverage: int, fragment_size: int, fragment_std: int): + """ + This will create a fastq based on each member of the population. + :param target_coverage: The target coverage value for the sample. + :param population: a list of bacteria + :return: None + """ + for bacterium in population: + bacterium.sample(target_coverage, fragment_size, fragment_std) + + +def extract_names(reference: str) -> list: + """ + This function attempts to extract the chromosome names from a fasta file + :param reference: The fasta file to analyze + :return: A list of chromosome names + """ + ref_names = [] + absolute_reference_path = pathlib.Path(reference) + if absolute_reference_path.suffix == '.gz': + with gzip.open(absolute_reference_path, 'rt') as ref: + for line in ref: + if line.startswith(">"): + ref_names.append(line[1:].rstrip()) + else: + with open(absolute_reference_path, 'r') as ref: + for line in ref: + if line.startswith(">"): + ref_names.append(line[1:].rstrip()) + if not ref_names: + print("Malformed fasta file. Missing properly formatted chromosome names.\n") + sys.exit(1) + + return ref_names + + +def main(): + parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter, + description="Wrapper for gen_reads.py that simulates multiple generations" + "of bacteria.") + parser.add_argument('-r', type=str, required=True, metavar='reference.fasta', + help="Reference file for organism in fasta format") + parser.add_argument('-g', type=int, required=True, metavar='generations', help="Number of generations to run") + parser.add_argument('-i', type=int, required=True, metavar='initial pop', help="Initial population size") + parser.add_argument('-k', type=float, required=False, metavar='cull pct', + help="Percentage of population to cull each cycle " + "(The default of 0.5 will keep population relatively stable)", + default=0.5) + parser.add_argument('-c', type=int, required=False, default=10, metavar='coverage value', + help='Target coverage value for final set of sampled fastqs') + parser.add_argument('--pe', nargs=2, type=int, required=False, metavar=('', ''), default=(500, 50), + help='Paired-end fragment length mean and std.') + args = parser.parse_args() + + (ref_fasta, init_population_size, generations) = (args.r, args.i, args.g) + cull_percentage = args.k + coverage = args.c + (fragment_size, fragment_std) = args.pe + + chrom_names = extract_names(ref_fasta) + + population = initialize_population(ref_fasta, init_population_size, chrom_names) + + for i in range(generations): + new_population = evolve_population(population, i+1) + + new_population = cull(new_population, cull_percentage) + + # If all elements get culled, then break the loop + if not new_population: + break + + population = new_population + + sample_population(population, coverage, fragment_size, fragment_std) + + +if __name__ == '__main__': + start_time = time() + main() + print(f'Finished bacterial wrapper in {time() - start_time} seconds.') \ No newline at end of file diff --git a/docs/PE_SE_reads.png b/docs/PE_SE_reads.png old mode 100644 new mode 100755 diff --git a/docs/flow_new.png b/docs/flow_new.png old mode 100644 new mode 100755 diff --git a/genReads.py b/genReads.py deleted file mode 100644 index 7b5d0e6..0000000 --- a/genReads.py +++ /dev/null @@ -1,827 +0,0 @@ -#!/usr/bin/env python -# encoding: utf-8 -""" //////////////////////////////////////////////////////////////////////////////// - /// /// - /// genReads.py /// - /// VERSION 2.0: HARDER, BETTER, FASTER, STRONGER! /// -/////// ////// - /// Variant and read simulator for benchmarking NGS workflows /// - /// /// - /// Written by: Zach Stephens /// -/////// For: DEPEND Research Group, UIUC /////// - /// Date: May 29, 2015 /// - /// Contact: zstephe2@illinois.edu /// - /// /// -/////////////////////////////////////////////////////////////////////////////// """ - -import os -import sys -import copy -import random -import re -import time -import bisect -import cPickle as pickle -import numpy as np -import argparse - -# absolute path to this script -SIM_PATH = '/'.join(os.path.realpath(__file__).split('/')[:-1]) -sys.path.append(SIM_PATH+'/py/') - -from inputChecking import requiredField, checkFileOpen, checkDir, isInRange -from refFunc import indexRef, readRef, getAllRefRegions, partitionRefRegions, ALLOWED_NUCL -from vcfFunc import parseVCF -from OutputFileWriter import OutputFileWriter, RC, sam_flag -from probability import DiscreteDistribution, mean_ind_of_weighted_list -from SequenceContainer import SequenceContainer, ReadContainer, parseInputMutationModel - -# if coverage val for a given window/position is below this value, consider it effectively zero. -LOW_COV_THRESH = 50 - -"""////////////////////////////////////////////////// -//////////// PARSE INPUT ARGUMENTS //////////// -//////////////////////////////////////////////////""" - - -parser = argparse.ArgumentParser(description='NEAT-genReads V2.0') -parser.add_argument('-r', type=str, required=True, metavar='', help="* ref.fa") -parser.add_argument('-R', type=int, required=True, metavar='', help="* read length") -parser.add_argument('-o', type=str, required=True, metavar='', help="* output prefix") -parser.add_argument('-c', type=float, required=False, metavar='', default=10., help="average coverage") -parser.add_argument('-e', type=str, required=False, metavar='', default=None, help="sequencing error model") -parser.add_argument('-E', type=float, required=False, metavar='', default=-1, help="rescale avg sequencing error rate to this") -parser.add_argument('-p', type=int, required=False, metavar='', default=2, help="ploidy") -parser.add_argument('-t', type=str, required=False, metavar='', default=None, help="targeted_regions.bed") -parser.add_argument('-d', type=str, required=False, metavar='', default=None, help="discard_regions.bed") -parser.add_argument('-to',type=float, required=False, metavar='', default=0.00, help="off-target coverage scalar") -parser.add_argument('-m', type=str, required=False, metavar='', default=None, help="mutation model pickle file") -parser.add_argument('-M', type=float, required=False, metavar='', default=-1, help="rescale avg mutation rate to this (1/bp)") -parser.add_argument('-Mb',type=str, required=False, metavar='', default=None, help="bed file containing positional mut rates") -parser.add_argument('-N', type=int, required=False, metavar='', default=-1, help="below this qual, replace base-calls with 'N's") -#parser.add_argument('-s', type=str, required=False, metavar='', default=None, help="input sample model") -parser.add_argument('-v', type=str, required=False, metavar='', default=None, help="input VCF file") - -parser.add_argument('--pe', nargs=2, type=int, required=False, metavar=('',''), default=(None,None), help='paired-end fragment length mean and std') -parser.add_argument('--pe-model', type=str, required=False, metavar='', default=None, help='empirical fragment length distribution') -#parser.add_argument('--cancer', required=False, action='store_true', default=False, help='produce tumor/normal datasets') -#parser.add_argument('-cm', type=str, required=False, metavar='', default=None, help="cancer mutation model directory") -#parser.add_argument('-cp', type=float, required=False, metavar='', default=0.8, help="tumor sample purity") -parser.add_argument('--gc-model', type=str, required=False, metavar='', default=None, help='empirical GC coverage bias distribution') -parser.add_argument('--job', nargs=2, type=int, required=False, metavar=('',''), default=(0,0), help='jobs IDs for generating reads in parallel') -parser.add_argument('--nnr', required=False, action='store_true', default=False, help='save non-N ref regions (for parallel jobs)') -parser.add_argument('--bam', required=False, action='store_true', default=False, help='output golden BAM file') -parser.add_argument('--vcf', required=False, action='store_true', default=False, help='output golden VCF file') -parser.add_argument('--fa', required=False, action='store_true', default=False, help='output FASTA instead of FASTQ') -parser.add_argument('--rng', type=int, required=False, metavar='', default=-1, help='rng seed value; identical RNG value should produce identical runs of the program, so things like read locations, variant positions, error positions, etc, should all be the same.') -parser.add_argument('--gz', required=False, action='store_true', default=False, help='gzip output FQ and VCF') -parser.add_argument('--no-fastq', required=False, action='store_true', default=False, help='bypass fastq generation') -parser.add_argument('--discard-offtarget', required=False, action='store_true', default=False, help='discard reads outside of targeted regions') -parser.add_argument('--force-coverage', required=False, action='store_true', default=False, help='[debug] ignore fancy models, force coverage to be constant') -parser.add_argument('--rescale-qual', required=False, action='store_true', default=False, help='rescale quality scores to match -E input') -args = parser.parse_args() - -# required args -(REFERENCE, READLEN, OUT_PREFIX) = (args.r, args.R, args.o) -# various dataset parameters -(COVERAGE, PLOIDS, INPUT_BED, DISCARD_BED, SE_MODEL, SE_RATE, MUT_MODEL, MUT_RATE, MUT_BED, INPUT_VCF) = (args.c, args.p, args.t, args.d, args.e, args.E, args.m, args.M, args.Mb, args.v) -# cancer params (disabled currently) -#(CANCER, CANCER_MODEL, CANCER_PURITY) = (args.cancer, args.cm, args.cp) -(CANCER, CANCER_MODEL, CANCER_PURITY) = (False, None, 0.8) -(OFFTARGET_SCALAR, OFFTARGET_DISCARD, FORCE_COVERAGE, RESCALE_QUAL) = (args.to, args.discard_offtarget, args.force_coverage, args.rescale_qual) -# important flags -(SAVE_BAM, SAVE_VCF, FASTA_INSTEAD, GZIPPED_OUT, NO_FASTQ) = (args.bam, args.vcf, args.fa, args.gz, args.no_fastq) - -ONLY_VCF = (NO_FASTQ and SAVE_VCF and not(SAVE_BAM)) -if ONLY_VCF: - print 'Only producing VCF output, that should speed things up a bit...' - -# sequencing model parameters -(FRAGMENT_SIZE, FRAGMENT_STD) = args.pe -FRAGLEN_MODEL = args.pe_model -GC_BIAS_MODEL = args.gc_model -N_MAX_QUAL = args.N - -# if user specified no fastq, no bam, no vcf, then inform them of their wasteful ways and exit -if NO_FASTQ == True and SAVE_BAM == False and SAVE_VCF == False: - print '\nError: No files will be written when --no-fastq is specified without --vcf or --bam.' - exit(1) - -# if user specified mean/std, use artificial fragment length distribution, otherwise use -# the empirical model specified. If neither, then we're doing single-end reads. -PAIRED_END = False -PAIRED_END_ARTIFICIAL = False -if FRAGMENT_SIZE != None and FRAGMENT_STD != None: - PAIRED_END = True - PAIRED_END_ARTIFICIAL = True -elif FRAGLEN_MODEL != None: - PAIRED_END = True - PAIRED_END_ARTIFICIAL = False - -(MYJOB, NJOBS) = args.job -if MYJOB == 0: - MYJOB = 1 - NJOBS = 1 -SAVE_NON_N = args.nnr - -RNG_SEED = args.rng -if RNG_SEED == -1: - RNG_SEED = random.randint(1,99999999) -random.seed(RNG_SEED) - - -"""************************************************ -**** INPUT ERROR CHECKING -************************************************""" - - -checkFileOpen(REFERENCE,'ERROR: could not open reference',required=True) -checkFileOpen(INPUT_VCF,'ERROR: could not open input VCF',required=False) -checkFileOpen(INPUT_BED,'ERROR: could not open input BED',required=False) -requiredField(OUT_PREFIX,'ERROR: no output prefix provided') -if (FRAGMENT_SIZE == None and FRAGMENT_STD != None) or (FRAGMENT_SIZE != None and FRAGMENT_STD == None): - print '\nError: --pe argument takes 2 space-separated arguments.\n' - exit(1) - - -"""************************************************ -**** LOAD INPUT MODELS -************************************************""" - - -# mutation models -# -MUT_MODEL = parseInputMutationModel(MUT_MODEL,1) -if CANCER: - CANCER_MODEL = parseInputMutationModel(CANCER_MODEL,2) -if MUT_RATE < 0.: - MUT_RATE = None - -# sequencing error model -# -if SE_RATE < 0.: - SE_RATE = None -if SE_MODEL == None: - print 'Using default sequencing error model.' - SE_MODEL = SIM_PATH+'/models/errorModel_toy.p' - SE_CLASS = ReadContainer(READLEN, SE_MODEL, SE_RATE, RESCALE_QUAL) -else: - # probably need to do some sanity checking - SE_CLASS = ReadContainer(READLEN, SE_MODEL, SE_RATE, RESCALE_QUAL) - -# GC-bias model -# -if GC_BIAS_MODEL == None: - print 'Using default gc-bias model.' - GC_BIAS_MODEL = SIM_PATH+'/models/gcBias_toy.p' - [GC_SCALE_COUNT, GC_SCALE_VAL] = pickle.load(open(GC_BIAS_MODEL,'rb')) - GC_WINDOW_SIZE = GC_SCALE_COUNT[-1] -else: - [GC_SCALE_COUNT, GC_SCALE_VAL] = pickle.load(open(GC_BIAS_MODEL,'rb')) - GC_WINDOW_SIZE = GC_SCALE_COUNT[-1] - -# fragment length distribution -# -if PAIRED_END and not(PAIRED_END_ARTIFICIAL): - print 'Using empirical fragment length distribution.' - [potential_vals, potential_prob] = pickle.load(open(FRAGLEN_MODEL,'rb')) - FRAGLEN_VALS = [] - FRAGLEN_PROB = [] - for i in xrange(len(potential_vals)): - if potential_vals[i] > READLEN: - FRAGLEN_VALS.append(potential_vals[i]) - FRAGLEN_PROB.append(potential_prob[i]) - # should probably add some validation and sanity-checking code here... - FRAGLEN_DISTRIBUTION = DiscreteDistribution(FRAGLEN_PROB,FRAGLEN_VALS) - FRAGMENT_SIZE = FRAGLEN_VALS[mean_ind_of_weighted_list(FRAGLEN_PROB)] - -# Indicate not writing FASTQ reads -# -if NO_FASTQ: - print 'Bypassing FASTQ generation...' - -"""************************************************ -**** HARD-CODED CONSTANTS -************************************************""" - - -# target window size for read sampling. how many times bigger than read/frag length -WINDOW_TARGET_SCALE = 100 -# sub-window size for read sampling windows. this is basically the finest resolution -# that can be obtained for targeted region boundaries and GC% bias -SMALL_WINDOW = 20 -# is the mutation model constant throughout the simulation? If so we can save a lot of time -CONSTANT_MUT_MODEL = True - - -"""************************************************ -**** DEFAULT MODELS -************************************************""" - - -# fragment length distribution: normal distribution that goes out to +- 6 standard deviations -if PAIRED_END and PAIRED_END_ARTIFICIAL: - print 'Using artificial fragment length distribution. mean='+str(FRAGMENT_SIZE)+', std='+str(FRAGMENT_STD) - if FRAGMENT_STD == 0: - FRAGLEN_DISTRIBUTION = DiscreteDistribution([1],[FRAGMENT_SIZE],degenerateVal=FRAGMENT_SIZE) - else: - potential_vals = range(FRAGMENT_SIZE-6*FRAGMENT_STD,FRAGMENT_SIZE+6*FRAGMENT_STD+1) - FRAGLEN_VALS = [] - for i in xrange(len(potential_vals)): - if potential_vals[i] > READLEN: - FRAGLEN_VALS.append(potential_vals[i]) - FRAGLEN_PROB = [np.exp(-(((n-float(FRAGMENT_SIZE))**2)/(2*(FRAGMENT_STD**2)))) for n in FRAGLEN_VALS] - FRAGLEN_DISTRIBUTION = DiscreteDistribution(FRAGLEN_PROB,FRAGLEN_VALS) - - -"""************************************************ -**** MORE INPUT ERROR CHECKING -************************************************""" - - -isInRange(READLEN, 10,1000000, 'Error: -R must be between 10 and 1,000,000') -isInRange(COVERAGE, 0,1000000, 'Error: -c must be between 0 and 1,000,000') -isInRange(PLOIDS, 1,100, 'Error: -p must be between 1 and 100') -isInRange(OFFTARGET_SCALAR, 0,1, 'Error: -to must be between 0 and 1') -if MUT_RATE != -1 and MUT_RATE != None: - isInRange(MUT_RATE, 0,0.3, 'Error: -M must be between 0 and 0.3') -if SE_RATE != -1 and SE_RATE != None: - isInRange(SE_RATE, 0,0.3, 'Error: -E must be between 0 and 0.3') -if NJOBS != 1: - isInRange(NJOBS, 1,1000, 'Error: --job must be between 1 and 1,000') - isInRange(MYJOB, 1,1000, 'Error: --job must be between 1 and 1,000') - isInRange(MYJOB, 1,NJOBS, 'Error: job id must be less than or equal to number of jobs') -if N_MAX_QUAL != -1: - isInRange(N_MAX_QUAL, 1,40, 'Error: -N must be between 1 and 40') - - -"""************************************************ -**** MAIN() -************************************************""" - - -def main(): - - # index reference - refIndex = indexRef(REFERENCE) - if PAIRED_END: - N_HANDLING = ('random',FRAGMENT_SIZE) - else: - N_HANDLING = ('ignore',READLEN) - indices_by_refName = {refIndex[n][0]:n for n in xrange(len(refIndex))} - - # parse input variants, if present - inputVariants = [] - if INPUT_VCF != None: - if CANCER: - (sampNames, inputVariants) = parseVCF(INPUT_VCF,tumorNormal=True,ploidy=PLOIDS) - tumorInd = sampNames.index('TUMOR') - normalInd = sampNames.index('NORMAL') - else: - (sampNames, inputVariants) = parseVCF(INPUT_VCF,ploidy=PLOIDS) - for k in sorted(inputVariants.keys()): - inputVariants[k].sort() - - # parse input targeted regions, if present - refList = [n[0] for n in refIndex] - inputRegions = {} - if INPUT_BED != None: - f = open(INPUT_BED,'r') - for line in f: - [myChr,pos1,pos2] = line.strip().split('\t')[:3] - if myChr not in inputRegions: - inputRegions[myChr] = [-1] - inputRegions[myChr].extend([int(pos1),int(pos2)]) - f.close() - # some validation - nInBedOnly = 0 - nInRefOnly = 0 - for k in refList: - if k not in inputRegions: - nInRefOnly += 1 - for k in inputRegions.keys(): - if not k in refList: - nInBedOnly += 1 - del inputRegions[k] - if nInRefOnly > 0: - print 'Warning: Reference contains sequences not found in targeted regions BED file.' - if nInBedOnly > 0: - print 'Warning: Targeted regions BED file contains sequence names not found in reference (regions ignored).' - # parse discard bed similarly - discardRegions = {} - if DISCARD_BED != None: - f = open(DISCARD_BED,'r') - for line in f: - [myChr,pos1,pos2] = line.strip().split('\t')[:3] - if myChr not in discardRegions: - discardRegions[myChr] = [-1] - discardRegions[myChr].extend([int(pos1),int(pos2)]) - f.close() - - # parse input mutation rate rescaling regions, if present - mutRateRegions = {} - mutRateValues = {} - if MUT_BED != None: - with open(MUT_BED,'r') as f: - for line in f: - [myChr,pos1,pos2,metaData] = line.strip().split('\t')[:4] - mutStr = re.findall(r"MUT_RATE=.*?(?=;)",metaData+';') - (pos1,pos2) = (int(pos1),int(pos2)) - if len(mutStr) and (pos2-pos1) > 1: - # mutRate = #_mutations / length_of_region, let's bound it by a reasonable amount - mutRate = max([0.0,min([float(mutStr[0][9:]),0.3])]) - if myChr not in mutRateRegions: - mutRateRegions[myChr] = [-1] - mutRateValues[myChr] = [0.0] - mutRateRegions[myChr].extend([pos1,pos2]) - mutRateValues.extend([mutRate*(pos2-pos1)]*2) - - # initialize output files (part I) - bamHeader = None - if SAVE_BAM: - bamHeader = [copy.deepcopy(refIndex)] - vcfHeader = None - if SAVE_VCF: - vcfHeader = [REFERENCE] - - # If processing jobs in parallel, precompute the independent regions that can be process separately - if NJOBS > 1: - parallelRegionList = getAllRefRegions(REFERENCE,refIndex,N_HANDLING,saveOutput=SAVE_NON_N) - (myRefs, myRegions) = partitionRefRegions(parallelRegionList,refIndex,MYJOB,NJOBS) - if not len(myRegions): - print 'This job id has no regions to process, exiting...' - exit(1) - for i in xrange(len(refIndex)-1,-1,-1): # delete reference not used in our job - if not refIndex[i][0] in myRefs: - del refIndex[i] - # if value of NJOBS is too high, let's change it to the maximum possible, to avoid output filename confusion - corrected_nJobs = min([NJOBS,sum([len(n) for n in parallelRegionList.values()])]) - else: - corrected_nJobs = 1 - - # initialize output files (part II) - if CANCER: - OFW = OutputFileWriter(OUT_PREFIX+'_normal',paired=PAIRED_END,BAM_header=bamHeader,VCF_header=vcfHeader,gzipped=GZIPPED_OUT,noFASTQ=NO_FASTQ,FASTA_instead=FASTA_INSTEAD) - OFW_CANCER = OutputFileWriter(OUT_PREFIX+'_tumor',paired=PAIRED_END,BAM_header=bamHeader,VCF_header=vcfHeader,gzipped=GZIPPED_OUT,jobTuple=(MYJOB,corrected_nJobs),noFASTQ=NO_FASTQ,FASTA_instead=FASTA_INSTEAD) - else: - OFW = OutputFileWriter(OUT_PREFIX,paired=PAIRED_END,BAM_header=bamHeader,VCF_header=vcfHeader,gzipped=GZIPPED_OUT,jobTuple=(MYJOB,corrected_nJobs),noFASTQ=NO_FASTQ,FASTA_instead=FASTA_INSTEAD) - OUT_PREFIX_NAME = OUT_PREFIX.split('/')[-1] - - - """************************************************ - **** LET'S GET THIS PARTY STARTED... - ************************************************""" - - - readNameCount = 1 # keep track of the number of reads we've sampled, for read-names - unmapped_records = [] - - for RI in xrange(len(refIndex)): - - # read in reference sequence and notate blocks of Ns - (refSequence,N_regions) = readRef(REFERENCE,refIndex[RI],N_HANDLING) - - # if we're processing jobs in parallel only take the regions relevant for the current job - if NJOBS > 1: - for i in xrange(len(N_regions['non_N'])-1,-1,-1): - if not (refIndex[RI][0],N_regions['non_N'][i][0],N_regions['non_N'][i][1]) in myRegions: - del N_regions['non_N'][i] - - # count total bp we'll be spanning so we can get an idea of how far along we are (for printing progress indicators) - total_bp_span = sum([n[1]-n[0] for n in N_regions['non_N']]) - currentProgress = 0 - currentPercent = 0 - havePrinted100 = False - - # prune invalid input variants, e.g variants that: - # - try to delete or alter any N characters - # - don't match the reference base at their specified position - # - any alt allele contains anything other than allowed characters - validVariants = [] - nSkipped = [0,0,0] - if refIndex[RI][0] in inputVariants: - for n in inputVariants[refIndex[RI][0]]: - span = (n[0],n[0]+len(n[1])) - rseq = str(refSequence[span[0]-1:span[1]-1]) # -1 because going from VCF coords to array coords - anyBadChr = any((nn not in ALLOWED_NUCL) for nn in [item for sublist in n[2] for item in sublist]) - if rseq != n[1]: - nSkipped[0] += 1 - continue - elif 'N' in rseq: - nSkipped[1] += 1 - continue - elif anyBadChr: - nSkipped[2] += 1 - continue - #if bisect.bisect(N_regions['big'],span[0])%2 or bisect.bisect(N_regions['big'],span[1])%2: - # continue - validVariants.append(n) - print 'found',len(validVariants),'valid variants for '+refIndex[RI][0]+' in input VCF...' - if any(nSkipped): - print sum(nSkipped),'variants skipped...' - print ' - ['+str(nSkipped[0])+'] ref allele does not match reference' - print ' - ['+str(nSkipped[1])+'] attempting to insert into N-region' - print ' - ['+str(nSkipped[2])+'] alt allele contains non-ACGT characters' - - - # add large random structural variants - # - # TBD!!! - - - # determine sampling windows based on read length, large N regions, and structural mutations. - # in order to obtain uniform coverage, windows should overlap by: - # - READLEN, if single-end reads - # - FRAGMENT_SIZE (mean), if paired-end reads - # ploidy is fixed per large sampling window, - # coverage distributions due to GC% and targeted regions are specified within these windows - samplingWindows = [] - ALL_VARIANTS_OUT = {} - sequences = None - if PAIRED_END: - targSize = WINDOW_TARGET_SCALE*FRAGMENT_SIZE - overlap = FRAGMENT_SIZE - overlap_minWindowSize = max(FRAGLEN_DISTRIBUTION.values) + 10 - else: - targSize = WINDOW_TARGET_SCALE*READLEN - overlap = READLEN - overlap_minWindowSize = READLEN + 10 - - print '--------------------------------' - if ONLY_VCF: - print 'generating vcf...' - else: - print 'sampling reads...' - tt = time.time() - - for i in xrange(len(N_regions['non_N'])): - (pi,pf) = N_regions['non_N'][i] - nTargWindows = max([1,(pf-pi)/targSize]) - bpd = int((pf-pi)/float(nTargWindows)) - #bpd += GC_WINDOW_SIZE - bpd%GC_WINDOW_SIZE - - #print len(refSequence), (pi,pf), nTargWindows - #print structuralVars - - # if for some reason our region is too small to process, skip it! (sorry) - if nTargWindows == 1 and (pf-pi) < overlap_minWindowSize: - #print 'Does this ever happen?' - continue - - start = pi - end = min([start+bpd,pf]) - #print '------------------RAWR:', (pi,pf), nTargWindows, bpd - varsFromPrevOverlap = [] - varsCancerFromPrevOverlap = [] - vindFromPrev = 0 - isLastTime = False - havePrinted100 = False - - while True: - - # which inserted variants are in this window? - varsInWindow = [] - updated = False - for j in xrange(vindFromPrev,len(validVariants)): - vPos = validVariants[j][0] - if vPos > start and vPos < end: # update: changed >= to >, so variant cannot be inserted in first position - varsInWindow.append(tuple([vPos-1]+list(validVariants[j][1:]))) # vcf --> array coords - if vPos >= end-overlap-1 and updated == False: - updated = True - vindFromPrev = j - if vPos >= end: - break - - # determine which structural variants will affect our sampling window positions - structuralVars = [] - for n in varsInWindow: - bufferNeeded = max([max([abs(len(n[1])-len(alt_allele)),1]) for alt_allele in n[2]]) # change: added abs() so that insertions are also buffered. - structuralVars.append((n[0]-1,bufferNeeded)) # -1 because going from VCF coords to array coords - - # adjust end-position of window based on inserted structural mutations - buffer_added = 0 - keepGoing = True - while keepGoing: - keepGoing = False - for n in structuralVars: - # adding "overlap" here to prevent SVs from being introduced in overlap regions - # (which can cause problems if random mutations from the previous window land on top of them) - delta = (end-1) - (n[0] + n[1]) - 2 - overlap - if delta < 0: - #print 'DELTA:', delta, 'END:', end, '-->', - buffer_added = -delta - end += buffer_added - ####print end - keepGoing = True - break - next_start = end-overlap - next_end = min([next_start+bpd,pf]) - if next_end-next_start < bpd: - end = next_end - isLastTime = True - - # print progress indicator - #print 'PROCESSING WINDOW:',(start,end), [buffer_added], 'next:', (next_start,next_end), 'isLastTime:', isLastTime - currentProgress += end-start - newPercent = int((currentProgress*100)/float(total_bp_span)) - if newPercent > currentPercent: - if newPercent <= 99 or (newPercent == 100 and not havePrinted100): - sys.stdout.write(str(newPercent)+'% ') - sys.stdout.flush() - currentPercent = newPercent - if currentPercent == 100: - havePrinted100 = True - - skip_this_window = False - - # compute coverage modifiers - coverage_avg = None - coverage_dat = [GC_WINDOW_SIZE,GC_SCALE_VAL,[]] - target_hits = 0 - if INPUT_BED == None: - coverage_dat[2] = [1.0]*(end-start) - else: - if refIndex[RI][0] not in inputRegions: - coverage_dat[2] = [OFFTARGET_SCALAR]*(end-start) - else: - for j in xrange(start,end): - if not(bisect.bisect(inputRegions[refIndex[RI][0]],j)%2): - coverage_dat[2].append(1.0) - target_hits += 1 - else: - coverage_dat[2].append(OFFTARGET_SCALAR) - - # offtarget and we're not interested? - if OFFTARGET_DISCARD and target_hits <= READLEN: - coverage_avg = 0.0 - skip_this_window = True - - #print len(coverage_dat[2]), sum(coverage_dat[2]) - if sum(coverage_dat[2]) < LOW_COV_THRESH: - coverage_avg = 0.0 - skip_this_window = True - - # check for small window sizes - if (end-start) < overlap_minWindowSize: - skip_this_window = True - - if skip_this_window: - # skip window, save cpu time - start = next_start - end = next_end - if isLastTime: - break - if end >= pf: - isLastTime = True - varsFromPrevOverlap = [] - continue - - # construct sequence data that we will sample reads from - if sequences == None: - sequences = SequenceContainer(start,refSequence[start:end],PLOIDS,overlap,READLEN,[MUT_MODEL]*PLOIDS,MUT_RATE,onlyVCF=ONLY_VCF) - else: - sequences.update(start,refSequence[start:end],PLOIDS,overlap,READLEN,[MUT_MODEL]*PLOIDS,MUT_RATE) - - # insert variants - sequences.insert_mutations(varsFromPrevOverlap + varsInWindow) - all_inserted_variants = sequences.random_mutations() - #print all_inserted_variants - - # init coverage - if sum(coverage_dat[2]) >= LOW_COV_THRESH: - if PAIRED_END: - coverage_avg = sequences.init_coverage(tuple(coverage_dat),fragDist=FRAGLEN_DISTRIBUTION) - else: - coverage_avg = sequences.init_coverage(tuple(coverage_dat)) - - # unused cancer stuff - if CANCER: - tumor_sequences = SequenceContainer(start,refSequence[start:end],PLOIDS,overlap,READLEN,[CANCER_MODEL]*PLOIDS,MUT_RATE,coverage_dat) - tumor_sequences.insert_mutations(varsCancerFromPrevOverlap + all_inserted_variants) - all_cancer_variants = tumor_sequences.random_mutations() - - # which variants do we need to keep for next time (because of window overlap)? - varsFromPrevOverlap = [] - varsCancerFromPrevOverlap = [] - for n in all_inserted_variants: - if n[0] >= end-overlap-1: - varsFromPrevOverlap.append(n) - if CANCER: - for n in all_cancer_variants: - if n[0] >= end-overlap-1: - varsCancerFromPrevOverlap.append(n) - - # if we're only producing VCF, no need to go through the hassle of generating reads - if ONLY_VCF: - pass - else: - windowSpan = end-start - if PAIRED_END: - if FORCE_COVERAGE: - readsToSample = int((windowSpan*float(COVERAGE))/(2*READLEN))+1 - else: - readsToSample = int((windowSpan*float(COVERAGE)*coverage_avg)/(2*READLEN))+1 - else: - if FORCE_COVERAGE: - readsToSample = int((windowSpan*float(COVERAGE))/READLEN)+1 - else: - readsToSample = int((windowSpan*float(COVERAGE)*coverage_avg)/READLEN)+1 - - # if coverage is so low such that no reads are to be sampled, skip region - # (i.e., remove buffer of +1 reads we add to every window) - if readsToSample == 1 and sum(coverage_dat[2]) < LOW_COV_THRESH: - readsToSample = 0 - - # sample reads - ASDF2_TT = time.time() - for i in xrange(readsToSample): - - isUnmapped = [] - if PAIRED_END: - myFraglen = FRAGLEN_DISTRIBUTION.sample() - myReadData = sequences.sample_read(SE_CLASS,myFraglen) - if myReadData == None: # skip if we failed to find a valid position to sample read - continue - if myReadData[0][0] == None: - isUnmapped.append(True) - else: - isUnmapped.append(False) - myReadData[0][0] += start # adjust mapping position based on window start - if myReadData[1][0] == None: - isUnmapped.append(True) - else: - isUnmapped.append(False) - myReadData[1][0] += start - else: - myReadData = sequences.sample_read(SE_CLASS) - if myReadData == None: # skip if we failed to find a valid position to sample read - continue - if myReadData[0][0] == None: # unmapped read (lives in large insertion) - isUnmapped = [True] - else: - isUnmapped = [False] - myReadData[0][0] += start # adjust mapping position based on window start - - # are we discarding offtargets? - outside_boundaries = [] - if OFFTARGET_DISCARD and INPUT_BED != None: - outside_boundaries += [bisect.bisect(inputRegions[refIndex[RI][0]],n[0])%2 for n in myReadData] - outside_boundaries += [bisect.bisect(inputRegions[refIndex[RI][0]],n[0]+len(n[2]))%2 for n in myReadData] - if DISCARD_BED != None: - outside_boundaries += [bisect.bisect(discardRegions[refIndex[RI][0]],n[0])%2 for n in myReadData] - outside_boundaries += [bisect.bisect(discardRegions[refIndex[RI][0]],n[0]+len(n[2]))%2 for n in myReadData] - if len(outside_boundaries) and any(outside_boundaries): - continue - - if NJOBS > 1: - myReadName = OUT_PREFIX_NAME+'-j'+str(MYJOB)+'-'+refIndex[RI][0]+'-r'+str(readNameCount) - else: - myReadName = OUT_PREFIX_NAME+'-'+refIndex[RI][0]+'-'+str(readNameCount) - readNameCount += len(myReadData) - - # if desired, replace all low-quality bases with Ns - if N_MAX_QUAL > -1: - for j in xrange(len(myReadData)): - myReadString = [n for n in myReadData[j][2]] - for k in xrange(len(myReadData[j][3])): - adjusted_qual = ord(myReadData[j][3][k])-SE_CLASS.offQ - if adjusted_qual <= N_MAX_QUAL: - myReadString[k] = 'N' - myReadData[j][2] = ''.join(myReadString) - - # flip a coin, are we forward or reverse strand? - isForward = (random.random() < 0.5) - - # if read (or read + mate for PE) are unmapped, put them at end of bam file - if all(isUnmapped): - if PAIRED_END: - if isForward: - flag1 = sam_flag(['paired','unmapped','mate_unmapped','first','mate_reverse']) - flag2 = sam_flag(['paired','unmapped','mate_unmapped','second','reverse']) - else: - flag1 = sam_flag(['paired','unmapped','mate_unmapped','second','mate_reverse']) - flag2 = sam_flag(['paired','unmapped','mate_unmapped','first','reverse']) - unmapped_records.append((myReadName+'/1',myReadData[0],flag1)) - unmapped_records.append((myReadName+'/2',myReadData[1],flag2)) - else: - flag1 = sam_flag(['unmapped']) - unmapped_records.append((myReadName+'/1',myReadData[0],flag1)) - - myRefIndex = indices_by_refName[refIndex[RI][0]] - - # - # write SE output - # - if len(myReadData) == 1: - if NO_FASTQ != True: - if isForward: - OFW.writeFASTQRecord(myReadName,myReadData[0][2],myReadData[0][3]) - else: - OFW.writeFASTQRecord(myReadName,RC(myReadData[0][2]),myReadData[0][3][::-1]) - if SAVE_BAM: - if isUnmapped[0] == False: - if isForward: - flag1 = 0 - OFW.writeBAMRecord(myRefIndex, myReadName, myReadData[0][0], myReadData[0][1], myReadData[0][2], myReadData[0][3], samFlag=flag1) - else: - flag1 = sam_flag(['reverse']) - OFW.writeBAMRecord(myRefIndex, myReadName, myReadData[0][0], myReadData[0][1], myReadData[0][2], myReadData[0][3], samFlag=flag1) - # - # write PE output - # - elif len(myReadData) == 2: - if NO_FASTQ != True: - OFW.writeFASTQRecord(myReadName,myReadData[0][2],myReadData[0][3],read2=myReadData[1][2],qual2=myReadData[1][3],orientation=isForward) - if SAVE_BAM: - if isUnmapped[0] == False and isUnmapped[1] == False: - if isForward: - flag1 = sam_flag(['paired','proper','first','mate_reverse']) - flag2 = sam_flag(['paired','proper','second','reverse']) - else: - flag1 = sam_flag(['paired','proper','second','mate_reverse']) - flag2 = sam_flag(['paired','proper','first','reverse']) - OFW.writeBAMRecord(myRefIndex, myReadName, myReadData[0][0], myReadData[0][1], myReadData[0][2], myReadData[0][3], samFlag=flag1, matePos=myReadData[1][0]) - OFW.writeBAMRecord(myRefIndex, myReadName, myReadData[1][0], myReadData[1][1], myReadData[1][2], myReadData[1][3], samFlag=flag2, matePos=myReadData[0][0]) - elif isUnmapped[0] == False and isUnmapped[1] == True: - if isForward: - flag1 = sam_flag(['paired','first', 'mate_unmapped', 'mate_reverse']) - flag2 = sam_flag(['paired','second', 'unmapped', 'reverse']) - else: - flag1 = sam_flag(['paired','second', 'mate_unmapped', 'mate_reverse']) - flag2 = sam_flag(['paired','first', 'unmapped', 'reverse']) - OFW.writeBAMRecord(myRefIndex, myReadName, myReadData[0][0], myReadData[0][1], myReadData[0][2], myReadData[0][3], samFlag=flag1, matePos=myReadData[0][0]) - OFW.writeBAMRecord(myRefIndex, myReadName, myReadData[0][0], myReadData[1][1], myReadData[1][2], myReadData[1][3], samFlag=flag2, matePos=myReadData[0][0], alnMapQual=0) - elif isUnmapped[0] == True and isUnmapped[1] == False: - if isForward: - flag1 = sam_flag(['paired','first', 'unmapped', 'mate_reverse']) - flag2 = sam_flag(['paired','second', 'mate_unmapped', 'reverse']) - else: - flag1 = sam_flag(['paired','second', 'unmapped', 'mate_reverse']) - flag2 = sam_flag(['paired','first', 'mate_unmapped', 'reverse']) - OFW.writeBAMRecord(myRefIndex, myReadName, myReadData[1][0], myReadData[0][1], myReadData[0][2], myReadData[0][3], samFlag=flag1, matePos=myReadData[1][0], alnMapQual=0) - OFW.writeBAMRecord(myRefIndex, myReadName, myReadData[1][0], myReadData[1][1], myReadData[1][2], myReadData[1][3], samFlag=flag2, matePos=myReadData[1][0]) - else: - print '\nError: Unexpected number of reads generated...\n' - exit(1) - #print 'READS:',time.time()-ASDF2_TT - - if not isLastTime: - OFW.flushBuffers(bamMax=next_start) - else: - OFW.flushBuffers(bamMax=end+1) - - # tally up all the variants that got successfully introduced - for n in all_inserted_variants: - ALL_VARIANTS_OUT[n] = True - - # prepare indices of next window - start = next_start - end = next_end - if isLastTime: - break - if end >= pf: - isLastTime = True - - if currentPercent != 100 and not havePrinted100: - print '100%' - else: - print '' - if ONLY_VCF: - print 'VCF generation completed in', - else: - print 'Read sampling completed in', - print int(time.time()-tt),'(sec)' - - # write all output variants for this reference - if SAVE_VCF: - print 'Writing output VCF...' - for k in sorted(ALL_VARIANTS_OUT.keys()): - currentRef = refIndex[RI][0] - myID = '.' - myQual = '.' - myFilt = 'PASS' - # k[0] + 1 because we're going back to 1-based vcf coords - OFW.writeVCFRecord(currentRef, str(int(k[0])+1), myID, k[1], k[2], myQual, myFilt, k[4]) - - #break - - # write unmapped reads to bam file - if SAVE_BAM and len(unmapped_records): - print 'writing unmapped reads to bam file...' - for umr in unmapped_records: - if PAIRED_END: - OFW.writeBAMRecord(-1, umr[0], 0, umr[1][1], umr[1][2], umr[1][3], samFlag=umr[2], matePos=0, alnMapQual=0) - else: - OFW.writeBAMRecord(-1, umr[0], 0, umr[1][1], umr[1][2], umr[1][3], samFlag=umr[2], alnMapQual=0) - - # close output files - OFW.closeFiles() - if CANCER: - OFW_CANCER.closeFiles() - - -if __name__ == '__main__': - main() - - - diff --git a/gen_reads.py b/gen_reads.py new file mode 100755 index 0000000..1fb5649 --- /dev/null +++ b/gen_reads.py @@ -0,0 +1,902 @@ +#!/usr/bin/env source +# encoding: utf-8 +""" //////////////////////////////////////////////////////////////////////////////// + /// /// + /// gen_reads.source /// + /// VERSION 2.0: HARDER, BETTER, FASTER, STRONGER! /// +/////// ////// + /// Variant and read simulator for benchmarking NGS workflows /// + /// /// + /// Written by: Zach Stephens /// +/////// For: DEPEND Research Group, UIUC /////// + /// Date: May 29, 2015 /// + /// Contact: zstephe2@illinois.edu /// + /// /// +/////////////////////////////////////////////////////////////////////////////// """ + +import sys +import copy +import random +import re +import time +import bisect +import pickle +import numpy as np +import argparse +import pathlib +# from Bio import SeqIO + +from source.input_checking import check_file_open, is_in_range +from source.ref_func import index_ref, read_ref +from source.vcf_func import parse_vcf +from source.output_file_writer import OutputFileWriter, reverse_complement, sam_flag +from source.probability import DiscreteDistribution, mean_ind_of_weighted_list +from source.SequenceContainer import SequenceContainer, ReadContainer, parse_input_mutation_model + +""" +Some constants needed for analysis +""" + +# target window size for read sampling. How many times bigger than read/frag length +WINDOW_TARGET_SCALE = 100 + +# allowed nucleotides +ALLOWED_NUCL = ['A', 'C', 'G', 'T'] + + +def main(raw_args=None): + """////////////////////////////////////////////////// + //////////// PARSE INPUT ARGUMENTS //////////// + //////////////////////////////////////////////////""" + + parser = argparse.ArgumentParser(description='NEAT-genReads V3.0', + formatter_class=argparse.ArgumentDefaultsHelpFormatter,) + parser.add_argument('-r', type=str, required=True, metavar='reference', help="Path to reference fasta") + parser.add_argument('-R', type=int, required=True, metavar='read length', help="The desired read length") + parser.add_argument('-o', type=str, required=True, metavar='output_prefix', + help="Prefix for the output files (can be a path)") + parser.add_argument('-c', type=float, required=False, metavar='coverage', default=10.0, + help="Average coverage, default is 10.0") + parser.add_argument('-e', type=str, required=False, metavar='error_model', default=None, + help="Location of the file for the sequencing error model (omit to use the default)") + parser.add_argument('-E', type=float, required=False, metavar='Error rate', default=-1, + help="Rescale avg sequencing error rate to this, must be between 0.0 and 0.3") + parser.add_argument('-p', type=int, required=False, metavar='ploidy', default=2, + help="Desired ploidy, default = 2") + parser.add_argument('-tr', type=str, required=False, metavar='target.bed', default=None, + help="Bed file containing targeted regions") + parser.add_argument('-dr', type=str, required=False, metavar='discard_regions.bed', default=None, + help="Bed file with regions to discard") + parser.add_argument('-to', type=float, required=False, metavar='off-target coverage scalar', default=0.00, + help="off-target coverage scalar") + parser.add_argument('-m', type=str, required=False, metavar='model.p', default=None, + help="Mutation model pickle file") + parser.add_argument('-M', type=float, required=False, metavar='avg mut rate', default=-1, + help="Rescale avg mutation rate to this (1/bp), must be between 0 and 0.3") + parser.add_argument('-Mb', type=str, required=False, metavar='mut_rates.bed', default=None, + help="Bed file containing positional mut rates") + parser.add_argument('-N', type=int, required=False, metavar='min qual score', default=-1, + help="below this quality score, replace base-calls with N's") + parser.add_argument('-v', type=str, required=False, metavar='vcf.file', default=None, + help="Input VCF file of variants to include") + parser.add_argument('--pe', nargs=2, type=int, required=False, metavar=('', ''), default=(None, None), + help='Paired-end fragment length mean and std') + parser.add_argument('--pe-model', type=str, required=False, metavar='', default=None, + help='empirical fragment length distribution') + parser.add_argument('--gc-model', type=str, required=False, metavar='', default=None, + help='empirical GC coverage bias distribution') + parser.add_argument('--bam', required=False, action='store_true', default=False, help='output golden BAM file') + parser.add_argument('--vcf', required=False, action='store_true', default=False, help='output golden VCF file') + parser.add_argument('--fa', required=False, action='store_true', default=False, + help='output FASTA instead of FASTQ') + parser.add_argument('--rng', type=int, required=False, metavar='', default=-1, + help='rng seed value; identical RNG value should produce identical runs of the program, so ' + 'things like read locations, variant positions, error positions, etc, ' + 'should all be the same.') + parser.add_argument('--no-fastq', required=False, action='store_true', default=False, + help='bypass fastq generation') + parser.add_argument('--discard-offtarget', required=False, action='store_true', default=False, + help='discard reads outside of targeted regions') + parser.add_argument('--force-coverage', required=False, action='store_true', default=False, + help='[debug] ignore fancy models, force coverage to be constant') + parser.add_argument('--rescale-qual', required=False, action='store_true', default=False, + help='Rescale quality scores to match -E input') + # TODO implement a broader debugging scheme for subclasses. + parser.add_argument('-d', required=False, action='store_true', default=False, help='Activate Debug Mode') + args = parser.parse_args(raw_args) + + """ + Set variables for processing + """ + + # absolute path to this script + sim_path = pathlib.Path(__file__).resolve().parent + + # if coverage val for a given window/position is below this value, consider it effectively zero. + low_cov_thresh = 50 + + # required args + (reference, read_len, out_prefix) = (args.r, args.R, args.o) + # various dataset parameters + (coverage, ploids, input_bed, discard_bed, se_model, se_rate, mut_model, mut_rate, mut_bed, input_vcf) = \ + (args.c, args.p, args.tr, args.dr, args.e, args.E, args.m, args.M, args.Mb, args.v) + # cancer params (disabled currently) + # (cancer, cancer_model, cancer_purity) = (args.cancer, args.cm, args.cp) + (cancer, cancer_model, cancer_purity) = (False, None, 0.8) + (off_target_scalar, off_target_discard, force_coverage, rescale_qual) = (args.to, + args.discard_offtarget, + args.force_coverage, args.rescale_qual) + # important flags + (save_bam, save_vcf, fasta_instead, no_fastq) = \ + (args.bam, args.vcf, args.fa, args.no_fastq) + + # sequencing model parameters + (fragment_size, fragment_std) = args.pe + (fraglen_model, gc_bias_model) = args.pe_model, args.gc_model + n_max_qual = args.N + + rng_seed = args.rng + + debug = args.d + + """ + INPUT ERROR CHECKING + """ + + # Check that files are real, if provided + check_file_open(reference, 'ERROR: could not open reference, {}'.format(reference), required=True) + check_file_open(input_vcf, 'ERROR: could not open input VCF, {}'.format(input_vcf), required=False) + check_file_open(input_bed, 'ERROR: could not open input BED, {}'.format(input_bed), required=False) + + # if user specified no fastq, not fasta only, and no bam and no vcf, then print error and exit. + if no_fastq and not fasta_instead and not save_bam and not save_vcf: + print('\nERROR: No files would be written.\n') + sys.exit(1) + + if no_fastq: + print('Bypassing FASTQ generation...') + + only_vcf = no_fastq and save_vcf and not save_bam and not fasta_instead + if only_vcf: + print('Only producing VCF output...') + + if (fragment_size is None and fragment_std is not None) or (fragment_size is not None and fragment_std is None): + print('\nERROR: --pe argument takes 2 space-separated arguments.\n') + sys.exit(1) + + # If user specified mean/std, or specified an empirical model, then the reads will be paired_ended + # If not, then we're doing single-end reads. + if (fragment_size is not None and fragment_std is not None) or (fraglen_model is not None) and not fasta_instead: + paired_end = True + else: + paired_end = False + + if rng_seed == -1: + rng_seed = random.randint(1, 99999999) + random.seed(rng_seed) + + is_in_range(read_len, 10, 1000000, 'Error: -R must be between 10 and 1,000,000') + is_in_range(coverage, 0, 1000000, 'Error: -c must be between 0 and 1,000,000') + is_in_range(ploids, 1, 100, 'Error: -p must be between 1 and 100') + is_in_range(off_target_scalar, 0, 1, 'Error: -to must be between 0 and 1') + + if se_rate != -1: + is_in_range(se_rate, 0, 0.3, 'Error: -E must be between 0 and 0.3') + else: + se_rate = None + + if n_max_qual != -1: + is_in_range(n_max_qual, 1, 40, 'Error: -N must be between 1 and 40') + + """ + LOAD INPUT MODELS + """ + + # mutation models + mut_model = parse_input_mutation_model(mut_model, 1) + if cancer: + cancer_model = parse_input_mutation_model(cancer_model, 2) + if mut_rate < 0.: + mut_rate = None + + if mut_rate != -1 and mut_rate is not None: + is_in_range(mut_rate, 0.0, 1.0, 'Error: -M must be between 0 and 0.3') + + # sequencing error model + if se_model is None: + print('Using default sequencing error model.') + se_model = sim_path / 'models/errorModel_toy.p' + se_class = ReadContainer(read_len, se_model, se_rate, rescale_qual) + else: + # probably need to do some sanity checking + se_class = ReadContainer(read_len, se_model, se_rate, rescale_qual) + + # GC-bias model + if gc_bias_model is None: + print('Using default gc-bias model.') + gc_bias_model = sim_path / 'models/gcBias_toy.p' + try: + [gc_scale_count, gc_scale_val] = pickle.load(open(gc_bias_model, 'rb')) + except IOError: + print("\nProblem reading the default gc-bias model.\n") + sys.exit(1) + gc_window_size = gc_scale_count[-1] + else: + try: + [gc_scale_count, gc_scale_val] = pickle.load(open(gc_bias_model, 'rb')) + except IOError: + print("\nProblem reading the gc-bias model.\n") + sys.exit(1) + gc_window_size = gc_scale_count[-1] + + # Assign appropriate values to the needed variables if we're dealing with paired-ended data + if paired_end: + # Empirical fragment length distribution, if input model is specified + if fraglen_model is not None: + print('Using empirical fragment length distribution.') + try: + [potential_values, potential_prob] = pickle.load(open(fraglen_model, 'rb')) + except IOError: + print('\nProblem loading the empirical fragment length model.\n') + sys.exit(1) + + fraglen_values = [] + fraglen_probability = [] + for i in range(len(potential_values)): + if potential_values[i] > read_len: + fraglen_values.append(potential_values[i]) + fraglen_probability.append(potential_prob[i]) + + # TODO add some validation and sanity-checking code here... + fraglen_distribution = DiscreteDistribution(fraglen_probability, fraglen_values) + fragment_size = fraglen_values[mean_ind_of_weighted_list(fraglen_probability)] + + # Using artificial fragment length distribution, if the parameters were specified + # fragment length distribution: normal distribution that goes out to +- 6 standard deviations + elif fragment_size is not None and fragment_std is not None: + print( + 'Using artificial fragment length distribution. mean=' + str(fragment_size) + ', std=' + str( + fragment_std)) + if fragment_std == 0: + fraglen_distribution = DiscreteDistribution([1], [fragment_size], degenerate_val=fragment_size) + else: + potential_values = range(fragment_size - 6 * fragment_std, fragment_size + 6 * fragment_std + 1) + fraglen_values = [] + for i in range(len(potential_values)): + if potential_values[i] > read_len: + fraglen_values.append(potential_values[i]) + fraglen_probability = [np.exp(-(((n - float(fragment_size)) ** 2) / (2 * (fragment_std ** 2)))) for n in + fraglen_values] + fraglen_distribution = DiscreteDistribution(fraglen_probability, fraglen_values) + + """ + Process Inputs + """ + + # index reference: [(0: chromosome name, 1: byte index where the contig seq begins, + # 2: byte index where the next contig begins, 3: contig seq length), + # (repeat for every chrom)] + # TODO check to see if this might work better as a dataframe or biopython object + ref_index = index_ref(reference) + + # TODO check if this index can work, maybe it's faster + # ref_index2 = SeqIO.index(reference, 'fasta') + + if paired_end: + n_handling = ('random', fragment_size) + else: + n_handling = ('ignore', read_len) + + indices_by_ref_name = {ref_index[n][0]: n for n in range(len(ref_index))} + ref_list = [n[0] for n in ref_index] + + # parse input variants, if present + # TODO read this in as a pandas dataframe + input_variants = [] + if input_vcf is not None: + if cancer: + (sample_names, input_variants) = parse_vcf(input_vcf, tumor_normal=True, ploidy=ploids) + # TODO figure out what these were going to be used for + tumor_ind = sample_names.index('TUMOR') + normal_ind = sample_names.index('NORMAL') + else: + (sample_names, input_variants) = parse_vcf(input_vcf, ploidy=ploids) + for k in sorted(input_variants.keys()): + input_variants[k].sort() + + # parse input targeted regions, if present + # TODO convert bed to pandas dataframe + input_regions = {} + if input_bed is not None: + try: + with open(input_bed, 'r') as f: + for line in f: + [my_chr, pos1, pos2] = line.strip().split('\t')[:3] + if my_chr not in input_regions: + input_regions[my_chr] = [-1] + input_regions[my_chr].extend([int(pos1), int(pos2)]) + except IOError: + print("\nProblem reading input target BED file.\n") + sys.exit(1) + + # some validation + n_in_bed_only = 0 + n_in_ref_only = 0 + for k in ref_list: + if k not in input_regions: + n_in_ref_only += 1 + for k in input_regions.keys(): + if k not in ref_list: + n_in_bed_only += 1 + del input_regions[k] + if n_in_ref_only > 0: + print('Warning: Reference contains sequences not found in targeted regions BED file.') + if n_in_bed_only > 0: + print( + 'Warning: Targeted regions BED file contains sequence names not found in reference (regions ignored).') + + # parse discard bed similarly + # TODO convert to pandas dataframe + discard_regions = {} + if discard_bed is not None: + try: + with open(discard_bed, 'r') as f: + for line in f: + [my_chr, pos1, pos2] = line.strip().split('\t')[:3] + if my_chr not in discard_regions: + discard_regions[my_chr] = [-1] + discard_regions[my_chr].extend([int(pos1), int(pos2)]) + except IOError: + print("\nProblem reading discard BED file.\n") + sys.exit(1) + + # parse input mutation rate rescaling regions, if present + # TODO convert to pandas dataframe + mut_rate_regions = {} + mut_rate_values = {} + if mut_bed is not None: + try: + with open(mut_bed, 'r') as f: + for line in f: + [my_chr, pos1, pos2, meta_data] = line.strip().split('\t')[:4] + mut_str = re.findall(r"mut_rate=.*?(?=;)", meta_data + ';') + (pos1, pos2) = (int(pos1), int(pos2)) + if len(mut_str) and (pos2 - pos1) > 1: + # mut_rate = #_mutations / length_of_region, let's bound it by a reasonable amount + mut_rate = max([0.0, min([float(mut_str[0][9:]), 0.3])]) + if my_chr not in mut_rate_regions: + mut_rate_regions[my_chr] = [-1] + mut_rate_values[my_chr] = [0.0] + mut_rate_regions[my_chr].extend([pos1, pos2]) + # TODO figure out what the next line is supposed to do and fix + mut_rate_values.extend([mut_rate * (pos2 - pos1)] * 2) + except IOError: + print("\nProblem reading mutational BED file.\n") + sys.exit(1) + + # initialize output files (part I) + bam_header = None + if save_bam: + # TODO wondering if this is actually needed in the bam_header + bam_header = [copy.deepcopy(ref_index)] + vcf_header = None + if save_vcf: + vcf_header = [reference] + + # initialize output files (part II) + # TODO figure out how to do this more efficiently. Write the files at the end. + if cancer: + output_file_writer = OutputFileWriter(out_prefix + '_normal', paired=paired_end, bam_header=bam_header, + vcf_header=vcf_header, + no_fastq=no_fastq, fasta_instead=fasta_instead) + output_file_writer_cancer = OutputFileWriter(out_prefix + '_tumor', paired=paired_end, bam_header=bam_header, + vcf_header=vcf_header, + no_fastq=no_fastq, fasta_instead=fasta_instead) + else: + output_file_writer = OutputFileWriter(out_prefix, paired=paired_end, bam_header=bam_header, + vcf_header=vcf_header, + no_fastq=no_fastq, + fasta_instead=fasta_instead) + # Using pathlib to make this more machine agnostic + out_prefix_name = pathlib.Path(out_prefix).name + + """ + LET'S GET THIS PARTY STARTED... + """ + # keep track of the number of reads we've sampled, for read-names + read_name_count = 1 + unmapped_records = [] + + for chrom in range(len(ref_index)): + + # read in reference sequence and notate blocks of Ns + (ref_sequence, n_regions) = read_ref(reference, ref_index[chrom], n_handling) + + # count total bp we'll be spanning so we can get an idea of how far along we are + # (for printing progress indicators) + total_bp_span = sum([n[1] - n[0] for n in n_regions['non_N']]) + current_progress = 0 + current_percent = 0 + have_printed100 = False + + """Prune invalid input variants, e.g variants that: + - try to delete or alter any N characters + - don't match the reference base at their specified position + - any alt allele contains anything other than allowed characters""" + valid_variants_from_vcf = [] + n_skipped = [0, 0, 0] + if ref_index[chrom][0] in input_variants: + for n in input_variants[ref_index[chrom][0]]: + span = (n[0], n[0] + len(n[1])) + r_seq = str(ref_sequence[span[0] - 1:span[1] - 1]) # -1 because going from VCF coords to array coords + # Checks if there are any invalid nucleotides in the vcf items + any_bad_nucl = any((nn not in ALLOWED_NUCL) for nn in [item for sublist in n[2] for item in sublist]) + # Ensure reference sequence matches the nucleotide in the vcf + if r_seq != n[1]: + n_skipped[0] += 1 + continue + # Ensure that we aren't trying to insert into an N region + elif 'N' in r_seq: + n_skipped[1] += 1 + continue + # Ensure that we don't insert any disallowed characters + elif any_bad_nucl: + n_skipped[2] += 1 + continue + # If it passes the above tests, append to valid variants list + valid_variants_from_vcf.append(n) + + print('found', len(valid_variants_from_vcf), 'valid variants for ' + + ref_index[chrom][0] + ' in input VCF...') + if any(n_skipped): + print(sum(n_skipped), 'variants skipped...') + print(' - [' + str(n_skipped[0]) + '] ref allele does not match reference') + print(' - [' + str(n_skipped[1]) + '] attempting to insert into N-region') + print(' - [' + str(n_skipped[2]) + '] alt allele contains non-ACGT characters') + + # TODO add large random structural variants + + # determine sampling windows based on read length, large N regions, and structural mutations. + # in order to obtain uniform coverage, windows should overlap by: + # - read_len, if single-end reads + # - fragment_size (mean), if paired-end reads + # ploidy is fixed per large sampling window, + # coverage distributions due to GC% and targeted regions are specified within these windows + all_variants_out = {} + sequences = None + if paired_end: + target_size = WINDOW_TARGET_SCALE * fragment_size + overlap = fragment_size + overlap_min_window_size = max(fraglen_distribution.values) + 10 + else: + target_size = WINDOW_TARGET_SCALE * read_len + overlap = read_len + overlap_min_window_size = read_len + 10 + + print('--------------------------------') + if only_vcf: + print('generating vcf...') + elif fasta_instead: + print('generating mutated fasta...') + else: + print('sampling reads...') + tt = time.time() + # start the progress bar + print("[", end='', flush=True) + + # Applying variants to non-N regions + for i in range(len(n_regions['non_N'])): + (initial_position, final_position) = n_regions['non_N'][i] + number_target_windows = max([1, (final_position - initial_position) // target_size]) + base_pair_distance = int((final_position - initial_position) / float(number_target_windows)) + + # if for some reason our region is too small to process, skip it! (sorry) + if number_target_windows == 1 and (final_position - initial_position) < overlap_min_window_size: + continue + + start = initial_position + end = min([start + base_pair_distance, final_position]) + vars_from_prev_overlap = [] + vars_cancer_from_prev_overlap = [] + v_index_from_prev = 0 + is_last_time = False + + while True: + # which inserted variants are in this window? + vars_in_window = [] + updated = False + for j in range(v_index_from_prev, len(valid_variants_from_vcf)): + variants_position = valid_variants_from_vcf[j][0] + # update: changed <= to <, so variant cannot be inserted in first position + if start < variants_position < end: + # vcf --> array coords + vars_in_window.append(tuple([variants_position - 1] + list(valid_variants_from_vcf[j][1:]))) + if variants_position >= end - overlap - 1 and updated is False: + updated = True + v_index_from_prev = j + if variants_position >= end: + break + + # determine which structural variants will affect our sampling window positions + structural_vars = [] + for n in vars_in_window: + # change: added abs() so that insertions are also buffered. + buffer_needed = max([max([abs(len(n[1]) - len(alt_allele)), 1]) for alt_allele in n[2]]) + # -1 because going from VCF coords to array coords + structural_vars.append((n[0] - 1, buffer_needed)) + + # adjust end-position of window based on inserted structural mutations + keep_going = True + while keep_going: + keep_going = False + for n in structural_vars: + # adding "overlap" here to prevent SVs from being introduced in overlap regions + # (which can cause problems if random mutations from the previous window land on top of them) + delta = (end - 1) - (n[0] + n[1]) - 2 - overlap + if delta < 0: + buffer_added = -delta + end += buffer_added + keep_going = True + break + next_start = end - overlap + next_end = min([next_start + base_pair_distance, final_position]) + if next_end - next_start < base_pair_distance: + end = next_end + is_last_time = True + + # print progress indicator + if debug: + print(f'PROCESSING WINDOW: {(start, end), [buffer_added]}, ' + f'next: {(next_start, next_end)}, isLastTime: {is_last_time}') + current_progress += end - start + new_percent = int((current_progress * 100) / float(total_bp_span)) + if new_percent > current_percent: + if new_percent <= 99 or (new_percent == 100 and not have_printed100): + if new_percent % 10 == 1: + print('-', end='', flush=True) + current_percent = new_percent + if current_percent == 100: + have_printed100 = True + + skip_this_window = False + + # compute coverage modifiers + coverage_avg = None + coverage_dat = [gc_window_size, gc_scale_val, []] + target_hits = 0 + if input_bed is None: + coverage_dat[2] = [1.0] * (end - start) + else: + if ref_index[chrom][0] not in input_regions: + coverage_dat[2] = [off_target_scalar] * (end - start) + else: + for j in range(start, end): + if not (bisect.bisect(input_regions[ref_index[chrom][0]], j) % 2): + coverage_dat[2].append(1.0) + target_hits += 1 + else: + coverage_dat[2].append(off_target_scalar) + + # off-target and we're not interested? + if off_target_discard and target_hits <= read_len: + coverage_avg = 0.0 + skip_this_window = True + + # print len(coverage_dat[2]), sum(coverage_dat[2]) + if sum(coverage_dat[2]) < low_cov_thresh: + coverage_avg = 0.0 + skip_this_window = True + + # check for small window sizes + if (end - start) < overlap_min_window_size: + skip_this_window = True + + if skip_this_window: + # skip window, save cpu time + start = next_start + end = next_end + if is_last_time: + break + if end >= final_position: + is_last_time = True + vars_from_prev_overlap = [] + continue + + # construct sequence data that we will sample reads from + if sequences is None: + sequences = SequenceContainer(start, ref_sequence[start:end], ploids, overlap, read_len, + [mut_model] * ploids, mut_rate, only_vcf=only_vcf) + if [cigar for cigar in sequences.all_cigar[0] if len(cigar) != 100] or \ + [cig for cig in sequences.all_cigar[1] if len(cig) != 100]: + print("There's a cigar that's off.") + # pdb.set_trace() + sys.exit(1) + else: + sequences.update(start, ref_sequence[start:end], ploids, overlap, read_len, [mut_model] * ploids, + mut_rate) + if [cigar for cigar in sequences.all_cigar[0] if len(cigar) != 100] or \ + [cig for cig in sequences.all_cigar[1] if len(cig) != 100]: + print("There's a cigar that's off.") + # pdb.set_trace() + sys.exit(1) + + # insert variants + sequences.insert_mutations(vars_from_prev_overlap + vars_in_window) + all_inserted_variants = sequences.random_mutations() + # print all_inserted_variants + + # init coverage + if sum(coverage_dat[2]) >= low_cov_thresh: + if paired_end: + coverage_avg = sequences.init_coverage(tuple(coverage_dat), frag_dist=fraglen_distribution) + else: + coverage_avg = sequences.init_coverage(tuple(coverage_dat)) + + # unused cancer stuff + if cancer: + tumor_sequences = SequenceContainer(start, ref_sequence[start:end], ploids, overlap, read_len, + [cancer_model] * ploids, mut_rate, coverage_dat) + tumor_sequences.insert_mutations(vars_cancer_from_prev_overlap + all_inserted_variants) + all_cancer_variants = tumor_sequences.random_mutations() + + # which variants do we need to keep for next time (because of window overlap)? + vars_from_prev_overlap = [] + vars_cancer_from_prev_overlap = [] + for n in all_inserted_variants: + if n[0] >= end - overlap - 1: + vars_from_prev_overlap.append(n) + if cancer: + for n in all_cancer_variants: + if n[0] >= end - overlap - 1: + vars_cancer_from_prev_overlap.append(n) + + # if we're only producing VCF, no need to go through the hassle of generating reads + if only_vcf: + pass + else: + window_span = end - start + + if paired_end: + if force_coverage: + reads_to_sample = int((window_span * float(coverage)) / (2 * read_len)) + 1 + else: + reads_to_sample = int((window_span * float(coverage) * coverage_avg) / (2 * read_len)) + 1 + else: + if force_coverage: + reads_to_sample = int((window_span * float(coverage)) / read_len) + 1 + else: + reads_to_sample = int((window_span * float(coverage) * coverage_avg) / read_len) + 1 + + # if coverage is so low such that no reads are to be sampled, skip region + # (i.e., remove buffer of +1 reads we add to every window) + if reads_to_sample == 1 and sum(coverage_dat[2]) < low_cov_thresh: + reads_to_sample = 0 + + # sample reads + for k in range(reads_to_sample): + + is_unmapped = [] + if paired_end: + my_fraglen = fraglen_distribution.sample() + my_read_data = sequences.sample_read(se_class, my_fraglen) + # skip if we failed to find a valid position to sample read + if my_read_data is None: + continue + if my_read_data[0][0] is None: + is_unmapped.append(True) + else: + is_unmapped.append(False) + # adjust mapping position based on window start + my_read_data[0][0] += start + if my_read_data[1][0] is None: + is_unmapped.append(True) + else: + is_unmapped.append(False) + my_read_data[1][0] += start + else: + my_read_data = sequences.sample_read(se_class) + # skip if we failed to find a valid position to sample read + if my_read_data is None: + continue + # unmapped read (lives in large insertion) + if my_read_data[0][0] is None: + is_unmapped = [True] + else: + is_unmapped = [False] + # adjust mapping position based on window start + my_read_data[0][0] += start + + # are we discarding offtargets? + outside_boundaries = [] + if off_target_discard and input_bed is not None: + outside_boundaries += [bisect.bisect(input_regions[ref_index[chrom][0]], n[0]) % 2 for n + in my_read_data] + outside_boundaries += [ + bisect.bisect(input_regions[ref_index[chrom][0]], n[0] + len(n[2])) % 2 for n in + my_read_data] + if discard_bed is not None: + outside_boundaries += [bisect.bisect(discard_regions[ref_index[chrom][0]], n[0]) % 2 for + n in my_read_data] + outside_boundaries += [ + bisect.bisect(discard_regions[ref_index[chrom][0]], n[0] + len(n[2])) % 2 for n in + my_read_data] + if len(outside_boundaries) and any(outside_boundaries): + continue + + my_read_name = out_prefix_name + '-' + ref_index[chrom][0] + '-' + str(read_name_count) + read_name_count += len(my_read_data) + + # if desired, replace all low-quality bases with Ns + if n_max_qual > -1: + for j in range(len(my_read_data)): + my_read_string = [n for n in my_read_data[j][2]] + for m in range(len(my_read_data[j][3])): + adjusted_qual = ord(my_read_data[j][3][m]) - se_class.off_q + if adjusted_qual <= n_max_qual: + my_read_string[m] = 'N' + my_read_data[j][2] = ''.join(my_read_string) + + # flip a coin, are we forward or reverse strand? + is_forward = (random.random() < 0.5) + + # if read (or read + mate for PE) are unmapped, put them at end of bam file + if all(is_unmapped): + if paired_end: + if is_forward: + flag1 = sam_flag(['paired', 'unmapped', 'mate_unmapped', 'first', 'mate_reverse']) + flag2 = sam_flag(['paired', 'unmapped', 'mate_unmapped', 'second', 'reverse']) + else: + flag1 = sam_flag(['paired', 'unmapped', 'mate_unmapped', 'second', 'mate_reverse']) + flag2 = sam_flag(['paired', 'unmapped', 'mate_unmapped', 'first', 'reverse']) + unmapped_records.append((my_read_name + '/1', my_read_data[0], flag1)) + unmapped_records.append((my_read_name + '/2', my_read_data[1], flag2)) + else: + flag1 = sam_flag(['unmapped']) + unmapped_records.append((my_read_name + '/1', my_read_data[0], flag1)) + + my_ref_index = indices_by_ref_name[ref_index[chrom][0]] + + # write SE output + if len(my_read_data) == 1: + if not no_fastq: + if is_forward: + output_file_writer.write_fastq_record(my_read_name, my_read_data[0][2], + my_read_data[0][3]) + else: + output_file_writer.write_fastq_record(my_read_name, + reverse_complement(my_read_data[0][2]), + my_read_data[0][3][::-1]) + if save_bam: + if is_unmapped[0] is False: + if is_forward: + flag1 = 0 + output_file_writer.write_bam_record(my_ref_index, my_read_name, + my_read_data[0][0], + my_read_data[0][1], my_read_data[0][2], + my_read_data[0][3], + output_sam_flag=flag1) + else: + flag1 = sam_flag(['reverse']) + output_file_writer.write_bam_record(my_ref_index, my_read_name, + my_read_data[0][0], + my_read_data[0][1], my_read_data[0][2], + my_read_data[0][3], + output_sam_flag=flag1) + # write PE output + elif len(my_read_data) == 2: + if no_fastq is not True: + output_file_writer.write_fastq_record(my_read_name, my_read_data[0][2], + my_read_data[0][3], + read2=my_read_data[1][2], + qual2=my_read_data[1][3], + orientation=is_forward) + if save_bam: + if is_unmapped[0] is False and is_unmapped[1] is False: + if is_forward: + flag1 = sam_flag(['paired', 'proper', 'first', 'mate_reverse']) + flag2 = sam_flag(['paired', 'proper', 'second', 'reverse']) + else: + flag1 = sam_flag(['paired', 'proper', 'second', 'mate_reverse']) + flag2 = sam_flag(['paired', 'proper', 'first', 'reverse']) + output_file_writer.write_bam_record(my_ref_index, my_read_name, my_read_data[0][0], + my_read_data[0][1], my_read_data[0][2], + my_read_data[0][3], + output_sam_flag=flag1, + mate_pos=my_read_data[1][0]) + output_file_writer.write_bam_record(my_ref_index, my_read_name, my_read_data[1][0], + my_read_data[1][1], my_read_data[1][2], + my_read_data[1][3], + output_sam_flag=flag2, mate_pos=my_read_data[0][0]) + elif is_unmapped[0] is False and is_unmapped[1] is True: + if is_forward: + flag1 = sam_flag(['paired', 'first', 'mate_unmapped', 'mate_reverse']) + flag2 = sam_flag(['paired', 'second', 'unmapped', 'reverse']) + else: + flag1 = sam_flag(['paired', 'second', 'mate_unmapped', 'mate_reverse']) + flag2 = sam_flag(['paired', 'first', 'unmapped', 'reverse']) + output_file_writer.write_bam_record(my_ref_index, my_read_name, my_read_data[0][0], + my_read_data[0][1], my_read_data[0][2], + my_read_data[0][3], + output_sam_flag=flag1, mate_pos=my_read_data[0][0]) + output_file_writer.write_bam_record(my_ref_index, my_read_name, my_read_data[0][0], + my_read_data[1][1], my_read_data[1][2], + my_read_data[1][3], + output_sam_flag=flag2, mate_pos=my_read_data[0][0], + aln_map_quality=0) + elif is_unmapped[0] is True and is_unmapped[1] is False: + if is_forward: + flag1 = sam_flag(['paired', 'first', 'unmapped', 'mate_reverse']) + flag2 = sam_flag(['paired', 'second', 'mate_unmapped', 'reverse']) + else: + flag1 = sam_flag(['paired', 'second', 'unmapped', 'mate_reverse']) + flag2 = sam_flag(['paired', 'first', 'mate_unmapped', 'reverse']) + output_file_writer.write_bam_record(my_ref_index, my_read_name, my_read_data[1][0], + my_read_data[0][1], my_read_data[0][2], + my_read_data[0][3], + output_sam_flag=flag1, mate_pos=my_read_data[1][0], + aln_map_quality=0) + output_file_writer.write_bam_record(my_ref_index, my_read_name, my_read_data[1][0], + my_read_data[1][1], my_read_data[1][2], + my_read_data[1][3], + output_sam_flag=flag2, mate_pos=my_read_data[1][0]) + else: + print('\nError: Unexpected number of reads generated...\n') + sys.exit(1) + + if not is_last_time: + output_file_writer.flush_buffers(bam_max=next_start) + else: + output_file_writer.flush_buffers(bam_max=end + 1) + + # tally up all the variants that got successfully introduced + for n in all_inserted_variants: + all_variants_out[n] = True + + # prepare indices of next window + start = next_start + end = next_end + if is_last_time: + break + if end >= final_position: + is_last_time = True + + print(']', flush=True) + + if only_vcf: + print('VCF generation completed in ', end='') + else: + print('Read sampling completed in ', end='') + print(int(time.time() - tt), '(sec)') + + # write all output variants for this reference + if save_vcf: + print('Writing output VCF...') + for k in sorted(all_variants_out.keys()): + current_ref = ref_index[chrom][0] + my_id = '.' + my_quality = '.' + my_filter = 'PASS' + # k[0] + 1 because we're going back to 1-based vcf coords + output_file_writer.write_vcf_record(current_ref, str(int(k[0]) + 1), my_id, k[1], k[2], my_quality, + my_filter, k[4]) + + # write unmapped reads to bam file + if save_bam and len(unmapped_records): + print('writing unmapped reads to bam file...') + for umr in unmapped_records: + if paired_end: + output_file_writer.write_bam_record(-1, umr[0], 0, umr[1][1], umr[1][2], umr[1][3], output_sam_flag=umr[2], + mate_pos=0, + aln_map_quality=0) + else: + output_file_writer.write_bam_record(-1, umr[0], 0, umr[1][1], umr[1][2], umr[1][3], output_sam_flag=umr[2], + aln_map_quality=0) + + # close output files + output_file_writer.close_files() + if cancer: + output_file_writer_cancer.close_files() + + +if __name__ == '__main__': + main() diff --git a/mergeJobs.py b/mergeJobs.py deleted file mode 100644 index ab23e91..0000000 --- a/mergeJobs.py +++ /dev/null @@ -1,129 +0,0 @@ -#!/usr/bin/env python -import os -import argparse - -def getListOfFiles(inDir,pattern): - return [inDir+n for n in os.listdir(inDir) if (pattern in n and os.path.getsize(inDir+n))] - -TEMP_IND = 0 -def stripVCF_header(fn): - global TEMP_IND - f = open(fn,'r') - ftn = fn+'_temp'+str(TEMP_IND) - f_t = open(ftn,'w') - hasHeader = False - for line in f: - if line[0] == '#': - if not hasHeader: - TEMP_IND += 1 - hasHeader = True - elif hasHeader: - f_t.write(line) - else: - break - f_t.close() - f.close() - if hasHeader: - return ftn - else: - os.system('rm '+ftn) - return fn - -def catListOfFiles(l,outName,gzipped=False): - for n in l: - if n[-3:] == '.gz' or n[-5:] == '.gzip': - gzipped = True - if gzipped: - for n in l: - if not n[-3:] == '.gz' and not n[-5:] == '.gzip': - print '\nError: Found a mixture of compressed and decompressed files with the specified prefix. Abandoning ship...\n' - for m in l: - print m - print '' - exit(1) - cmd = 'cat '+' '.join(sorted(l))+' > '+outName+'.gz' - else: - cmd = 'cat '+' '.join(sorted(l))+' > '+outName - print cmd - os.system(cmd) - -def catBams(l,outName,samtools_exe): - l_sort = sorted(l) - tmp = outName+'.tempHeader.sam' - os.system(samtools_exe+' view -H '+l_sort[0]+' > '+tmp) - cmd = samtools_exe+' cat -h '+tmp+' '+' '.join(l_sort)+' | '+samtools_exe+' sort - > '+outName - print cmd - os.system(cmd) - cmd = samtools_exe+' index '+outName - print cmd - os.system(cmd) - os.system('rm '+tmp) - - -##################################### -# main() # -##################################### - -def main(): - - parser = argparse.ArgumentParser(description='mergeJobs.py') - parser.add_argument('-i', type=str, required=True, metavar='', nargs='+', help="* input prefix: [prefix_1] [prefix_2] ...") - parser.add_argument('-o', type=str, required=True, metavar='', help="* output prefix") - parser.add_argument('-s', type=str, required=True, metavar='', help="* /path/to/samtools") - parser.add_argument('--no-job', required=False, action='store_true', help='files do not have .job suffix', default=False) - - args = parser.parse_args() - (INP, OUP, SAMTOOLS, NO_JOB) = (args.i, args.o, args.s, args.no_job) - - inDir = '/'.join(INP[0].split('/')[:-1])+'/' - if inDir == '/': - inDir = './' - #print inDir - - INP_LIST = [] - for n in INP: - if n[-1] == '/': - n = n[:-1] - INP_LIST.append(n.split('/')[-1]) - listing_r1 = [] - listing_r2 = [] - listing_b = [] - listing_v = [] - pat_r1 = '_read1.fq' + (NO_JOB == False)*'.job' - pat_r2 = '_read2.fq' + (NO_JOB == False)*'.job' - pat_gb = '_golden.bam' + (NO_JOB == False)*'.job' - pat_gv = '_golden.vcf' + (NO_JOB == False)*'.job' - for n in INP_LIST: - listing_r1 += getListOfFiles(inDir,n+pat_r1) - listing_r2 += getListOfFiles(inDir,n+pat_r2) - listing_b += getListOfFiles(inDir,n+pat_gb) - if len(listing_v): # remove headers from vcf files that aren't the first being processed - initList = getListOfFiles(inDir,n+pat_gv) - listing_v += [stripVCF_header(n) for n in initList] - else: - listing_v += getListOfFiles(inDir,n+pat_gv) - - # - # merge fq files - # - if len(listing_r1): - catListOfFiles(listing_r1,OUP+'_read1.fq') - if len(listing_r2): - catListOfFiles(listing_r2,OUP+'_read2.fq') - - # - # merge golden alignments, if present - # - if len(listing_b): - catBams(listing_b,OUP+'_golden.bam',SAMTOOLS) - - # - # merge golden vcfs, if present - # - if len(listing_v): - catListOfFiles(listing_v,OUP+'_golden.vcf') - - -if __name__ == "__main__": - main() - diff --git a/models/MutModel_BRCA_US_ICGC.p b/models/MutModel_BRCA_US_ICGC.p index 0450b94..0f946fc 100644 Binary files a/models/MutModel_BRCA_US_ICGC.p and b/models/MutModel_BRCA_US_ICGC.p differ diff --git a/models/MutModel_CLLE-ES_ICGC.p b/models/MutModel_CLLE-ES_ICGC.p index e14a039..b9dc37f 100644 Binary files a/models/MutModel_CLLE-ES_ICGC.p and b/models/MutModel_CLLE-ES_ICGC.p differ diff --git a/models/MutModel_NA12878_noIndel.p b/models/MutModel_NA12878_noIndel.p index d7372e9..4e9d75a 100644 Binary files a/models/MutModel_NA12878_noIndel.p and b/models/MutModel_NA12878_noIndel.p differ diff --git a/models/MutModel_SKCM-US_ICGC.p b/models/MutModel_SKCM-US_ICGC.p index ea9c4fc..9bcac55 100644 Binary files a/models/MutModel_SKCM-US_ICGC.p and b/models/MutModel_SKCM-US_ICGC.p differ diff --git a/models/errorModel_pacbio_toy.p b/models/errorModel_pacbio_toy.p index 1988777..ba376c0 100644 Binary files a/models/errorModel_pacbio_toy.p and b/models/errorModel_pacbio_toy.p differ diff --git a/models/errorModel_toy.p b/models/errorModel_toy.p index a172db8..111440c 100644 Binary files a/models/errorModel_toy.p and b/models/errorModel_toy.p differ diff --git a/models/fraglenModel_toy.p b/models/fraglenModel_toy.p index 50aa253..02883c1 100644 Binary files a/models/fraglenModel_toy.p and b/models/fraglenModel_toy.p differ diff --git a/models/gcBias_toy.p b/models/gcBias_toy.p index f4630a5..627497d 100644 Binary files a/models/gcBias_toy.p and b/models/gcBias_toy.p differ diff --git a/models/gcBias_uniform.p b/models/gcBias_uniform.p index 52c5b86..a8ad93d 100644 Binary files a/models/gcBias_uniform.p and b/models/gcBias_uniform.p differ diff --git a/py/OutputFileWriter.py b/py/OutputFileWriter.py deleted file mode 100644 index 87dc7c1..0000000 --- a/py/OutputFileWriter.py +++ /dev/null @@ -1,285 +0,0 @@ -import sys -import os -import re -import gzip -from struct import pack - -from biopython_modified_bgzf import BgzfWriter - -BAM_COMPRESSION_LEVEL = 6 - -# return the reverse complement of a string -RC_DICT = {'A':'T','C':'G','G':'C','T':'A','N':'N'} -def RC(s): - return ''.join(RC_DICT[n] for n in s[::-1]) - -# SAMtools reg2bin function -def reg2bin(a,b): - b -= 1 - if (a>>14 == b>>14): return ((1<<15)-1)/7 + (a>>14) - if (a>>17 == b>>17): return ((1<<12)-1)/7 + (a>>17) - if (a>>20 == b>>20): return ((1<<9)-1)/7 + (a>>20) - if (a>>23 == b>>23): return ((1<<6)-1)/7 + (a>>23) - if (a>>26 == b>>26): return ((1<<3)-1)/7 + (a>>26) - return 0 - -# takes list of strings, returns numerical flag -def sam_flag(l): - outVal = 0 - l = list(set(l)) - for n in l: - if n == 'paired': outVal += 1 - elif n == 'proper': outVal += 2 - elif n == 'unmapped': outVal += 4 - elif n == 'mate_unmapped': outVal += 8 - elif n == 'reverse': outVal += 16 - elif n == 'mate_reverse': outVal += 32 - elif n == 'first': outVal += 64 - elif n == 'second': outVal += 128 - elif n == 'not_primary': outVal += 256 - elif n == 'low_quality': outVal += 512 - elif n == 'duplicate': outVal += 1024 - elif n == 'supplementary': outVal += 2048 - return outVal - -CIGAR_PACKED = {'M':0, 'I':1, 'D':2, 'N':3, 'S':4, 'H':5, 'P':6, '=':7, 'X':8} -SEQ_PACKED = {'=':0, 'A':1, 'C':2, 'M':3, 'G':4, 'R':5, 'S':6, 'V':7, - 'T':8, 'W':9, 'Y':10,'H':11,'K':12,'D':13,'B':14,'N':15} - -BUFFER_BATCH_SIZE = 1000 # write out to file after this many reads - -# -# outFQ = path to output FASTQ prefix -# paired = True for PE reads, False for SE -# BAM_header = [refIndex] -# VCF_header = [path_to_ref] -# gzipped = True for compressed FASTQ/VCF, False for uncompressed -# -class OutputFileWriter: - def __init__(self, outPrefix, paired=False, BAM_header=None, VCF_header=None, gzipped=False, jobTuple=(1,1), noFASTQ=False, FASTA_instead=False): - - jobSuffix = '' - if jobTuple[1] > 1: - jsl = len(str(jobTuple[1])) - jsb = '0'*(jsl-len(str(jobTuple[0]))) - jobSuffix = '.job'+jsb+str(jobTuple[0])+'of'+str(jobTuple[1]) - - self.FASTA_instead = FASTA_instead - if FASTA_instead: - fq1 = outPrefix+'_read1.fa'+jobSuffix - fq2 = outPrefix+'_read2.fa'+jobSuffix - else: - fq1 = outPrefix+'_read1.fq'+jobSuffix - fq2 = outPrefix+'_read2.fq'+jobSuffix - bam = outPrefix+'_golden.bam'+jobSuffix - vcf = outPrefix+'_golden.vcf'+jobSuffix - - self.noFASTQ = noFASTQ - if not self.noFASTQ: - if gzipped: - self.fq1_file = gzip.open(fq1+'.gz', 'wb') - else: - self.fq1_file = open(fq1,'w') - - self.fq2_file = None - if paired: - if gzipped: - self.fq2_file = gzip.open(fq2+'.gz', 'wb') - else: - self.fq2_file = open(fq2,'w') - - # - # VCF OUTPUT - # - self.vcf_file = None - if VCF_header != None: - if gzipped: - self.vcf_file = gzip.open(vcf+'.gz', 'wb') - else: - self.vcf_file = open(vcf, 'wb') - - # WRITE VCF HEADER (if parallel: only for first job) - if jobTuple[0] == 1: - self.vcf_file.write('##fileformat=VCFv4.1\n') - self.vcf_file.write('##reference='+VCF_header[0]+'\n') - self.vcf_file.write('##INFO=\n') - self.vcf_file.write('##INFO=\n') - #self.vcf_file.write('##INFO=\n') - self.vcf_file.write('##INFO=\n') - self.vcf_file.write('##INFO=\n') - self.vcf_file.write('##INFO=\n') - self.vcf_file.write('##INFO=\n') - self.vcf_file.write('##ALT=\n') - self.vcf_file.write('##ALT=\n') - self.vcf_file.write('##ALT=\n') - self.vcf_file.write('##ALT=\n') - self.vcf_file.write('##ALT=\n') - self.vcf_file.write('##ALT=\n') - self.vcf_file.write('##ALT=\n') - self.vcf_file.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n') - - # - # BAM OUTPUT - # - self.bam_file = None - if BAM_header != None: - self.bam_file = BgzfWriter(bam, 'w', compresslevel=BAM_COMPRESSION_LEVEL) - - # WRITE BAM HEADER (if parallel: only for first job) - if True or jobTuple[0] == 1: - self.bam_file.write("BAM\1") - header = '@HD\tVN:1.5\tSO:coordinate\n' - for n in BAM_header[0]: - header += '@SQ\tSN:'+n[0]+'\tLN:'+str(n[3])+'\n' - header += '@RG\tID:NEAT\tSM:NEAT\tLB:NEAT\tPL:NEAT\n' - headerBytes = len(header) - numRefs = len(BAM_header[0]) - self.bam_file.write(pack(''+readName+'/1\n'+r1+'\n') - if read2 != None: - self.fq2_buffer.append('>'+readName+'/2\n'+r2+'\n') - else: - self.fq1_buffer.append('@'+readName+'/1\n'+r1+'\n+\n'+q1+'\n') - if read2 != None: - self.fq2_buffer.append('@'+readName+'/2\n'+r2+'\n+\n'+q2+'\n') - - def writeVCFRecord(self, chrom, pos, idStr, ref, alt, qual, filt, info): - self.vcf_file.write(str(chrom)+'\t'+str(pos)+'\t'+str(idStr)+'\t'+str(ref)+'\t'+str(alt)+'\t'+str(qual)+'\t'+str(filt)+'\t'+str(info)+'\n') - - def writeBAMRecord(self, refID, readName, pos_0, cigar, seq, qual, samFlag, matePos=None, alnMapQual=70): - - myBin = reg2bin(pos_0,pos_0+len(seq)) - #myBin = 0 # or just use a dummy value, does this actually matter? - - myMapQual = alnMapQual - cig_letters = re.split(r"\d+",cigar)[1:] - cig_numbers = [int(n) for n in re.findall(r"\d+",cigar)] - cig_ops = len(cig_letters) - next_refID = refID - if matePos == None: - next_pos = 0 - my_tlen = 0 - else: - next_pos = matePos - if next_pos > pos_0: - my_tlen = next_pos - pos_0 + len(seq) - else: - my_tlen = next_pos - pos_0 - len(seq) - - encodedCig = '' - for i in xrange(cig_ops): - encodedCig += pack('= BUFFER_BATCH_SIZE or len(self.bam_buffer) >= BUFFER_BATCH_SIZE) or (len(self.fq1_buffer) and lastTime) or (len(self.bam_buffer) and lastTime): - # fq - if not self.noFASTQ: - self.fq1_file.write(''.join(self.fq1_buffer)) - if len(self.fq2_buffer): - self.fq2_file.write(''.join(self.fq2_buffer)) - # bam - if len(self.bam_buffer): - bam_data = sorted(self.bam_buffer) - if lastTime: - self.bam_file.write(''.join([n[2] for n in bam_data])) - self.bam_buffer = [] - else: - ind_to_stop_at = 0 - for i in xrange(0,len(bam_data)): - # if we are from previous reference, or have coordinates lower than next window position, it's safe to write out to file - if bam_data[i][0] != bam_data[-1][0] or bam_data[i][1] < bamMax: - ind_to_stop_at = i+1 - else: - break - self.bam_file.write(''.join([n[2] for n in bam_data[:ind_to_stop_at]])) - ####print 'BAM WRITING:',ind_to_stop_at,'/',len(bam_data) - if ind_to_stop_at >= len(bam_data): - self.bam_buffer = [] - else: - self.bam_buffer = bam_data[ind_to_stop_at:] - self.fq1_buffer = [] - self.fq2_buffer = [] - - - def closeFiles(self): - self.flushBuffers(lastTime=True) - if not self.noFASTQ: - self.fq1_file.close() - if self.fq2_file != None: - self.fq2_file.close() - if self.vcf_file != None: - self.vcf_file.close() - if self.bam_file != None: - self.bam_file.close() - - - - diff --git a/py/SequenceContainer.py b/py/SequenceContainer.py deleted file mode 100644 index 34bf922..0000000 --- a/py/SequenceContainer.py +++ /dev/null @@ -1,1025 +0,0 @@ -import random -import copy -import re -import os -import bisect -import cPickle as pickle -import numpy as np - -from probability import DiscreteDistribution, poisson_list, quantize_list -from neat_cigar import CigarString - -MAX_ATTEMPTS = 100 # max attempts to insert a mutation into a valid position -MAX_MUTFRAC = 0.3 # the maximum percentage of a window that can contain mutations - -NUCL = ['A','C','G','T'] -TRI_IND = {'AA':0, 'AC':1, 'AG':2, 'AT':3, 'CA':4, 'CC':5, 'CG':6, 'CT':7, - 'GA':8, 'GC':9, 'GG':10, 'GT':11, 'TA':12, 'TC':13, 'TG':14, 'TT':15} -NUC_IND = {'A':0, 'C':1, 'G':2, 'T':3} -ALL_TRI = [NUCL[i]+NUCL[j]+NUCL[k] for i in xrange(len(NUCL)) for j in xrange(len(NUCL)) for k in xrange(len(NUCL))] -ALL_IND = {ALL_TRI[i]:i for i in xrange(len(ALL_TRI))} - -# DEBUG -IGNORE_TRINUC = False - -# percentile resolution used for fraglen quantizing -COV_FRAGLEN_PERCENTILE = 10. -LARGE_NUMBER = 9999999999 - -# -# Container for reference sequences, applies mutations -# -class SequenceContainer: - def __init__(self, xOffset, sequence, ploidy, windowOverlap, readLen, mutationModels=[], mutRate=None, onlyVCF=False): - # initialize basic variables - self.onlyVCF = onlyVCF - self.init_basicVars(xOffset, sequence, ploidy, windowOverlap, readLen) - # initialize mutation models - self.init_mutModels(mutationModels, mutRate) - # sample the number of variants that will be inserted into each ploid - self.init_poisson() - self.indelsToAdd = [n.sample() for n in self.ind_pois] - self.snpsToAdd = [n.sample() for n in self.snp_pois] - # initialize trinuc snp bias - self.init_trinucBias() - - def init_basicVars(self, xOffset, sequence, ploidy, windowOverlap, readLen): - self.x = xOffset - self.ploidy = ploidy - self.readLen = readLen - self.sequences = [bytearray(sequence) for n in xrange(self.ploidy)] - self.seqLen = len(sequence) - self.indelList = [[] for n in xrange(self.ploidy)] - self.snpList = [[] for n in xrange(self.ploidy)] - self.allCigar = [[] for n in xrange(self.ploidy)] - self.FM_pos = [[] for n in xrange(self.ploidy)] - self.FM_span = [[] for n in xrange(self.ploidy)] - # blackList[ploid][pos] = 0 safe to insert variant here - # blackList[ploid][pos] = 1 indel inserted here - # blackList[ploid][pos] = 2 snp inserted here - # blackList[ploid][pos] = 3 invalid position for various processing reasons - self.blackList = [np.zeros(self.seqLen,dtype=' int(self.readLen/2.): - trCov_vals[i] = [0.0]*int(self.readLen/2) + trCov_vals[i][:-int(self.readLen/2.)] - # fill in missing indices - trCov_vals[i].extend([0.0]*(len(self.sequences[i])-len(trCov_vals[i]))) - - # - covvec = np.cumsum([trCov_vals[i][nnn]*gcCov_vals[i][nnn] for nnn in xrange(len(trCov_vals[i]))]) - coverage_vals = [] - for j in xrange(0,max_coord): - coverage_vals.append(covvec[j+self.readLen] - covvec[j]) - #avg_out.append(np.mean(coverage_vals)/float(self.readLen)) - avg_out.append(np.mean(coverage_vals)/float(min([self.readLen, max_coord]))) - #print avg_out, np.mean(avg_out) - - if fragDist == None: - #print '++++', max_coord, len(self.sequences[i]), len(self.allCigar[i]), len(coverage_vals) - self.coverage_distribution.append(DiscreteDistribution(coverage_vals,range(len(coverage_vals)))) - - # fragment length nightmare - else: - currentThresh = 0. - index_list = [0] - for j in xrange(len(fragDist.cumP)): - if fragDist.cumP[j] >= currentThresh + COV_FRAGLEN_PERCENTILE/100.0: - currentThresh = fragDist.cumP[j] - index_list.append(j) - flq = [fragDist.values[nnn] for nnn in index_list] - if fragDist.values[-1] not in flq: - flq.append(fragDist.values[-1]) - flq.append(LARGE_NUMBER) - - self.fraglens_indMap = {} - for j in fragDist.values: - bInd = bisect.bisect(flq,j) - if abs(flq[bInd-1] - j) <= abs(flq[bInd] - j): - self.fraglens_indMap[j] = flq[bInd-1] - else: - self.fraglens_indMap[j] = flq[bInd] - - self.coverage_distribution.append({}) - for flv in sorted(list(set(self.fraglens_indMap.values()))): - buffer_val = self.readLen - for j in fragDist.values: - if self.fraglens_indMap[j] == flv and j > buffer_val: - buffer_val = j - max_coord = min([len(self.sequences[i])-buffer_val-1, len(self.allCigar[i])-buffer_val+self.readLen-2]) - #print 'BEFORE:', len(self.sequences[i])-buffer_val - #print 'AFTER: ', len(self.allCigar[i])-buffer_val+self.readLen-2 - #print 'AFTER2:', max_coord - coverage_vals = [] - for j in xrange(0,max_coord): - coverage_vals.append(covvec[j+self.readLen] - covvec[j] + covvec[j+flv] - covvec[j+flv-self.readLen]) - - # EXPERIMENTAL - #quantized_covVals = quantize_list(coverage_vals) - #self.coverage_distribution[i][flv] = DiscreteDistribution([n[2] for n in quantized_covVals],[(n[0],n[1]) for n in quantized_covVals]) - - # TESTING - #import matplotlib.pyplot as mpl - #print len(coverage_vals),'-->',len(quantized_covVals) - #mpl.figure(0) - #mpl.plot(range(len(coverage_vals)),coverage_vals) - #for qcv in quantized_covVals: - # mpl.plot([qcv[0],qcv[1]+1],[qcv[2],qcv[2]],'r') - #mpl.show() - #exit(1) - - self.coverage_distribution[i][flv] = DiscreteDistribution(coverage_vals,range(len(coverage_vals))) - - return np.mean(avg_out) - - def init_mutModels(self,mutationModels,mutRate): - if mutationModels == []: - ml = [copy.deepcopy(DEFAULT_MODEL_1) for n in xrange(self.ploidy)] - self.modelData = ml[:self.ploidy] - else: - if len(mutationModels) != self.ploidy: - print '\nError: Number of mutation models recieved is not equal to specified ploidy\n' - exit(1) - self.modelData = copy.deepcopy(mutationModels) - - # do we need to rescale mutation frequencies? - mutRateSum = sum([n[0] for n in self.modelData]) - self.mutRescale = mutRate - if self.mutRescale == None: - self.mutScalar = 1.0 - else: - self.mutScalar = float(self.mutRescale)/(mutRateSum/float(len(self.modelData))) - - # how are mutations spread to each ploid, based on their specified mut rates? - self.ploidMutFrac = [float(n[0])/mutRateSum for n in self.modelData] - self.ploidMutPrior = DiscreteDistribution(self.ploidMutFrac,range(self.ploidy)) - - # init mutation models - # - # self.models[ploid][0] = average mutation rate - # self.models[ploid][1] = p(mut is homozygous | mutation occurs) - # self.models[ploid][2] = p(mut is indel | mut occurs) - # self.models[ploid][3] = p(insertion | indel occurs) - # self.models[ploid][4] = distribution of insertion lengths - # self.models[ploid][5] = distribution of deletion lengths - # self.models[ploid][6] = distribution of trinucleotide SNP transitions - # self.models[ploid][7] = p(trinuc mutates) - self.models = [] - for n in self.modelData: - self.models.append([self.mutScalar*n[0],n[1],n[2],n[3],DiscreteDistribution(n[5],n[4]),DiscreteDistribution(n[7],n[6]),[]]) - for m in n[8]: - self.models[-1][6].append([DiscreteDistribution(m[0],NUCL), - DiscreteDistribution(m[1],NUCL), - DiscreteDistribution(m[2],NUCL), - DiscreteDistribution(m[3],NUCL)]) - self.models[-1].append([m for m in n[9]]) - - def init_poisson(self): - ind_l_list = [self.seqLen*self.models[i][0]*self.models[i][2]*self.ploidMutFrac[i] for i in xrange(len(self.models))] - snp_l_list = [self.seqLen*self.models[i][0]*(1.-self.models[i][2])*self.ploidMutFrac[i] for i in xrange(len(self.models))] - k_range = range(int(self.seqLen*MAX_MUTFRAC)) - self.ind_pois = [poisson_list(k_range,ind_l_list[n]) for n in xrange(len(self.models))] - self.snp_pois = [poisson_list(k_range,snp_l_list[n]) for n in xrange(len(self.models))] - - def init_trinucBias(self): - # compute mutation positional bias given trinucleotide strings of the sequence (ONLY AFFECTS SNPs) - # - # note: since indels are added before snps, it's possible these positional biases aren't correctly utilized - # at positions affected by indels. At the moment I'm going to consider this negligible. - trinuc_snp_bias = [[0. for n in xrange(self.seqLen)] for m in xrange(self.ploidy)] - self.trinuc_bias = [None for n in xrange(self.ploidy)] - for p in xrange(self.ploidy): - for i in xrange(self.winBuffer+1,self.seqLen-1): - trinuc_snp_bias[p][i] = self.models[p][7][ALL_IND[str(self.sequences[p][i-1:i+2])]] - self.trinuc_bias[p] = DiscreteDistribution(trinuc_snp_bias[p][self.winBuffer+1:self.seqLen-1],range(self.winBuffer+1,self.seqLen-1)) - - def update(self, xOffset, sequence, ploidy, windowOverlap, readLen, mutationModels=[], mutRate=None): - # if mutation model is changed, we have to reinitialize it... - if ploidy != self.ploidy or mutRate != self.mutRescale or mutationModels != []: - self.ploidy = ploidy - self.mutRescale = mutRate - self.init_mutModels(mutationModels, mutRate) - # if sequence length is different than previous window, we have to redo snp/indel poissons - if len(sequence) != self.seqLen: - self.seqLen = len(sequence) - self.init_poisson() - # basic vars - self.init_basicVars(xOffset, sequence, ploidy, windowOverlap, readLen) - self.indelsToAdd = [n.sample() for n in self.ind_pois] - self.snpsToAdd = [n.sample() for n in self.snp_pois] - # initialize trinuc snp bias - if not IGNORE_TRINUC: - self.init_trinucBias() - - def insert_mutations(self, inputList): - for inpV in inputList: - whichPloid = [] - wps = inpV[4][0] - if wps == None: # if no genotype given, assume heterozygous and choose a single ploid based on their mut rates - whichPloid.append(self.ploidMutPrior.sample()) - whichAlt = [0] - else: - if '/' in wps or '|' in wps: - if '/' in wps: - splt = wps.split('/') - else: - splt = wps.split('|') - whichPloid = [] - whichAlt = [] - for i in xrange(len(splt)): - if splt[i] == '1': - whichPloid.append(i) - # assume we're just using first alt for inserted variants? - whichAlt = [0 for n in whichPloid] - else: # otherwise assume monoploidy - whichPloid = [0] - whichAlt = [0] - - # ignore invalid ploids - for i in xrange(len(whichPloid)-1,-1,-1): - if whichPloid[i] >= self.ploidy: - del whichPloid[i] - - for i in xrange(len(whichPloid)): - p = whichPloid[i] - myAlt = inpV[2][whichAlt[i]] - myVar = (inpV[0]-self.x,inpV[1],myAlt) - #inLen = max([len(inpV[1]),len(myAlt)]) - inLen = len(inpV[1]) - - if myVar[0] < 0 or myVar[0] >= len(self.blackList[p]): - print '\nError: Attempting to insert variant out of window bounds:' - print myVar, '--> blackList[0:'+str(len(self.blackList[p]))+']\n' - exit(1) - if len(inpV[1]) == 1 and len(myAlt) == 1: - if self.blackList[p][myVar[0]]: - continue - self.snpList[p].append(myVar) - self.blackList[p][myVar[0]] = 2 - else: - indel_failed = False - for k in xrange(myVar[0],myVar[0]+inLen): - if k >= len(self.blackList[p]): - indel_failed = True - continue - if self.blackList[p][k]: - indel_failed = True - continue - if indel_failed: - continue - for k in xrange(myVar[0],myVar[0]+inLen): - self.blackList[p][k] = 1 - self.indelList[p].append(myVar) - - def random_mutations(self): - - # add random indels - all_indels = [[] for n in self.sequences] - for i in xrange(self.ploidy): - for j in xrange(self.indelsToAdd[i]): - if random.random() <= self.models[i][1]: # insert homozygous indel - whichPloid = range(self.ploidy) - else: # insert heterozygous indel - whichPloid = [self.ploidMutPrior.sample()] - - # try to find suitable places to insert indels - eventPos = -1 - for attempt in xrange(MAX_ATTEMPTS): - eventPos = random.randint(self.winBuffer,self.seqLen-1) - for p in whichPloid: - if self.blackList[p][eventPos]: - eventPos = -1 - if eventPos != -1: - break - if eventPos == -1: - continue - - if random.random() <= self.models[i][3]: # insertion - inLen = self.models[i][4].sample() - # sequence content of random insertions is uniformly random (change this later, maybe) - inSeq = ''.join([random.choice(NUCL) for n in xrange(inLen)]) - refNucl = chr(self.sequences[i][eventPos]) - myIndel = (eventPos,refNucl,refNucl+inSeq) - else: # deletion - inLen = self.models[i][5].sample() - if eventPos+inLen+1 >= len(self.sequences[i]): # skip if deletion too close to boundary - continue - if inLen == 1: - inSeq = chr(self.sequences[i][eventPos+1]) - else: - inSeq = str(self.sequences[i][eventPos+1:eventPos+inLen+1]) - refNucl = chr(self.sequences[i][eventPos]) - myIndel = (eventPos,refNucl+inSeq,refNucl) - - # if event too close to boundary, skip. if event conflicts with other indel, skip. - skipEvent = False - if eventPos+len(myIndel[1]) >= self.seqLen-self.winBuffer-1: - skipEvent = True - if skipEvent: - continue - for p in whichPloid: - for k in xrange(eventPos,eventPos+inLen+1): - if self.blackList[p][k]: - skipEvent = True - if skipEvent: - continue - - for p in whichPloid: - for k in xrange(eventPos,eventPos+inLen+1): - self.blackList[p][k] = 1 - all_indels[p].append(myIndel) - - # add random snps - all_snps = [[] for n in self.sequences] - for i in xrange(self.ploidy): - for j in xrange(self.snpsToAdd[i]): - if random.random() <= self.models[i][1]: # insert homozygous SNP - whichPloid = range(self.ploidy) - else: # insert heterozygous SNP - whichPloid = [self.ploidMutPrior.sample()] - - # try to find suitable places to insert snps - eventPos = -1 - for attempt in xrange(MAX_ATTEMPTS): - # based on the mutation model for the specified ploid, choose a SNP location based on trinuc bias - # (if there are multiple ploids, choose one at random) - if IGNORE_TRINUC: - eventPos = random.randint(self.winBuffer+1,self.seqLen-2) - else: - ploid_to_use = whichPloid[random.randint(0,len(whichPloid)-1)] - eventPos = self.trinuc_bias[ploid_to_use].sample() - for p in whichPloid: - if self.blackList[p][eventPos]: - eventPos = -1 - if eventPos != -1: - break - if eventPos == -1: - continue - - refNucl = chr(self.sequences[i][eventPos]) - context = str(chr(self.sequences[i][eventPos-1])+chr(self.sequences[i][eventPos+1])) - # sample from tri-nucleotide substitution matrices to get SNP alt allele - newNucl = self.models[i][6][TRI_IND[context]][NUC_IND[refNucl]].sample() - mySNP = (eventPos,refNucl,newNucl) - - for p in whichPloid: - all_snps[p].append(mySNP) - self.blackList[p][mySNP[0]] = 2 - - # combine random snps with inserted snps, remove any snps that overlap indels - for p in xrange(len(all_snps)): - all_snps[p].extend(self.snpList[p]) - all_snps[p] = [n for n in all_snps[p] if self.blackList[p][n[0]] != 1] - - # MODIFY REFERENCE STRING: SNPS - for i in xrange(len(all_snps)): - for j in xrange(len(all_snps[i])): - vPos = all_snps[i][j][0] - - if all_snps[i][j][1] != chr(self.sequences[i][vPos]): - print '\nError: Something went wrong!\n', all_snps[i][j], chr(self.sequences[i][vPos]),'\n' - exit(1) - else: - self.sequences[i][vPos] = all_snps[i][j][2] - - # organize the indels we want to insert - for i in xrange(len(all_indels)): - all_indels[i].extend(self.indelList[i]) - all_indels_ins = [sorted([list(m) for m in n]) for n in all_indels] - - # MODIFY REFERENCE STRING: INDELS - adjToAdd = [[] for n in xrange(self.ploidy)] - for i in xrange(len(all_indels_ins)): - rollingAdj = 0 - tempSymbolString = ['M' for n in self.sequences[i]] - # there's an off-by-one error somewhere in the position sampling routines.. this might fix it - tempSymbolString.append('M') - for j in xrange(len(all_indels_ins[i])): - vPos = all_indels_ins[i][j][0] + rollingAdj - vPos2 = vPos + len(all_indels_ins[i][j][1]) - rollingAdj += len(all_indels_ins[i][j][2])-len(all_indels_ins[i][j][1]) - - if all_indels_ins[i][j][1] != str(self.sequences[i][vPos:vPos2]): - print '\nError: Something went wrong!\n', all_indels_ins[i][j], [vPos,vPos2], str(self.sequences[i][vPos:vPos2]),'\n' - exit(1) - else: - # alter reference sequence - self.sequences[i] = self.sequences[i][:vPos] + bytearray(all_indels_ins[i][j][2]) + self.sequences[i][vPos2:] - # notate indel positions for cigar computation - d = len(all_indels_ins[i][j][2]) - len(all_indels_ins[i][j][1]) - if d > 0: - tempSymbolString = tempSymbolString[:vPos+1] + ['I']*d + tempSymbolString[vPos2+1:] - elif d < 0: - tempSymbolString[vPos+1] = 'D'*abs(d)+'M' - - # precompute cigar strings - for j in xrange(len(tempSymbolString)-self.readLen): - self.allCigar[i].append(CigarString(listIn=tempSymbolString[j:j+self.readLen]).getString()) - - # create some data structures we will need later: - # --- self.FM_pos[ploid][pos]: position of the left-most matching base (IN REFERENCE COORDINATES, i.e. corresponding to the unmodified reference genome) - # --- self.FM_span[ploid][pos]: number of reference positions spanned by a read originating from this coordinate - MD_soFar = 0 - for j in xrange(len(tempSymbolString)): - self.FM_pos[i].append(MD_soFar) - # fix an edge case with deletions - if 'D' in tempSymbolString[j]: - self.FM_pos[i][-1] += tempSymbolString[j].count('D') - # compute number of ref matches for each read - span_dif = len([n for n in tempSymbolString[j:j+self.readLen] if 'M' in n]) - self.FM_span[i].append(self.FM_pos[i][-1] + span_dif) - MD_soFar += tempSymbolString[j].count('M') + tempSymbolString[j].count('D') - - # tally up all the variants we handled... - countDict = {} - all_variants = [sorted(all_snps[i]+all_indels[i]) for i in xrange(self.ploidy)] - for i in xrange(len(all_variants)): - for j in xrange(len(all_variants[i])): - all_variants[i][j] = tuple([all_variants[i][j][0]+self.x])+all_variants[i][j][1:] - t = tuple(all_variants[i][j]) - if t not in countDict: - countDict[t] = [] - countDict[t].append(i) - - # - # TODO: combine multiple variants that happened to occur at same position into single vcf entry? - # - - output_variants = [] - for k in sorted(countDict.keys()): - output_variants.append(k+tuple([len(countDict[k])/float(self.ploidy)])) - ploid_string = ['0' for n in xrange(self.ploidy)] - for k2 in [n for n in countDict[k]]: - ploid_string[k2] = '1' - output_variants[-1] += tuple(['WP='+'/'.join(ploid_string)]) - return output_variants - - - def sample_read(self, sequencingModel, fragLen=None): - - # choose a ploid - myPloid = random.randint(0,self.ploidy-1) - - # stop attempting to find a valid position if we fail enough times - MAX_READPOS_ATTEMPTS = 100 - attempts_thus_far = 0 - - # choose a random position within the ploid, and generate quality scores / sequencing errors - readsToSample = [] - if fragLen == None: - rPos = self.coverage_distribution[myPloid].sample() - - # sample read position and call function to compute quality scores / sequencing errors - rDat = self.sequences[myPloid][rPos:rPos+self.readLen] - (myQual, myErrors) = sequencingModel.getSequencingErrors(rDat) - readsToSample.append([rPos,myQual,myErrors,rDat]) - - else: - rPos1 = self.coverage_distribution[myPloid][self.fraglens_indMap[fragLen]].sample() - - # EXPERIMENTAL - #coords_to_select_from = self.coverage_distribution[myPloid][self.fraglens_indMap[fragLen]].sample() - #rPos1 = random.randint(coords_to_select_from[0],coords_to_select_from[1]) - - rPos2 = rPos1 + fragLen - self.readLen - rDat1 = self.sequences[myPloid][rPos1:rPos1+self.readLen] - rDat2 = self.sequences[myPloid][rPos2:rPos2+self.readLen] - (myQual1, myErrors1) = sequencingModel.getSequencingErrors(rDat1) - (myQual2, myErrors2) = sequencingModel.getSequencingErrors(rDat2,isReverseStrand=True) - readsToSample.append([rPos1,myQual1,myErrors1,rDat1]) - readsToSample.append([rPos2,myQual2,myErrors2,rDat2]) - - # error format: - # myError[i] = (type, len, pos, ref, alt) - - # examine sequencing errors to-be-inserted. - # - remove deletions that don't have enough bordering sequence content to "fill in" - # if error is valid, make the changes to the read data - rOut = [] - for r in readsToSample: - try: - myCigar = self.allCigar[myPloid][r[0]] - except IndexError: - print 'Index error when attempting to find cigar string.' - print myPloid, len(self.allCigar[myPloid]), r[0] - if fragLen != None: - print (rPos1, rPos2) - print fragLen, self.fraglens_indMap[fragLen] - exit(1) - totalD = sum([error[1] for error in r[2] if error[0] == 'D']) - totalI = sum([error[1] for error in r[2] if error[0] == 'I']) - availB = len(self.sequences[myPloid]) - r[0] - self.readLen - 1 - # add buffer sequence to fill in positions that get deleted - r[3] += self.sequences[myPloid][r[0]+self.readLen:r[0]+self.readLen+totalD] - expandedCigar = [] - extraCigar = [] - adj = 0 - sse_adj = [0 for n in xrange(self.readLen + max(sequencingModel.errP[3]))] - anyIndelErr = False - - # sort by letter (D > I > S) such that we introduce all indel errors before substitution errors - # secondarily, sort by index - arrangedErrors = {'D':[],'I':[],'S':[]} - for error in r[2]: - arrangedErrors[error[0]].append((error[2],error)) - sortedErrors = [] - for k in sorted(arrangedErrors.keys()): - sortedErrors.extend([n[1] for n in sorted(arrangedErrors[k])]) - - skipIndels = False - - - #FIXED TdB 05JUN2018 - #Moved this outside the for error loop, since it messes up the CIGAR string when more than one deletion is in the same read - extraCigarVal = [] - #END FIXED TdB - - for error in sortedErrors: - eLen = error[1] - ePos = error[2] - if error[0] == 'D' or error[0] == 'I': - anyIndelErr = True - - #FIXED TdB 05JUN2018 - #Moved this OUTSIDE the for error loop, since it messes up the CIGAR string when more than one deletion is in the same read - #extraCigarVal = [] - #END FIXED TdB - - if totalD > availB: # if not enough bases to fill-in deletions, skip all indel erors - continue - if expandedCigar == []: - expandedCigar = CigarString(stringIn=myCigar).getList() - fillToGo = totalD - totalI + 1 - if fillToGo > 0: - try: - extraCigarVal = CigarString(stringIn=self.allCigar[myPloid][r[0]+fillToGo]).getList()[-fillToGo:] - except IndexError: # applying the deletions we want requires going beyond region boundaries. skip all indel errors - skipIndels = True - - - if skipIndels: - continue - - # insert deletion error into read and update cigar string accordingly - if error[0] == 'D': - myadj = sse_adj[ePos] - pi = ePos+myadj - pf = ePos+myadj+eLen+1 - if str(r[3][pi:pf]) == str(error[3]): - r[3] = r[3][:pi+1] + r[3][pf:] - expandedCigar = expandedCigar[:pi+1] + expandedCigar[pf:] - if pi+1 == len(expandedCigar): # weird edge case with del at very end of region. Make a guess and add a "M" - expandedCigar.append('M') - expandedCigar[pi+1] = 'D'*eLen + expandedCigar[pi+1] - else: - print '\nError, ref does not match alt while attempting to insert deletion error!\n' - exit(1) - adj -= eLen - for i in xrange(ePos,len(sse_adj)): - sse_adj[i] -= eLen - - # insert insertion error into read and update cigar string accordingly - else: - myadj = sse_adj[ePos] - if chr(r[3][ePos+myadj]) == error[3]: - r[3] = r[3][:ePos+myadj] + error[4] + r[3][ePos+myadj+1:] - expandedCigar = expandedCigar[:ePos+myadj] + ['I']*eLen + expandedCigar[ePos+myadj:] - else: - print '\nError, ref does not match alt while attempting to insert insertion error!\n' - print '---',chr(r[3][ePos+myadj]), '!=', error[3] - exit(1) - adj += eLen - for i in xrange(ePos,len(sse_adj)): - sse_adj[i] += eLen - - else: # substitution errors, much easier by comparison... - if chr(r[3][ePos+sse_adj[ePos]]) == error[3]: - r[3][ePos+sse_adj[ePos]] = error[4] - else: - print '\nError, ref does not match alt while attempting to insert substitution error!\n' - exit(1) - - if anyIndelErr: - if len(expandedCigar): - relevantCigar = (expandedCigar+extraCigarVal)[:self.readLen] - myCigar = CigarString(listIn=relevantCigar).getString() - - r[3] = r[3][:self.readLen] - - rOut.append([self.FM_pos[myPloid][r[0]],myCigar,str(r[3]),str(r[1])]) - - # rOut[i] = (pos, cigar, read_string, qual_string) - return rOut - - -# -# Container for read data, computes quality scores and positions to insert errors -# -class ReadContainer: - def __init__(self, readLen, errorModel, reScaledError, rescaleQual=False): - - self.readLen = readLen - self.rescaleQ = rescaleQual - - errorDat = pickle.load(open(errorModel,'rb')) - self.UNIFORM = False - if len(errorDat) == 4: # uniform-error SE reads (e.g. PacBio) - self.UNIFORM = True - [Qscores,offQ,avgError,errorParams] = errorDat - self.uniform_qscore = min([max(Qscores), int(-10.*np.log10(avgError)+0.5)]) - print 'Reading in uniform sequencing error model... (q='+str(self.uniform_qscore)+'+'+str(offQ)+', p(err)={0:0.2f}%)'.format(100.*avgError) - if len(errorDat) == 6: # only 1 q-score model present, use same model for both strands - [initQ1,probQ1,Qscores,offQ,avgError,errorParams] = errorDat - self.PE_MODELS = False - elif len(errorDat) == 8: # found a q-score model for both forward and reverse strands - #print 'Using paired-read quality score profiles...' - [initQ1,probQ1,initQ2,probQ2,Qscores,offQ,avgError,errorParams] = errorDat - self.PE_MODELS = True - if len(initQ1) != len(initQ2) or len(probQ1) != len(probQ2): - print '\nError: R1 and R2 quality score models are of different length.\n' - exit(1) - - self.qErrRate = [0.]*(max(Qscores)+1) - for q in Qscores: - self.qErrRate[q] = 10.**(-q/10.) - self.offQ = offQ - - # errorParams = [SSE_PROB, SIE_RATE, SIE_PROB, SIE_VAL, SIE_INS_FREQ, SIE_INS_NUCL] - self.errP = errorParams - self.errSSE = [DiscreteDistribution(n,NUCL) for n in self.errP[0]] - self.errSIE = DiscreteDistribution(self.errP[2],self.errP[3]) - self.errSIN = DiscreteDistribution(self.errP[5],NUCL) - - # adjust sequencing error frequency to match desired rate - if reScaledError == None: - self.errorScale = 1.0 - else: - self.errorScale = reScaledError/avgError - if self.rescaleQ == False: - print 'Warning: Quality scores no longer exactly representative of error probability. Error model scaled by {0:.3f} to match desired rate...'.format(self.errorScale) - if self.UNIFORM: - if reScaledError <= 0.: - self.uniform_qscore = max(Qscores) - else: - self.uniform_qscore = min([max(Qscores), int(-10.*np.log10(reScaledError)+0.5)]) - print ' - Uniform quality score scaled to match specified error rate (q='+str(self.uniform_qscore)+'+'+str(self.offQ)+', p(err)={0:0.2f}%)'.format(100.*reScaledError) - - if self.UNIFORM == False: - # adjust length to match desired read length - if self.readLen == len(initQ1): - self.qIndRemap = range(self.readLen) - else: - print 'Warning: Read length of error model ('+str(len(initQ1))+') does not match -R value ('+str(self.readLen)+'), rescaling model...' - self.qIndRemap = [max([1,len(initQ1)*n/readLen]) for n in xrange(readLen)] - - # initialize probability distributions - self.initDistByPos1 = [DiscreteDistribution(initQ1[i],Qscores) for i in xrange(len(initQ1))] - self.probDistByPosByPrevQ1 = [None] - for i in xrange(1,len(initQ1)): - self.probDistByPosByPrevQ1.append([]) - for j in xrange(len(initQ1[0])): - if np.sum(probQ1[i][j]) <= 0.: # if we don't have sufficient data for a transition, use the previous qscore - self.probDistByPosByPrevQ1[-1].append(DiscreteDistribution([1],[Qscores[j]],degenerateVal=Qscores[j])) - else: - self.probDistByPosByPrevQ1[-1].append(DiscreteDistribution(probQ1[i][j],Qscores)) - - if self.PE_MODELS: - self.initDistByPos2 = [DiscreteDistribution(initQ2[i],Qscores) for i in xrange(len(initQ2))] - self.probDistByPosByPrevQ2 = [None] - for i in xrange(1,len(initQ2)): - self.probDistByPosByPrevQ2.append([]) - for j in xrange(len(initQ2[0])): - if np.sum(probQ2[i][j]) <= 0.: # if we don't have sufficient data for a transition, use the previous qscore - self.probDistByPosByPrevQ2[-1].append(DiscreteDistribution([1],[Qscores[j]],degenerateVal=Qscores[j])) - else: - self.probDistByPosByPrevQ2[-1].append(DiscreteDistribution(probQ2[i][j],Qscores)) - - def getSequencingErrors(self, readData, isReverseStrand=False): - - qOut = [0]*self.readLen - sErr = [] - - if self.UNIFORM: - myQ = [self.uniform_qscore + self.offQ for n in xrange(self.readLen)] - qOut = ''.join([chr(n) for n in myQ]) - for i in xrange(self.readLen): - if random.random() < self.errorScale*self.qErrRate[self.uniform_qscore]: - sErr.append(i) - else: - - if self.PE_MODELS and isReverseStrand: - myQ = self.initDistByPos2[0].sample() - else: - myQ = self.initDistByPos1[0].sample() - qOut[0] = myQ - - for i in xrange(1,self.readLen): - if self.PE_MODELS and isReverseStrand: - myQ = self.probDistByPosByPrevQ2[self.qIndRemap[i]][myQ].sample() - else: - myQ = self.probDistByPosByPrevQ1[self.qIndRemap[i]][myQ].sample() - qOut[i] = myQ - - if isReverseStrand: - qOut = qOut[::-1] - - for i in xrange(self.readLen): - if random.random() < self.errorScale * self.qErrRate[qOut[i]]: - sErr.append(i) - - if self.rescaleQ == True: # do we want to rescale qual scores to match rescaled error? - qOut = [max([0, int(-10.*np.log10(self.errorScale*self.qErrRate[n])+0.5)]) for n in qOut] - qOut = [min([int(self.qErrRate[-1]), n]) for n in qOut] - qOut = ''.join([chr(n + self.offQ) for n in qOut]) - else: - qOut = ''.join([chr(n + self.offQ) for n in qOut]) - - if self.errorScale == 0.0: - return (qOut,[]) - - sOut = [] - nDelSoFar = 0 - # don't allow indel errors to occur on subsequent positions - prevIndel = -2 - # don't allow other sequencing errors to occur on bases removed by deletion errors - delBlacklist = [] - - for ind in sErr[::-1]: # for each error that we're going to insert... - - # determine error type - isSub = True - if ind != 0 and ind != self.readLen-1-max(self.errP[3]) and abs(ind-prevIndel) > 1: - if random.random() < self.errP[1]: - isSub = False - - # errorOut = (type, len, pos, ref, alt) - - if isSub: # insert substitution error - myNucl = chr(readData[ind]) - newNucl = self.errSSE[NUC_IND[myNucl]].sample() - sOut.append(('S',1,ind,myNucl,newNucl)) - else: # insert indel error - indelLen = self.errSIE.sample() - if random.random() < self.errP[4]: # insertion error - myNucl = chr(readData[ind]) - newNucl = myNucl + ''.join([self.errSIN.sample() for n in xrange(indelLen)]) - sOut.append(('I',len(newNucl)-1,ind,myNucl,newNucl)) - elif ind < self.readLen-2-nDelSoFar: # deletion error (prevent too many of them from stacking up) - myNucl = str(readData[ind:ind+indelLen+1]) - newNucl = chr(readData[ind]) - nDelSoFar += len(myNucl)-1 - sOut.append(('D',len(myNucl)-1,ind,myNucl,newNucl)) - for i in xrange(ind+1,ind+indelLen+1): - delBlacklist.append(i) - prevIndel = ind - - # remove blacklisted errors - for i in xrange(len(sOut)-1,-1,-1): - if sOut[i][2] in delBlacklist: - del sOut[i] - - return (qOut,sOut) - - - -"""************************************************ -**** DEFAULT MUTATION MODELS -************************************************""" - - -# parse mutation model pickle file -def parseInputMutationModel(model=None, whichDefault=1): - if whichDefault == 1: - outModel = [copy.deepcopy(n) for n in DEFAULT_MODEL_1] - elif whichDefault == 2: - outModel = [copy.deepcopy(n) for n in DEFAULT_MODEL_2] - else: - print '\nError: Unknown default mutation model specified\n' - exit(1) - - if model != None: - pickle_dict = pickle.load(open(model,"rb")) - outModel[0] = pickle_dict['AVG_MUT_RATE'] - outModel[2] = 1. - pickle_dict['SNP_FREQ'] - - insList = pickle_dict['INDEL_FREQ'] - if len(insList): - insCount = sum([insList[k] for k in insList.keys() if k >= 1]) - delCount = sum([insList[k] for k in insList.keys() if k <= -1]) - insVals = [k for k in sorted(insList.keys()) if k >= 1] - insWght = [insList[k]/float(insCount) for k in insVals] - delVals = [k for k in sorted([abs(k) for k in insList.keys() if k <= -1])] - delWght = [insList[-k]/float(delCount) for k in delVals] - else: # degenerate case where no indel stats are provided - insCount = 1 - delCount = 1 - insVals = [1] - insWght = [1.0] - delVals = [1] - delWght = [1.0] - outModel[3] = insCount/float(insCount + delCount) - outModel[4] = insVals - outModel[5] = insWght - outModel[6] = delVals - outModel[7] = delWght - - trinuc_trans_prob = pickle_dict['TRINUC_TRANS_PROBS'] - for k in sorted(trinuc_trans_prob.keys()): - myInd = TRI_IND[k[0][0]+k[0][2]] - (k1,k2) = (NUC_IND[k[0][1]],NUC_IND[k[1][1]]) - outModel[8][myInd][k1][k2] = trinuc_trans_prob[k] - for i in xrange(len(outModel[8])): - for j in xrange(len(outModel[8][i])): - for l in xrange(len(outModel[8][i][j])): - # if trinuc not present in input mutation model, assign it uniform probability - if float(sum(outModel[8][i][j])) < 1e-12: - outModel[8][i][j] = [0.25,0.25,0.25,0.25] - else: - outModel[8][i][j][l] /= float(sum(outModel[8][i][j])) - - trinuc_mut_prob = pickle_dict['TRINUC_MUT_PROB'] - which_have_we_seen = {n:False for n in ALL_TRI} - trinuc_mean = np.mean(trinuc_mut_prob.values()) - for trinuc in trinuc_mut_prob.keys(): - outModel[9][ALL_IND[trinuc]] = trinuc_mut_prob[trinuc] - which_have_we_seen[trinuc] = True - for trinuc in which_have_we_seen.keys(): - if which_have_we_seen[trinuc] == False: - outModel[9][ALL_IND[trinuc]] = trinuc_mean - - return outModel - - -# parse mutation model files, returns default model if no model directory is specified -# -# OLD FUNCTION THAT PROCESSED OUTDATED TEXTFILE MUTATION MODELS -def parseInputMutationModel_deprecated(prefix=None, whichDefault=1): - if whichDefault == 1: - outModel = [copy.deepcopy(n) for n in DEFAULT_MODEL_1] - elif whichDefault == 2: - outModel = [copy.deepcopy(n) for n in DEFAULT_MODEL_2] - else: - print '\nError: Unknown default mutation model specified\n' - exit(1) - - if prefix != None: - if prefix[-1] != '/': - prefix += '/' - if not os.path.isdir(prefix): - '\nError: Input mutation model directory not found:',prefix,'\n' - exit(1) - - print 'Reading in mutation model...' - listing1 = [n for n in os.listdir(prefix) if n[-5:] == '.prob'] - listing2 = [n for n in os.listdir(prefix) if n[-7:] == '.trinuc'] - listing = sorted(listing1) + sorted(listing2) - for l in listing: - f = open(prefix+l,'r') - fr = [n.split('\t') for n in f.read().split('\n')] - f.close() - - if '_overall.prob' in l: - myIns = None - myDel = None - for dat in fr[1:]: - if len(dat) == 2: - if dat[0] == 'insertion': - myIns = float(dat[1]) - elif dat[0] == 'deletion': - myDel = float(dat[1]) - if myIns != None and myDel != None: - outModel[2] = myIns + myDel - outModel[3] = myIns / (myIns + myDel) - print '-',l - - if '_insLength.prob' in l: - insVals = {} - for dat in fr[1:]: - if len(dat) == 2: - insVals[int(dat[0])] = float(dat[1]) - if len(insVals): - outModel[4] = sorted(insVals.keys()) - outModel[5] = [insVals[n] for n in outModel[4]] - print '-',l - - if '_delLength.prob' in l: - delVals = {} - for dat in fr[1:]: - if len(dat) == 2: - delVals[int(dat[0])] = float(dat[1]) - if len(delVals): - outModel[6] = sorted(delVals.keys()) - outModel[7] = [delVals[n] for n in outModel[6]] - print '-',l - - if '.trinuc' == l[-7:]: - context_ind = TRI_IND[l[-10]+l[-8]] - p_matrix = [[-1,-1,-1,-1],[-1,-1,-1,-1],[-1,-1,-1,-1],[-1,-1,-1,-1]] - for i in xrange(len(p_matrix)): - for j in xrange(len(fr[i])): - p_matrix[i][j] = float(fr[i][j]) - anyNone = False - for i in xrange(len(p_matrix)): - for j in xrange(len(p_matrix[i])): - if p_matrix[i][j] == -1: - anyNone = True - if not anyNone: - outModel[8][context_ind] = copy.deepcopy(p_matrix) - print '-',l - - return outModel - -###################### -# DEFAULT VALUES # -###################### - -DEFAULT_1_OVERALL_MUT_RATE = 0.001 -DEFAULT_1_HOMOZYGOUS_FREQ = 0.010 -DEFAULT_1_INDEL_FRACTION = 0.05 -DEFAULT_1_INS_VS_DEL = 0.6 -DEFAULT_1_INS_LENGTH_VALUES = [1,2,3,4,5,6,7,8,9,10] -DEFAULT_1_INS_LENGTH_WEIGHTS = [0.4, 0.2, 0.1, 0.05, 0.05, 0.05, 0.05, 0.034, 0.033, 0.033] -DEFAULT_1_DEL_LENGTH_VALUES = [1,2,3,4,5] -DEFAULT_1_DEL_LENGTH_WEIGHTS = [0.3,0.2,0.2,0.2,0.1] -example_matrix_1 = [[0.0, 0.15, 0.7, 0.15], - [0.15, 0.0, 0.15, 0.7], - [0.7, 0.15, 0.0, 0.15], - [0.15, 0.7, 0.15, 0.0]] -DEFAULT_1_TRI_FREQS = [copy.deepcopy(example_matrix_1) for n in xrange(16)] -DEFAULT_1_TRINUC_BIAS = [1./float(len(ALL_TRI)) for n in ALL_TRI] -DEFAULT_MODEL_1 = [DEFAULT_1_OVERALL_MUT_RATE, - DEFAULT_1_HOMOZYGOUS_FREQ, - DEFAULT_1_INDEL_FRACTION, - DEFAULT_1_INS_VS_DEL, - DEFAULT_1_INS_LENGTH_VALUES, - DEFAULT_1_INS_LENGTH_WEIGHTS, - DEFAULT_1_DEL_LENGTH_VALUES, - DEFAULT_1_DEL_LENGTH_WEIGHTS, - DEFAULT_1_TRI_FREQS, - DEFAULT_1_TRINUC_BIAS] - -DEFAULT_2_OVERALL_MUT_RATE = 0.002 -DEFAULT_2_HOMOZYGOUS_FREQ = 0.200 -DEFAULT_2_INDEL_FRACTION = 0.1 -DEFAULT_2_INS_VS_DEL = 0.3 -DEFAULT_2_INS_LENGTH_VALUES = [1,2,3,4,5,6,7,8,9,10] -DEFAULT_2_INS_LENGTH_WEIGHTS = [0.1, 0.1, 0.2, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05] -DEFAULT_2_DEL_LENGTH_VALUES = [1,2,3,4,5] -DEFAULT_2_DEL_LENGTH_WEIGHTS = [0.3,0.2,0.2,0.2,0.1] -example_matrix_2 = [[0.0, 0.15, 0.7, 0.15], - [0.15, 0.0, 0.15, 0.7], - [0.7, 0.15, 0.0, 0.15], - [0.15, 0.7, 0.15, 0.0]] -DEFAULT_2_TRI_FREQS = [copy.deepcopy(example_matrix_2) for n in xrange(16)] -DEFAULT_2_TRINUC_BIAS = [1./float(len(ALL_TRI)) for n in ALL_TRI] -DEFAULT_MODEL_2 = [DEFAULT_2_OVERALL_MUT_RATE, - DEFAULT_2_HOMOZYGOUS_FREQ, - DEFAULT_2_INDEL_FRACTION, - DEFAULT_2_INS_VS_DEL, - DEFAULT_2_INS_LENGTH_VALUES, - DEFAULT_2_INS_LENGTH_WEIGHTS, - DEFAULT_2_DEL_LENGTH_VALUES, - DEFAULT_2_DEL_LENGTH_WEIGHTS, - DEFAULT_2_TRI_FREQS, - DEFAULT_2_TRINUC_BIAS] - - diff --git a/py/biopython_modified_bgzf.py b/py/biopython_modified_bgzf.py deleted file mode 100755 index 4e84afe..0000000 --- a/py/biopython_modified_bgzf.py +++ /dev/null @@ -1,103 +0,0 @@ -#!/usr/bin/env python -# Copyright 2010-2013 by Peter Cock. -# All rights reserved. -# This code is part of the Biopython distribution and governed by its -# license. Please see the LICENSE file that should have been included -# as part of this package. - -""" ############################################################################ -####### ####### -####### 06/02/2015: ####### -####### - I picked out the bits and pieces of code needed ####### -####### to write BAM files, removed python 3.0 compatibility ####### -####### ####### -############################################################################ """ - -import zlib -import struct - -_bgzf_header = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00\x42\x43\x02\x00" -_bgzf_eof = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00\x42\x43\x02\x00\x1b\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00" - -class BgzfWriter(object): - - def __init__(self, filename=None, mode="w", fileobj=None, compresslevel=6): - if fileobj: - assert filename is None - handle = fileobj - else: - if "w" not in mode.lower() \ - and "a" not in mode.lower(): - raise ValueError("Must use write or append mode, not %r" % mode) - if "a" in mode.lower(): - handle = open(filename, "ab") - else: - handle = open(filename, "wb") - self._text = "b" not in mode.lower() - self._handle = handle - self._buffer = b"" - self.compresslevel = compresslevel - - def _write_block(self, block): - start_offset = self._handle.tell() - assert len(block) <= 65536 - # Giving a negative window bits means no gzip/zlib headers, -15 used in samtools - c = zlib.compressobj(self.compresslevel, - zlib.DEFLATED, - -15, - zlib.DEF_MEM_LEVEL, - 0) - compressed = c.compress(block) + c.flush() - del c - assert len(compressed) < 65536, "TODO - Didn't compress enough, try less data in this block" - crc = zlib.crc32(block) - # Should cope with a mix of Python platforms... - if crc < 0: - crc = struct.pack("= 65536: - self._write_block(self._buffer[:65536]) - self._buffer = self._buffer[65536:] - - def flush(self): - while len(self._buffer) >= 65536: - self._write_block(self._buffer[:65535]) - self._buffer = self._buffer[65535:] - self._write_block(self._buffer) - self._buffer = b"" - self._handle.flush() - - def close(self): - """Flush data, write 28 bytes empty BGZF EOF marker, and close the BGZF file.""" - if self._buffer: - self.flush() - # samtools will look for a magic EOF marker, just a 28 byte empty BGZF block, - # and if it is missing warns the BAM file may be truncated. In addition to - # samtools writing this block, so too does bgzip - so we should too. - self._handle.write(_bgzf_eof) - self._handle.flush() - self._handle.close() - - def __enter__(self): - return self - - def __exit__(self, type, value, traceback): - self.close() - - -if __name__ == "__main__": - pass diff --git a/py/inputChecking.py b/py/inputChecking.py deleted file mode 100644 index 30cddaf..0000000 --- a/py/inputChecking.py +++ /dev/null @@ -1,29 +0,0 @@ -import os -import sys - -def requiredField(s,errString): - if s == None: - print '\n'+errString+'\n' - exit(1) - -def checkFileOpen(fn,errString,required=False): - if required or fn != None: - if fn == None: - print '\n'+errString+'\n' - exit(1) - else: - try: - open(fn,'r') - except: - print '\n'+errString+'\n' - exit(1) - -def checkDir(dir,errString): - if not os.path.isdir(dir): - print '\n'+errString+'\n' - exit(1) - -def isInRange(val,lb,ub,errString): - if val < lb or val > ub: - print '\n'+errString+'\n' - exit(1) diff --git a/py/neat_cigar.py b/py/neat_cigar.py deleted file mode 100644 index 80d79b2..0000000 --- a/py/neat_cigar.py +++ /dev/null @@ -1,108 +0,0 @@ -import re - -class CigarString: - def __init__(self, stringIn=None, listIn=None): - - if stringIn == None and listIn == None: - print '\nError: CigarString object not initialized.\n' - exit(1) - - self.cigarData = [] - - if stringIn != None: - self.joinCigar(j_stringIn=stringIn) - - if listIn != None: - self.joinCigar(j_listIn=listIn) - - - def stringToList(self, s): - - cigarDat = [] - letters = re.split(r"\d+",s)[1:] - numbers = [int(n) for n in re.findall(r"\d+",s)] - dReserve = 0 - for i in xrange(len(letters)): - if letters[i] == 'D': - dReserve = numbers[i] - if letters[i] == 'M' or letters[i] == 'I': - if dReserve: - cigarDat += ['D'*dReserve+letters[i]] + [letters[i]]*(int(numbers[i])-1) - else: - cigarDat += [letters[i]]*int(numbers[i]) - dReserve = 0 - return cigarDat - - - def listToString(self, l): - - symbols = '' - currentSym = l[0] - currentCount = 1 - if 'D' in currentSym: - currentSym = currentSym[-1] - for k in xrange(1,len(l)): - nextSym = l[k] - if len(nextSym) == 1 and nextSym == currentSym: - currentCount += 1 - else: - symbols += str(currentCount) + currentSym - if 'D' in nextSym: - symbols += str(nextSym.count('D')) + 'D' - currentSym = nextSym[-1] - else: - currentSym = nextSym - currentCount = 1 - symbols += str(currentCount) + currentSym - return symbols - - def getList(self): - - return self.cigarData - - - def getString(self): - - return self.listToString(self.cigarData) - - - def joinCigar(self, j_stringIn=None, j_listIn=None): - - if j_stringIn == None and j_listIn == None: - print '\nError: Invalid join operation in CigarString\n' - exit(1) - - if j_stringIn != None: - self.cigarData += self.stringToList(j_stringIn) - - if j_listIn != None: - self.cigarData += j_listIn - - - def insertCigarElement(self, pos, i_stringIn=None, i_listIn=None): - - if i_stringIn == None and i_listIn == None: - print '\nError: Invalid insertion operation in CigarString\n' - exit(1) - - if pos < 0 or pos >= len(self.cigarData): - print '\nError: Invalid insertion position in CigarString\n' - exit(1) - - if i_stringIn != None: - self.cigarData = self.cigarData[:pos] + self.stringToList(i_stringIn) + self.cigarData[pos:] - - if i_listIn != None: - self.cigarData = self.cigarData[:pos] + i_listIn + self.cigarData[pos:] - - -if __name__ == '__main__': - print 'testing CigarString class...' - - str1 = '50M10D7I23M' - str2 = '10I25M' - iPos = 20 - myCigar = CigarString(stringIn=str1) - myCigar.insertCigarElement(iPos,i_stringIn=str2) - print str1,'+',str2,'[inserted at position',str(iPos)+']','=',myCigar.getString() - diff --git a/py/probability.py b/py/probability.py deleted file mode 100644 index 789e98c..0000000 --- a/py/probability.py +++ /dev/null @@ -1,146 +0,0 @@ -import math -import random -import bisect -import copy -import numpy as np - -LOW_PROB_THRESH = 1e-12 - -def mean_ind_of_weighted_list(l): - myMid = sum(l)/2.0 - mySum = 0.0 - for i in xrange(len(l)): - mySum += l[i] - if mySum >= myMid: - return i - -class DiscreteDistribution: - def __init__(self, weights, values, degenerateVal=None, method='bisect'): - - # some sanity checking - if not len(weights) or not len(values): - print '\nError: weight or value vector given to DiscreteDistribution() are 0-length.\n' - asdf = intentional_crash[0] - exit(1) - - self.method = method - sumWeight = float(sum(weights)) - - # if probability of all input events is 0, consider it degenerate and always return the first value - if sumWeight < LOW_PROB_THRESH: - self.degenerate = values[0] - else: - self.weights = [n/sumWeight for n in weights] - self.values = copy.deepcopy(values) - if len(self.values) != len(self.weights): - print '\nError: length and weights and values vectors must be the same.\n' - exit(1) - self.degenerate = degenerateVal - # prune values with probability too low to be worth using [DOESN'T REALLY IMPROVE PERFORMANCE] - ####if self.degenerate != None: - #### for i in xrange(len(self.weights)-1,-1,-1): - #### if self.weights[i] < LOW_PROB_THRESH: - #### del self.weights[i] - #### del self.values[i] - #### if len(self.weights) == 0: - #### print '\nError: probability distribution has no usable values.\n' - #### exit(1) - - if self.method == 'alias': - K = len(self.weights) - q = np.zeros(K) - J = np.zeros(K, dtype=np.int) - smaller = [] - larger = [] - for kk, prob in enumerate(self.weights): - q[kk] = K*prob - if q[kk] < 1.0: - smaller.append(kk) - else: - larger.append(kk) - while len(smaller) > 0 and len(larger) > 0: - small = smaller.pop() - large = larger.pop() - J[small] = large - q[large] = (q[large] + q[small]) - 1.0 - if q[large] < 1.0: - smaller.append(large) - else: - larger.append(large) - - self.a1 = len(J)-1 - self.a2 = J.tolist() - self.a3 = q.tolist() - - elif self.method == 'bisect': - self.cumP = np.cumsum(self.weights).tolist()[:-1] - self.cumP.insert(0,0.) - - def __str__(self): - return str(self.weights)+' '+str(self.values)+' '+self.method - - def sample(self): - - if self.degenerate != None: - return self.degenerate - - else: - - if self.method == 'alias': - r1 = random.randint(0,self.a1) - r2 = random.random() - if r2 < self.a3[r1]: - return self.values[r1] - else: - return self.values[self.a2[r1]] - - elif self.method == 'bisect': - r = random.random() - return self.values[bisect.bisect(self.cumP,r)-1] - - -# takes k_range, lambda, [0,1,2,..], returns a DiscreteDistribution object with the corresponding to a poisson distribution -MIN_WEIGHT = 1e-12 -def poisson_list(k_range,l): - if l < MIN_WEIGHT: - return DiscreteDistribution([1],[0],degenerateVal=0) - logFactorial_list = [0.0] - for k in k_range[1:]: - logFactorial_list.append(np.log(float(k))+logFactorial_list[k-1]) - w_range = [np.exp(k*np.log(l) - l - logFactorial_list[k]) for k in k_range] - w_range = [n for n in w_range if n >= MIN_WEIGHT] - if len(w_range) <= 1: - return DiscreteDistribution([1],[0],degenerateVal=0) - return DiscreteDistribution(w_range,k_range[:len(w_range)]) - -# quantize a list of values into blocks -MIN_PROB = 1e-12 -QUANT_BLOCKS = 10 -def quantize_list(l): - suml = float(sum(l)) - ls = sorted([n for n in l if n >= MIN_PROB*suml]) - if len(ls) == 0: - return None - qi = [] - for i in xrange(QUANT_BLOCKS): - #qi.append(ls[int((i)*(len(ls)/float(QUANT_BLOCKS)))]) - qi.append(ls[0]+(i/float(QUANT_BLOCKS))*(ls[-1]-ls[0])) - qi.append(1e12) - runningList = [] - prevBi = None - previ = None - for i in xrange(len(l)): - if l[i] >= MIN_PROB*suml: - bi = bisect.bisect(qi,l[i]) - #print i, l[i], qi[bi-1] - if prevBi != None: - if bi == prevBi and previ == i-1: - runningList[-1][1] += 1 - else: - runningList.append([i,i,qi[bi-1]]) - else: - runningList.append([i,i,qi[bi-1]]) - prevBi = bi - previ = i - return runningList - diff --git a/py/refFunc.py b/py/refFunc.py deleted file mode 100644 index 6ade0c8..0000000 --- a/py/refFunc.py +++ /dev/null @@ -1,210 +0,0 @@ -import sys -import time -import os -import random - -OK_CHR_ORD = {ord('A'):True,ord('C'):True,ord('G'):True,ord('T'):True,ord('U'):True} -ALLOWED_NUCL = ['A','C','G','T'] - -# -# Index reference fasta -# -def indexRef(refPath): - - tt = time.time() - - fn = None - if os.path.isfile(refPath+'i'): - print 'found index '+refPath+'i' - fn = refPath+'i' - if os.path.isfile(refPath+'.fai'): - print 'found index '+refPath+'.fai' - fn = refPath+'.fai' - - ref_inds = [] - if fn != None: - fai = open(fn,'r') - for line in fai: - splt = line[:-1].split('\t') - seqLen = int(splt[1]) - offset = int(splt[2]) - lineLn = int(splt[3]) - nLines = seqLen/lineLn - if seqLen%lineLn != 0: - nLines += 1 - ref_inds.append((splt[0],offset,offset+seqLen+nLines,seqLen)) - fai.close() - return ref_inds - - sys.stdout.write('index not found, creating one... ') - sys.stdout.flush() - refFile = open(refPath,'r') - prevR = None - prevP = None - seqLen = 0 - while 1: - data = refFile.readline() - if not data: - ref_inds.append( (prevR, prevP, refFile.tell()-len(data), seqLen) ) - break - if data[0] == '>': - if prevP != None: - ref_inds.append( (prevR, prevP, refFile.tell()-len(data), seqLen) ) - seqLen = 0 - prevP = refFile.tell() - prevR = data[1:-1] - else: - seqLen += len(data)-1 - refFile.close() - - print '{0:.3f} (sec)'.format(time.time()-tt) - return ref_inds - - -# -# Read in sequence data from reference fasta -# -# N_unknowns = True --> all ambiguous characters will be treated as Ns -# N_handling = (mode,params) -# - ('random',read/frag len) --> all regions of Ns smaller than read or fragment -# length (whichever is bigger) will be replaced -# with uniformly random nucleotides -# - ('allChr',read/frag len, chr) --> same as above, but replaced instead with a string -# of 'chr's -# - ('ignore') --> do not alter nucleotides in N regions -# -def readRef(refPath,ref_inds_i,N_handling,N_unknowns=True,quiet=False): - - tt = time.time() - if not quiet: - sys.stdout.write('reading '+ref_inds_i[0]+'... ') - sys.stdout.flush() - - refFile = open(refPath,'r') - refFile.seek(ref_inds_i[1]) - myDat = ''.join(refFile.read(ref_inds_i[2]-ref_inds_i[1]).split('\n')) - myDat = bytearray(myDat.upper()) - - # find N regions - # data explanation: myDat[N_atlas[0][0]:N_atlas[0][1]] = solid block of Ns - prevNI = 0 - nCount = 0 - N_atlas = [] - for i in xrange(len(myDat)): - if myDat[i] == ord('N') or (N_unknowns and myDat[i] not in OK_CHR_ORD): - if nCount == 0: - prevNI = i - nCount += 1 - if i == len(myDat)-1: - N_atlas.append((prevNI,prevNI+nCount)) - else: - if nCount > 0: - N_atlas.append((prevNI,prevNI+nCount)) - nCount = 0 - - # handle N base-calls as desired - N_info = {} - N_info['all'] = [] - N_info['big'] = [] - N_info['non_N'] = [] - if N_handling[0] == 'random': - for region in N_atlas: - N_info['all'].extend(region) - if region[1]-region[0] <= N_handling[1]: - for i in xrange(region[0],region[1]): - myDat[i] = random.choice(ALLOWED_NUCL) - else: - N_info['big'].extend(region) - elif N_handling[0] == 'allChr' and N_handling[2] in OK_CHR_ORD: - for region in N_atlas: - N_info['all'].extend(region) - if region[1]-region[0] <= N_handling[1]: - for i in xrange(region[0],region[1]): - myDat[i] = N_handling[2] - else: - N_info['big'].extend(region) - elif N_handling[0] == 'ignore': - for region in N_atlas: - N_info['all'].extend(region) - N_info['big'].extend(region) - else: - print '\nERROR: UNKNOWN N_HANDLING MODE\n' - exit(1) - - habitableRegions = [] - if N_info['big'] == []: - N_info['non_N'] = [(0,len(myDat))] - else: - for i in xrange(0,len(N_info['big']),2): - if i == 0: - habitableRegions.append((0,N_info['big'][0])) - else: - habitableRegions.append((N_info['big'][i-1],N_info['big'][i])) - habitableRegions.append((N_info['big'][-1],len(myDat))) - for n in habitableRegions: - if n[0] != n[1]: - N_info['non_N'].append(n) - - if not quiet: - print '{0:.3f} (sec)'.format(time.time()-tt) - return (myDat,N_info) - -# -# find all non-N regions in reference sequence ahead of time, for computing jobs in parallel -# -def getAllRefRegions(refPath,ref_inds,N_handling,saveOutput=False): - outRegions = {} - fn = refPath+'.nnr' - if os.path.isfile(fn) and not(saveOutput): - print 'found list of preidentified non-N regions...' - f = open(fn,'r') - for line in f: - splt = line.strip().split('\t') - if splt[0] not in outRegions: - outRegions[splt[0]] = [] - outRegions[splt[0]].append((int(splt[1]),int(splt[2]))) - f.close() - return outRegions - else: - print 'enumerating all non-N regions in reference sequence...' - for RI in xrange(len(ref_inds)): - (refSequence,N_regions) = readRef(refPath,ref_inds[RI],N_handling,quiet=True) - refName = ref_inds[RI][0] - outRegions[refName] = [n for n in N_regions['non_N']] - if saveOutput: - f = open(fn,'w') - for k in outRegions.keys(): - for n in outRegions[k]: - f.write(k+'\t'+str(n[0])+'\t'+str(n[1])+'\n') - f.close() - return outRegions - -# -# find which of the non-N regions are going to be used for this job -# -def partitionRefRegions(inRegions,ref_inds,myjob,njobs): - - totSize = 0 - for RI in xrange(len(ref_inds)): - refName = ref_inds[RI][0] - for region in inRegions[refName]: - totSize += region[1] - region[0] - sizePerJob = int(totSize/float(njobs)-0.5) - - regionsPerJob = [[] for n in xrange(njobs)] - refsPerJob = [{} for n in xrange(njobs)] - currentInd = 0 - currentCount = 0 - for RI in xrange(len(ref_inds)): - refName = ref_inds[RI][0] - for region in inRegions[refName]: - regionsPerJob[currentInd].append((refName,region[0],region[1])) - refsPerJob[currentInd][refName] = True - currentCount += region[1] - region[0] - if currentCount >= sizePerJob: - currentCount = 0 - currentInd = min([currentInd+1,njobs-1]) - - relevantRefs = refsPerJob[myjob-1].keys() - relevantRegs = regionsPerJob[myjob-1] - return (relevantRefs,relevantRegs) diff --git a/py/vcfFunc.py b/py/vcfFunc.py deleted file mode 100644 index c9fb4b8..0000000 --- a/py/vcfFunc.py +++ /dev/null @@ -1,188 +0,0 @@ -import sys -import time -import os -import re -import random - -INCLUDE_HOMS = False -INCLUDE_FAIL = False -CHOOSE_RANDOM_PLOID_IF_NO_GT_FOUND = True - -def parseLine(splt,colDict,colSamp): - - # check if we want to proceed.. - ra = splt[colDict['REF']] - aa = splt[colDict['ALT']] - # enough columns? - if len(splt) != len(colDict): - return None - # exclude homs / filtered? - if not(INCLUDE_HOMS) and (aa == '.' or aa == '' or aa == ra): - return None - if not(INCLUDE_FAIL) and (splt[colDict['FILTER']] != 'PASS' and splt[colDict['FILTER']] != '.'): - return None - - # default vals - alt_alleles = [aa] - alt_freqs = [] - - gt_perSamp = [] - - # any alt alleles? - alt_split = aa.split(',') - if len(alt_split) > 1: - alt_alleles = alt_split - - # check INFO for AF - af = None - if 'INFO' in colDict and ';AF=' in ';'+splt[colDict['INFO']]: - info = splt[colDict['INFO']]+';' - af = re.findall(r"AF=.*?(?=;)",info)[0][3:] - if af != None: - af_splt = af.split(',') - while(len(af_splt) < len(alt_alleles)): # are we lacking enough AF values for some reason? - af_splt.append(af_splt[-1]) # phone it in. - if len(af_splt) != 0 and af_splt[0] != '.' and af_splt[0] != '': # missing data, yay - alt_freqs = [float(n) for n in af_splt] - else: - alt_freqs = [None]*max([len(alt_alleles),1]) - - gt_perSamp = None - # if available (i.e. we simulated it) look for WP in info - if len(colSamp) == 0 and 'INFO' in colDict and 'WP=' in splt[colDict['INFO']]: - info = splt[colDict['INFO']]+';' - gt_perSamp = [re.findall(r"WP=.*?(?=;)",info)[0][3:]] - else: - # if no sample columns, check info for GT - if len(colSamp) == 0 and 'INFO' in colDict and 'GT=' in splt[colDict['INFO']]: - info = splt[colDict['INFO']]+';' - gt_perSamp = [re.findall(r"GT=.*?(?=;)",info)[0][3:]] - elif len(colSamp): - fmt = ':'+splt[colDict['FORMAT']]+':' - if ':GT:' in fmt: - gtInd = fmt.split(':').index('GT') - gt_perSamp = [splt[colSamp[iii]].split(':')[gtInd-1] for iii in xrange(len(colSamp))] - for i in xrange(len(gt_perSamp)): - gt_perSamp[i] = gt_perSamp[i].replace('.','0') - if gt_perSamp == None: - gt_perSamp = [None]*max([len(colSamp),1]) - - return (alt_alleles, alt_freqs, gt_perSamp) - - - -def parseVCF(vcfPath,tumorNormal=False,ploidy=2): - - tt = time.time() - print '--------------------------------' - sys.stdout.write('reading input VCF...\n') - sys.stdout.flush() - - colDict = {} - colSamp = [] - nSkipped = 0 - nSkipped_becauseHash = 0 - allVars = {} # [ref][pos] - sampNames = [] - alreadyPrintedWarning = False - f = open(vcfPath,'r') - for line in f: - - if line[0] != '#': - if len(colDict) == 0: - print '\n\nERROR: VCF has no header?\n'+vcfPath+'\n\n' - f.close() - exit(1) - splt = line.strip().split('\t') - plOut = parseLine(splt,colDict,colSamp) - if plOut == None: - nSkipped += 1 - else: - (aa, af, gt) = plOut - - # make sure at least one allele somewhere contains the variant - if tumorNormal: - gtEval = gt[:2] - else: - gtEval = gt[:1] - if None in gtEval: - if CHOOSE_RANDOM_PLOID_IF_NO_GT_FOUND: - if not alreadyPrintedWarning: - print 'Warning: Found variants without a GT field, assuming heterozygous...' - alreadyPrintedWarning = True - for i in xrange(len(gtEval)): - tmp = ['0']*ploidy - tmp[random.randint(0,ploidy-1)] = '1' - gtEval[i] = '/'.join(tmp) - else: - # skip because no GT field was found - nSkipped += 1 - continue - isNonReference = False - for gtVal in gtEval: - if gtVal != None: - if '1' in gtVal: - isNonReference = True - if not isNonReference: - # skip if no genotype actually contains this variant - nSkipped += 1 - continue - - chrom = splt[0] - pos = int(splt[1]) - ref = splt[3] - # skip if position is <= 0 - if pos <= 0: - nSkipped += 1 - continue - - # hash variants to avoid inserting duplicates (there are some messy VCFs out there...) - if chrom not in allVars: - allVars[chrom] = {} - if pos not in allVars[chrom]: - allVars[chrom][pos] = (pos,ref,aa,af,gtEval) - else: - nSkipped_becauseHash += 1 - - else: - if line[1] != '#': - cols = line[1:-1].split('\t') - for i in xrange(len(cols)): - if 'FORMAT' in colDict: - colSamp.append(i) - colDict[cols[i]] = i - if len(colSamp): - sampNames = cols[-len(colSamp):] - if len(colSamp) == 1: - pass - elif len(colSamp) == 2 and tumorNormal: - print 'Detected 2 sample columns in input VCF, assuming tumor/normal.' - else: - print 'Warning: Multiple sample columns present in input VCF. By default genReads uses only the first column.' - else: - sampNames = ['Unknown'] - if tumorNormal: - #tumorInd = sampNames.index('TUMOR') - #normalInd = sampNames.index('NORMAL') - if 'NORMAL' not in sampNames or 'TUMOR' not in sampNames: - print '\n\nERROR: Input VCF must have a "NORMAL" and "TUMOR" column.\n' - f.close() - - varsOut = {} - for r in allVars.keys(): - varsOut[r] = [list(allVars[r][k]) for k in sorted(allVars[r].keys())] - # prune unnecessary sequence from ref/alt alleles - for i in xrange(len(varsOut[r])): - while len(varsOut[r][i][1]) > 1 and all([n[-1] == varsOut[r][i][1][-1] for n in varsOut[r][i][2]]) and all([len(n) > 1 for n in varsOut[r][i][2]]): - varsOut[r][i][1] = varsOut[r][i][1][:-1] - varsOut[r][i][2] = [n[:-1] for n in varsOut[r][i][2]] - varsOut[r][i] = tuple(varsOut[r][i]) - - print 'found',sum([len(n) for n in allVars.values()]),'valid variants in input vcf.' - print ' *',nSkipped,'variants skipped: (qual filtered / ref genotypes / invalid syntax)' - print ' *',nSkipped_becauseHash,'variants skipped due to multiple variants found per position' - print '--------------------------------' - return (sampNames, varsOut) - - - diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..0836577 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,6 @@ +numpy==1.19.5 +biopython==1.78 +matplotlib==3.3.4 +matplotlib_venn==0.11.6 +pandas==1.2.1 +pysam==0.16.0.1 \ No newline at end of file diff --git a/source/SequenceContainer.py b/source/SequenceContainer.py new file mode 100755 index 0000000..fe9fc0d --- /dev/null +++ b/source/SequenceContainer.py @@ -0,0 +1,1124 @@ +import random +import copy +import pathlib +import bisect +import pickle +import sys + +import numpy as np +from Bio.Seq import Seq + +from source.neat_cigar import CigarString +from source.probability import DiscreteDistribution, poisson_list + +# TODO This whole file is in desperate need of refactoring + +""" +Constants needed for analysis +""" +MAX_ATTEMPTS = 100 # max attempts to insert a mutation into a valid position +MAX_MUTFRAC = 0.3 # the maximum percentage of a window that can contain mutations + +NUCL = ['A', 'C', 'G', 'T'] +TRI_IND = {'AA': 0, 'AC': 1, 'AG': 2, 'AT': 3, 'CA': 4, 'CC': 5, 'CG': 6, 'CT': 7, + 'GA': 8, 'GC': 9, 'GG': 10, 'GT': 11, 'TA': 12, 'TC': 13, 'TG': 14, 'TT': 15} +NUC_IND = {'A': 0, 'C': 1, 'G': 2, 'T': 3} +ALL_TRI = [NUCL[i] + NUCL[j] + NUCL[k] for i in range(len(NUCL)) for j in range(len(NUCL)) for k in range(len(NUCL))] +ALL_IND = {ALL_TRI[i]: i for i in range(len(ALL_TRI))} + +# DEBUG +IGNORE_TRINUC = False + +# percentile resolution used for fraglen quantizing +COV_FRAGLEN_PERCENTILE = 10. +LARGE_NUMBER = 9999999999 + +""" +DEFAULT MUTATION MODELS +""" + +DEFAULT_1_OVERALL_MUT_RATE = 0.001 +DEFAULT_1_HOMOZYGOUS_FREQ = 0.010 +DEFAULT_1_INDEL_FRACTION = 0.05 +DEFAULT_1_INS_VS_DEL = 0.6 +DEFAULT_1_INS_LENGTH_VALUES = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] +DEFAULT_1_INS_LENGTH_WEIGHTS = [0.4, 0.2, 0.1, 0.05, 0.05, 0.05, 0.05, 0.034, 0.033, 0.033] +DEFAULT_1_DEL_LENGTH_VALUES = [1, 2, 3, 4, 5] +DEFAULT_1_DEL_LENGTH_WEIGHTS = [0.3, 0.2, 0.2, 0.2, 0.1] +example_matrix_1 = [[0.0, 0.15, 0.7, 0.15], + [0.15, 0.0, 0.15, 0.7], + [0.7, 0.15, 0.0, 0.15], + [0.15, 0.7, 0.15, 0.0]] +DEFAULT_1_TRI_FREQS = [copy.deepcopy(example_matrix_1) for _ in range(16)] +DEFAULT_1_TRINUC_BIAS = [1. / float(len(ALL_TRI)) for _ in ALL_TRI] +DEFAULT_MODEL_1 = [DEFAULT_1_OVERALL_MUT_RATE, + DEFAULT_1_HOMOZYGOUS_FREQ, + DEFAULT_1_INDEL_FRACTION, + DEFAULT_1_INS_VS_DEL, + DEFAULT_1_INS_LENGTH_VALUES, + DEFAULT_1_INS_LENGTH_WEIGHTS, + DEFAULT_1_DEL_LENGTH_VALUES, + DEFAULT_1_DEL_LENGTH_WEIGHTS, + DEFAULT_1_TRI_FREQS, + DEFAULT_1_TRINUC_BIAS] + +DEFAULT_2_OVERALL_MUT_RATE = 0.002 +DEFAULT_2_HOMOZYGOUS_FREQ = 0.200 +DEFAULT_2_INDEL_FRACTION = 0.1 +DEFAULT_2_INS_VS_DEL = 0.3 +DEFAULT_2_INS_LENGTH_VALUES = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] +DEFAULT_2_INS_LENGTH_WEIGHTS = [0.1, 0.1, 0.2, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05] +# noinspection DuplicatedCode +DEFAULT_2_DEL_LENGTH_VALUES = [1, 2, 3, 4, 5] +DEFAULT_2_DEL_LENGTH_WEIGHTS = [0.3, 0.2, 0.2, 0.2, 0.1] +example_matrix_2 = [[0.0, 0.15, 0.7, 0.15], + [0.15, 0.0, 0.15, 0.7], + [0.7, 0.15, 0.0, 0.15], + [0.15, 0.7, 0.15, 0.0]] +DEFAULT_2_TRI_FREQS = [copy.deepcopy(example_matrix_2) for _ in range(16)] +DEFAULT_2_TRINUC_BIAS = [1. / float(len(ALL_TRI)) for _ in ALL_TRI] +DEFAULT_MODEL_2 = [DEFAULT_2_OVERALL_MUT_RATE, + DEFAULT_2_HOMOZYGOUS_FREQ, + DEFAULT_2_INDEL_FRACTION, + DEFAULT_2_INS_VS_DEL, + DEFAULT_2_INS_LENGTH_VALUES, + DEFAULT_2_INS_LENGTH_WEIGHTS, + DEFAULT_2_DEL_LENGTH_VALUES, + DEFAULT_2_DEL_LENGTH_WEIGHTS, + DEFAULT_2_TRI_FREQS, + DEFAULT_2_TRINUC_BIAS] + + +class SequenceContainer: + """ + Container for reference sequences, applies mutations + """ + + def __init__(self, x_offset, sequence, ploidy, window_overlap, read_len, mut_models=None, mut_rate=None, + only_vcf=False): + + # initialize basic variables + self.only_vcf = only_vcf + self.x = x_offset + self.ploidy = ploidy + self.read_len = read_len + self.sequences = [Seq(str(sequence)) for _ in range(self.ploidy)] + self.seq_len = len(sequence) + self.indel_list = [[] for _ in range(self.ploidy)] + self.snp_list = [[] for _ in range(self.ploidy)] + self.all_cigar = [[] for _ in range(self.ploidy)] + self.fm_pos = [[] for _ in range(self.ploidy)] + self.fm_span = [[] for _ in range(self.ploidy)] + + # Blacklist explanation: + # black_list[ploid][pos] = 0 safe to insert variant here + # black_list[ploid][pos] = 1 indel inserted here + # black_list[ploid][pos] = 2 snp inserted here + # black_list[ploid][pos] = 3 invalid position for various processing reasons + self.black_list = [np.zeros(self.seq_len, dtype=' int(self.read_len / 2.): + tr_cov_vals[i] = [0.0] * int(self.read_len // 2) + tr_cov_vals[i][:-int(self.read_len / 2.)] + # fill in missing indices + tr_cov_vals[i].extend([0.0] * (len(self.sequences[i]) - len(tr_cov_vals[i]))) + + # + coverage_vector = np.cumsum([tr_cov_vals[i][nnn] * + gc_cov_vals[i][nnn] for nnn in range(len(tr_cov_vals[i]))]) + coverage_vals = [] + # TODO if max_coord is <=0, this is a problem + for j in range(0, max_coord): + coverage_vals.append(coverage_vector[j + self.read_len] - coverage_vector[j]) + # Below is Zach's attempt to fix this. The commented out line is the original + # avg_out.append(np.mean(coverage_vals) / float(self.read_len)) + avg_out.append(np.mean(coverage_vals)/float(min([self.read_len, max_coord]))) + # Debug statement + # print(f'{avg_out}, {np.mean(avg_out)}') + + if frag_dist is None: + # Debug statement + # print(f'++++, {max_coord}, {len(self.sequences[i])}, ' + # f'{len(self.all_cigar[i])}, {len(coverage_vals)}') + self.coverage_distribution.append(DiscreteDistribution(coverage_vals, range(len(coverage_vals)))) + + # fragment length nightmare + else: + current_thresh = 0. + index_list = [0] + for j in range(len(frag_dist.cum_prob)): + if frag_dist.cum_prob[j] >= current_thresh + COV_FRAGLEN_PERCENTILE / 100.0: + current_thresh = frag_dist.cum_prob[j] + index_list.append(j) + flq = [frag_dist.values[nnn] for nnn in index_list] + if frag_dist.values[-1] not in flq: + flq.append(frag_dist.values[-1]) + flq.append(LARGE_NUMBER) + + self.fraglen_ind_map = {} + for j in frag_dist.values: + b_ind = bisect.bisect(flq, j) + if abs(flq[b_ind - 1] - j) <= abs(flq[b_ind] - j): + self.fraglen_ind_map[j] = flq[b_ind - 1] + else: + self.fraglen_ind_map[j] = flq[b_ind] + + self.coverage_distribution.append({}) + for flv in sorted(list(set(self.fraglen_ind_map.values()))): + buffer_val = self.read_len + for j in frag_dist.values: + if self.fraglen_ind_map[j] == flv and j > buffer_val: + buffer_val = j + max_coord = min([len(self.sequences[i]) - buffer_val - 1, + len(self.all_cigar[i]) - buffer_val + self.read_len - 2]) + # print 'BEFORE:', len(self.sequences[i])-buffer_val + # print 'AFTER: ', len(self.all_cigar[i])-buffer_val+self.read_len-2 + # print 'AFTER2:', max_coord + coverage_vals = [] + for j in range(0, max_coord): + coverage_vals.append( + coverage_vector[j + self.read_len] - coverage_vector[j] + coverage_vector[j + flv] - + coverage_vector[ + j + flv - self.read_len]) + + # EXPERIMENTAL + # quantized_cov_vals = quantize_list(coverage_vals) + # self.coverage_distribution[i][flv] = \ + # DiscreteDistribution([n[2] for n in quantized_cov_vals], + # [(n[0], n[1]) for n in quantized_cov_vals]) + + # TESTING + # import matplotlib.pyplot as mpl + # print len(coverage_vals),'-->',len(quantized_cov_vals) + # mpl.figure(0) + # mpl.plot(range(len(coverage_vals)), coverage_vals) + # for qcv in quantized_cov_vals: + # mpl.plot([qcv[0], qcv[1]+1], [qcv[2],qcv[2]], 'r') + # mpl.show() + # sys.exit(1) + + self.coverage_distribution[i][flv] = DiscreteDistribution(coverage_vals, + range(len(coverage_vals))) + + return np.mean(avg_out) + + def init_poisson(self): + ind_l_list = [self.seq_len * self.models[i][0] * self.models[i][2] * self.ploid_mut_frac[i] for i in + range(len(self.models))] + snp_l_list = [self.seq_len * self.models[i][0] * (1. - self.models[i][2]) * self.ploid_mut_frac[i] for i in + range(len(self.models))] + k_range = range(int(self.seq_len * MAX_MUTFRAC)) + # return (indel_poisson, snp_poisson) + # TODO These next two lines are really slow. Maybe there's a better way + return [poisson_list(k_range, ind_l_list[n]) for n in range(len(self.models))], \ + [poisson_list(k_range, snp_l_list[n]) for n in range(len(self.models))] + + def update(self, x_offset, sequence, ploidy, window_overlap, read_len, mut_models=None, mut_rate=None): + # if mutation model is changed, we have to reinitialize it... + if ploidy != self.ploidy or mut_rate != self.mut_rescale or mut_models is not None: + self.ploidy = ploidy + self.mut_rescale = mut_rate + self.update_mut_models(mut_models, mut_rate) + # if sequence length is different than previous window, we have to redo snp/indel poissons + if len(sequence) != self.seq_len: + self.seq_len = len(sequence) + self.indel_poisson, self.snp_poisson = self.init_poisson() + # basic vars + self.update_basic_vars(x_offset, sequence, ploidy, window_overlap, read_len) + self.indels_to_add = [n.sample() for n in self.indel_poisson] + self.snps_to_add = [n.sample() for n in self.snp_poisson] + # initialize trinuc snp bias + if not IGNORE_TRINUC: + self.update_trinuc_bias() + + def insert_mutations(self, input_list): + for input_variable in input_list: + which_ploid = [] + wps = input_variable[4][0] + + # if no genotype given, assume heterozygous and choose a single ploid based on their mut rates + if wps is None: + which_ploid.append(self.ploid_mut_prior.sample()) + which_alt = [0] + else: + if '/' in wps or '|' in wps: + if '/' in wps: + splt = wps.split('/') + else: + splt = wps.split('|') + which_ploid = [] + for i in range(len(splt)): + if splt[i] == '1': + which_ploid.append(i) + # assume we're just using first alt for inserted variants? + which_alt = [0 for _ in which_ploid] + # otherwise assume monoploidy + else: + which_ploid = [0] + which_alt = [0] + + # ignore invalid ploids + for i in range(len(which_ploid) - 1, -1, -1): + if which_ploid[i] >= self.ploidy: + del which_ploid[i] + + for i in range(len(which_ploid)): + p = which_ploid[i] + my_alt = input_variable[2][which_alt[i]] + my_var = (input_variable[0] - self.x, input_variable[1], my_alt) + # This is a potential fix implemented by Zach in a previous commit. He left the next line in. + # in_len = max([len(input_variable[1]), len(my_alt)]) + in_len = len(input_variable[1]) + + if my_var[0] < 0 or my_var[0] >= len(self.black_list[p]): + print('\nError: Attempting to insert variant out of window bounds:') + print(my_var, '--> blackList[0:' + str(len(self.black_list[p])) + ']\n') + sys.exit(1) + if len(input_variable[1]) == 1 and len(my_alt) == 1: + if self.black_list[p][my_var[0]]: + continue + self.snp_list[p].append(my_var) + self.black_list[p][my_var[0]] = 2 + else: + indel_failed = False + for k in range(my_var[0], my_var[0] + in_len): + if k >= len(self.black_list[p]): + indel_failed = True + continue + if self.black_list[p][k]: + indel_failed = True + continue + if indel_failed: + continue + for k in range(my_var[0], my_var[0] + in_len): + self.black_list[p][k] = 1 + self.indel_list[p].append(my_var) + + def random_mutations(self): + + # add random indels + all_indels = [[] for _ in self.sequences] + for i in range(self.ploidy): + for j in range(self.indels_to_add[i]): + # insert homozygous indel + if random.random() <= self.models[i][1]: + which_ploid = range(self.ploidy) + # insert heterozygous indel + else: + which_ploid = [self.ploid_mut_prior.sample()] + + # try to find suitable places to insert indels + event_pos = -1 + for attempt in range(MAX_ATTEMPTS): + event_pos = random.randint(self.win_buffer, self.seq_len - 1) + for p in which_ploid: + if self.black_list[p][event_pos]: + event_pos = -1 + if event_pos != -1: + break + if event_pos == -1: + continue + + # insertion + if random.random() <= self.models[i][3]: + in_len = self.models[i][4].sample() + # sequence content of random insertions is uniformly random (change this later, maybe) + in_seq = ''.join([random.choice(NUCL) for _ in range(in_len)]) + ref_nucl = self.sequences[i][event_pos] + my_indel = (event_pos, ref_nucl, ref_nucl + in_seq) + # deletion + else: + in_len = self.models[i][5].sample() + # skip if deletion too close to boundary + if event_pos + in_len + 1 >= len(self.sequences[i]): + continue + if in_len == 1: + in_seq = self.sequences[i][event_pos + 1] + else: + in_seq = str(self.sequences[i][event_pos + 1:event_pos + in_len + 1]) + ref_nucl = self.sequences[i][event_pos] + my_indel = (event_pos, ref_nucl + in_seq, ref_nucl) + + # if event too close to boundary, skip. if event conflicts with other indel, skip. + skip_event = False + if event_pos + len(my_indel[1]) >= self.seq_len - self.win_buffer - 1: + skip_event = True + if skip_event: + continue + for p in which_ploid: + for k in range(event_pos, event_pos + in_len + 1): + if self.black_list[p][k]: + skip_event = True + if skip_event: + continue + + for p in which_ploid: + for k in range(event_pos, event_pos + in_len + 1): + self.black_list[p][k] = 1 + all_indels[p].append(my_indel) + + # add random snps + all_snps = [[] for _ in self.sequences] + for i in range(self.ploidy): + for j in range(self.snps_to_add[i]): + # insert homozygous SNP + if random.random() <= self.models[i][1]: + which_ploid = range(self.ploidy) + # insert heterozygous SNP + else: + which_ploid = [self.ploid_mut_prior.sample()] + + # try to find suitable places to insert snps + event_pos = -1 + for attempt in range(MAX_ATTEMPTS): + # based on the mutation model for the specified ploid, choose a SNP location based on trinuc bias + # (if there are multiple ploids, choose one at random) + if IGNORE_TRINUC: + event_pos = random.randint(self.win_buffer + 1, self.seq_len - 2) + else: + ploid_to_use = which_ploid[random.randint(0, len(which_ploid) - 1)] + event_pos = self.trinuc_bias[ploid_to_use].sample() + for p in which_ploid: + if self.black_list[p][event_pos]: + event_pos = -1 + if event_pos != -1: + break + if event_pos == -1: + continue + + ref_nucl = self.sequences[i][event_pos] + context = str(self.sequences[i][event_pos - 1]) + str(self.sequences[i][event_pos + 1]) + # sample from tri-nucleotide substitution matrices to get SNP alt allele + new_nucl = self.models[i][6][TRI_IND[context]][NUC_IND[ref_nucl]].sample() + my_snp = (event_pos, ref_nucl, new_nucl) + + for p in which_ploid: + all_snps[p].append(my_snp) + self.black_list[p][my_snp[0]] = 2 + + # combine random snps with inserted snps, remove any snps that overlap indels + for p in range(len(all_snps)): + all_snps[p].extend(self.snp_list[p]) + all_snps[p] = [n for n in all_snps[p] if self.black_list[p][n[0]] != 1] + + # MODIFY REFERENCE STRING: SNPS + for i in range(len(all_snps)): + temp = self.sequences[i].tomutable() + for j in range(len(all_snps[i])): + v_pos = all_snps[i][j][0] + + if all_snps[i][j][1] != temp[v_pos]: + print('\nError: Something went wrong!\n', all_snps[i][j], temp[v_pos], '\n') + print(all_snps[i][j]) + print(self.sequences[i][v_pos]) + sys.exit(1) + else: + temp[v_pos] = all_snps[i][j][2] + self.sequences[i] = temp.toseq() + + # organize the indels we want to insert + for i in range(len(all_indels)): + all_indels[i].extend(self.indel_list[i]) + all_indels_ins = [sorted([list(m) for m in n]) for n in all_indels] + + # MODIFY REFERENCE STRING: INDELS + for i in range(len(all_indels_ins)): + rolling_adj = 0 + temp_symbol_list = CigarString.string_to_list(str(len(self.sequences[i])) + "M") + + for j in range(len(all_indels_ins[i])): + v_pos = all_indels_ins[i][j][0] + rolling_adj + v_pos2 = v_pos + len(all_indels_ins[i][j][1]) + indel_length = len(all_indels_ins[i][j][2]) - len(all_indels_ins[i][j][1]) + rolling_adj += indel_length + + if all_indels_ins[i][j][1] != str(self.sequences[i][v_pos:v_pos2]): + print('\nError: Something went wrong!\n', all_indels_ins[i][j], [v_pos, v_pos2], + str(self.sequences[i][v_pos:v_pos2]), '\n') + sys.exit(1) + else: + # alter reference sequence + self.sequences[i] = self.sequences[i][:v_pos] + Seq(all_indels_ins[i][j][2]) + \ + self.sequences[i][v_pos2:] + # notate indel positions for cigar computation + if indel_length > 0: + temp_symbol_list = temp_symbol_list[:v_pos + 1] + ['I'] * indel_length \ + + temp_symbol_list[v_pos2 + 1:] + elif indel_length < 0: + temp_symbol_list[v_pos + 1] = "D" * abs(indel_length) + "M" + + # pre-compute cigar strings + for j in range(len(temp_symbol_list) - self.read_len): + self.all_cigar[i].append(temp_symbol_list[j:j + self.read_len]) + + # create some data structures we will need later: + # --- self.fm_pos[ploid][pos]: position of the left-most matching base (IN REFERENCE COORDINATES, i.e. + # corresponding to the unmodified reference genome) + # --- self.fm_span[ploid][pos]: number of reference positions spanned by a read originating from + # this coordinate + md_so_far = 0 + for j in range(len(temp_symbol_list)): + self.fm_pos[i].append(md_so_far) + # fix an edge case with deletions + if 'D' in temp_symbol_list[j]: + self.fm_pos[i][-1] += temp_symbol_list[j].count('D') + # compute number of ref matches for each read + # This line gets hit a lot and is relatively slow. Might look for an improvement + span_dif = len([n for n in temp_symbol_list[j: j + self.read_len] if 'M' in n]) + self.fm_span[i].append(self.fm_pos[i][-1] + span_dif) + md_so_far += temp_symbol_list[j].count('M') + temp_symbol_list[j].count('D') + + # tally up all the variants we handled... + count_dict = {} + all_variants = [sorted(all_snps[i] + all_indels[i]) for i in range(self.ploidy)] + for i in range(len(all_variants)): + for j in range(len(all_variants[i])): + all_variants[i][j] = tuple([all_variants[i][j][0] + self.x]) + all_variants[i][j][1:] + t = tuple(all_variants[i][j]) + if t not in count_dict: + count_dict[t] = [] + count_dict[t].append(i) + + # TODO: combine multiple variants that happened to occur at same position into single vcf entry? + + output_variants = [] + for k in sorted(count_dict.keys()): + output_variants.append(k + tuple([len(count_dict[k]) / float(self.ploidy)])) + ploid_string = ['0' for _ in range(self.ploidy)] + for k2 in [n for n in count_dict[k]]: + ploid_string[k2] = '1' + output_variants[-1] += tuple(['WP=' + '/'.join(ploid_string)]) + return output_variants + + def sample_read(self, sequencing_model, frag_len=None): + + # choose a ploid + my_ploid = random.randint(0, self.ploidy - 1) + + # stop attempting to find a valid position if we fail enough times + MAX_READPOS_ATTEMPTS = 100 + attempts_thus_far = 0 + + # choose a random position within the ploid, and generate quality scores / sequencing errors + reads_to_sample = [] + if frag_len is None: + r_pos = self.coverage_distribution[my_ploid].sample() + + # sample read position and call function to compute quality scores / sequencing errors + r_dat = self.sequences[my_ploid][r_pos:r_pos + self.read_len] + (my_qual, my_errors) = sequencing_model.get_sequencing_errors(r_dat) + reads_to_sample.append([r_pos, my_qual, my_errors, r_dat]) + + else: + r_pos1 = self.coverage_distribution[my_ploid][self.fraglen_ind_map[frag_len]].sample() + + # EXPERIMENTAL + # coords_to_select_from = self.coverage_distribution[my_ploid][self.fraglens_ind_map[frag_len]].sample() + # r_pos1 = random.randint(coords_to_select_from[0],coords_to_select_from[1]) + + r_pos2 = r_pos1 + frag_len - self.read_len + r_dat1 = self.sequences[my_ploid][r_pos1:r_pos1 + self.read_len] + r_dat2 = self.sequences[my_ploid][r_pos2:r_pos2 + self.read_len] + (my_qual1, my_errors1) = sequencing_model.get_sequencing_errors(r_dat1) + (my_qual2, my_errors2) = sequencing_model.get_sequencing_errors(r_dat2, is_reverse_strand=True) + reads_to_sample.append([r_pos1, my_qual1, my_errors1, r_dat1]) + reads_to_sample.append([r_pos2, my_qual2, my_errors2, r_dat2]) + + # error format: + # myError[i] = (type, len, pos, ref, alt) + + """ + examine sequencing errors to-be-inserted. + - remove deletions that don't have enough bordering sequence content to "fill in" + if error is valid, make the changes to the read data + """ + read_out = [] + for read in reads_to_sample: + try: + my_cigar = self.all_cigar[my_ploid][read[0]] + except IndexError: + print('Index error when attempting to find cigar string.') + print(my_ploid, len(self.all_cigar[my_ploid]), read[0]) + if frag_len is not None: + print((r_pos1, r_pos2)) + print(frag_len, self.fraglen_ind_map[frag_len]) + sys.exit(1) + total_d = sum([error[1] for error in read[2] if error[0] == 'D']) + total_i = sum([error[1] for error in read[2] if error[0] == 'I']) + avail_b = len(self.sequences[my_ploid]) - read[0] - self.read_len - 1 + + # add buffer sequence to fill in positions that get deleted + read[3] += self.sequences[my_ploid][read[0] + self.read_len:read[0] + self.read_len + total_d] + # this is leftover code and a patch for a method that isn't used. There is probably a better + # way to structure this than with a boolean + first_time = True + adj = 0 + sse_adj = [0 for _ in range(self.read_len + max(sequencing_model.err_p[3]))] + any_indel_err = False + + # sort by letter (D > I > S) such that we introduce all indel errors before substitution errors + # secondarily, sort by index + arranged_errors = {'D': [], 'I': [], 'S': []} + for error in read[2]: + arranged_errors[error[0]].append((error[2], error)) + sorted_errors = [] + for k in sorted(arranged_errors.keys()): + sorted_errors.extend([n[1] for n in sorted(arranged_errors[k])]) + + skip_indels = False + + # FIXED TdB 05JUN2018 + # Moved this outside the for error loop, since it messes up the CIGAR string when + # more than one deletion is in the same read + extra_cigar_val = [] + # END FIXED TdB + + for error in sorted_errors: + e_len = error[1] + e_pos = error[2] + if error[0] == 'D' or error[0] == 'I': + any_indel_err = True + + # FIXED TdB 05JUN2018 + # Moved this OUTSIDE the for error loop, since it messes up the CIGAR string + # when more than one deletion is in the same read + # extra_cigar_val = [] + # END FIXED TdB + + if total_d > avail_b: # if not enough bases to fill-in deletions, skip all indel erors + continue + if first_time: + # Again, this whole first time thing is a workaround for the previous + # code, which is simplified. May need to fix this all at some point + first_time = False + fill_to_go = total_d - total_i + 1 + if fill_to_go > 0: + try: + extra_cigar_val = self.all_cigar[my_ploid][read[0] + fill_to_go][-fill_to_go:] + except IndexError: + # Applying the deletions we want requires going beyond region boundaries. + # Skip all indel errors + skip_indels = True + + if skip_indels: + continue + + # insert deletion error into read and update cigar string accordingly + if error[0] == 'D': + my_adj = sse_adj[e_pos] + pi = e_pos + my_adj + pf = e_pos + my_adj + e_len + 1 + if str(read[3][pi:pf]) == str(error[3]): + read[3] = read[3][:pi + 1] + read[3][pf:] + my_cigar = my_cigar[:pi + 1] + my_cigar[pf:] + # weird edge case with del at very end of region. Make a guess and add a "M" + if pi + 1 == len(my_cigar): + my_cigar.append('M') + + try: + my_cigar[pi + 1] = 'D' * e_len + my_cigar[pi + 1] + except IndexError: + print("Bug!! Index error on expanded cigar") + sys.exit(1) + + else: + print('\nError, ref does not match alt while attempting to insert deletion error!\n') + sys.exit(1) + adj -= e_len + for i in range(e_pos, len(sse_adj)): + sse_adj[i] -= e_len + + # insert insertion error into read and update cigar string accordingly + else: + my_adj = sse_adj[e_pos] + if str(read[3][e_pos + my_adj]) == error[3]: + read[3] = read[3][:e_pos + my_adj] + error[4] + read[3][e_pos + my_adj + 1:] + my_cigar = my_cigar[:e_pos + my_adj] + ['I'] * e_len + my_cigar[e_pos + my_adj:] + else: + print('\nError, ref does not match alt while attempting to insert insertion error!\n') + print('---', chr(read[3][e_pos + my_adj]), '!=', error[3]) + sys.exit(1) + adj += e_len + for i in range(e_pos, len(sse_adj)): + sse_adj[i] += e_len + + else: # substitution errors, much easier by comparison... + if str(read[3][e_pos + sse_adj[e_pos]]) == error[3]: + temp = read[3].tomutable() + temp[e_pos + sse_adj[e_pos]] = error[4] + read[3] = temp.toseq() + else: + print('\nError, ref does not match alt while attempting to insert substitution error!\n') + sys.exit(1) + + if any_indel_err: + if len(my_cigar): + my_cigar = (my_cigar + extra_cigar_val)[:self.read_len] + + read[3] = read[3][:self.read_len] + + read_out.append([self.fm_pos[my_ploid][read[0]], my_cigar, read[3], str(read[1])]) + + # read_out[i] = (pos, cigar, read_string, qual_string) + return read_out + + +class ReadContainer: + """ + Container for read data: computes quality scores and positions to insert errors + """ + + def __init__(self, read_len, error_model, rescaled_error, rescale_qual=False): + + self.read_len = read_len + self.rescale_qual = rescale_qual + + model_path = pathlib.Path(error_model) + try: + error_dat = pickle.load(open(model_path, 'rb'), encoding="bytes") + except IOError: + print("\nProblem opening the sequencing error model.\n") + sys.exit(1) + + self.uniform = False + + # uniform-error SE reads (e.g., PacBio) + if len(error_dat) == 4: + self.uniform = True + [q_scores, off_q, avg_error, error_params] = error_dat + self.uniform_q_score = min([max(q_scores), int(-10. * np.log10(avg_error) + 0.5)]) + print('Reading in uniform sequencing error model... (q=' + str(self.uniform_q_score) + '+' + str( + off_q) + ', p(err)={0:0.2f}%)'.format(100. * avg_error)) + + # only 1 q-score model present, use same model for both strands + elif len(error_dat) == 6: + [init_q1, prob_q1, q_scores, off_q, avg_error, error_params] = error_dat + self.pe_models = False + + # found a q-score model for both forward and reverse strands + elif len(error_dat) == 8: + [init_q1, prob_q1, init_q2, prob_q2, q_scores, off_q, avg_error, error_params] = error_dat + self.pe_models = True + if len(init_q1) != len(init_q2) or len(prob_q1) != len(prob_q2): + print('\nError: R1 and R2 quality score models are of different length.\n') + sys.exit(1) + + # This serves as a sanity check for the input model + else: + print('\nError: Something wrong with error model.\n') + sys.exit(1) + + self.q_err_rate = [0.] * (max(q_scores) + 1) + for q in q_scores: + self.q_err_rate[q] = 10. ** (-q / 10.) + self.off_q = off_q + self.err_p = error_params + # Selects a new nucleotide based on the error model + self.err_sse = [DiscreteDistribution(n, NUCL) for n in self.err_p[0]] + # allows for selection of indel length based on the parameters of the model + self.err_sie = DiscreteDistribution(self.err_p[2], self.err_p[3]) + # allows for indel insertion based on the length above and the probability from the model + self.err_sin = DiscreteDistribution(self.err_p[5], NUCL) + + # adjust sequencing error frequency to match desired rate + if rescaled_error is None: + self.error_scale = 1.0 + else: + self.error_scale = rescaled_error / avg_error + if not self.rescale_qual: + print('Warning: Quality scores no longer exactly representative of error probability. ' + 'Error model scaled by {0:.3f} to match desired rate...'.format(self.error_scale)) + if self.uniform: + if rescaled_error <= 0.: + self.uniform_q_score = max(q_scores) + else: + self.uniform_q_score = min([max(q_scores), int(-10. * np.log10(rescaled_error) + 0.5)]) + print(' - Uniform quality score scaled to match specified error rate (q=' + str( + self.uniform_qscore) + '+' + str(self.off_q) + ', p(err)={0:0.2f}%)'.format(100. * rescaled_error)) + + if not self.uniform: + # adjust length to match desired read length + if self.read_len == len(init_q1): + self.q_ind_remap = range(self.read_len) + else: + print('Warning: Read length of error model (' + str(len(init_q1)) + ') does not match -R value (' + str( + self.read_len) + '), rescaling model...') + self.q_ind_remap = [max([1, len(init_q1) * n // read_len]) for n in range(read_len)] + + # initialize probability distributions + self.init_dist_by_pos_1 = [DiscreteDistribution(init_q1[i], q_scores) for i in range(len(init_q1))] + self.prob_dist_by_pos_by_prev_q1 = [None] + for i in range(1, len(init_q1)): + self.prob_dist_by_pos_by_prev_q1.append([]) + for j in range(len(init_q1[0])): + # if we don't have sufficient data for a transition, use the previous quality score + if np.sum(prob_q1[i][j]) <= 0.: + self.prob_dist_by_pos_by_prev_q1[-1].append( + DiscreteDistribution([1], [q_scores[j]], degenerate_val=q_scores[j])) + else: + self.prob_dist_by_pos_by_prev_q1[-1].append(DiscreteDistribution(prob_q1[i][j], q_scores)) + + # If paired-end, initialize probability distributions for the other strand + if self.pe_models: + self.init_dist_by_pos_2 = [DiscreteDistribution(init_q2[i], q_scores) for i in range(len(init_q2))] + self.prob_dist_by_pos_by_prev_q2 = [None] + for i in range(1, len(init_q2)): + self.prob_dist_by_pos_by_prev_q2.append([]) + for j in range(len(init_q2[0])): + if np.sum(prob_q2[i][ + j]) <= 0.: # if we don't have sufficient data for a transition, use the previous qscore + self.prob_dist_by_pos_by_prev_q2[-1].append( + DiscreteDistribution([1], [q_scores[j]], degenerate_val=q_scores[j])) + else: + self.prob_dist_by_pos_by_prev_q2[-1].append(DiscreteDistribution(prob_q2[i][j], q_scores)) + + def get_sequencing_errors(self, read_data, is_reverse_strand=False): + """ + Inserts errors of type substitution, insertion, or deletion into read_data, and assigns a quality score + based on the container model. + + :param read_data: sequence to insert errors into + :param is_reverse_strand: whether to treat this as the reverse strand or not + :return: modified sequence and associate quality scores + """ + + # TODO this is one of the slowest methods in the code. Need to investigate how to speed this up. + q_out = [0] * self.read_len + s_err = [] + + if self.uniform: + my_q = [self.uniform_q_score + self.off_q] * self.read_len + q_out = ''.join([chr(n) for n in my_q]) + for i in range(self.read_len): + if random.random() < self.error_scale * self.q_err_rate[self.uniform_q_score]: + s_err.append(i) + else: + if self.pe_models and is_reverse_strand: + my_q = self.init_dist_by_pos_2[0].sample() + else: + my_q = self.init_dist_by_pos_1[0].sample() + q_out[0] = my_q + + # Every time this is hit, we loop the entire read length twice. I feel like these two loops + # Could be combined into one fairly easily. The culprit seems to bee too many hits to the sample() method. + for i in range(1, self.read_len): + if self.pe_models and is_reverse_strand: + my_q = self.prob_dist_by_pos_by_prev_q2[self.q_ind_remap[i]][my_q].sample() + else: + my_q = self.prob_dist_by_pos_by_prev_q1[self.q_ind_remap[i]][my_q].sample() + q_out[i] = my_q + + if is_reverse_strand: + q_out = q_out[::-1] + + for i in range(self.read_len): + if random.random() < self.error_scale * self.q_err_rate[q_out[i]]: + s_err.append(i) + + if self.rescale_qual: # do we want to rescale qual scores to match rescaled error? + q_out = [max([0, int(-10. * np.log10(self.error_scale * self.q_err_rate[n]) + 0.5)]) for n in q_out] + q_out = [min([int(self.q_err_rate[-1]), n]) for n in q_out] + q_out = ''.join([chr(n + self.off_q) for n in q_out]) + else: + q_out = ''.join([chr(n + self.off_q) for n in q_out]) + + if self.error_scale == 0.0: + return q_out, [] + + s_out = [] + n_del_so_far = 0 + # don't allow indel errors to occur on subsequent positions + prev_indel = -2 + # don't allow other sequencing errors to occur on bases removed by deletion errors + del_blacklist = [] + + # Need to check into this loop, to make sure it isn't slowing us down. + # The culprit seems to bee too many hits to the sample() method. This has a few of those calls. + for ind in s_err[::-1]: # for each error that we're going to insert... + + # determine error type + is_sub = True + if ind != 0 and ind != self.read_len - 1 - max(self.err_p[3]) and abs(ind - prev_indel) > 1: + if random.random() < self.err_p[1]: + is_sub = False + + # insert substitution error + if is_sub: + my_nucl = str(read_data[ind]) + new_nucl = self.err_sse[NUC_IND[my_nucl]].sample() + s_out.append(('S', 1, ind, my_nucl, new_nucl)) + + # insert indel error + else: + indel_len = self.err_sie.sample() + + # insertion error + if random.random() < self.err_p[4]: + my_nucl = str(read_data[ind]) + new_nucl = my_nucl + ''.join([self.err_sin.sample() for n in range(indel_len)]) + s_out.append(('I', len(new_nucl) - 1, ind, my_nucl, new_nucl)) + + # deletion error (prevent too many of them from stacking up) + elif ind < self.read_len - 2 - n_del_so_far: + my_nucl = str(read_data[ind:ind + indel_len + 1]) + new_nucl = str(read_data[ind]) + n_del_so_far += len(my_nucl) - 1 + s_out.append(('D', len(my_nucl) - 1, ind, my_nucl, new_nucl)) + for i in range(ind + 1, ind + indel_len + 1): + del_blacklist.append(i) + prev_indel = ind + + # remove blacklisted errors + for i in range(len(s_out) - 1, -1, -1): + if s_out[i][2] in del_blacklist: + del s_out[i] + + return q_out, s_out + + +# parse mutation model pickle file +def parse_input_mutation_model(model=None, which_default=1): + if which_default == 1: + out_model = [copy.deepcopy(n) for n in DEFAULT_MODEL_1] + elif which_default == 2: + out_model = [copy.deepcopy(n) for n in DEFAULT_MODEL_2] + else: + print('\nError: Unknown default mutation model specified\n') + sys.exit(1) + + if model is not None: + pickle_dict = pickle.load(open(model, "rb")) + out_model[0] = pickle_dict['AVG_MUT_RATE'] + out_model[2] = 1. - pickle_dict['SNP_FREQ'] + + ins_list = pickle_dict['INDEL_FREQ'] + if len(ins_list): + ins_count = sum([ins_list[k] for k in ins_list.keys() if k >= 1]) + del_count = sum([ins_list[k] for k in ins_list.keys() if k <= -1]) + ins_vals = [k for k in sorted(ins_list.keys()) if k >= 1] + ins_weight = [ins_list[k] / float(ins_count) for k in ins_vals] + del_vals = [k for k in sorted([abs(k) for k in ins_list.keys() if k <= -1])] + del_weight = [ins_list[-k] / float(del_count) for k in del_vals] + else: # degenerate case where no indel stats are provided + ins_count = 1 + del_count = 1 + ins_vals = [1] + ins_weight = [1.0] + del_vals = [1] + del_weight = [1.0] + out_model[3] = ins_count / float(ins_count + del_count) + out_model[4] = ins_vals + out_model[5] = ins_weight + out_model[6] = del_vals + out_model[7] = del_weight + + trinuc_trans_prob = pickle_dict['TRINUC_TRANS_PROBS'] + for k in sorted(trinuc_trans_prob.keys()): + my_ind = TRI_IND[k[0][0] + k[0][2]] + (k1, k2) = (NUC_IND[k[0][1]], NUC_IND[k[1][1]]) + out_model[8][my_ind][k1][k2] = trinuc_trans_prob[k] + for i in range(len(out_model[8])): + for j in range(len(out_model[8][i])): + for l in range(len(out_model[8][i][j])): + # if trinuc not present in input mutation model, assign it uniform probability + if float(sum(out_model[8][i][j])) < 1e-12: + out_model[8][i][j] = [0.25, 0.25, 0.25, 0.25] + else: + out_model[8][i][j][l] /= float(sum(out_model[8][i][j])) + + trinuc_mut_prob = pickle_dict['TRINUC_MUT_PROB'] + which_have_we_seen = {n: False for n in ALL_TRI} + trinuc_mean = np.mean(list(trinuc_mut_prob.values())) + for trinuc in trinuc_mut_prob.keys(): + out_model[9][ALL_IND[trinuc]] = trinuc_mut_prob[trinuc] + which_have_we_seen[trinuc] = True + for trinuc in which_have_we_seen.keys(): + if not which_have_we_seen[trinuc]: + out_model[9][ALL_IND[trinuc]] = trinuc_mean + + return out_model diff --git a/source/__init__.py b/source/__init__.py new file mode 100755 index 0000000..d42441e --- /dev/null +++ b/source/__init__.py @@ -0,0 +1,8 @@ +# -*- coding: utf-8 -*- +""" +Created on Mon Nov 9 10:41:07 2020 + +@author: membry2 +""" + +from source.probability import * \ No newline at end of file diff --git a/source/input_checking.py b/source/input_checking.py new file mode 100755 index 0000000..dc27fff --- /dev/null +++ b/source/input_checking.py @@ -0,0 +1,69 @@ +""" +This file contains several standard functions that will be used throughout the program. Each function checks input +and issues an error if there is something wrong. +""" + +import pathlib +import sys + + +def required_field(variable_to_test: any, err_string: str) -> None: + """ + If required field variable_to_test is empty, issues an error. Otherwise this does nothing + + :param variable_to_test: Any input type + :param err_string: A string with the error message + :return: None + """ + if variable_to_test is None: + print('\n' + err_string + '\n') + sys.exit(1) + + +def check_file_open(filename: str, err_string: str, required: bool = False) -> None: + """ + Checks that the filename is not empty and that it is indeed a file + + :param filename: file name, string + :param err_string: string of the error if it is not a file + :param required: If not required, skips the check + :return: None + """ + if required or filename is not None: + if filename is None: + print('\n' + err_string + '\n') + sys.exit(1) + else: + try: + pathlib.Path(filename).resolve(strict=True) + except FileNotFoundError: + print('\n' + err_string + '\n') + sys.exit(1) + + +def check_dir(directory: str, err_string: str) -> None: + """ + Checks that directory exists and is a directory + :param directory: string of the directory path + :param err_string: string of the error in case it is not a directory or doesn't exist + :return: None + """ + if not pathlib.Path(directory).is_dir(): + print('\n' + err_string + '\n') + raise NotADirectoryError + + +def is_in_range(value: float, lower_bound: float, upper_bound: float, err_string: str) -> None: + """ + Checks that value is between the lower bound and upper bound, and if not prints an error message + (err_string) and exits the program. + + :param value: float for the value + :param lower_bound: float for the upper bound + :param upper_bound: float for the lower bound + :param err_string: string of the error message to print if the value is out of range + :return: None + """ + if value < lower_bound or value > upper_bound: + print('\n' + err_string + '\n') + sys.exit(1) diff --git a/source/neat_cigar.py b/source/neat_cigar.py new file mode 100644 index 0000000..e1d1ef2 --- /dev/null +++ b/source/neat_cigar.py @@ -0,0 +1,79 @@ +from itertools import groupby + + +class CigarString: + """" + Now we're testing out a list method of CigarString that we're hoping is faster + """ + _read_consuming_ops = ("M", "I", "S", "=", "X") + _ref_consuming_ops = ("M", "D", "N", "=", "X") + + @staticmethod + def items(string_in: str) -> iter: + """ + iterator for cigar string items + :return: Creates an iterator object + """ + if string_in == "*": + yield 0, None + raise StopIteration + cig_iter = groupby(string_in, lambda c: c.isdigit()) + for g, n in cig_iter: + yield int("".join(n)), "".join(next(cig_iter)[1]) + + @staticmethod + def string_to_list(string_in: str) -> list: + """ + This will convert a cigar string into a list of elements + :param string_in: a valid cigar string. + :return: a list version of that string. + """ + cigar_dat = [] + d_reserve = 0 + for item in CigarString.items(string_in): + if item[1] == 'D': + d_reserve = item[0] + if item[1] in ['M', 'I']: + if d_reserve: + cigar_dat += ['D' * d_reserve + item[1]] + [item[1]] * (item[0] - 1) + else: + cigar_dat += [item[1]] * item[0] + d_reserve = 0 + return cigar_dat + + @staticmethod + def list_to_string(input_list: list) -> str: + """ + Convert a cigar string in list format to a standard cigar string + :param input_list: Cigar string in list format + :return: cigar string in string format + """ + + symbols = '' + current_sym = input_list[0] + current_count = 1 + if 'D' in current_sym: + current_sym = current_sym[-1] + for k in range(1, len(input_list)): + next_sym = input_list[k] + if len(next_sym) == 1 and next_sym == current_sym: + current_count += 1 + else: + symbols += str(current_count) + current_sym + if 'D' in next_sym: + symbols += str(next_sym.count('D')) + 'D' + current_sym = next_sym[-1] + else: + current_sym = next_sym + current_count = 1 + symbols += str(current_count) + current_sym + return symbols + + +if __name__ == '__main__': + cigar = "10M1I3D1M" + lst = CigarString.string_to_list(cigar) + print(lst) + st = CigarString.list_to_string(lst) + print(st) + \ No newline at end of file diff --git a/source/output_file_writer.py b/source/output_file_writer.py new file mode 100755 index 0000000..87e1347 --- /dev/null +++ b/source/output_file_writer.py @@ -0,0 +1,325 @@ +from struct import pack +import Bio.bgzf as bgzf +import pathlib +import re + +from source.neat_cigar import CigarString + +BAM_COMPRESSION_LEVEL = 6 + + +def reverse_complement(dna_string) -> str: + """ + Return the reverse complement of a string from a DNA strand. Found this method that is slightly faster than + biopython. Thanks to this stack exchange post: + https://bioinformatics.stackexchange.com/questions/3583/what-is-the-fastest-way-to-get-the-reverse-complement-of-a-dna-sequence-in-pytho + :param dna_string: string of DNA, either in string or Seq format + :return: the reverse complement of the above string in either string or MutableSeq format + """ + if type(dna_string) != str: + dna_string.reverse_complement() + return dna_string + else: + tab = str.maketrans("ACTGN", "TGACN") + + return dna_string.translate(tab)[::-1] + + +# SAMtools reg2bin function +def reg2bin(beg: int, end: int): + """ + Finds the largest superset bin of region. Numeric values taken from hts-specs + Note: description of this function taken from source code for bamnostic.bai + (https://bamnostic.readthedocs.io/en/latest/_modules/bamnostic/bai.html) + :param beg: inclusive beginning position of region + :param end: exclusive end position of region + :return: distinct bin ID or largest superset bin of region + """ + end -= 1 + if beg >> 14 == end >> 14: + return ((1 << 15) - 1) // 7 + (beg >> 14) + if beg >> 17 == end >> 17: + return ((1 << 12) - 1) // 7 + (beg >> 17) + if beg >> 20 == end >> 20: + return ((1 << 9) - 1) // 7 + (beg >> 20) + if beg >> 23 == end >> 23: + return ((1 << 6) - 1) // 7 + (beg >> 23) + if beg >> 26 == end >> 26: + return ((1 << 3) - 1) // 7 + (beg >> 26) + return 0 + + +# takes list of strings, returns numerical flag +def sam_flag(string_list: list) -> int: + out_val = 0 + string_list = list(set(string_list)) + for n in string_list: + if n == 'paired': + out_val += 1 + elif n == 'proper': + out_val += 2 + elif n == 'unmapped': + out_val += 4 + elif n == 'mate_unmapped': + out_val += 8 + elif n == 'reverse': + out_val += 16 + elif n == 'mate_reverse': + out_val += 32 + elif n == 'first': + out_val += 64 + elif n == 'second': + out_val += 128 + elif n == 'not_primary': + out_val += 256 + elif n == 'low_quality': + out_val += 512 + elif n == 'duplicate': + out_val += 1024 + elif n == 'supplementary': + out_val += 2048 + return out_val + + +CIGAR_PACKED = {'M': 0, 'I': 1, 'D': 2, 'N': 3, 'S': 4, 'H': 5, 'P': 6, '=': 7, 'X': 8} +SEQ_PACKED = {'=': 0, 'A': 1, 'C': 2, 'M': 3, 'G': 4, 'R': 5, 'S': 6, 'V': 7, + 'T': 8, 'W': 9, 'Y': 10, 'H': 11, 'K': 12, 'D': 13, 'B': 14, 'N': 15} + +# TODO figure out an optimum batch size +BUFFER_BATCH_SIZE = 8000 # write out to file after this many reads + + +# TODO find a better way to write output files +class OutputFileWriter: + def __init__(self, out_prefix, paired=False, bam_header=None, vcf_header=None, + no_fastq=False, fasta_instead=False): + + self.fasta_instead = fasta_instead + # TODO Eliminate paired end as an option for fastas. Plan is to create a write fasta method. + if self.fasta_instead: + fq1 = pathlib.Path(out_prefix + '.fasta.gz') + fq2 = None + else: + fq1 = pathlib.Path(out_prefix + '_read1.fq.gz') + fq2 = pathlib.Path(out_prefix + '_read2.fq.gz') + bam = pathlib.Path(out_prefix + '_golden.bam') + vcf = pathlib.Path(out_prefix + '_golden.vcf.gz') + + # TODO Make a fasta-specific method + self.no_fastq = no_fastq + if not self.no_fastq: + self.fq1_file = bgzf.open(fq1, 'w') + + self.fq2_file = None + if paired: + self.fq2_file = bgzf.open(fq2, 'w') + + # VCF OUTPUT + self.vcf_file = None + if vcf_header is not None: + self.vcf_file = bgzf.open(vcf, 'wb') + + # WRITE VCF HEADER + self.vcf_file.write('##fileformat=VCFv4.1\n'.encode('utf-8')) + reference = '##reference=' + vcf_header[0] + '\n' + self.vcf_file.write(reference.encode('utf-8')) + self.vcf_file.write('##INFO=\n'.encode('utf-8')) + self.vcf_file.write( + '##INFO=\n'.encode('utf-8')) + self.vcf_file.write( + '##INFO=\n'.encode( + 'utf-8')) + self.vcf_file.write( + '##INFO=\n'.encode( + 'utf-8')) + self.vcf_file.write( + '##INFO=\n'.encode('utf-8')) + self.vcf_file.write( + '##INFO=\n'.encode( + 'utf-8')) + self.vcf_file.write('##ALT=\n'.encode('utf-8')) + self.vcf_file.write('##ALT=\n'.encode('utf-8')) + self.vcf_file.write('##ALT=\n'.encode('utf-8')) + self.vcf_file.write('##ALT=\n'.encode('utf-8')) + self.vcf_file.write('##ALT=\n'.encode('utf-8')) + self.vcf_file.write('##ALT=\n'.encode('utf-8')) + self.vcf_file.write('##ALT=\n'.encode('utf-8')) + # TODO add sample to vcf output + self.vcf_file.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n'.encode('utf-8')) + + # BAM OUTPUT + self.bam_file = None + if bam_header is not None: + self.bam_file = bgzf.BgzfWriter(bam, 'w', compresslevel=BAM_COMPRESSION_LEVEL) + + # WRITE BAM HEADER + self.bam_file.write("BAM\1") + header = '@HD\tVN:1.5\tSO:coordinate\n' + for n in bam_header[0]: + header += '@SQ\tSN:' + n[0] + '\tLN:' + str(n[3]) + '\n' + header += '@RG\tID:NEAT\tSM:NEAT\tLB:NEAT\tPL:NEAT\n' + header_bytes = len(header) + num_refs = len(bam_header[0]) + self.bam_file.write(pack('' + read_name + '/1\n' + str(read1) + '\n') + if read2 is not None: + self.fq2_buffer.append('>' + read_name + '/2\n' + str(read2) + '\n') + else: + self.fq1_buffer.append('@' + read_name + '/1\n' + str(read1) + '\n+\n' + quality1 + '\n') + if read2 is not None: + self.fq2_buffer.append('@' + read_name + '/2\n' + str(read2) + '\n+\n' + quality2 + '\n') + + def write_vcf_record(self, chrom, pos, id_str, ref, alt, qual, filt, info): + self.vcf_file.write( + str(chrom) + '\t' + str(pos) + '\t' + str(id_str) + '\t' + str(ref) + '\t' + str(alt) + '\t' + str( + qual) + '\t' + str(filt) + '\t' + str(info) + '\n') + + def write_bam_record(self, ref_id, read_name, pos_0, cigar, seq, qual, output_sam_flag, + mate_pos=None, aln_map_quality=70): + + my_bin = reg2bin(pos_0, pos_0 + len(seq)) + # my_bin = 0 # or just use a dummy value, does this actually matter? + + my_map_quality = aln_map_quality + cigar_string = CigarString.list_to_string(cigar) + cig_letters = re.split(r"\d+", cigar_string)[1:] + cig_numbers = [int(n) for n in re.findall(r"\d+", cigar_string)] + cig_ops = len(cig_letters) + next_ref_id = ref_id + if mate_pos is None: + next_pos = 0 + my_t_len = 0 + else: + next_pos = mate_pos + if next_pos > pos_0: + my_t_len = next_pos - pos_0 + len(seq) + else: + my_t_len = next_pos - pos_0 - len(seq) + + encoded_cig = bytearray() + for i in range(cig_ops): + encoded_cig.extend(pack('= BUFFER_BATCH_SIZE or len(self.bam_buffer) >= BUFFER_BATCH_SIZE) or ( + len(self.fq1_buffer) and last_time) or (len(self.bam_buffer) and last_time): + # fq + if not self.no_fastq: + self.fq1_file.write(''.join(self.fq1_buffer)) + if len(self.fq2_buffer): + self.fq2_file.write(''.join(self.fq2_buffer)) + # bam + if len(self.bam_buffer): + bam_data = sorted(self.bam_buffer) + if last_time: + self.bam_file.write(b''.join([n[2] for n in bam_data])) + self.bam_buffer = [] + else: + ind_to_stop_at = 0 + for i in range(0, len(bam_data)): + # if we are from previous reference, or have coordinates lower + # than next window position, it's safe to write out to file + if bam_data[i][0] != bam_data[-1][0] or bam_data[i][1] < bam_max: + ind_to_stop_at = i + 1 + else: + break + self.bam_file.write(b''.join([n[2] for n in bam_data[:ind_to_stop_at]])) + # Debug statement + # print(f'BAM WRITING: {ind_to_stop_at}/{len(bam_data)}') + if ind_to_stop_at >= len(bam_data): + self.bam_buffer = [] + else: + self.bam_buffer = bam_data[ind_to_stop_at:] + self.fq1_buffer = [] + self.fq2_buffer = [] + + def close_files(self): + self.flush_buffers(last_time=True) + if not self.no_fastq: + self.fq1_file.close() + if self.fq2_file is not None: + self.fq2_file.close() + if self.vcf_file is not None: + self.vcf_file.close() + if self.bam_file is not None: + self.bam_file.close() diff --git a/source/probability.py b/source/probability.py new file mode 100755 index 0000000..12cc076 --- /dev/null +++ b/source/probability.py @@ -0,0 +1,160 @@ +import random +import bisect +import copy +import sys +from typing import Union + +import numpy as np + +LOW_PROB_THRESH = 1e-12 + + +def mean_ind_of_weighted_list(candidate_list: list) -> int: + """ + Returns the index of the mean of a weighted list + + :param candidate_list: weighted list + :return: index of mean + """ + my_mid = sum(candidate_list) / 2.0 + my_sum = 0.0 + for i in range(len(candidate_list)): + my_sum += candidate_list[i] + if my_sum >= my_mid: + return i + + +class DiscreteDistribution: + def __init__(self, weights, values, degenerate_val=None, method='bisect'): + + # some sanity checking + if not len(weights) or not len(values): + print('\nError: weight or value vector given to DiscreteDistribution() are 0-length.\n') + sys.exit(1) + + self.method = method + sum_weight = float(sum(weights)) + + # if probability of all input events is 0, consider it degenerate and always return the first value + if sum_weight < LOW_PROB_THRESH: + self.degenerate = values[0] + else: + self.weights = [n / sum_weight for n in weights] + # TODO This line is slowing things down and seems unnecessary. Are these "values + # possibly some thing from another class? + self.values = copy.deepcopy(values) + if len(self.values) != len(self.weights): + print('\nError: length and weights and values vectors must be the same.\n') + exit(1) + self.degenerate = degenerate_val + + if self.method == 'alias': + len_weights = len(self.weights) + prob_vector = np.zeros(len_weights) + count_vector = np.zeros(len_weights, dtype=np.int) + smaller = [] + larger = [] + for kk, prob in enumerate(self.weights): + prob_vector[kk] = len_weights * prob + if prob_vector[kk] < 1.0: + smaller.append(kk) + else: + larger.append(kk) + while len(smaller) > 0 and len(larger) > 0: + small = smaller.pop() + large = larger.pop() + count_vector[small] = large + prob_vector[large] = (prob_vector[large] + prob_vector[small]) - 1.0 + if prob_vector[large] < 1.0: + smaller.append(large) + else: + larger.append(large) + + self.a1 = len(count_vector) - 1 + self.a2 = count_vector.tolist() + self.a3 = prob_vector.tolist() + + elif self.method == 'bisect': + self.cum_prob = np.cumsum(self.weights).tolist()[:-1] + self.cum_prob.insert(0, 0.) + + else: + print("\nUnknown discreet distribution method.\n") + + def __str__(self): + return str(self.weights) + ' ' + str(self.values) + ' ' + self.method + + def sample(self) -> Union[int, float]: + """ + This is one of the slowest parts of the code. Or it just gets hit the most times. Will need + to investigate at some point. + :return: Since this function is selecting an item from a list, and the list could theoretically be anything, + then in a broad sense this function returns a list item or a generic object. But I'm fairly confident that most + of these uses will be lists of ints or floats, but will investigate further + """ + + if self.degenerate is not None: + return self.degenerate + + else: + + if self.method == 'alias': + random1 = random.randint(0, self.a1) + random2 = random.random() + if random2 < self.a3[random1]: + return self.values[random1] + else: + return self.values[self.a2[random1]] + + elif self.method == 'bisect': + r = random.random() + return self.values[bisect.bisect(self.cum_prob, r) - 1] + + +# takes k_range, lambda, [0,1,2,..], returns a DiscreteDistribution object +# with the corresponding to a poisson distribution + +def poisson_list(k_range, input_lambda): + min_weight = 1e-12 + if input_lambda < min_weight: + return DiscreteDistribution([1], [0], degenerate_val=0) + log_factorial_list = [0.0] + for k in k_range[1:]: + log_factorial_list.append(np.log(float(k)) + log_factorial_list[k - 1]) + w_range = [np.exp(k * np.log(input_lambda) - input_lambda - log_factorial_list[k]) for k in k_range] + w_range = [n for n in w_range if n >= min_weight] + if len(w_range) <= 1: + return DiscreteDistribution([1], [0], degenerate_val=0) + return DiscreteDistribution(w_range, k_range[:len(w_range)]) + + +# quantize a list of values into blocks +def quantize_list(list_to_quantize): + min_prob = 1e-12 + quant_blocks = 10 + sum_list = float(sum(list_to_quantize)) + sorted_list = sorted([n for n in list_to_quantize if n >= min_prob * sum_list]) + if len(sorted_list) == 0: + return None + qi = [] + for i in range(quant_blocks): + # qi.append(sorted_list[int((i)*(len(sorted_list)/float(quant_blocks)))]) + qi.append(sorted_list[0] + (i / float(quant_blocks)) * (sorted_list[-1] - sorted_list[0])) + qi.append(1e12) + running_list = [] + prev_bi = None + prev_i = None + for i in range(len(list_to_quantize)): + if list_to_quantize[i] >= min_prob * sum_list: + bi = bisect.bisect(qi, list_to_quantize[i]) + # print i, l[i], qi[bi-1] + if prev_bi is not None: + if bi == prev_bi and prev_i == i - 1: + running_list[-1][1] += 1 + else: + running_list.append([i, i, qi[bi - 1]]) + else: + running_list.append([i, i, qi[bi - 1]]) + prev_bi = bi + prev_i = i + return running_list diff --git a/source/ref_func.py b/source/ref_func.py new file mode 100755 index 0000000..1c817d2 --- /dev/null +++ b/source/ref_func.py @@ -0,0 +1,251 @@ +import sys +import time +import os +import gzip +import pathlib +import random +from Bio.Seq import Seq +from Bio import SeqIO + +OK_CHR_ORD = {'A': True, 'C': True, 'G': True, 'T': True, 'U': True} +ALLOWED_NUCL = ['A', 'C', 'G', 'T'] + + +def index_ref(reference_path: str) -> list: + """ + Index reference fasta + :param reference_path: string path to the reference + :return: reference index in list from + """ + tt = time.time() + + absolute_reference_location = pathlib.Path(reference_path) + + # sanity check + if not absolute_reference_location.is_file(): + print("\nProblem reading the reference fasta file.\n") + sys.exit(1) + + index_filename = None + + # check if the reference file already exists + if absolute_reference_location.with_suffix('.fai').is_file(): + print('found index ' + str(absolute_reference_location.with_suffix('.fai'))) + index_filename = absolute_reference_location.with_suffix('.fai') + elif absolute_reference_location.with_suffix(absolute_reference_location.suffix + '.fai').is_file(): + print('found index ' + + str(absolute_reference_location.with_suffix(absolute_reference_location.suffix + '.fai'))) + index_filename = absolute_reference_location.with_suffix(absolute_reference_location.suffix + '.fai') + else: + pass + + ref_indices = [] + if index_filename is not None: + fai = open(index_filename, 'r') + for line in fai: + splt = line[:-1].split('\t') + # Defined as the number of bases in the contig + seq_len = int(splt[1]) + # Defined as the byte index where the contig sequence begins + offset = int(splt[2]) + # Defined as bases per line in the Fasta file + line_ln = int(splt[3]) + n_lines = seq_len // line_ln + if seq_len % line_ln != 0: + n_lines += 1 + # Item 3 in this gives you the byte position of the next contig, I believe + ref_indices.append((splt[0], offset, offset + seq_len + n_lines, seq_len)) + fai.close() + return ref_indices + + print('Index not found, creating one... ') + if absolute_reference_location.suffix == ".gz": + ref_file = gzip.open(absolute_reference_location, 'rt') + else: + ref_file = open(absolute_reference_location, 'r') + prev_r = None + prev_p = None + seq_len = 0 + + while True: + data = ref_file.readline() + if not data: + ref_indices.append((prev_r, prev_p, ref_file.tell() - len(data), seq_len)) + break + elif data[0] == '>': + if prev_p is not None: + ref_indices.append((prev_r, prev_p, ref_file.tell() - len(data), seq_len)) + seq_len = 0 + prev_p = ref_file.tell() + prev_r = data[1:-1] + else: + seq_len += len(data) - 1 + ref_file.close() + + print('{0:.3f} (sec)'.format(time.time() - tt)) + return ref_indices + + +def read_ref(ref_path, ref_inds_i, n_handling, n_unknowns=True, quiet=False): + tt = time.time() + if not quiet: + print('reading ' + ref_inds_i[0] + '... ') + + absolute_reference_path = pathlib.Path(ref_path) + if absolute_reference_path.suffix == '.gz': + ref_file = gzip.open(absolute_reference_path, 'rt') + else: + ref_file = open(absolute_reference_path, 'r') + + # TODO convert to SeqIO containers + # for seq_record in SeqIO.parse(ref_file, "fasta"): + # pass + + + ref_file.seek(ref_inds_i[1]) + my_dat = ''.join(ref_file.read(ref_inds_i[2] - ref_inds_i[1]).split('\n')) + my_dat = Seq(my_dat.upper()) + # Mutable seqs have a number of disadvantages. I'm going to try making them immutable and see if that helps + # my_dat = my_dat.tomutable() + + # find N regions + # data explanation: my_dat[n_atlas[0][0]:n_atlas[0][1]] = solid block of Ns + prev_ni = 0 + n_count = 0 + n_atlas = [] + for i in range(len(my_dat)): + if my_dat[i] == 'N' or (n_unknowns and my_dat[i] not in OK_CHR_ORD): + if n_count == 0: + prev_ni = i + n_count += 1 + if i == len(my_dat) - 1: + n_atlas.append((prev_ni, prev_ni + n_count)) + else: + if n_count > 0: + n_atlas.append((prev_ni, prev_ni + n_count)) + n_count = 0 + + # handle N base-calls as desired + # TODO this seems to randomly replace an N with a base. Is this necessary? How to do this in an immutable seq? + n_info = {'all': [], 'big': [], 'non_N': []} + if n_handling[0] == 'random': + for region in n_atlas: + n_info['all'].extend(region) + if region[1] - region[0] <= n_handling[1]: + for i in range(region[0], region[1]): + temp = my_dat.tomutable() + temp[i] = random.choice(ALLOWED_NUCL) + my_dat = temp.toseq() + else: + n_info['big'].extend(region) + elif n_handling[0] == 'allChr' and n_handling[2] in OK_CHR_ORD: + for region in n_atlas: + n_info['all'].extend(region) + if region[1] - region[0] <= n_handling[1]: + for i in range(region[0], region[1]): + temp = my_dat.tomutable() + temp[i] = n_handling[2] + my_dat = temp.toseq() + else: + n_info['big'].extend(region) + elif n_handling[0] == 'ignore': + for region in n_atlas: + n_info['all'].extend(region) + n_info['big'].extend(region) + else: + print('\nERROR: UNKNOWN N_HANDLING MODE\n') + sys.exit(1) + + habitable_regions = [] + if not n_info['big']: + n_info['non_N'] = [(0, len(my_dat))] + else: + for i in range(0, len(n_info['big']), 2): + if i == 0: + habitable_regions.append((0, n_info['big'][0])) + else: + habitable_regions.append((n_info['big'][i - 1], n_info['big'][i])) + habitable_regions.append((n_info['big'][-1], len(my_dat))) + for n in habitable_regions: + if n[0] != n[1]: + n_info['non_N'].append(n) + + ref_file.close() + + if not quiet: + print('{0:.3f} (sec)'.format(time.time() - tt)) + + return my_dat, n_info + + +def get_all_ref_regions(ref_path, ref_inds, n_handling, save_output=False): + """ + Find all non-N regions in reference sequence ahead of time, for computing jobs in parallel + + :param ref_path: + :param ref_inds: + :param n_handling: + :param save_output: + :return: + """ + out_regions = {} + fn = ref_path + '.nnr' + if os.path.isfile(fn) and not (save_output): + print('found list of preidentified non-N regions...') + f = open(fn, 'r') + for line in f: + splt = line.strip().split('\t') + if splt[0] not in out_regions: + out_regions[splt[0]] = [] + out_regions[splt[0]].append((int(splt[1]), int(splt[2]))) + f.close() + return out_regions + else: + print('enumerating all non-N regions in reference sequence...') + for RI in range(len(ref_inds)): + (ref_sequence, N_regions) = read_ref(ref_path, ref_inds[RI], n_handling, quiet=True) + ref_name = ref_inds[RI][0] + out_regions[ref_name] = [n for n in N_regions['non_N']] + if save_output: + f = open(fn, 'w') + for k in out_regions.keys(): + for n in out_regions[k]: + f.write(k + '\t' + str(n[0]) + '\t' + str(n[1]) + '\n') + f.close() + return out_regions + + +def partition_ref_regions(in_regions, ref_inds, my_job, n_jobs): + """ + Find which of the non-N regions are going to be used for this job + + :param in_regions: + :param ref_inds: + :param my_job: + :param n_jobs: + :return: + """ + tot_size = 0 + for RI in range(len(ref_inds)): + ref_name = ref_inds[RI][0] + for region in in_regions[ref_name]: + tot_size += region[1] - region[0] + size_per_job = int(tot_size / float(n_jobs) - 0.5) + + regions_per_job = [[] for n in range(n_jobs)] + refs_per_job = [{} for n in range(n_jobs)] + current_ind = 0 + current_count = 0 + for RI in range(len(ref_inds)): + ref_name = ref_inds[RI][0] + for region in in_regions[ref_name]: + regions_per_job[current_ind].append((ref_name, region[0], region[1])) + refs_per_job[current_ind][ref_name] = True + current_count += region[1] - region[0] + if current_count >= size_per_job: + current_count = 0 + current_ind = min([current_ind + 1, n_jobs - 1]) + + relevant_refs = refs_per_job[my_job - 1].keys() + relevant_regs = regions_per_job[my_job - 1] + return relevant_refs, relevant_regs diff --git a/source/test_write.py b/source/test_write.py new file mode 100755 index 0000000..aaabdf7 --- /dev/null +++ b/source/test_write.py @@ -0,0 +1,98 @@ +#!/bin/bash/python + +import pathlib +import gzip +from timeit import default_timer as timer + + +class OutputFileWriter1: + def __init__(self, out_prefix, gzipped, fasta): + start = timer() + self.buffer = [] + + if gzipped: + file = pathlib.Path(out_prefix + '.fasta.gz') + else: + file = pathlib.Path(out_prefix + '.fasta') + + if gzipped: + self.file = gzip.open(file, 'wb') + else: + self.file = open(file, 'w') + + fasta_path = pathlib.Path(fasta) + + for i in range(100): + with open(fasta_path, 'r') as file: + for line in file: + self.file.write(line) + + self.file.close() + + end = timer() + + print("It took {} seconds!".format(end-start)) + + +class OutputFileWriter2: + def __init__(self, out_prefix, gzipped, fasta): + start = timer() + + if gzipped: + self.file = pathlib.Path(out_prefix + '.fasta.gz') + else: + self.file = pathlib.Path(out_prefix + '.fasta') + + lines = [] + fasta_path = pathlib.Path(fasta) + for i in range(100): + with open(fasta_path, 'r') as file: + for line in file: + lines.append(line) + if i % 10 == 0: + if gzipped: + with gzip.open(self.file, 'wb') as f: + f.write("\n".join(lines)) + lines = [] + else: + with open(self.file, 'w') as f: + f.write("\n".join(lines)) + lines = [] + + end = timer() + + print("It took {} seconds!".format(end - start)) + + +class OutputFileWriter3: + def __init__(self, out_prefix, gzipped, fasta): + start = timer() + + if gzipped: + self.file = pathlib.Path(out_prefix + '.fasta.gz') + else: + self.file = pathlib.Path(out_prefix + '.fasta') + + lines = [] + fasta_path = pathlib.Path(fasta) + for i in range(100): + with open(fasta_path, 'r') as file: + for line in file: + lines.append(line) + + if gzipped: + with gzip.open(self.file, 'wb') as f: + f.write("\n".join(lines)) + else: + with open(self.file, 'w') as f: + f.write("\n".join(lines)) + + end = timer() + + print("It took {} seconds!".format(end - start)) + + +fasta = '/home/joshfactorial/Documents/neat_data/chr21.fasta' +OutputFileWriter1('test1', False, fasta) +OutputFileWriter2('test2', False, fasta) +# OutputFileWriter3('test3', False, fasta) diff --git a/source/vcf_func.py b/source/vcf_func.py new file mode 100755 index 0000000..ba233b2 --- /dev/null +++ b/source/vcf_func.py @@ -0,0 +1,190 @@ +import sys +import time +import os +import re +import random + +def parse_line(vcf_line, col_dict, col_samp): + # these were in the original. Not sure the point other than debugging. + include_homs = False + include_fail = False + + # check if we want to proceed... + reference_allele = vcf_line[col_dict['REF']] + alternate_allele = vcf_line[col_dict['ALT']] + # enough columns? + if len(vcf_line) != len(col_dict): + return None + # exclude homs / filtered? + if not include_homs and alternate_allele == '.' or alternate_allele == '' or alternate_allele == reference_allele: + return None + if not include_fail and vcf_line[col_dict['FILTER']] != 'PASS' and vcf_line[col_dict['FILTER']] != '.': + return None + + # default vals + alt_alleles = [alternate_allele] + alt_freqs = [] + + gt_per_samp = [] + + # any alt alleles? + alt_split = alternate_allele.split(',') + if len(alt_split) > 1: + alt_alleles = alt_split + + # check INFO for AF + af = None + if 'INFO' in col_dict and ';AF=' in ';' + vcf_line[col_dict['INFO']]: + info = vcf_line[col_dict['INFO']] + ';' + af = re.findall(r"AF=.*?(?=;)", info)[0][3:] + if af is not None: + af_splt = af.split(',') + while (len(af_splt) < len(alt_alleles)): # are we lacking enough AF values for some reason? + af_splt.append(af_splt[-1]) # phone it in. + if len(af_splt) != 0 and af_splt[0] != '.' and af_splt[0] != '': # missing data, yay + alt_freqs = [float(n) for n in af_splt] + else: + alt_freqs = [None] * max([len(alt_alleles), 1]) + + gt_per_samp = None + # if available (i.e. we simulated it) look for WP in info + if len(col_samp) == 0 and 'INFO' in col_dict and 'WP=' in vcf_line[col_dict['INFO']]: + info = vcf_line[col_dict['INFO']] + ';' + gt_per_samp = [re.findall(r"WP=.*?(?=;)", info)[0][3:]] + else: + # if no sample columns, check info for GT + if len(col_samp) == 0 and 'INFO' in col_dict and 'GT=' in vcf_line[col_dict['INFO']]: + info = vcf_line[col_dict['INFO']] + ';' + gt_per_samp = [re.findall(r"GT=.*?(?=;)", info)[0][3:]] + elif len(col_samp): + fmt = ':' + vcf_line[col_dict['FORMAT']] + ':' + if ':GT:' in fmt: + gt_ind = fmt.split(':').index('GT') + gt_per_samp = [vcf_line[col_samp[iii]].split(':')[gt_ind - 1] for iii in range(len(col_samp))] + for i in range(len(gt_per_samp)): + gt_per_samp[i] = gt_per_samp[i].replace('.', '0') + if gt_per_samp is None: + gt_per_samp = [None] * max([len(col_samp), 1]) + + return alt_alleles, alt_freqs, gt_per_samp + + +def parse_vcf(vcf_path, tumor_normal=False, ploidy=2): + # this var was in the orig. May have just been a debugging thing. + # I think this is trying to implement a check on GT + choose_random_ploid_if_no_gt_found = True + + tt = time.time() + print('--------------------------------') + print('reading input VCF...\n', flush=True) + + col_dict = {} + col_samp = [] + n_skipped = 0 + n_skipped_because_hash = 0 + all_vars = {} # [ref][pos] + samp_names = [] + printed_warning = False + f = open(vcf_path, 'r') + for line in f: + + if line[0] != '#': + if len(col_dict) == 0: + print('\n\nERROR: VCF has no header?\n' + vcf_path + '\n\n') + f.close() + exit(1) + splt = line.strip().split('\t') + pl_out = parse_line(splt, col_dict, col_samp) + if pl_out is None: + n_skipped += 1 + else: + (aa, af, gt) = pl_out + + # make sure at least one allele somewhere contains the variant + if tumor_normal: + gt_eval = gt[:2] + else: + gt_eval = gt[:1] + # For some reason this had an additional "if True" inserted. I guess it was supposed to be an option + # the user could set but was never implemented. + if None in gt_eval: + if choose_random_ploid_if_no_gt_found: + if not printed_warning: + print('Warning: Found variants without a GT field, assuming heterozygous...') + printed_warning = True + for i in range(len(gt_eval)): + tmp = ['0'] * ploidy + tmp[random.randint(0, ploidy - 1)] = '1' + gt_eval[i] = '/'.join(tmp) + else: + # skip because no GT field was found + n_skipped += 1 + continue + non_reference = False + for gtVal in gt_eval: + if gtVal is not None: + if '1' in gtVal: + non_reference = True + if not non_reference: + # skip if no genotype actually contains this variant + n_skipped += 1 + continue + + chrom = splt[0] + pos = int(splt[1]) + ref = splt[3] + # skip if position is <= 0 + if pos <= 0: + n_skipped += 1 + continue + + # hash variants to avoid inserting duplicates (there are some messy VCFs out there...) + if chrom not in all_vars: + all_vars[chrom] = {} + if pos not in all_vars[chrom]: + all_vars[chrom][pos] = (pos, ref, aa, af, gt_eval) + else: + n_skipped_because_hash += 1 + + else: + if line[1] != '#': + cols = line[1:-1].split('\t') + for i in range(len(cols)): + if 'FORMAT' in col_dict: + col_samp.append(i) + col_dict[cols[i]] = i + if len(col_samp): + samp_names = cols[-len(col_samp):] + if len(col_samp) == 1: + pass + elif len(col_samp) == 2 and tumor_normal: + print('Detected 2 sample columns in input VCF, assuming tumor/normal.') + else: + print( + 'Warning: Multiple sample columns present in input VCF. By default genReads uses ' + 'only the first column.') + else: + samp_names = ['Unknown'] + if tumor_normal: + # tumorInd = samp_names.index('TUMOR') + # normalInd = samp_names.index('NORMAL') + if 'NORMAL' not in samp_names or 'TUMOR' not in samp_names: + print('\n\nERROR: Input VCF must have a "NORMAL" and "TUMOR" column.\n') + f.close() + + vars_out = {} + for r in all_vars.keys(): + vars_out[r] = [list(all_vars[r][k]) for k in sorted(all_vars[r].keys())] + # prune unnecessary sequence from ref/alt alleles + for i in range(len(vars_out[r])): + while len(vars_out[r][i][1]) > 1 and all([n[-1] == vars_out[r][i][1][-1] for n in vars_out[r][i][2]]) \ + and all([len(n) > 1 for n in vars_out[r][i][2]]): + vars_out[r][i][1] = vars_out[r][i][1][:-1] + vars_out[r][i][2] = [n[:-1] for n in vars_out[r][i][2]] + vars_out[r][i] = tuple(vars_out[r][i]) + + print('found', sum([len(n) for n in all_vars.values()]), 'valid variants in input vcf.') + print(' *', n_skipped, 'variants skipped: (qual filtered / ref genotypes / invalid syntax)') + print(' *', n_skipped_because_hash, 'variants skipped due to multiple variants found per position') + print('--------------------------------') + return samp_names, vars_out diff --git a/utilities/README.md b/utilities/README.md old mode 100644 new mode 100755 index 1864d6c..590bfe1 --- a/utilities/README.md +++ b/utilities/README.md @@ -1,29 +1,47 @@ -# computeGC.py +# compute_gc.py Takes .genomecov files produced by BEDtools genomeCov (with -d option). ``` bedtools genomecov - -d \ - -ibam normal.bam \ + -d \ + -ibam normal.bam \ -g reference.fa ``` ``` -python computeGC.py \ - -r reference.fa \ - -i genomecovfile \ - -w [sliding window length] \ - -o /path/to/model.p +python computeGC.py \ + -r reference.fasta \ + -i genomecov \ + -w [sliding window length] \ + -o /path/to/output ``` -# computeFraglen.py +The main function in this file processes the inputs (reference.fasta, genome.cov, window length), and outputs a GC count for the sequence in the form of a pickle file at the location and with the name from the path the user provides with the -o command. -Takes SAM file via stdin: -./samtools view toy.bam | python computeFraglen.py -and creates fraglen.p model in working directory. +# compute_fraglen.py + +Takes SAM or BAM files and uses console commands for processing: + +``` +python computeGC.py \ + -i path to sam file \ + -o path/to/output +``` + +The main function in this file will save a pickle (.p) in the location and with the name from the path the user provides with the -o command. + +**Please be aware that pysam is not usable on Windows, so any BAM file will need to be turned into a SAM file using samtools beforehand.** + +To convert a BAM file to a SAM file using samtools, use the following command: + +``` +samtools view nameof.bam > nameof.sam +``` + + # genMutModel.py diff --git a/utilities/__init__.py b/utilities/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/utilities/computeFraglen.py b/utilities/computeFraglen.py deleted file mode 100644 index 21a691b..0000000 --- a/utilities/computeFraglen.py +++ /dev/null @@ -1,88 +0,0 @@ -# -# -# Compute Fragment Length Model for genReads.py -# computeFraglen.py -# -# -# Usage: samtools view normal.bam | python computeFraglen.py -# -# - -import sys -import fileinput -import cPickle as pickle -import numpy as np - -FILTER_MAPQUAL = 10 # only consider reads that are mapped with at least this mapping quality -FILTER_MINREADS = 100 # only consider fragment lengths that have at least this many read pairs supporting it -FILTER_MEDDEV_M = 10 # only consider fragment lengths this many median deviations above the median - -def quick_median(countDict): - midPoint = sum(countDict.values())/2 - mySum = 0 - myInd = 0 - sk = sorted(countDict.keys()) - while mySum < midPoint: - mySum += countDict[sk[myInd]] - if mySum >= midPoint: - break - myInd += 1 - return myInd - -def median_deviation_from_median(countDict): - myMedian = quick_median(countDict) - deviations = {} - for k in sorted(countDict.keys()): - d = abs(k-myMedian) - deviations[d] = countDict[k] - return quick_median(deviations) - -if len(sys.argv) != 1: - print "Usage: samtools view normal.bam | python computeFraglen.py" - exit(1) - -all_tlens = {} -PRINT_EVERY = 100000 -BREAK_AFTER = 1000000 -i = 0 -for line in fileinput.input(): - splt = line.strip().split('\t') - samFlag = int(splt[1]) - myRef = splt[2] - mapQual = int(splt[4]) - mateRef = splt[6] - myTlen = abs(int(splt[8])) - - if samFlag&1 and samFlag&64 and mapQual > FILTER_MAPQUAL: # if read is paired, and is first in pair, and is confidently mapped... - if mateRef == '=' or mateRef == myRef: # and mate is mapped to same reference - if myTlen not in all_tlens: - all_tlens[myTlen] = 0 - all_tlens[myTlen] += 1 - i += 1 - if i%PRINT_EVERY == 0: - print '---',i, quick_median(all_tlens), median_deviation_from_median(all_tlens) - #for k in sorted(all_tlens.keys()): - # print k, all_tlens[k] - - #if i > BREAK_AFTER: - # break - - -med = quick_median(all_tlens) -mdm = median_deviation_from_median(all_tlens) - -outVals = [] -outProbs = [] -for k in sorted(all_tlens.keys()): - if k > 0 and k < med + FILTER_MEDDEV_M * mdm: - if all_tlens[k] >= FILTER_MINREADS: - print k, all_tlens[k] - outVals.append(k) - outProbs.append(all_tlens[k]) -countSum = float(sum(outProbs)) -outProbs = [n/countSum for n in outProbs] - -print '\nsaving model...' -pickle.dump([outVals, outProbs],open('fraglen.p','wb')) - - diff --git a/utilities/computeGC.py b/utilities/computeGC.py deleted file mode 100644 index ea7ea8e..0000000 --- a/utilities/computeGC.py +++ /dev/null @@ -1,115 +0,0 @@ -# -# -# computeGC.py -# Compute GC and coverage model for genReads.py -# -# Takes output file from bedtools genomecov to generate GC/coverage model -# -# Usage: bedtools genomecov -d -ibam input.bam -g reference.fa > genomeCov.dat -# python computeGC.py -r reference.fa -i genomeCov.dat -W [sliding window length] -o output_name.p -# -# -# - -import time -import sys -import argparse -import numpy as np -import cPickle as pickle - -parser = argparse.ArgumentParser(description='computeGC.py') -parser.add_argument('-i', type=str, required=True, metavar='', help="* input.genomecov") -parser.add_argument('-r', type=str, required=True, metavar='', help="* reference.fa") -parser.add_argument('-o', type=str, required=True, metavar='', help="* output.p") -parser.add_argument('-w', type=int, required=False, metavar='', help="sliding window length [50]", default=50) -args = parser.parse_args() - -(IN_GCB, REF_FILE, WINDOW_SIZE, OUT_P) = (args.i, args.r, args.w, args.o) - -GC_BINS = {n:[] for n in range(WINDOW_SIZE+1)} - -print 'reading ref...' -allRefs = {} -f = open(REF_FILE,'r') -for line in f: - if line[0] == '>': - refName = line.strip()[1:] - allRefs[refName] = [] - print refName - #if refName == 'chr2': - # break - else: - allRefs[refName].append(line.strip()) -f.close() - -print 'capitalizing ref...' -for k in sorted(allRefs.keys()): - print k - allRefs[k] = ''.join(allRefs[k]) - allRefs[k] = allRefs[k].upper() - -print 'reading genomecov file...' -tt = time.time() -f = open(IN_GCB,'r') -currentLine = 0 -currentRef = None -currentCov = 0 -linesProcessed = 0 -PRINT_EVERY = 1000000 -STOP_AFTER = 1000000 -for line in f: - splt = line.strip().split('\t') - if linesProcessed%PRINT_EVERY == 0: - print linesProcessed - linesProcessed += 1 - - #if linesProcessed > STOP_AFTER: - # break - - if currentLine == 0: - currentRef = splt[0] - sPos = int(splt[1])-1 - - if currentRef not in allRefs: - continue - - currentLine += 1 - currentCov += float(splt[2]) - - if currentLine == WINDOW_SIZE: - currentLine = 0 - seq = allRefs[currentRef][sPos:sPos+WINDOW_SIZE] - if 'N' not in seq: - gc_count = seq.count('G') + seq.count('C') - GC_BINS[gc_count].append(currentCov) - currentCov = 0 - -f.close() - -runningTot = 0 -allMean = 0.0 -for k in sorted(GC_BINS.keys()): - if len(GC_BINS[k]) == 0: - print '{0:0.2%}'.format(k/float(WINDOW_SIZE)), 0.0, 0 - GC_BINS[k] = 0 - else: - myMean = np.mean(GC_BINS[k]) - myLen = len(GC_BINS[k]) - print '{0:0.2%}'.format(k/float(WINDOW_SIZE)), myMean, myLen - allMean += myMean * myLen - runningTot += myLen - GC_BINS[k] = myMean - -avgCov = allMean/float(runningTot) -print 'AVERAGE COVERAGE =',avgCov - -y_out = [] -for k in sorted(GC_BINS.keys()): - GC_BINS[k] /= avgCov - y_out.append(GC_BINS[k]) - -print 'saving model...' -pickle.dump([range(WINDOW_SIZE+1),y_out],open(OUT_P,'wb')) - -print time.time()-tt,'(sec)' - diff --git a/utilities/compute_fraglen.py b/utilities/compute_fraglen.py new file mode 100755 index 0000000..5eb6b9c --- /dev/null +++ b/utilities/compute_fraglen.py @@ -0,0 +1,170 @@ +# +# +# Compute Fragment Length Model for gen_reads.source +# compute_fraglen.source +# +# +# Usage: samtools view normal.bam | source compute_fraglen.source +# +# +# Upgraded 5/6/2020 to match Python 3 standards and refactored for easier reading + +import pickle +import argparse +import platform + +os = platform.system() +if os !='Windows': + import pysam + + +def median(datalist: list) -> float: + """ + Finds the median of a list of data. For this function, the data are expected to be a list of + numbers, either float or int. + :param datalist: the list of data to find the median of. This should be a set of numbers. + :return: The median of the set + >>> median([2]) + 2 + >>> median([2183, 2292, 4064, 4795, 7471, 12766, 14603, 15182, 16803, 18704, 21504, 21677, 23347, 23586, 24612, 24878, 25310, 25993, 26448, 28018, 28352, 28373, 28786, 30037, 31659, 31786, 33487, 33531, 34442, 39138, 39718, 39815, 41518, 41934, 43301]) + 25993 + >>> median([1,2,4,6,8,12,14,15,17,21]) + 10.0 + """ + # using integer division here gives the index of the midpoint, due to zero-based indexing. + midpoint = len(datalist)//2 + + # Once we've found the midpoint, we calculate the median, which is just the middle value if there are an + # odd number of values, or the average of the two middle values if there are an even number + if len(datalist) % 2 == 0: + median = (datalist[midpoint] + datalist[midpoint-1])/2 + else: + median = datalist[midpoint] + return median + + + +def median_absolute_deviation(datalist: list) -> float: + """ + Calculates the absolute value of the median deviation from the median for each element of of a datalist. + Then returns the median of these values. + :param datalist: A list of data to find the MAD of + :return: index of median of the deviations + >>> median_absolute_deviation([2183, 2292, 4064, 4795, 7471, 12766, 14603, 15182, 16803, 18704, 21504, 21677, 23347, 23586, 24612, 24878, 25310, 25993, 26448, 28018, 28352, 28373, 28786, 30037, 31659, 31786, 33487, 33531, 34442, 39138, 39718, 39815, 41518, 41934, 43301]) + 7494 + >>> median_absolute_deviation([1,2,4,6,8,12,14,15,17,21]) + 5.5 + >>> median_absolute_deviation([0,2]) + 1.0 + """ + my_median = median(datalist) + deviations = [] + for item in datalist: + # We take the absolute difference between the value and the median + X_value = abs(item - my_median) + # This creates a dataset that is the absolute deviations about the median + deviations.append(X_value) + # The median of the absolute deviations is the median absolute deviation + return median(sorted(deviations)) + + +def count_frags(file: str) -> list: + """ + Takes a sam or bam file input and creates a list of the number of reads that are paired, + first in the pair, confidently mapped and whose pair is mapped to the same reference + :param file: A sam input file + :return: A list of the tlens from the bam/sam file + """ + FILTER_MAPQUAL = 10 # only consider reads that are mapped with at least this mapping quality + count_list = [] + # Check if the file is sam or bam and decide how to open based on that + if file[-4:] == ".sam": + file_to_parse = open(file, 'r') + elif file[-4:] == ".bam": + print("WARNING: Must have pysam installed to read bam files. Pysam does not work on Windows OS.") + if os != 'Windows': + file_to_parse = pysam.AlignmentFile(file, 'rb') + else: + raise Exception("Your machine is running Windows. Please convert any BAM files to SAM files using samtools prior to input") + else: + print("Unknown file type, file extension must be bam or sam") + exit(1) + + for item in file_to_parse: + # Need to convert bam iterable objects into strings for the next part + line = str(item) + # Skip all comments and headers + if line[0] == '#' or line[0] == '@': + continue + splt = line.strip().split('\t') + sam_flag = int(splt[1]) + my_ref = splt[2] + map_qual = int(splt[4]) + mate_ref = splt[6] + my_tlen = abs(int(splt[8])) + # if read is paired, and is first in pair, and is confidently mapped... + if sam_flag & 1 and sam_flag & 64 and map_qual > FILTER_MAPQUAL: + # and mate is mapped to same reference + if mate_ref == '=' or mate_ref == my_ref: + count_list.append(my_tlen) + count_list = sorted(count_list) + file_to_parse.close() + return count_list + + +def compute_probs(datalist: list) -> (list, list): + """ + Computes the probabilities for fragments with at least 100 pairs supporting it and that are at least 10 median + deviations from the median. + :param datalist: A list of fragments with counts + :return: A list of values that meet the criteria and a list of their associated probabilities + """ + FILTER_MINREADS = 100 # only consider fragment lengths that have at least this many read pairs supporting it + FILTER_MEDDEV_M = 10 # only consider fragment lengths this many median deviations above the median + values = [] + probabilities = [] + med = median(datalist) + mad = median_absolute_deviation(datalist) + + for item in list(set(datalist)): + if 0 < item <= med + FILTER_MEDDEV_M * mad: + data_count = datalist.count(item) + if data_count >= FILTER_MINREADS: + values.append(item) + probabilities.append(data_count) + count_sum = float(sum(probabilities)) + probabilities = [n / count_sum for n in probabilities] + return values, probabilities + + +def main(): + """ + Main function takes 2 arguments: + input - a path to a sam or bam file input. Note that sam files can be formed by applying samtools to a bam file + in the follawing way: samtools view nameof.bam > nameof.sam + + output - the string prefix of the output. The actual output will be the prefix plus ".p" at the end + for pickle file. The list of values and list of probabilities are dumped as a list of lists + into a pickle file on completion of the analysis + + :return: None + """ + parser = argparse.ArgumentParser(description="compute_fraglen.source", + formatter_class=argparse.ArgumentDefaultsHelpFormatter,) + parser.add_argument('-i', type=str, metavar="input", required=True, default=None, + help="Sam file input (samtools view name.bam > name.sam)") + parser.add_argument('-o', type=str, metavar="output", required=True, default=None, help="Prefix for output") + + args = parser.parse_args() + input_file = args.i + output_prefix = args.o + output = output_prefix + '.p' + + all_tlens = count_frags(input_file) + print('\nSaving model...') + out_vals, out_probs = compute_probs(all_tlens) + pickle.dump([out_vals, out_probs], open(output, 'wb')) + print('\nModel successfully saved.') + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/utilities/compute_gc.py b/utilities/compute_gc.py new file mode 100755 index 0000000..697cdec --- /dev/null +++ b/utilities/compute_gc.py @@ -0,0 +1,161 @@ +# +# +# compute_gc.source +# Compute GC and coverage model for gen_reads.source +# +# Takes output file from bedtools genomecov to generate GC/coverage model +# +# Usage: bedtools genomecov -d -ibam input.bam -g reference.fa > genomeCov.dat +# source compute_gc.source -r reference.fa -i genomeCov.dat -w [sliding window length] -o output_name.p +# +# +# Updated to Python 3 standards + +import time +import argparse +import numpy as np +import pickle +from Bio import SeqIO + + +def process_fasta(file: str) -> dict: + """ + Takes a fasta file, converts it into a dictionary of upper case sequences. Does some basic error checking, + like the file is readable and the reference dictionary is not empty + :param file: path to a fasta file + :return: dictionary form of the sequences indexed by chromosome + """ + ref_dict = {} + + try: + # reads in fasta file, converts sequence to upper case + ref_dict = {rec.id: rec.seq.upper() for rec in SeqIO.parse(file, "fasta")} + except UnicodeDecodeError: + # if the file isn't readable, this exception should catch it + print("Input file incorrect: -r should specify the reference fasta") + exit(1) + + if not ref_dict: + # if the file was readable by SeqIO but wasn't a fasta file, this should catch it + print("Input file incorrect: -r should specify the reference fasta") + exit(1) + + return ref_dict + + +def process_genomecov(file: str, ref_dict: dict, window: int) -> dict: + """ + Takes a genomecov file and converts it into a dictionary made up of 'window' sized sections + that record the number of GCs and the coverage measure for each section. + :param file: path to a genomecov file + :param ref_dict: dictionary created from using the process_fasta function + :param window: Length of each section of base pairs to count in the reference dictionary + :return: dictionary form of genomecov file based on window size and ref_dict data + """ + gc_bins = {n: [] for n in range(window + 1)} + + # variables needed to parse coverage file + current_line = 0 + current_ref = None + current_cov = 0 + lines_processed = 0 + + f = open(file, 'r') + for line in f: + splt = line.strip().split('\t') + lines_processed += 1 + if current_line == 0: + current_ref = splt[0] + current_pos = int(splt[1]) - 1 + + if current_ref not in ref_dict: + continue + + current_line += 1 + current_cov += float(splt[2]) + + if current_line == window: + current_line = 0 + seq = str(ref_dict[current_ref][current_pos:current_pos + window]) + if 'N' not in seq: + gc_count = seq.count('G') + seq.count('C') + gc_bins[gc_count].append(current_cov) + current_cov = 0 + + f.close() + return gc_bins + + +def calculate_coverage(bin_dict: dict, window: int) -> float: + """ + Takes the dictionary created in process_genomecov and finds the average coverage value. + Also ouputs the average coverage value for each window, along with the number of entries in that window. + :param bin_dict: dictionary created from using the process_genomecov function + :param window: Length of each section of base pairs to count, + should be the same as the window value in process_genomecov + :return: Average coverage value for the whole sample, along with average coverage values for each window. + """ + running_total = 0 + all_mean = 0.0 + for k in sorted(bin_dict.keys()): + if len(bin_dict[k]) == 0: + print('{0:0.2%}'.format(k / float(window)), 0.0, 0) + bin_dict[k] = 0 + else: + my_mean = np.mean(bin_dict[k]) + my_len = len(bin_dict[k]) + print('{0:0.2%}'.format(k / float(window)), my_mean, my_len) + all_mean += my_mean * my_len + running_total += my_len + bin_dict[k] = my_mean + + return all_mean / float(running_total) + + +def main(): + """ + Reads in arguments and processes the inputs to a GC count for the sequence. + Parameters: + -i is the genome coverage input file (genomecov) + -r is the reference file (fasta) + -o is the prefix for the output + -w is the sliding window length. The default is 50, but you can declare any reasonable integer + :return: None + """ + parser = argparse.ArgumentParser(description='compute_gc.source', + formatter_class=argparse.ArgumentDefaultsHelpFormatter,) + parser.add_argument('-i', type=str, required=True, metavar='input', help="input.genomecov") + parser.add_argument('-r', type=str, required=True, metavar='reference', help="reference.fasta") + parser.add_argument('-o', type=str, required=True, metavar='output prefix', + help="prefix for output (/path/to/output)") + parser.add_argument('-w', type=int, required=False, metavar='sliding window', + help="sliding window length [50]", default=50) + args = parser.parse_args() + + (in_gcb, ref_file, window_size, out_p) = (args.i, args.r, args.w, args.o) + + print('Reading ref...') + allrefs = process_fasta(ref_file) + + tt = time.time() + print('Reading genome coverage file...') + gc_bins = process_genomecov(in_gcb, allrefs, window_size) + + print("Calculating average coverage...") + average_coverage = calculate_coverage(gc_bins, window_size) + + print('AVERAGE COVERAGE =', average_coverage) + + y_out = [] + for k in sorted(gc_bins.keys()): + gc_bins[k] /= average_coverage + y_out.append(gc_bins[k]) + + print('saving model...') + pickle.dump([range(window_size + 1), y_out], open(out_p, 'wb')) + + print(time.time() - tt, '(sec)') + + +if __name__ == "__main__": + main() diff --git a/utilities/deprecated/FindNucleotideContextOnReference.healthy.pl b/utilities/deprecated/FindNucleotideContextOnReference.healthy.pl deleted file mode 100755 index ee982cb..0000000 --- a/utilities/deprecated/FindNucleotideContextOnReference.healthy.pl +++ /dev/null @@ -1,508 +0,0 @@ -#!/usr/bin/perl - -use strict; -use Math::Round; - - -if ($#ARGV < 1) { - print "parameter mismatch\nTo run type this command:\nperl $0 fastahack reference input_pos_file output_file human_gff_file\n\n"; - - print " first argument = full path to fastahack\n"; - print " second argument = full path to reference genome\n"; - print " third argument = input file with arbitrary number of columns, but 1st col=chromosome name and 2nd col=position\n"; - print " fourth argument = output file with three columns: chromosome name, position of the center nucleotide, and the thre-nucleotide context for that position\n"; - print " fifth argument = full path to human gff file\n\n\n"; - exit 1; -} - - -my $Fastahack=$ARGV[0]; -my $Reference=$ARGV[1]; -open(InputPositions, '<', $ARGV[2]) || die("Could not open file!"); -open(OutputTrinucleotideContext, '>', $ARGV[3]) || die("Could not open file!"); -open(HumanGFF, '<', $ARGV[4]) || die("Could not open file!"); - - - -################ read in one coordinate at a time and execute fastahack on it - -# reading the header -my $head = ; -$head =~ s/\n|\r//; -print OutputTrinucleotideContext "$head\tContext\n"; -my $gffHead = ; -chomp $gffHead; - -# creating trinucleotide context data hash, insertion and deletion counts -my %trinucleotide_context_data; -my %context_tally_across_mutated_to; -my %gff_hash; -my $gffMatch; -my %location; -# my %genotype_hash; -my %insertion_hash; -my %deletion_hash; -my $insertion_total; -my $deletion_total; -my $zygotes_total; -my %annotation_hash; -my $annotation_total; -my %exonic_consequence_hash; -my $intronic; -my $exonic; -my $intergenic; - -# reading the positional information -my $line_count = 1; -while () { - $_ =~ s/\n|\r//; - #print "$_\n"; - my @line = split('\t', $_); - - # getting the chromosome and coordinate fields from input file - # fastahack will need to the chromosome and coordinate to read the information from the reference - my $chromosome = $line[0]; - my $coordinate = $line[1]; - - # get coordinates of first and last character in the context - my $start_region = $coordinate - 1; - my $end_region = $coordinate + 1; - - # if the coordinate is the very first letter on the chromosome, then do not read before that position - # the context becomes 2 letter code, as opposed to a trinucleotide - if ( $start_region == 0 ) { - $start_region = 1; - $end_region = 2; - } - - #print "$Fastahack -r $chromosome:$start_region..$end_region $Reference\n"; - my $context = `$Fastahack -r $chromosome:$start_region..$end_region $Reference`; - - # capitalize context letters - $context = uc($context); - - #### IF USING CONTROLLED DATA, split germline column into germline allele and mutated_to allele - # my @germline = split ('/', $line[6]); - - # if germline allele does not equal reference allele, print "start_region germline allele end_region" - # specifically, replace the middle letter of the context with the germline allele - #print "$germline[0], $germline[1]\n"; - # if ($germline[0] ne $germline[1]) { - # print "germline/reference mismatch, line number $line_count\n"; - # if ($coordinate != 1) { - # substr($context,1,1)= $germline[1]; - # } - # else { - # substr($context,0,1)= $germline[1]; - # } - # } - - print OutputTrinucleotideContext "$_\t$context"; - - - ############################### - # new section: forming the data structure - ############################### - - # to create N_N contexts for data structure, context_code is defined as the trinucleotide context with a blank middle allele - my $context_code=$context; - $context_code =~ s/\n|\r//; - substr($context_code,1,1) = "_"; - - # create variables for mutated_from and mutated_to nucleotides - my $mutated_from = $line[3]; - my $mutated_to = $line[4]; - - # creating genotype variable from column 10 of VCF - my $genotype = $line[9]; - - # incrementing each genotype - # $genotype_hash{$genotype} = $genotype_hash{$genotype} + 1; - - # splitting heterozygosity by comma, defining heterozygosity total - my @zygotes = split (',', $mutated_to); - my $zygotes_length = scalar(@zygotes); - - # identify heterozygosity, choose one at random to use. Count heterozygosity instances - if ($zygotes_length > 1) { - my $zygotesRand = $zygotes_length*rand(); - my $zygotesRound = round($zygotesRand) - 1; - $zygotes_total = $zygotes_total + 1; - - $mutated_to = $zygotes[$zygotesRound]; - # print "@zygotes\t$mutated_to\n"; - } - # print "@zygotes\t$mutated_to\n"; - - # my $round_rand_test = round(rand()); - # print $round_rand_test; - - # define length of insertions and deletions - my $insertion_length; - my $deletion_length; - if ($mutated_from eq "-") { - $insertion_length = length( $mutated_to ); - } - else { - $insertion_length = length( $mutated_to ) - 1; - } - if ($mutated_to eq "-") { - $deletion_length = length( $mutated_from ); - } - else { - $deletion_length = length( $mutated_from ) - 1; - } - - # context_codes are totalled - $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to} = $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to} + 1; - $context_tally_across_mutated_to{$context_code}{$mutated_from} = $context_tally_across_mutated_to{$context_code}{$mutated_from} + 1; - - # insertion and deletion lengths are totalled - if ($insertion_length > $deletion_length) { - $insertion_hash{$insertion_length} = $insertion_hash{$insertion_length} + 1; - } - if ($deletion_length > $insertion_length) { - $deletion_hash{$deletion_length} = $deletion_hash{$deletion_length} + 1; - } - - # total insertions and deletions - if ($insertion_length != $deletion_length) { - if ($insertion_length > $deletion_length) { - $insertion_total = $insertion_total + 1; - } - elsif ($deletion_length > $insertion_length) { - $deletion_total = $deletion_total + 1; - } - } - - # Find variant annotation and exonic consequence in ANNOVAR outfile - my $annotation = $line[7]; - if ( $annotation =~ /Func.refGene=(.{1,30});Gene\.refGene/ ) { - # print "$1\n"; - $annotation_hash{$1}++; - $annotation_total++; - } - if ( $annotation =~ /ExonicFunc.refGene=(.{1,30});AAChange\.refGene/ ) { - # print "$1\n"; - $exonic_consequence_hash{$1}++; - } - if ( $annotation =~ /Func.refGene=.{0,15}intronic\;/ ) { - $intronic++; - } - if ( $annotation !~ /Func.refGene=ncRNA_exonic/ ) { - if ( $annotation =~ /Func.refGene=.{0,15}exonic\;/ ) { - $exonic++; - } - } - if ( $annotation =~ /Func.refGene=.{0,15}intergenic\;/ ) { - $intergenic++; - } - elsif ( $annotation =~ /Func.refGene=.{0,15}ncRNA_splicing\;/ ) { - $intergenic++; - } - elsif ( $annotation =~ /Func.refGene=.{0,15}upstream\;/ ) { - $intergenic++; - } - elsif ( $annotation =~ /Func.refGene=.{0,15}downstream\;/ ) { - $intergenic++; - } - - $location{$coordinate}++; - # Reading input gff file, incrementing gff variant region hash - # while () { - # $_ =~ s/\n|\r//; - # my @line = split('\t', $_); - # my $region_name = "$line[3]-$line[4]"; - # if ($coordinate >= $line[3] && $coordinate <= $line[4]) { - # $gff_hash{$region_name}++; - # $gffMatch++; - # print "$coordinate $region_name\n"; - # } - # } - #print "$region_name, $gff_hash{$region_name}\n"; - - - # to keep track of progress - # 1000000 for LARGE dbsnp vcfs, 10000 for smaller vcf/tsv tumor mutation files - unless ($line_count%10000) { - print "processed $line_count lines\n"; - } - $line_count++; -} -# end working through the input file - -# print total number of mutations -my $mutation_total = $line_count; -print "Number of Mutations -- $mutation_total\n"; - - -################### Reading the input gff and creating custom BED file #################### - -my $gffBED = "vars.bed"; -open(my $bed_handle, '>', $gffBED) || die("Could not open file!"); - -# Print BED file Header -print $bed_handle "START\tEND\tVariant_Frequency\n"; - -# Reading input gff file, incrementing gff variant region hash -while () { - $_ =~ s/\n|\r//; - my @line = split('\t', $_); - my $region_name = "$line[3]-$line[4]"; - my $region_length = $line[4] - $line[3]; - my $region_freq = 0; - foreach my $coordinate (sort(keys %location)) { - if ($coordinate >= $line[3] && $coordinate <= $line[4]) { - $gff_hash{$region_name}++; - $gffMatch++; - # print "$coordinate $region_name\n"; - } - } - if ($gff_hash{$region_name} == 0) { - print $bed_handle "$line[3]\t$line[4]\t$region_freq\n"; - } - if ($gff_hash{$region_name} > 0) { - $region_freq = $gff_hash{$region_name} / $region_length; - print $bed_handle "$line[3]\t$line[4]\t$region_freq\n"; - print "Region $region_name variant frequency -- $region_freq\n"; - print "Total variants in region $region_name -- $gff_hash{$region_name}\n"; - } -} - #print "$region_name, $gff_hash{$region_name}\n"; - -print "GFF Match -- $gffMatch\n"; - - -######################### open files for writing ########################## - - -# my $genotype_name = "zygosity.prob"; -# open(my $genotype_handle, '>', $genotype_name) || die("Could not open file!"); - -my $insertion_file_name = "SSM_insLength.prob"; -open(my $insertion_prob_handle, '>', $insertion_file_name) || die("Could not open file!"); - -my $deletion_file_name = "SSM_delLength.prob"; -open(my $deletion_prob_handle, '>', $deletion_file_name) || die("Could not open file!"); - -my $overall_file_name = "SSM_overall.prob"; -open(my $overall_prob_handle, '>', $overall_file_name) || die("Could not open file!"); - -my $heterozygosity_file_name = "heterozygosity.prob"; -open(my $heterozygosity_prob_handle, '>', $heterozygosity_file_name) || die("Could not open file!"); - -my $annotation_file_name = "annofreq.prob"; -open(my $annotation_handle, '>', $annotation_file_name) || die("Could not open file!"); - -my $exonic_con_file_name = "exonic_consequences.prob"; -open(my $exonic_con_handle, '>', $exonic_con_file_name) || die("Could not open file!"); - -my $intronic_file_name = "intronic_vars.prob"; -open(my $intronic_handle, '>', $intronic_file_name) || die("Could not open file!"); - -my $exonic_file_name = "exonic_vars.prob"; -open(my $exonic_handle, '>', $exonic_file_name) || die ("Could not open file!"); - -my $intergenic_file_name = "intergenic_vars.prob"; -open(my $intergenic_handle, '>', $intergenic_file_name) || die ("Could not open file!"); - - -######################### Calculate frequency models ####################### - - -# calculate zygosity ratio frequency, print to file -# foreach my $genotype (sort(keys %genotype_hash)) { - # my $zygosity_frequency; - # $zygosity_frequency = $genotype_hash{$genotype}/$mutation_total; - # print $genotype_handle "$genotype\t$zygosity_frequency\n"; - # print "Genotype, $genotype -- $genotype_hash{$genotype}\n"; -# } - -# print annotation and exonic consequence frequencies -foreach $1 (sort(keys %annotation_hash)) { - my $annotation_frequency; - $annotation_frequency = $annotation_hash{$1}/$mutation_total; - print "$1 -- $annotation_hash{$1}, $annotation_frequency\n"; - print $annotation_handle "$1\t$annotation_frequency\n"; -} -foreach $1 (sort(keys %exonic_consequence_hash)) { - my $exonic_con_freq; - if ( $1 ne "." ) { - $exonic_con_freq = $exonic_consequence_hash{$1}/$mutation_total; - print "Exonic Consequence: $1 -- $exonic_consequence_hash{$1}, $exonic_con_freq\n"; - print $exonic_con_handle "$1\t$exonic_con_freq\n"; - } -} - -# Calculating exonic, intronic, and intergenic frequencies, printing to files -my $intronic_freq; -my $exonic_freq; -my $intergenic_freq; -$intronic_freq = $intronic/$mutation_total; -$exonic_freq = $exonic/$mutation_total; -$intergenic_freq = $intergenic/$mutation_total; -print $intronic_handle "$intronic_freq\n"; -print $exonic_handle "$exonic_freq\n"; -print $intergenic_handle "$intergenic_freq\n"; - -print "Intronic -- $intronic\nExonic -- $exonic\nIntergenic -- $intergenic\n"; -#print "Total Annotations -- $annotation_total\n"; - -# print overall likelihood file headers -print $overall_prob_handle "mutation_type\tprobability\n"; - -# print insertions and deletion probabilities out of all mutations -my $insertion_prob_all = $insertion_total / $mutation_total; -my $deletion_prob_all = $deletion_total / $mutation_total; -print $overall_prob_handle "insertion\t$insertion_prob_all\ndeletion\t$deletion_prob_all\n"; -# print $overall_prob_handle "Deletion Probability -- $deletion_prob_all\n"; - -# print InDel totals -print "Insertions $insertion_total\n"; -print "Deletions $deletion_total\n"; - -# print insertion and deletion headers -print $insertion_prob_handle "insertion_length\tprobability\n"; -print $deletion_prob_handle "deletion_length\tprobability\n"; - -# calculate InDel length totals and probability out of total number of insertions/deletions. Print probabilities to file. -foreach my $insertion_length (sort(keys %insertion_hash)) { - my $insertion_probability; - $insertion_probability = $insertion_hash{$insertion_length}/$insertion_total; - print $insertion_prob_handle "$insertion_length\t$insertion_probability\n"; - # print "Insertion, $insertion_length, total , $insertion_hash{$insertion_length}\n"; -} -foreach my $deletion_length (sort(keys %deletion_hash)) { - my $deletion_probability; - $deletion_probability = $deletion_hash{$deletion_length}/$deletion_total; - print $deletion_prob_handle "$deletion_length\t$deletion_probability\n"; - # print "Deletion, $deletion_length, total, $deletion_hash{$deletion_length}\n"; -} - -# print heterozygosity frequency to file -my $zygote_frequency = $zygotes_total / $mutation_total; -print $heterozygosity_prob_handle "$zygote_frequency\n"; - -print "heterozygous alleles -- $zygotes_total\n"; - - -# define nucleotide array -my @nucleotides = ("A", "C", "G", "T"); - -foreach my $nt1 (@nucleotides) { - foreach my $nt3 (@nucleotides) { - - # define the output file name and open it for writing - my $trinucleotide_SNP_probability_file_name = "Context".$nt1."-".$nt3.".trinuc"; - open(my $trinuc_prob_handle, '>', $trinucleotide_SNP_probability_file_name) || die("Could not open file!"); - - - # print trinucleotide contexts and corresponding totals for every mutated_to nucleotide - my $context_code=$nt1."_".$nt3; - - #foreach my $mutated_from_nucl_key (keys %{ $trinucleotide_context_data{$context_code} }) { - foreach my $mutated_from (@nucleotides) { - # define the "mutated_to" keys in trinuc context hash - # my $mutated_to_nucl_key; - - # the sum is only across mutated_to, and will be redefined for each mutated_from - my $context_sum_across_mutated_to = 0; - my $context_sum_across_indel = 0; - - # print "\nRaw counts for mutated_from $mutated_from \n"; - - - # foreach $mutated_to_nucl_key (keys %{ $trinucleotide_context_data{$context_code}{$mutated_from_nucl_key} }) { - foreach my $mutated_to (@nucleotides) { - my $mutated_from_length = length( $mutated_from ); - my $mutated_to_length = length( $mutated_to ); - if ( $mutated_from_length == 1 ) { - if ( $mutated_from ne "-" ) { - if ( $mutated_to_length == 1 ) { - if ( $mutated_to ne "-" ) { - # print "$context_code, $mutated_from_nucl_key, $mutated_to_nucl_key -- $trinucleotide_context_data{$context_code}{$mutated_from_nucl_key}{$mutated_to_nucl_key}\n"; - $context_sum_across_mutated_to = $context_sum_across_mutated_to + $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}; - }# end if statement - else { - $context_sum_across_indel = $context_sum_across_indel + $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}; - }# end else statement - }# end if statement - else { - $context_sum_across_indel = $context_sum_across_indel + $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}; - }# end else statement - }# end if statement - else { - $context_sum_across_indel = $context_sum_across_indel + $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}; - }# end else statement - }# end if statement - else { - $context_sum_across_indel = $context_sum_across_indel + $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}; - }# end else statement - # print "$context_code, $mutated_from, $mutated_to-- $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}\n"; - }# end of loop over mutated_to - - # print "\nProbabilities for mutated_from $mutated_from:\n"; - - - foreach my $mutated_to (@nucleotides) { - #foreach $mutated_to_nucl_key (keys %{ $trinucleotide_context_data{$context_code}{$mutated_from_nucl_key} }) { - my $mutated_from_length = length( $mutated_from); - my $mutated_to_length = length( $mutated_to); - if ( $mutated_from_length == 1 ) { - if ( $mutated_from ne "-" ) { - if ( $mutated_to_length == 1 ) { - if ( $mutated_to ne "-" ) { - my $SNP_probability; - if ( $context_sum_across_mutated_to == 0 ) { - $SNP_probability = 0; - } - else { - $SNP_probability = $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}/$context_sum_across_mutated_to; - } - if ( $mutated_to eq "T" ) { - print $trinuc_prob_handle "$SNP_probability"; - } - else { - # print "$context_code, $mutated_from, $mutated_to, context_sum_across_mutated_to=$context_sum_across_mutated_to -- $SNP_probability\n"; - print $trinuc_prob_handle "$SNP_probability\t"; - } - }# end of if statement - else { - my $indel_probability = $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}/$context_sum_across_indel; - # print $indel_prob_handle "$context_code, $mutated_from, $mutated_to, context_sum_across_indel=$context_sum_across_indel -- $indel_probability\n"; - }# end else statement - }# end of if statement - else { - # my $indel_probability = $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}/$context_sum_across_indel; - # print $indel_prob_handle "$context_code, $mutated_from, $mutated_to, context_sum_across_indel=$context_sum_across_indel -- $indel_probability\n"; - }# end else statement - }# end of if statement - else { - # my $indel_probability = $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}/$context_sum_across_indel; - # print $indel_prob_handle "$context_code, $mutated_from, $mutated_to, context_sum_across_indel=$context_sum_across_indel -- $indel_probability\n"; - }# end else statement - }# end of if statement - else { - my $indel_probability; - if ( $context_sum_across_indel = 0 ) { - $indel_probability = 0; - } - else { - # $indel_probability = $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}/$context_sum_across_indel; - # print $indel_prob_handle "$context_code, $mutated_from, $mutated_to, context_sum_across_indel=$context_sum_across_indel -- $indel_probability\n"; - } - }# end else statement - }# end of loop over mutated_to - print $trinuc_prob_handle "\n"; - - }# end of loop over mutated_from - - # print "\n\n"; - - - }# end loop over nt3 -}# end loop over nt1 - - - - diff --git a/utilities/deprecated/FindNucleotideContextOnReference.pl b/utilities/deprecated/FindNucleotideContextOnReference.pl deleted file mode 100755 index 61fed2c..0000000 --- a/utilities/deprecated/FindNucleotideContextOnReference.pl +++ /dev/null @@ -1,307 +0,0 @@ -#!/usr/bin/perl - -use strict; - - -if ($#ARGV < 1) { - print "parameter mismatch\nTo run type this command:\nperl $0 fastahack reference input_pos_file output_file\n\n"; - - print " first argument = full path to fastahack\n"; - print " second argument = full path to reference genome\n"; - print " third argument = input file with arbitrary number of columns, but 1st col=chromosome name and 2nd col=position\n"; - print " fourth argument = output file with three columns: chromosome name, position of the center nucleotide, and the thre-nucleotide context for that position\n\n\n"; - exit 1; -} - - -my $Fastahack=$ARGV[0]; -my $Reference=$ARGV[1]; -open(InputPositions, '<', $ARGV[2]) || die("Could not open file!"); -open(OutputTrinucleotideContext, '>', $ARGV[3]) || die("Could not open file!"); - - - - -################ read in one coordinate at a time and execute fastahack on it - -# reading the header -my $head = ; -$head =~ s/\n|\r//; -print OutputTrinucleotideContext "$head\tContext\n"; - -# creating trinucleotide context data hash, insertion and deletion counts -my %trinucleotide_context_data; -my %context_tally_across_mutated_to; -my %insertion_hash; -my %deletion_hash; -my $insertion_total; -my $deletion_total; - - -# reading the positional information -my $line_count = 1; -while () { - $_ =~ s/\n|\r//; - #print "$_\n"; - my @line = split('\t', $_); - - # getting the chromosome and coordinate fields from input file - # fastahack will need to the chromosome and coordinate to read the information from the reference - my $chromosome = $line[0]; - my $coordinate = $line[1]; - - # get coordinates of first and last character in the context - my $start_region = $coordinate - 1; - my $end_region = $coordinate + 1; - - # if the coordinate is the very first letter on the chromosome, then do not read before that position - # the context becomes 2 letter code, as opposed to a trinucleotide - if ( $start_region == 0 ) { - $start_region = 1; - $end_region = 2; - } - - #print "$Fastahack -r $chromosome:$start_region..$end_region $Reference\n"; - my $context = `$Fastahack -r $chromosome:$start_region..$end_region $Reference`; - - # capitalize context letters - $context = uc($context); - - # split germline column into germline allele and mutated_to allele - # my @germline = split ('/', $line[6]); - - # if germline allele does not equal reference allele, print "start_region germline allele end_region" - # specifically, replace the middle letter of the context with the germline allele - #print "$germline[0], $germline[1]\n"; - # if ($germline[0] ne $germline[1]) { - # print "germline/reference mismatch, line number $line_count\n"; - # if ($coordinate != 1) { - # substr($context,1,1)= $germline[1]; - # } - # else { - # substr($context,0,1)= $germline[1]; - # } - # } - - print OutputTrinucleotideContext "$_\t$context"; - - - - ############################### - # new section: forming the data structure - ############################### - - # to create N_N contexts for data structure, context_code is defined as the trinucleotide context with a blank middle allele - my $context_code=$context; - $context_code =~ s/\n|\r//; - substr($context_code,1,1) = "_"; - - # create variables for mutated_from and mutated_to nucleotides - my $mutated_from = $line[9]; - my $mutated_to = $line[10]; - - # define length of insertions and deletions - # if ($mutated_from eq "-") { - my $insertion_length = length( $mutated_to ); - # } - - # if ($mutated_to eq "-") { - my $deletion_length = length( $mutated_from ); - # } - - # context_codes are totalled - $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to} = $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to} + 1; - $context_tally_across_mutated_to{$context_code}{$mutated_from} = $context_tally_across_mutated_to{$context_code}{$mutated_from} + 1; - - # insertion and deletion lengths are totalled - if ($mutated_from eq "-") { - $insertion_hash{$insertion_length} = $insertion_hash{$insertion_length} + 1; - } - if ($mutated_to eq "-") { - $deletion_hash{$deletion_length} = $deletion_hash{$deletion_length} + 1; - } - - # total insertions and deletions - if ($mutated_from eq "-") { - $insertion_total = $insertion_total + 1; - } - if ($mutated_to eq "-") { - $deletion_total = $deletion_total + 1; - } - - - - # to keep track of progress - unless ($line_count%10000) { - print "processed $line_count lines\n"; - } - $line_count++; -} -# end working through the input file - -# print total number of mutations -my $mutation_total = $line_count - 1; -print "Number of Mutations -- $mutation_total\n"; - -# define the output file name for insertions, deletions, and overall likelihoods. Open files for writing -my $insertion_file_name = "Leukemia_OPEN_insLength.prob"; -open(my $insertion_prob_handle, '>', $insertion_file_name) || die("Could not open file!"); - -my $deletion_file_name = "Leukemia_OPEN_delLength.prob"; -open(my $deletion_prob_handle, '>', $deletion_file_name) || die("Could not open file!"); - -my $overall_file_name = "Leukemia_OPEN_overall.prob"; -open(my $overall_prob_handle, '>', $overall_file_name) || die("Could not open file!"); - -# print overall likelihood file headers -print $overall_prob_handle "mutation_type\tprobability\n"; - -# print insertions and deletion probabilities out of all mutations -my $insertion_prob_all = $insertion_total / $mutation_total; -my $deletion_prob_all = $deletion_total / $mutation_total; -print $overall_prob_handle "insertion\t$insertion_prob_all\ndeletion\t$deletion_prob_all\n"; -# print $overall_prob_handle "Deletion Probability -- $deletion_prob_all\n"; - -# print InDel totals -print "Insertions $insertion_total\n"; -print "Deletions $deletion_total\n"; - -# print insertion and deletion headers -print $insertion_prob_handle "insertion_length\tprobability\n"; -print $deletion_prob_handle "deletion_length\tprobability\n"; - -# calculate InDel length totals and probability out of total number of insertions/deletions. Print probabilities to file. -foreach my $insertion_length (sort(keys %insertion_hash)) { - my $insertion_probability; - $insertion_probability = $insertion_hash{$insertion_length}/$insertion_total; - print $insertion_prob_handle "$insertion_length\t$insertion_probability\n"; - print "Insertion, $insertion_length, total , $insertion_hash{$insertion_length}\n"; -} -foreach my $deletion_length (sort(keys %deletion_hash)) { - my $deletion_probability; - $deletion_probability = $deletion_hash{$deletion_length}/$deletion_total; - print $deletion_prob_handle "$deletion_length\t$deletion_probability\n"; - print "Deletion, $deletion_length, total, $deletion_hash{$deletion_length}\n"; -} - - -# define nucleotide array -my @nucleotides = ("A", "C", "G", "T"); - -foreach my $nt1 (@nucleotides) { - foreach my $nt3 (@nucleotides) { - - # define the output file name and open it for writing - my $trinucleotide_SNP_probability_file_name = "Leukemia_OPEN_".$nt1."-".$nt3.".trinuc"; - open(my $trinuc_prob_handle, '>', $trinucleotide_SNP_probability_file_name) || die("Could not open file!"); - - - # print trinucleotide contexts and corresponding totals for every mutated_to nucleotide - my $context_code=$nt1."_".$nt3; - - #foreach my $mutated_from_nucl_key (keys %{ $trinucleotide_context_data{$context_code} }) { - foreach my $mutated_from (@nucleotides) { - # define the "mutated_to" keys in trinuc context hash - # my $mutated_to_nucl_key; - - # the sum is only across mutated_to, and will be redefined for each mutated_from - my $context_sum_across_mutated_to = 0; - my $context_sum_across_indel = 0; - - # print "\nRaw counts for mutated_from $mutated_from \n"; - - - # foreach $mutated_to_nucl_key (keys %{ $trinucleotide_context_data{$context_code}{$mutated_from_nucl_key} }) { - foreach my $mutated_to (@nucleotides) { - my $mutated_from_length = length( $mutated_from ); - my $mutated_to_length = length( $mutated_to ); - if ( $mutated_from_length == 1 ) { - if ( $mutated_from ne "-" ) { - if ( $mutated_to_length == 1 ) { - if ( $mutated_to ne "-" ) { - # print "$context_code, $mutated_from_nucl_key, $mutated_to_nucl_key -- $trinucleotide_context_data{$context_code}{$mutated_from_nucl_key}{$mutated_to_nucl_key}\n"; - $context_sum_across_mutated_to = $context_sum_across_mutated_to + $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}; - }# end if statement - else { - $context_sum_across_indel = $context_sum_across_indel + $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}; - }# end else statement - }# end if statement - else { - $context_sum_across_indel = $context_sum_across_indel + $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}; - }# end else statement - }# end if statement - else { - $context_sum_across_indel = $context_sum_across_indel + $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}; - }# end else statement - }# end if statement - else { - $context_sum_across_indel = $context_sum_across_indel + $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}; - }# end else statement - # print "$context_code, $mutated_from, $mutated_to-- $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}\n"; - }# end of loop over mutated_to - - # print "\nProbabilities for mutated_from $mutated_from:\n"; - - - foreach my $mutated_to (@nucleotides) { - #foreach $mutated_to_nucl_key (keys %{ $trinucleotide_context_data{$context_code}{$mutated_from_nucl_key} }) { - my $mutated_from_length = length( $mutated_from); - my $mutated_to_length = length( $mutated_to); - if ( $mutated_from_length == 1 ) { - if ( $mutated_from ne "-" ) { - if ( $mutated_to_length == 1 ) { - if ( $mutated_to ne "-" ) { - my $SNP_probability; - if ( $context_sum_across_mutated_to == 0 ) { - $SNP_probability = 0; - } - else { - $SNP_probability = $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}/$context_sum_across_mutated_to; - } - if ( $mutated_to eq "T" ) { - print $trinuc_prob_handle "$SNP_probability"; - } - else { - # print "$context_code, $mutated_from, $mutated_to, context_sum_across_mutated_to=$context_sum_across_mutated_to -- $SNP_probability\n"; - print $trinuc_prob_handle "$SNP_probability\t"; - } - }# end of if statement - else { - my $indel_probability = $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}/$context_sum_across_indel; - # print $indel_prob_handle "$context_code, $mutated_from, $mutated_to, context_sum_across_indel=$context_sum_across_indel -- $indel_probability\n"; - }# end else statement - }# end of if statement - else { - # my $indel_probability = $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}/$context_sum_across_indel; - # print $indel_prob_handle "$context_code, $mutated_from, $mutated_to, context_sum_across_indel=$context_sum_across_indel -- $indel_probability\n"; - }# end else statement - }# end of if statement - else { - # my $indel_probability = $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}/$context_sum_across_indel; - # print $indel_prob_handle "$context_code, $mutated_from, $mutated_to, context_sum_across_indel=$context_sum_across_indel -- $indel_probability\n"; - }# end else statement - }# end of if statement - else { - my $indel_probability; - if ( $context_sum_across_indel = 0 ) { - $indel_probability = 0; - } - else { - # $indel_probability = $trinucleotide_context_data{$context_code}{$mutated_from}{$mutated_to}/$context_sum_across_indel; - # print $indel_prob_handle "$context_code, $mutated_from, $mutated_to, context_sum_across_indel=$context_sum_across_indel -- $indel_probability\n"; - } - }# end else statement - }# end of loop over mutated_to - print $trinuc_prob_handle "\n"; - - }# end of loop over mutated_from - - # print "\n\n"; - - - }# end loop over nt3 -}# end loop over nt1 - - - - diff --git a/utilities/deprecated/README.md b/utilities/deprecated/README.md deleted file mode 100644 index 01613ce..0000000 --- a/utilities/deprecated/README.md +++ /dev/null @@ -1,16 +0,0 @@ -#Deprecated Perl Scripts -These scripts were updated and rewritten in python to improve ease of use and speed. Usage and a quick description of the deprecated scripts can be found below. Please use genMutModel.py to generate mutation models. - -##FindNucleotideContextOnReference.pl -This script takes in VCF files and generates variant frequency models for NEAT. Coordinates for each variant are located within the HG19 human reference. The corresponding trinucleotide context around that location on the reference is returned into a new column. - -## Running the Script -The script requires 5 arguments to be entered after the full path to FindNucleotideContextOnReference.healthy.pl - -``` -1. Full path to Fastahack -2. Full path to Reference Genome -3. Full path to input VCF -4. Full path to output file -5. Full path to human GFF -``` diff --git a/utilities/genMutModel.py b/utilities/genMutModel.py deleted file mode 100644 index 6aa1097..0000000 --- a/utilities/genMutModel.py +++ /dev/null @@ -1,581 +0,0 @@ -#!/usr/bin/env python - -import sys -import os -import re -import bisect -import pickle -import argparse -import numpy as np - -# absolute path to the directory above this script -SIM_PATH = '/'.join(os.path.realpath(__file__).split('/')[:-2]) -sys.path.append(SIM_PATH+'/py/') - -from refFunc import indexRef - -REF_WHITELIST = [str(n) for n in xrange(1,30)] + ['x','y','X','Y','mt','Mt','MT'] -REF_WHITELIST += ['chr'+n for n in REF_WHITELIST] -VALID_NUCL = ['A','C','G','T'] -VALID_TRINUC = [VALID_NUCL[i]+VALID_NUCL[j]+VALID_NUCL[k] for i in xrange(len(VALID_NUCL)) for j in xrange(len(VALID_NUCL)) for k in xrange(len(VALID_NUCL))] -# if parsing a dbsnp vcf, and no CAF= is found in info tag, use this as default val for population freq -VCF_DEFAULT_POP_FREQ = 0.00001 - - -######################################################### -# VARIOUS HELPER FUNCTIONS # -######################################################### - - -# given a reference index, grab the sequence string of a specified reference -def getChrFromFasta(refPath,ref_inds,chrName): - for i in xrange(len(ref_inds)): - if ref_inds[i][0] == chrName: - ref_inds_i = ref_inds[i] - break - refFile = open(refPath,'r') - refFile.seek(ref_inds_i[1]) - myDat = ''.join(refFile.read(ref_inds_i[2]-ref_inds_i[1]).split('\n')) - return myDat - -# cluster a sorted list -def clusterList(l,delta): - outList = [[l[0]]] - prevVal = l[0] - currentInd = 0 - for n in l[1:]: - if n-prevVal <= delta: - outList[currentInd].append(n) - else: - currentInd += 1 - outList.append([]) - outList[currentInd].append(n) - prevVal = n - return outList - -def list_2_countDict(l): - cDict = {} - for n in l: - if n not in cDict: - cDict[n] = 0 - cDict[n] += 1 - return cDict - -def getBedTracks(fn): - f = open(fn,'r') - trackDict = {} - for line in f: - splt = line.strip().split('\t') - if splt[0] not in trackDict: - trackDict[splt[0]] = [] - trackDict[splt[0]].extend([int(splt[1]),int(splt[2])]) - f.close() - return trackDict - -def getTrackLen(trackDict): - totSum = 0 - for k in trackDict.keys(): - for i in xrange(0,len(trackDict[k]),2): - totSum += trackDict[k][i+1] - trackDict[k][i] + 1 - return totSum - -def isInBed(track,ind): - myInd = bisect.bisect(track,ind) - if myInd&1: - return True - if myInd < len(track): - if track[myInd-1] == ind: - return True - return False - -## return the mean distance to the median of a cluster -#def mean_dist_from_median(c): -# centroid = np.median([n for n in c]) -# dists = [] -# for n in c: -# dists.append(abs(n-centroid)) -# return np.mean(dists) -# -## get median value from counting dictionary -#def quick_median(countDict): -# midPoint = sum(countDict.values())/2 -# mySum = 0 -# myInd = 0 -# sk = sorted(countDict.keys()) -# while mySum < midPoint: -# mySum += countDict[sk[myInd]] -# if mySum >= midPoint: -# break -# myInd += 1 -# return myInd -# -## get median deviation from median of counting dictionary -#def median_deviation_from_median(countDict): -# myMedian = quick_median(countDict) -# deviations = {} -# for k in sorted(countDict.keys()): -# d = abs(k-myMedian) -# deviations[d] = countDict[k] -# return quick_median(deviations) - - -################################################# -# PARSE INPUT OPTIONS # -################################################# - - -parser = argparse.ArgumentParser(description='genMutModel.py') -parser.add_argument('-r', type=str, required=True, metavar='', help="* ref.fa") -parser.add_argument('-m', type=str, required=True, metavar='', help="* mutations.tsv [.vcf]") -parser.add_argument('-o', type=str, required=True, metavar='', help="* output.p") -parser.add_argument('-bi', type=str, required=False, metavar='', default=None, help="only_use_these_regions.bed") -parser.add_argument('-be', type=str, required=False, metavar='', default=None, help="exclude_these_regions.bed") -parser.add_argument('--save-trinuc', required=False,action='store_true', default=False, help='save trinuc counts for ref') -parser.add_argument('--no-whitelist',required=False,action='store_true', default=False, help='allow any non-standard ref') -parser.add_argument('--skip-common', required=False,action='store_true', default=False, help='do not save common snps + high mut regions') -args = parser.parse_args() -(REF, TSV, OUT_PICKLE, SAVE_TRINUC, NO_WHITELIST, SKIP_COMMON) = (args.r, args.m, args.o, args.save_trinuc, args.no_whitelist, args.skip_common) - -MYBED = None -if args.bi != None: - print 'only considering variants in specified bed regions...' - MYBED = (getBedTracks(args.bi),True) -elif args.be != None: - print 'only considering variants outside of specified bed regions...' - MYBED = (getBedTracks(args.be),False) - -if TSV[-4:] == '.vcf': - IS_VCF = True -elif TSV[-4:] == '.tsv': - IS_VCF = False -else: - print '\nError: Unknown format for mutation input.\n' - exit(1) - - -##################################### -# main() # -##################################### - - -def main(): - - ref_inds = indexRef(REF) - refList = [n[0] for n in ref_inds] - - # how many times do we observe each trinucleotide in the reference (and input bed region, if present)? - TRINUC_REF_COUNT = {} - TRINUC_BED_COUNT = {} - printBedWarning = True - # [(trinuc_a, trinuc_b)] = # of times we observed a mutation from trinuc_a into trinuc_b - TRINUC_TRANSITION_COUNT = {} - # total count of SNPs - SNP_COUNT = 0 - # overall SNP transition probabilities - SNP_TRANSITION_COUNT = {} - # total count of indels, indexed by length - INDEL_COUNT = {} - # tabulate how much non-N reference sequence we've eaten through - TOTAL_REFLEN = 0 - # detect variants that occur in a significant percentage of the input samples (pos,ref,alt,pop_fraction) - COMMON_VARIANTS = [] - # tabulate how many unique donors we've encountered (this is useful for identifying common variants) - TOTAL_DONORS = {} - # identify regions that have significantly higher local mutation rates than the average - HIGH_MUT_REGIONS = [] - - # load and process variants in each reference sequence individually, for memory reasons... - for refName in refList: - - if (refName not in REF_WHITELIST) and (not NO_WHITELIST): - print refName,'is not in our whitelist, skipping...' - continue - - print 'reading reference "'+refName+'"...' - refSequence = getChrFromFasta(REF,ref_inds,refName).upper() - TOTAL_REFLEN += len(refSequence) - refSequence.count('N') - - # list to be used for counting variants that occur multiple times in file (i.e. in multiple samples) - VDAT_COMMON = [] - - - """ ########################################################################## - ### COUNT TRINUCLEOTIDES IN REF ### - ########################################################################## """ - - - if MYBED != None: - if printBedWarning: - print "since you're using a bed input, we have to count trinucs in bed region even if you specified a trinuc count file for the reference..." - printBedWarning = False - if refName in MYBED[0]: - refKey = refName - elif ('chr' in refName) and (refName not in MYBED[0]) and (refName[3:] in MYBED[0]): - refKey = refName[3:] - elif ('chr' not in refName) and (refName not in MYBED[0]) and ('chr'+refName in MYBED[0]): - refKey = 'chr'+refName - if refKey in MYBED[0]: - subRegions = [(MYBED[0][refKey][n],MYBED[0][refKey][n+1]) for n in xrange(0,len(MYBED[0][refKey]),2)] - for sr in subRegions: - for i in xrange(sr[0],sr[1]+1-2): - trinuc = refSequence[i:i+3] - if not trinuc in VALID_TRINUC: - continue # skip if trinuc contains invalid characters, or not in specified bed region - if trinuc not in TRINUC_BED_COUNT: - TRINUC_BED_COUNT[trinuc] = 0 - TRINUC_BED_COUNT[trinuc] += 1 - - if not os.path.isfile(REF+'.trinucCounts'): - print 'counting trinucleotides in reference...' - for i in xrange(len(refSequence)-2): - if i%1000000 == 0 and i > 0: - print i,'/',len(refSequence) - #break - trinuc = refSequence[i:i+3] - if not trinuc in VALID_TRINUC: - continue # skip if trinuc contains invalid characters - if trinuc not in TRINUC_REF_COUNT: - TRINUC_REF_COUNT[trinuc] = 0 - TRINUC_REF_COUNT[trinuc] += 1 - else: - print 'skipping trinuc counts (for whole reference) because we found a file...' - - - """ ########################################################################## - ### READ INPUT VARIANTS ### - ########################################################################## """ - - - print 'reading input variants...' - f = open(TSV,'r') - isFirst = True - for line in f: - - if IS_VCF and line[0] == '#': - continue - if isFirst: - if IS_VCF: - # hard-code index values based on expected columns in vcf - (c1,c2,c3,m1,m2,m3) = (0,1,1,3,3,4) - else: - # determine columns of fields we're interested in - splt = line.strip().split('\t') - (c1,c2,c3) = (splt.index('chromosome'),splt.index('chromosome_start'),splt.index('chromosome_end')) - (m1,m2,m3) = (splt.index('reference_genome_allele'),splt.index('mutated_from_allele'),splt.index('mutated_to_allele')) - (d_id) = (splt.index('icgc_donor_id')) - isFirst = False - continue - - splt = line.strip().split('\t') - # we have -1 because tsv/vcf coords are 1-based, and our reference string index is 0-based - [chrName,chrStart,chrEnd] = [splt[c1],int(splt[c2])-1,int(splt[c3])-1] - [allele_ref,allele_normal,allele_tumor] = [splt[m1].upper(),splt[m2].upper(),splt[m3].upper()] - if IS_VCF: - if len(allele_ref) != len(allele_tumor): - # indels in tsv don't include the preserved first nucleotide, so lets trim the vcf alleles - [allele_ref,allele_normal,allele_tumor] = [allele_ref[1:],allele_normal[1:],allele_tumor[1:]] - if not allele_ref: allele_ref = '-' - if not allele_normal: allele_normal = '-' - if not allele_tumor: allele_tumor = '-' - # if alternate alleles are present, lets just ignore this variant. I may come back and improve this later - if ',' in allele_tumor: - continue - vcf_info = ';'+splt[7]+';' - else: - [donor_id] = [splt[d_id]] - - # if we encounter a multi-np (i.e. 3 nucl --> 3 different nucl), let's skip it for now... - if ('-' not in allele_normal and '-' not in allele_tumor) and (len(allele_normal) > 1 or len(allele_tumor) > 1): - print 'skipping a complex variant...' - continue - - # to deal with '1' vs 'chr1' references, manually change names. this is hacky and bad. - if 'chr' not in chrName: - chrName = 'chr'+chrName - if 'chr' not in refName: - refName = 'chr'+refName - # skip irrelevant variants - if chrName != refName: - continue - - # if variant is outside the regions we're interested in (if specified), skip it... - if MYBED != None: - refKey = refName - if not refKey in MYBED[0] and refKey[3:] in MYBED[0]: # account for 1 vs chr1, again... - refKey = refKey[3:] - if refKey not in MYBED[0]: - inBed = False - else: - inBed = isInBed(MYBED[0][refKey],chrStart) - if inBed != MYBED[1]: - continue - - # we want only snps - # so, no '-' characters allowed, and chrStart must be same as chrEnd - if '-' not in allele_normal and '-' not in allele_tumor and chrStart == chrEnd: - trinuc_ref = refSequence[chrStart-1:chrStart+2] - if not trinuc_ref in VALID_TRINUC: - continue # skip ref trinuc with invalid characters - # only consider positions where ref allele in tsv matches the nucleotide in our reference - if allele_ref == trinuc_ref[1]: - trinuc_normal = refSequence[chrStart-1] + allele_normal + refSequence[chrStart+1] - trinuc_tumor = refSequence[chrStart-1] + allele_tumor + refSequence[chrStart+1] - if not trinuc_normal in VALID_TRINUC or not trinuc_tumor in VALID_TRINUC: - continue # skip if mutation contains invalid char - key = (trinuc_normal,trinuc_tumor) - if key not in TRINUC_TRANSITION_COUNT: - TRINUC_TRANSITION_COUNT[key] = 0 - TRINUC_TRANSITION_COUNT[key] += 1 - SNP_COUNT += 1 - key2 = (allele_normal,allele_tumor) - if key2 not in SNP_TRANSITION_COUNT: - SNP_TRANSITION_COUNT[key2] = 0 - SNP_TRANSITION_COUNT[key2] += 1 - - if IS_VCF: - myPopFreq = VCF_DEFAULT_POP_FREQ - if ';CAF=' in vcf_info: - cafStr = re.findall(r";CAF=.*?(?=;)",vcf_info)[0] - if ',' in cafStr: - myPopFreq = float(cafStr[5:].split(',')[1]) - VDAT_COMMON.append((chrStart,allele_ref,allele_normal,allele_tumor,myPopFreq)) - else: - VDAT_COMMON.append((chrStart,allele_ref,allele_normal,allele_tumor)) - TOTAL_DONORS[donor_id] = True - else: - print '\nError: ref allele in variant call does not match reference.\n' - exit(1) - - # now let's look for indels... - if '-' in allele_normal: len_normal = 0 - else: len_normal = len(allele_normal) - if '-' in allele_tumor: len_tumor = 0 - else: len_tumor = len(allele_tumor) - if len_normal != len_tumor: - indel_len = len_tumor - len_normal - if indel_len not in INDEL_COUNT: - INDEL_COUNT[indel_len] = 0 - INDEL_COUNT[indel_len] += 1 - - if IS_VCF: - myPopFreq = VCF_DEFAULT_POP_FREQ - if ';CAF=' in vcf_info: - cafStr = re.findall(r";CAF=.*?(?=;)",vcf_info)[0] - if ',' in cafStr: - myPopFreq = float(cafStr[5:].split(',')[1]) - VDAT_COMMON.append((chrStart,allele_ref,allele_normal,allele_tumor,myPopFreq)) - else: - VDAT_COMMON.append((chrStart,allele_ref,allele_normal,allele_tumor)) - TOTAL_DONORS[donor_id] = True - f.close() - - # if we didn't find anything, skip ahead along to the next reference sequence - if not len(VDAT_COMMON): - print 'Found no variants for this reference, moving along...' - continue - - # - # identify common mutations - # - percentile_var = 95 - if IS_VCF: - minVal = np.percentile([n[4] for n in VDAT_COMMON],percentile_var) - for k in sorted(VDAT_COMMON): - if k[4] >= minVal: - COMMON_VARIANTS.append((refName,k[0],k[1],k[3],k[4])) - VDAT_COMMON = {(n[0],n[1],n[2],n[3]):n[4] for n in VDAT_COMMON} - else: - N_DONORS = len(TOTAL_DONORS) - VDAT_COMMON = list_2_countDict(VDAT_COMMON) - minVal = int(np.percentile(VDAT_COMMON.values(),percentile_var)) - for k in sorted(VDAT_COMMON.keys()): - if VDAT_COMMON[k] >= minVal: - COMMON_VARIANTS.append((refName,k[0],k[1],k[3],VDAT_COMMON[k]/float(N_DONORS))) - - # - # identify areas that have contained significantly higher random mutation rates - # - dist_thresh = 2000 - percentile_clust = 97 - qptn = 1000 - # identify regions with disproportionately more variants in them - VARIANT_POS = sorted([n[0] for n in VDAT_COMMON.keys()]) - clustered_pos = clusterList(VARIANT_POS,dist_thresh) - byLen = [(len(clustered_pos[i]),min(clustered_pos[i]),max(clustered_pos[i]),i) for i in xrange(len(clustered_pos))] - #byLen = sorted(byLen,reverse=True) - #minLen = int(np.percentile([n[0] for n in byLen],percentile_clust)) - #byLen = [n for n in byLen if n[0] >= minLen] - candidate_regions = [] - for n in byLen: - bi = int((n[1]-dist_thresh)/float(qptn))*qptn - bf = int((n[2]+dist_thresh)/float(qptn))*qptn - candidate_regions.append((n[0]/float(bf-bi),max([0,bi]),min([len(refSequence),bf]))) - minVal = np.percentile([n[0] for n in candidate_regions],percentile_clust) - for n in candidate_regions: - if n[0] >= minVal: - HIGH_MUT_REGIONS.append((refName,n[1],n[2],n[0])) - # collapse overlapping regions - for i in xrange(len(HIGH_MUT_REGIONS)-1,0,-1): - if HIGH_MUT_REGIONS[i-1][2] >= HIGH_MUT_REGIONS[i][1] and HIGH_MUT_REGIONS[i-1][0] == HIGH_MUT_REGIONS[i][0]: - avgMutRate = 0.5*HIGH_MUT_REGIONS[i-1][3]+0.5*HIGH_MUT_REGIONS[i][3] # not accurate, but I'm lazy - HIGH_MUT_REGIONS[i-1] = (HIGH_MUT_REGIONS[i-1][0], HIGH_MUT_REGIONS[i-1][1], HIGH_MUT_REGIONS[i][2], avgMutRate) - del HIGH_MUT_REGIONS[i] - - # - # if we didn't count ref trinucs because we found file, read in ref counts from file now - # - if os.path.isfile(REF+'.trinucCounts'): - print 'reading pre-computed trinuc counts...' - f = open(REF+'.trinucCounts','r') - for line in f: - splt = line.strip().split('\t') - TRINUC_REF_COUNT[splt[0]] = int(splt[1]) - f.close() - # otherwise, save trinuc counts to file, if desired - elif SAVE_TRINUC: - if MYBED != None: - print 'unable to save trinuc counts to file because using input bed region...' - else: - print 'saving trinuc counts to file...' - f = open(REF+'.trinucCounts','w') - for trinuc in sorted(TRINUC_REF_COUNT.keys()): - f.write(trinuc+'\t'+str(TRINUC_REF_COUNT[trinuc])+'\n') - f.close() - - # - # if using an input bed region, make necessary adjustments to trinuc ref counts based on the bed region trinuc counts - # - if MYBED != None: - if MYBED[1] == True: # we are restricting our attention to bed regions, so ONLY use bed region trinuc counts - TRINUC_REF_COUNT = TRINUC_BED_COUNT - else: # we are only looking outside bed regions, so subtract bed region trinucs from entire reference trinucs - for k in TRINUC_REF_COUNT.keys(): - if k in TRINUC_BED_COUNT: - TRINUC_REF_COUNT[k] -= TRINUC_BED_COUNT[k] - - # if for some reason we didn't find any valid input variants, exit gracefully... - totalVar = SNP_COUNT + sum(INDEL_COUNT.values()) - if totalVar == 0: - print '\nError: No valid variants were found, model could not be created. (Are you using the correct reference?)\n' - exit(1) - - """ ########################################################################## - ### COMPUTE PROBABILITIES ### - ########################################################################## """ - - - #for k in sorted(TRINUC_REF_COUNT.keys()): - # print k, TRINUC_REF_COUNT[k] - # - #for k in sorted(TRINUC_TRANSITION_COUNT.keys()): - # print k, TRINUC_TRANSITION_COUNT[k] - - # frequency that each trinuc mutated into anything else - TRINUC_MUT_PROB = {} - # frequency that a trinuc mutates into another trinuc, given that it mutated - TRINUC_TRANS_PROBS = {} - # frequency of snp transitions, given a snp occurs. - SNP_TRANS_FREQ = {} - - for trinuc in sorted(TRINUC_REF_COUNT.keys()): - myCount = 0 - for k in sorted(TRINUC_TRANSITION_COUNT.keys()): - if k[0] == trinuc: - myCount += TRINUC_TRANSITION_COUNT[k] - TRINUC_MUT_PROB[trinuc] = myCount / float(TRINUC_REF_COUNT[trinuc]) - for k in sorted(TRINUC_TRANSITION_COUNT.keys()): - if k[0] == trinuc: - TRINUC_TRANS_PROBS[k] = TRINUC_TRANSITION_COUNT[k] / float(myCount) - - for n1 in VALID_NUCL: - rollingTot = sum([SNP_TRANSITION_COUNT[(n1,n2)] for n2 in VALID_NUCL if (n1,n2) in SNP_TRANSITION_COUNT]) - for n2 in VALID_NUCL: - key2 = (n1,n2) - if key2 in SNP_TRANSITION_COUNT: - SNP_TRANS_FREQ[key2] = SNP_TRANSITION_COUNT[key2] / float(rollingTot) - - # compute average snp and indel frequencies - SNP_FREQ = SNP_COUNT/float(totalVar) - AVG_INDEL_FREQ = 1.-SNP_FREQ - INDEL_FREQ = {k:(INDEL_COUNT[k]/float(totalVar))/AVG_INDEL_FREQ for k in INDEL_COUNT.keys()} - if MYBED != None: - if MYBED[1] == True: - AVG_MUT_RATE = totalVar/float(getTrackLen(MYBED[0])) - else: - AVG_MUT_RATE = totalVar/float(TOTAL_REFLEN - getTrackLen(MYBED[0])) - else: - AVG_MUT_RATE = totalVar/float(TOTAL_REFLEN) - - # - # if values weren't found in data, appropriately append null entries - # - printTrinucWarning = False - for trinuc in VALID_TRINUC: - trinuc_mut = [trinuc[0]+n+trinuc[2] for n in VALID_NUCL if n != trinuc[1]] - if trinuc not in TRINUC_MUT_PROB: - TRINUC_MUT_PROB[trinuc] = 0. - printTrinucWarning = True - for trinuc2 in trinuc_mut: - if (trinuc,trinuc2) not in TRINUC_TRANS_PROBS: - TRINUC_TRANS_PROBS[(trinuc,trinuc2)] = 0. - printTrinucWarning = True - if printTrinucWarning: - print 'Warning: Some trinucleotides transitions were not encountered in the input dataset, probabilities of 0.0 have been assigned to these events.' - - # - # print some stuff - # - for k in sorted(TRINUC_MUT_PROB.keys()): - print 'p('+k+' mutates) =',TRINUC_MUT_PROB[k] - - for k in sorted(TRINUC_TRANS_PROBS.keys()): - print 'p('+k[0]+' --> '+k[1]+' | '+k[0]+' mutates) =',TRINUC_TRANS_PROBS[k] - - for k in sorted(INDEL_FREQ.keys()): - if k > 0: - print 'p(ins length = '+str(abs(k))+' | indel occurs) =',INDEL_FREQ[k] - else: - print 'p(del length = '+str(abs(k))+' | indel occurs) =',INDEL_FREQ[k] - - for k in sorted(SNP_TRANS_FREQ.keys()): - print 'p('+k[0]+' --> '+k[1]+' | SNP occurs) =',SNP_TRANS_FREQ[k] - - #for n in COMMON_VARIANTS: - # print n - - #for n in HIGH_MUT_REGIONS: - # print n - - print 'p(snp) =',SNP_FREQ - print 'p(indel) =',AVG_INDEL_FREQ - print 'overall average mut rate:',AVG_MUT_RATE - print 'total variants processed:',totalVar - - # - # save variables to file - # - if SKIP_COMMON: - OUT_DICT = {'AVG_MUT_RATE':AVG_MUT_RATE, - 'SNP_FREQ':SNP_FREQ, - 'SNP_TRANS_FREQ':SNP_TRANS_FREQ, - 'INDEL_FREQ':INDEL_FREQ, - 'TRINUC_MUT_PROB':TRINUC_MUT_PROB, - 'TRINUC_TRANS_PROBS':TRINUC_TRANS_PROBS} - else: - OUT_DICT = {'AVG_MUT_RATE':AVG_MUT_RATE, - 'SNP_FREQ':SNP_FREQ, - 'SNP_TRANS_FREQ':SNP_TRANS_FREQ, - 'INDEL_FREQ':INDEL_FREQ, - 'TRINUC_MUT_PROB':TRINUC_MUT_PROB, - 'TRINUC_TRANS_PROBS':TRINUC_TRANS_PROBS, - 'COMMON_VARIANTS':COMMON_VARIANTS, - 'HIGH_MUT_REGIONS':HIGH_MUT_REGIONS} - pickle.dump( OUT_DICT, open( OUT_PICKLE, "wb" ) ) - - -if __name__ == "__main__": - main() - - - - diff --git a/utilities/genSeqErrorModel.py b/utilities/genSeqErrorModel.py old mode 100644 new mode 100755 index 7860178..85a6a59 --- a/utilities/genSeqErrorModel.py +++ b/utilities/genSeqErrorModel.py @@ -1,300 +1,306 @@ -#!/usr/bin/env python +#!/usr/bin/env source # # -# genSeqErrorModel.py -# Computes sequencing error model for genReads.py +# genSeqErrorModel.source +# Computes sequencing error model for gen_reads.source # # -# Usage: python genSeqErrorModel.py -i input_reads.fq -o path/to/output_name.p +# Usage: source genSeqErrorModel.source -i input_reads.fq -o path/to/output_name.p # # +# Python 3 ready -import os -import sys -import gzip -import random import numpy as np import argparse -import cPickle as pickle - -# absolute path to this script -SIM_PATH = '/'.join(os.path.realpath(__file__).split('/')[:-2])+'/py/' -sys.path.append(SIM_PATH) - -from probability import DiscreteDistribution - -def parseFQ(inf): - print 'reading '+inf+'...' - if inf[-3:] == '.gz': - print 'detected gzip suffix...' - f = gzip.open(inf,'r') - else: - f = open(inf,'r') - - IS_SAM = False - if inf[-4:] == '.sam': - print 'detected sam input...' - IS_SAM = True - - rRead = 0 - actual_readlen = 0 - qDict = {} - while True: - - if IS_SAM: - data4 = f.readline() - if not len(data4): - break - try: - data4 = data4.split('\t')[10] - except IndexError: - break - # need to add some input checking here? Yup, probably. - else: - data1 = f.readline() - data2 = f.readline() - data3 = f.readline() - data4 = f.readline() - if not all([data1,data2,data3,data4]): - break - - if actual_readlen == 0: - if inf[-3:] != '.gz' and not IS_SAM: - totalSize = os.path.getsize(inf) - entrySize = sum([len(n) for n in [data1,data2,data3,data4]]) - print 'estimated number of reads in file:',int(float(totalSize)/entrySize) - actual_readlen = len(data4)-1 - print 'assuming read length is uniform...' - print 'detected read length (from first read found):',actual_readlen - priorQ = np.zeros([actual_readlen,RQ]) - totalQ = [None] + [np.zeros([RQ,RQ]) for n in xrange(actual_readlen-1)] - - # sanity-check readlengths - if len(data4)-1 != actual_readlen: - print 'skipping read with unexpected length...' - continue - - for i in range(len(data4)-1): - q = ord(data4[i])-offQ - qDict[q] = True - if i == 0: - priorQ[i][q] += 1 - else: - totalQ[i][prevQ,q] += 1 - priorQ[i][q] += 1 - prevQ = q - - rRead += 1 - if rRead%PRINT_EVERY == 0: - print rRead - if MAX_READS > 0 and rRead >= MAX_READS: - break - f.close() - - # some sanity checking again... - QRANGE = [min(qDict.keys()),max(qDict.keys())] - if QRANGE[0] < 0: - print '\nError: Read in Q-scores below 0\n' - exit(1) - if QRANGE[1] > RQ: - print '\nError: Read in Q-scores above specified maximum:',QRANGE[1],'>',RQ,'\n' - exit(1) - - print 'computing probabilities...' - probQ = [None] + [[[0. for m in xrange(RQ)] for n in xrange(RQ)] for p in xrange(actual_readlen-1)] - for p in xrange(1,actual_readlen): - for i in xrange(RQ): - rowSum = float(np.sum(totalQ[p][i,:]))+PROB_SMOOTH*RQ - if rowSum <= 0.: - continue - for j in xrange(RQ): - probQ[p][i][j] = (totalQ[p][i][j]+PROB_SMOOTH)/rowSum - - initQ = [[0. for m in xrange(RQ)] for n in xrange(actual_readlen)] - for i in xrange(actual_readlen): - rowSum = float(np.sum(priorQ[i,:]))+INIT_SMOOTH*RQ - if rowSum <= 0.: - continue - for j in xrange(RQ): - initQ[i][j] = (priorQ[i][j]+INIT_SMOOTH)/rowSum - - if PLOT_STUFF: - mpl.rcParams.update({'font.size': 14, 'font.weight':'bold', 'lines.linewidth': 3}) - - mpl.figure(1) - Z = np.array(initQ).T - X, Y = np.meshgrid( range(0,len(Z[0])+1), range(0,len(Z)+1) ) - mpl.pcolormesh(X,Y,Z,vmin=0.,vmax=0.25) - mpl.axis([0,len(Z[0]),0,len(Z)]) - mpl.yticks(range(0,len(Z),10),range(0,len(Z),10)) - mpl.xticks(range(0,len(Z[0]),10),range(0,len(Z[0]),10)) - mpl.xlabel('Read Position') - mpl.ylabel('Quality Score') - mpl.title('Q-Score Prior Probabilities') - mpl.colorbar() - - mpl.show() - - VMIN_LOG = [-4,0] - minVal = 10**VMIN_LOG[0] - qLabels = [str(n) for n in range(QRANGE[0],QRANGE[1]+1) if n%5==0] - print qLabels - qTicksx = [int(n)+0.5 for n in qLabels] - qTicksy = [(RQ-int(n))-0.5 for n in qLabels] - - for p in xrange(1,actual_readlen,10): - currentDat = np.array(probQ[p]) - for i in xrange(len(currentDat)): - for j in xrange(len(currentDat[i])): - currentDat[i][j] = max(minVal,currentDat[i][j]) - - # matrix indices: pcolormesh plotting: plot labels and axes: - # - # y ^ ^ - # --> x | y | - # x | --> --> - # v y x - # - # to plot a MxN matrix 'Z' with rowNames and colNames we need to: - # - # pcolormesh(X,Y,Z[::-1,:]) # invert x-axis - # # swap x/y axis parameters and labels, remember x is still inverted: - # xlim([yMin,yMax]) - # ylim([M-xMax,M-xMin]) - # xticks() - # - - mpl.figure(p+1) - Z = np.log10(currentDat) - X, Y = np.meshgrid( range(0,len(Z[0])+1), range(0,len(Z)+1) ) - mpl.pcolormesh(X,Y,Z[::-1,:],vmin=VMIN_LOG[0],vmax=VMIN_LOG[1],cmap='jet') - mpl.xlim([QRANGE[0],QRANGE[1]+1]) - mpl.ylim([RQ-QRANGE[1]-1,RQ-QRANGE[0]]) - mpl.yticks(qTicksy,qLabels) - mpl.xticks(qTicksx,qLabels) - mpl.xlabel('\n' + r'$Q_{i+1}$') - mpl.ylabel(r'$Q_i$') - mpl.title('Q-Score Transition Frequencies [Read Pos:'+str(p)+']') - cb = mpl.colorbar() - cb.set_ticks([-4,-3,-2,-1,0]) - cb.set_ticklabels([r'$10^{-4}$',r'$10^{-3}$',r'$10^{-2}$',r'$10^{-1}$',r'$10^{0}$']) - - #mpl.tight_layout() - mpl.show() - - print 'estimating average error rate via simulation...' - Qscores = range(RQ) - #print (len(initQ), len(initQ[0])) - #print (len(probQ), len(probQ[1]), len(probQ[1][0])) - - initDistByPos = [DiscreteDistribution(initQ[i],Qscores) for i in xrange(len(initQ))] - probDistByPosByPrevQ = [None] - for i in xrange(1,len(initQ)): - probDistByPosByPrevQ.append([]) - for j in xrange(len(initQ[0])): - if np.sum(probQ[i][j]) <= 0.: # if we don't have sufficient data for a transition, use the previous qscore - probDistByPosByPrevQ[-1].append(DiscreteDistribution([1],[Qscores[j]],degenerateVal=Qscores[j])) - else: - probDistByPosByPrevQ[-1].append(DiscreteDistribution(probQ[i][j],Qscores)) - - countDict = {} - for q in Qscores: - countDict[q] = 0 - for samp in xrange(1,N_SAMP+1): - if samp%PRINT_EVERY == 0: - print samp - myQ = initDistByPos[0].sample() - countDict[myQ] += 1 - for i in xrange(1,len(initQ)): - myQ = probDistByPosByPrevQ[i][myQ].sample() - countDict[myQ] += 1 - - totBases = float(sum(countDict.values())) - avgError = 0. - for k in sorted(countDict.keys()): - eVal = 10.**(-k/10.) - #print k, eVal, countDict[k] - avgError += eVal * (countDict[k]/totBases) - print 'AVG ERROR RATE:',avgError - - return (initQ, probQ, avgError) - -parser = argparse.ArgumentParser(description='genSeqErrorModel.py') -parser.add_argument('-i', type=str, required=True, metavar='', help="* input_read1.fq (.gz) / input_read1.sam") -parser.add_argument('-o', type=str, required=True, metavar='', help="* output.p") -parser.add_argument('-i2', type=str, required=False, metavar='', default=None, help="input_read2.fq (.gz) / input_read2.sam") -parser.add_argument('-p', type=str, required=False, metavar='', default=None, help="input_alignment.pileup") -parser.add_argument('-q', type=int, required=False, metavar='', default=33, help="quality score offset [33]") -parser.add_argument('-Q', type=int, required=False, metavar='', default=41, help="maximum quality score [41]") -parser.add_argument('-n', type=int, required=False, metavar='', default=-1, help="maximum number of reads to process [all]") -parser.add_argument('-s', type=int, required=False, metavar='', default=1000000, help="number of simulation iterations [1000000]") -parser.add_argument('--plot', required=False, action='store_true', default=False, help='perform some optional plotting') -args = parser.parse_args() - -(INF, OUF, offQ, maxQ, MAX_READS, N_SAMP) = (args.i, args.o, args.q, args.Q, args.n, args.s) -(INF2, PILEUP) = (args.i2, args.p) - -RQ = maxQ+1 - -INIT_SMOOTH = 0. -PROB_SMOOTH = 0. -PRINT_EVERY = 10000 -PLOT_STUFF = args.plot -if PLOT_STUFF: - print 'plotting is desired, lets import matplotlib...' - import matplotlib.pyplot as mpl +import sys +import pickle +import matplotlib.pyplot as mpl +import pathlib +import pysam +from functools import reduce + +# enables import from neighboring package +sys.path.append(str(pathlib.Path(__file__).resolve().parents[1])) + +from source.probability import DiscreteDistribution + + +def parse_file(input_file, real_q, off_q, max_reads, n_samp, plot_stuff): + init_smooth = 0. + prob_smooth = 0. + + # Takes a gzip or sam file and returns the simulation's average error rate, + print('reading ' + input_file + '...') + is_aligned = False + lines_to_read = 0 + try: + if input_file[-4:] == '.bam' or input_file[-4:] == '.sam': + print('detected aligned file....') + stats = pysam.idxstats(input_file).strip().split('\n') + lines_to_read = reduce(lambda x, y: x + y, [eval('+'.join(l.rstrip('\n').split('\t')[2:])) for l in stats]) + f = pysam.AlignmentFile(input_file) + is_aligned = True + else: + print('detected fastq file....') + with pysam.FastxFile(input_file) as f: + for _ in f: + lines_to_read += 1 + f = pysam.FastxFile(input_file) + except FileNotFoundError: + print("Check input file. Must be fastq, gzipped fastq, or bam/sam file.") + sys.exit(1) + + actual_readlen = 0 + q_dict = {} + current_line = 0 + quarters = lines_to_read // 4 + + if is_aligned: + g = f.fetch() + else: + g = f + + for read in g: + if is_aligned: + qualities_to_check = read.query_alignment_qualities + else: + qualities_to_check = read.get_quality_array() + if actual_readlen == 0: + actual_readlen = len(qualities_to_check) - 1 + print('assuming read length is uniform...') + print('detected read length (from first read found):', actual_readlen) + prior_q = np.zeros([actual_readlen, real_q]) + total_q = [None] + [np.zeros([real_q, real_q]) for n in range(actual_readlen - 1)] + + # sanity-check readlengths + if len(qualities_to_check) - 1 != actual_readlen: + print('skipping read with unexpected length...') + continue + + for i in range(actual_readlen): + q = qualities_to_check[i] + q_dict[q] = True + prev_q = q + if i == 0: + prior_q[i][q] += 1 + else: + total_q[i][prev_q, q] += 1 + prior_q[i][q] += 1 + + current_line += 1 + if current_line % quarters == 0: + print(f'{(current_line/lines_to_read)*100:.0f}%') + if 0 < max_reads <= current_line: + break + + f.close() + + # some sanity checking again... + q_range = [min(q_dict.keys()), max(q_dict.keys())] + if q_range[0] < 0: + print('\nError: Read in Q-scores below 0\n') + exit(1) + if q_range[1] > real_q: + print('\nError: Read in Q-scores above specified maximum:', q_range[1], '>', real_q, '\n') + exit(1) + + print('computing probabilities...') + prob_q = [None] + [[[0. for m in range(real_q)] for n in range(real_q)] for p in range(actual_readlen - 1)] + for p in range(1, actual_readlen): + for i in range(real_q): + row_sum = float(np.sum(total_q[p][i, :])) + prob_smooth * real_q + if row_sum <= 0.: + continue + for j in range(real_q): + prob_q[p][i][j] = (total_q[p][i][j] + prob_smooth) / row_sum + + init_q = [[0. for m in range(real_q)] for n in range(actual_readlen)] + for i in range(actual_readlen): + row_sum = float(np.sum(prior_q[i, :])) + init_smooth * real_q + if row_sum <= 0.: + continue + for j in range(real_q): + init_q[i][j] = (prior_q[i][j] + init_smooth) / row_sum + + if plot_stuff: + mpl.rcParams.update({'font.size': 14, 'font.weight': 'bold', 'lines.linewidth': 3}) + + mpl.figure(1) + Z = np.array(init_q).T + X, Y = np.meshgrid(range(0, len(Z[0]) + 1), range(0, len(Z) + 1)) + mpl.pcolormesh(X, Y, Z, vmin=0., vmax=0.25) + mpl.axis([0, len(Z[0]), 0, len(Z)]) + mpl.yticks(range(0, len(Z), 10), range(0, len(Z), 10)) + mpl.xticks(range(0, len(Z[0]), 10), range(0, len(Z[0]), 10)) + mpl.xlabel('Read Position') + mpl.ylabel('Quality Score') + mpl.title('Q-Score Prior Probabilities') + mpl.colorbar() + + mpl.show() + + v_min_log = [-4, 0] + min_val = 10 ** v_min_log[0] + q_labels = [str(n) for n in range(q_range[0], q_range[1] + 1) if n % 5 == 0] + print(q_labels) + q_ticks_x = [int(n) + 0.5 for n in q_labels] + q_ticks_y = [(real_q - int(n)) - 0.5 for n in q_labels] + + for p in range(1, actual_readlen, 10): + current_data = np.array(prob_q[p]) + for i in range(len(current_data)): + for j in range(len(current_data[i])): + current_data[i][j] = max(min_val, current_data[i][j]) + + # matrix indices: pcolormesh plotting: plot labels and axes: + # + # y ^ ^ + # --> x | y | + # x | --> --> + # v y x + # + # to plot a MxN matrix 'Z' with rowNames and colNames we need to: + # + # pcolormesh(X,Y,Z[::-1,:]) # invert x-axis + # # swap x/y axis parameters and labels, remember x is still inverted: + # xlim([yMin,yMax]) + # ylim([M-xMax,M-xMin]) + # xticks() + # + + mpl.figure(p + 1) + z = np.log10(current_data) + x, y = np.meshgrid(range(0, len(Z[0]) + 1), range(0, len(Z) + 1)) + mpl.pcolormesh(x, y, z[::-1, :], vmin=v_min_log[0], vmax=v_min_log[1], cmap='jet') + mpl.xlim([q_range[0], q_range[1] + 1]) + mpl.ylim([real_q - q_range[1] - 1, real_q - q_range[0]]) + mpl.yticks(q_ticks_y, q_labels) + mpl.xticks(q_ticks_x, q_labels) + mpl.xlabel('\n' + r'$Q_{i+1}$') + mpl.ylabel(r'$Q_i$') + mpl.title('Q-Score Transition Frequencies [Read Pos:' + str(p) + ']') + cb = mpl.colorbar() + cb.set_ticks([-4, -3, -2, -1, 0]) + cb.set_ticklabels([r'$10^{-4}$', r'$10^{-3}$', r'$10^{-2}$', r'$10^{-1}$', r'$10^{0}$']) + + # mpl.tight_layout() + mpl.show() + + print('estimating average error rate via simulation...') + q_scores = range(real_q) + # print (len(init_q), len(init_q[0])) + # print (len(prob_q), len(prob_q[1]), len(prob_q[1][0])) + + init_dist_by_pos = [DiscreteDistribution(init_q[i], q_scores) for i in range(len(init_q))] + prob_dist_by_pos_by_prev_q = [None] + for i in range(1, len(init_q)): + prob_dist_by_pos_by_prev_q.append([]) + for j in range(len(init_q[0])): + if np.sum(prob_q[i][j]) <= 0.: # if we don't have sufficient data for a transition, use the previous qscore + prob_dist_by_pos_by_prev_q[-1].append(DiscreteDistribution([1], [q_scores[j]], degenerate_val=q_scores[j])) + else: + prob_dist_by_pos_by_prev_q[-1].append(DiscreteDistribution(prob_q[i][j], q_scores)) + + count_dict = {} + for q in q_scores: + count_dict[q] = 0 + lines_to_sample = len(range(1, n_samp + 1)) + samp_quarters = lines_to_sample // 4 + for samp in range(1, n_samp + 1): + if samp % samp_quarters == 0: + print(f'{(samp/lines_to_sample)*100:.0f}%') + my_q = init_dist_by_pos[0].sample() + count_dict[my_q] += 1 + for i in range(1, len(init_q)): + my_q = prob_dist_by_pos_by_prev_q[i][my_q].sample() + count_dict[my_q] += 1 + + tot_bases = float(sum(count_dict.values())) + avg_err = 0. + for k in sorted(count_dict.keys()): + eVal = 10. ** (-k / 10.) + # print k, eVal, count_dict[k] + avg_err += eVal * (count_dict[k] / tot_bases) + print('AVG ERROR RATE:', avg_err) + + return init_q, prob_q, avg_err + def main(): + parser = argparse.ArgumentParser(description='genSeqErrorModel.py') + parser.add_argument('-i', type=str, required=True, metavar='', help="* input_read1.fq (.gz) / input_read1.sam") + parser.add_argument('-o', type=str, required=True, metavar='', help="* output.p") + parser.add_argument('-i2', type=str, required=False, metavar='', default=None, + help="input_read2.fq (.gz) / input_read2.sam") + parser.add_argument('-p', type=str, required=False, metavar='', default=None, help="input_alignment.pileup") + parser.add_argument('-q', type=int, required=False, metavar='', default=33, help="quality score offset [33]") + parser.add_argument('-Q', type=int, required=False, metavar='', default=41, help="maximum quality score [41]") + parser.add_argument('-n', type=int, required=False, metavar='', default=-1, + help="maximum number of reads to process [all]") + parser.add_argument('-s', type=int, required=False, metavar='', default=1000000, + help="number of simulation iterations [1000000]") + parser.add_argument('--plot', required=False, action='store_true', default=False, + help='perform some optional plotting') + args = parser.parse_args() + + (infile, outfile, off_q, max_q, max_reads, n_samp) = (args.i, args.o, args.q, args.Q, args.n, args.s) + (infile2, pile_up) = (args.i2, args.p) + + real_q = max_q + 1 + + plot_stuff = args.plot + + q_scores = range(real_q) + if infile2 is None: + (init_q, prob_q, avg_err) = parse_file(infile, real_q, off_q, max_reads, n_samp, plot_stuff) + else: + (init_q, prob_q, avg_err1) = parse_file(infile, real_q, off_q, max_reads, n_samp, plot_stuff) + (init_q2, prob_q2, avg_err2) = parse_file(infile2, real_q, off_q, max_reads, n_samp, plot_stuff) + avg_err = (avg_err1 + avg_err2) / 2. + + # + # embed some default sequencing error parameters if no pileup is provided + # + if pile_up == None: + + print('Using default sequencing error parameters...') + + # sequencing substitution transition probabilities + sse_prob = [[0., 0.4918, 0.3377, 0.1705], + [0.5238, 0., 0.2661, 0.2101], + [0.3754, 0.2355, 0., 0.3890], + [0.2505, 0.2552, 0.4942, 0.]] + # if a sequencing error occurs, what are the odds it's an indel? + sie_rate = 0.01 + # sequencing indel error length distribution + sie_prob = [0.999, 0.001] + sie_val = [1, 2] + # if a sequencing indel error occurs, what are the odds it's an insertion as opposed to a deletion? + sie_ins_freq = 0.4 + # if a sequencing insertion error occurs, what's the probability of it being an A, C, G, T... + sie_ins_nucl = [0.25, 0.25, 0.25, 0.25] + + # + # otherwise we need to parse a pileup and compute statistics! + # + else: + print('\nPileup parsing coming soon!\n') + exit(1) + + err_params = [sse_prob, sie_rate, sie_prob, sie_val, sie_ins_freq, sie_ins_nucl] + + # + # finally, let's save our output model + # + outfile = pathlib.Path(outfile).with_suffix(".p") + print('saving model...') + if infile2 is None: + pickle.dump([init_q, prob_q, q_scores, off_q, avg_err, err_params], open(outfile, 'wb')) + else: + pickle.dump([init_q, prob_q, init_q2, prob_q2, q_scores, off_q, avg_err, err_params], open(outfile, 'wb')) - Qscores = range(RQ) - if INF2 == None: - (initQ, probQ, avgError) = parseFQ(INF) - else: - (initQ, probQ, avgError1) = parseFQ(INF) - (initQ2, probQ2, avgError2) = parseFQ(INF2) - avgError = (avgError1+avgError2)/2. - - # - # embed some default sequencing error parameters if no pileup is provided - # - if PILEUP == None: - - print 'Using default sequencing error parameters...' - - # sequencing substitution transition probabilities - SSE_PROB = [[0., 0.4918, 0.3377, 0.1705 ], - [0.5238, 0., 0.2661, 0.2101 ], - [0.3754, 0.2355, 0., 0.3890 ], - [0.2505, 0.2552, 0.4942, 0. ]] - # if a sequencing error occurs, what are the odds it's an indel? - SIE_RATE = 0.01 - # sequencing indel error length distribution - SIE_PROB = [0.999,0.001] - SIE_VAL = [1,2] - # if a sequencing indel error occurs, what are the odds it's an insertion as opposed to a deletion? - SIE_INS_FREQ = 0.4 - # if a sequencing insertion error occurs, what's the probability of it being an A, C, G, T... - SIE_INS_NUCL = [0.25, 0.25, 0.25, 0.25] - - # - # otherwise we need to parse a pileup and compute statistics! - # - else: - print '\nPileup parsing coming soon!\n' - exit(1) - - errorParams = [SSE_PROB, SIE_RATE, SIE_PROB, SIE_VAL, SIE_INS_FREQ, SIE_INS_NUCL] - - # - # finally, let's save our output model - # - print 'saving model...' - if INF2 == None: - pickle.dump([initQ,probQ,Qscores,offQ,avgError,errorParams],open(OUF,'wb')) - else: - pickle.dump([initQ,probQ,initQ2,probQ2,Qscores,offQ,avgError,errorParams],open(OUF,'wb')) if __name__ == '__main__': - main() + main() diff --git a/utilities/gen_mut_model.py b/utilities/gen_mut_model.py new file mode 100755 index 0000000..905863a --- /dev/null +++ b/utilities/gen_mut_model.py @@ -0,0 +1,505 @@ +#!/usr/bin/env source + +# Python 3 ready + +import os +import re +import pickle +import argparse +import numpy as np +from Bio import SeqIO +import pandas as pd + + +######################################################### +# VARIOUS HELPER FUNCTIONS # +######################################################### + + +def cluster_list(list_to_cluster: list, delta: float) -> list: + """ + Clusters a sorted list + :param list_to_cluster: a sorted list + :param delta: the value to compare list items to + :return: a clustered list of values + """ + out_list = [[list_to_cluster[0]]] + previous_value = list_to_cluster[0] + current_index = 0 + for item in list_to_cluster[1:]: + if item - previous_value <= delta: + out_list[current_index].append(item) + else: + current_index += 1 + out_list.append([]) + out_list[current_index].append(item) + previous_value = item + return out_list + + +##################################### +# main() # +##################################### + + +def main(): + # Some constants we'll need later + REF_WHITELIST = [str(n) for n in range(1, 30)] + ['x', 'y', 'X', 'Y', 'mt', 'Mt', 'MT'] + REF_WHITELIST += ['chr' + n for n in REF_WHITELIST] + VALID_NUCL = ['A', 'C', 'G', 'T'] + VALID_TRINUC = [VALID_NUCL[i] + VALID_NUCL[j] + VALID_NUCL[k] for i in range(len(VALID_NUCL)) for j in + range(len(VALID_NUCL)) for k in range(len(VALID_NUCL))] + # if parsing a dbsnp vcf, and no CAF= is found in info tag, use this as default val for population freq + VCF_DEFAULT_POP_FREQ = 0.00001 + + parser = argparse.ArgumentParser(description='gen_mut_model.source', + formatter_class=argparse.ArgumentDefaultsHelpFormatter,) + parser.add_argument('-r', type=str, required=True, metavar='/path/to/reference.fasta', + help="Reference file for organism in fasta format") + parser.add_argument('-m', type=str, required=True, metavar='/path/to/mutations.vcf', + help="Mutation file for organism in VCF format") + parser.add_argument('-o', type=str, required=True, metavar='/path/to/output/and/prefix', + help="Name of output file (final model will append \'.p\')") + parser.add_argument('-b', type=str, required=False, metavar='Bed file of regions to include ' + '(use bedtools complement if you have a ' + 'bed of exclusion areas)', default=None, + help="only_use_these_regions.bed") + parser.add_argument('--save-trinuc', required=False, action='store_true', default=False, + help='save trinucleotide counts for reference') + parser.add_argument('--human-sample', required=False, action='store_true', default=False, + help='To skip unnumbered scaffolds in human references') + parser.add_argument('--skip-common', required=False, action='store_true', default=False, + help='Do not save common snps + high mut regions') + args = parser.parse_args() + + (ref, vcf, out_pickle, save_trinuc, skip_common) = ( + args.r, args.m, args.o, args.save_trinuc, args.skip_common) + + is_human = args.human_sample + + # how many times do we observe each trinucleotide in the reference (and input bed region, if present)? + TRINUC_REF_COUNT = {} + # [(trinuc_a, trinuc_b)] = # of times we observed a mutation from trinuc_a into trinuc_b + TRINUC_TRANSITION_COUNT = {} + # total count of SNPs + SNP_COUNT = 0 + # overall SNP transition probabilities + SNP_TRANSITION_COUNT = {} + # total count of indels, indexed by length + INDEL_COUNT = {} + # tabulate how much non-N reference sequence we've eaten through + TOTAL_REFLEN = 0 + # detect variants that occur in a significant percentage of the input samples (pos,ref,alt,pop_fraction) + COMMON_VARIANTS = [] + # tabulate how many unique donors we've encountered (this is useful for identifying common variants) + TOTAL_DONORS = {} + # identify regions that have significantly higher local mutation rates than the average + HIGH_MUT_REGIONS = [] + + # Process bed file, + is_bed = False + my_bed = None + if args.b is not None: + print('Processing bed file...') + try: + my_bed = pd.read_csv(args.b, sep='\t', header=None, index_col=None) + is_bed = True + except ValueError: + print('Problem parsing bed file. Ensure bed file is tab separated, standard bed format') + + my_bed = my_bed.rename(columns={0: 'chrom', 1: 'start', 2: 'end'}) + # Adding a couple of columns we'll need for later calculations + my_bed['coords'] = list(zip(my_bed.start, my_bed.end)) + my_bed['track_len'] = my_bed.end - my_bed.start + 1 + + # Process reference file + print('Processing reference...') + try: + reference = SeqIO.to_dict(SeqIO.parse(ref, "fasta")) + except ValueError: + print("Problems parsing reference file. Ensure reference is in proper fasta format") + + # simplify naming and filter out actual human genomes from scaffolding + ref_dict = {} + for key in reference.keys(): + key_split = key.split("|") + if is_human: + if key_split[0] in REF_WHITELIST: + ref_dict[key_split[0]] = reference[key] + else: + continue + else: + ref_dict[key_split[0]] = reference[key] + + ref_list = list(ref_dict.keys()) + + # Process VCF file. First check if it's been entered as a TSV + if vcf[-3:] == 'tsv': + print("Warning! TSV file must follow VCF specifications.") + + # Pre-parsing to find all the matching chromosomes between ref and vcf + print('Processing VCF file...') + try: + variants = pd.read_csv(vcf, sep='\t', comment='#', index_col=None, header=None) + except ValueError: + print("VCF must be in standard VCF format with tab-separated columns") + + # Narrow chromosomes to those matching the reference + # This is in part to make sure the names match + variant_chroms = variants[0].to_list() + variant_chroms = list(set(variant_chroms)) + matching_chromosomes = [] + for ref_name in ref_list: + if ref_name not in variant_chroms: + continue + else: + matching_chromosomes.append(ref_name) + + # Check to make sure there are some matches + if not matching_chromosomes: + print("Found no chromosomes in common between VCF and Fasta. Please fix the chromosome names and try again") + exit(1) + + # Double check that there are matches + try: + matching_variants = variants[variants[0].isin(matching_chromosomes)] + except ValueError: + print("Problem matching variants with reference.") + + if matching_variants.empty: + print("There is no overlap between reference and variant file. This could be a chromosome naming problem") + exit(1) + + + # Rename header in dataframe for processing + matching_variants = matching_variants.rename(columns={0: "CHROM", 1: 'chr_start', 2: 'ID', 3: 'REF', 4: 'ALT', + 5: 'QUAL', 6: 'FILTER', 7: 'INFO'}) + + # Change the indexing by -1 to match source format indexing (0-based) + matching_variants['chr_start'] = matching_variants['chr_start'] - 1 + matching_variants['chr_end'] = matching_variants['chr_start'] + + # Process the variant table + indices_to_indels = \ + matching_variants.loc[matching_variants.ALT.apply(len) != matching_variants.REF.apply(len)].index + + # indels in vcf don't include the preserved first nucleotide, so lets trim the vcf alleles + ref_values_to_change = matching_variants.loc[indices_to_indels, 'REF'].copy().str[1:] + alt_values_to_change = matching_variants.loc[indices_to_indels, 'ALT'].copy().str[1:] + matching_variants.loc[indices_to_indels, "REF"] = ref_values_to_change + matching_variants.loc[indices_to_indels, "ALT"] = alt_values_to_change + matching_variants.replace('', '-', inplace=True) + + # If multi-alternate alleles are present, lets just ignore this variant. I may come back and improve this later + indices_to_ignore = matching_variants[matching_variants['ALT'].str.contains(',')].index + matching_variants = matching_variants.drop(indices_to_ignore) + + # if we encounter a multi-np (i.e. 3 nucl --> 3 different nucl), let's skip it for now... + + # Alt and Ref contain no dashes + no_dashes = matching_variants[ + ~matching_variants['REF'].str.contains('-') & ~matching_variants['ALT'].str.contains('-')].index + # Alt and Ref lengths are greater than 1 + long_variants = matching_variants[ + (matching_variants['REF'].apply(len) > 1) & (matching_variants['ALT'].apply(len) > 1)].index + complex_variants = list(set(no_dashes) & set(long_variants)) + matching_variants = matching_variants.drop(complex_variants) + + # This is solely to make regex easier later, since we can't predict where in the line a string will be + new_info = ';' + matching_variants['INFO'].copy() + ';' + matching_variants['INFO'] = new_info + + # Now we check that the bed and vcf have matching regions + # This also checks that the vcf and bed have the same naming conventions and cuts out scaffolding. + if is_bed: + bed_chroms = list(set(my_bed['chrom'])) + matching_bed_keys = list(set(bed_chroms) & set(variant_chroms)) + try: + matching_bed = my_bed[my_bed['chrom'].isin(matching_bed_keys)] + except ValueError: + print('Problem matching bed chromosomes to variant file.') + + if matching_bed.empty: + print("There is no overlap between bed and variant file. " + "This could be a chromosome naming problem") + exit(1) + + # Count Trinucleotides in reference, based on bed or not + print('Counting trinucleotides in reference...') + + if is_bed: + print("since you're using a bed input, we have to count trinucs in bed region even if " + "you already have a trinuc count file for the reference...") + for ref_name in matching_chromosomes: + sub_bed = matching_bed[matching_bed['chrom'] == ref_name] + sub_regions = sub_bed['coords'].to_list() + for sr in sub_regions: + sub_seq = ref_dict[ref_name][sr[0]: sr[1]].seq + for trinuc in VALID_TRINUC: + if trinuc not in TRINUC_REF_COUNT: + TRINUC_REF_COUNT[trinuc] = 0 + TRINUC_REF_COUNT[trinuc] += sub_seq.count_overlap(trinuc) + + elif not os.path.isfile(ref + '.trinucCounts'): + for ref_name in matching_chromosomes: + sub_seq = ref_dict[ref_name].seq + for trinuc in VALID_TRINUC: + if trinuc not in TRINUC_REF_COUNT: + TRINUC_REF_COUNT[trinuc] = 0 + TRINUC_REF_COUNT[trinuc] += sub_seq.count_overlap(trinuc) + else: + print('Found trinucCounts file, using that.') + + # Load and process variants in each reference sequence individually, for memory reasons... + print('Creating mutational model...') + for ref_name in matching_chromosomes: + # Count the number of non-N nucleotides for the reference + TOTAL_REFLEN += len(ref_dict[ref_name].seq) - ref_dict[ref_name].seq.count('N') + + # list to be used for counting variants that occur multiple times in file (i.e. in multiple samples) + VDAT_COMMON = [] + + # Create a view that narrows variants list to current ref + variants_to_process = matching_variants[matching_variants["CHROM"] == ref_name].copy() + ref_sequence = str(ref_dict[ref_name].seq) + + # we want only snps + # so, no '-' characters allowed, and chrStart must be same as chrEnd + snp_df = variants_to_process[~variants_to_process.index.isin(indices_to_indels)] + snp_df = snp_df.loc[snp_df['chr_start'] == snp_df['chr_end']] + if is_bed: + bed_to_process = matching_bed[matching_bed['chrom'] == ref_name].copy() + # TODO fix this line (need the intersection of these two, I think) + snp_df = bed_to_process.join(snp_df) + + if not snp_df.empty: + # only consider positions where ref allele in vcf matches the nucleotide in our reference + for index, row in snp_df.iterrows(): + trinuc_to_analyze = str(ref_sequence[row.chr_start - 1: row.chr_start + 2]) + if trinuc_to_analyze not in VALID_TRINUC: + continue + if row.REF == trinuc_to_analyze[1]: + trinuc_ref = trinuc_to_analyze + trinuc_alt = trinuc_to_analyze[0] + snp_df.loc[index, 'ALT'] + trinuc_to_analyze[2] + if trinuc_alt not in VALID_TRINUC: + continue + key = (trinuc_ref, trinuc_alt) + if key not in TRINUC_TRANSITION_COUNT: + TRINUC_TRANSITION_COUNT[key] = 0 + TRINUC_TRANSITION_COUNT[key] += 1 + SNP_COUNT += 1 + key2 = (str(row.REF), str(row.ALT)) + if key2 not in SNP_TRANSITION_COUNT: + SNP_TRANSITION_COUNT[key2] = 0 + SNP_TRANSITION_COUNT[key2] += 1 + + my_pop_freq = VCF_DEFAULT_POP_FREQ + if ';CAF=' in snp_df.loc[index, 'INFO']: + caf_str = re.findall(r";CAF=.*?(?=;)", row.INFO)[0] + if ',' in caf_str: + my_pop_freq = float(caf_str[5:].split(',')[1]) + VDAT_COMMON.append( + (row.chr_start, row.REF, row.REF, row.ALT, my_pop_freq)) + else: + print('\nError: ref allele in variant call does not match reference.\n') + exit(1) + + # now let's look for indels... + indel_df = variants_to_process[variants_to_process.index.isin(indices_to_indels)] + if not indel_df.empty: + for index, row in indel_df.iterrows(): + if "-" in row.REF: + len_ref = 0 + else: + len_ref = len(row.REF) + if "-" in row.ALT: + len_alt = 0 + else: + len_alt = len(row.ALT) + if len_ref != len_alt: + indel_len = len_alt - len_ref + if indel_len not in INDEL_COUNT: + INDEL_COUNT[indel_len] = 0 + INDEL_COUNT[indel_len] += 1 + + my_pop_freq = VCF_DEFAULT_POP_FREQ + if ';CAF=' in row.INFO: + caf_str = re.findall(r";CAF=.*?(?=;)", row.INFO)[0] + if ',' in caf_str: + my_pop_freq = float(caf_str[5:].split(',')[1]) + VDAT_COMMON.append((row.chr_start, row.REF, row.REF, row.ALT, my_pop_freq)) + + # if we didn't find anything, skip ahead along to the next reference sequence + if not len(VDAT_COMMON): + print('Found no variants for this reference.') + continue + + # identify common mutations + percentile_var = 95 + min_value = np.percentile([n[4] for n in VDAT_COMMON], percentile_var) + for k in sorted(VDAT_COMMON): + if k[4] >= min_value: + COMMON_VARIANTS.append((ref_name, k[0], k[1], k[3], k[4])) + VDAT_COMMON = {(n[0], n[1], n[2], n[3]): n[4] for n in VDAT_COMMON} + + # identify areas that have contained significantly higher random mutation rates + dist_thresh = 2000 + percentile_clust = 97 + scaler = 1000 + # identify regions with disproportionately more variants in them + VARIANT_POS = sorted([n[0] for n in VDAT_COMMON.keys()]) + clustered_pos = cluster_list(VARIANT_POS, dist_thresh) + by_len = [(len(clustered_pos[i]), min(clustered_pos[i]), max(clustered_pos[i]), i) for i in + range(len(clustered_pos))] + # Not sure what this was intended to do or why it is commented out. Leaving it here for now. + # by_len = sorted(by_len,reverse=True) + # minLen = int(np.percentile([n[0] for n in by_len],percentile_clust)) + # by_len = [n for n in by_len if n[0] >= minLen] + candidate_regions = [] + for n in by_len: + bi = int((n[1] - dist_thresh) / float(scaler)) * scaler + bf = int((n[2] + dist_thresh) / float(scaler)) * scaler + candidate_regions.append((n[0] / float(bf - bi), max([0, bi]), min([len(ref_dict[ref_name]), bf]))) + minimum_value = np.percentile([n[0] for n in candidate_regions], percentile_clust) + for n in candidate_regions: + if n[0] >= minimum_value: + HIGH_MUT_REGIONS.append((ref_name, n[1], n[2], n[0])) + # collapse overlapping regions + for i in range(len(HIGH_MUT_REGIONS) - 1, 0, -1): + if HIGH_MUT_REGIONS[i - 1][2] >= HIGH_MUT_REGIONS[i][1] and HIGH_MUT_REGIONS[i - 1][0] == \ + HIGH_MUT_REGIONS[i][0]: + # Might need to research a more accurate way to get the mutation rate for this region + avg_mut_rate = 0.5 * HIGH_MUT_REGIONS[i - 1][3] + 0.5 * HIGH_MUT_REGIONS[i][ + 3] + HIGH_MUT_REGIONS[i - 1] = ( + HIGH_MUT_REGIONS[i - 1][0], HIGH_MUT_REGIONS[i - 1][1], HIGH_MUT_REGIONS[i][2], avg_mut_rate) + del HIGH_MUT_REGIONS[i] + + # if we didn't count ref trinucs because we found file, read in ref counts from file now + if os.path.isfile(ref + '.trinucCounts'): + print('reading pre-computed trinuc counts...') + f = open(ref + '.trinucCounts', 'r') + for line in f: + splt = line.strip().split('\t') + TRINUC_REF_COUNT[splt[0]] = int(splt[1]) + f.close() + # otherwise, save trinuc counts to file, if desired + elif save_trinuc: + if is_bed: + print('unable to save trinuc counts to file because using input bed region...') + else: + print('saving trinuc counts to file...') + f = open(ref + '.trinucCounts', 'w') + for trinuc in sorted(TRINUC_REF_COUNT.keys()): + f.write(trinuc + '\t' + str(TRINUC_REF_COUNT[trinuc]) + '\n') + f.close() + + # if for some reason we didn't find any valid input variants, exit gracefully... + total_var = SNP_COUNT + sum(INDEL_COUNT.values()) + if total_var == 0: + print( + '\nError: No valid variants were found, model could not be created. (Are you using the correct reference?)\n') + exit(1) + + + ### COMPUTE PROBABILITIES + + # frequency that each trinuc mutated into anything else + TRINUC_MUT_PROB = {} + # frequency that a trinuc mutates into another trinuc, given that it mutated + TRINUC_TRANS_PROBS = {} + # frequency of snp transitions, given a snp occurs. + SNP_TRANS_FREQ = {} + + for trinuc in sorted(TRINUC_REF_COUNT.keys()): + my_count = 0 + for k in sorted(TRINUC_TRANSITION_COUNT.keys()): + if k[0] == trinuc: + my_count += TRINUC_TRANSITION_COUNT[k] + TRINUC_MUT_PROB[trinuc] = my_count / float(TRINUC_REF_COUNT[trinuc]) + for k in sorted(TRINUC_TRANSITION_COUNT.keys()): + if k[0] == trinuc: + TRINUC_TRANS_PROBS[k] = TRINUC_TRANSITION_COUNT[k] / float(my_count) + + for n1 in VALID_NUCL: + rolling_tot = sum([SNP_TRANSITION_COUNT[(n1, n2)] for n2 in VALID_NUCL if (n1, n2) in SNP_TRANSITION_COUNT]) + for n2 in VALID_NUCL: + key2 = (n1, n2) + if key2 in SNP_TRANSITION_COUNT: + SNP_TRANS_FREQ[key2] = SNP_TRANSITION_COUNT[key2] / float(rolling_tot) + + # compute average snp and indel frequencies + SNP_FREQ = SNP_COUNT / float(total_var) + AVG_INDEL_FREQ = 1. - SNP_FREQ + INDEL_FREQ = {k: (INDEL_COUNT[k] / float(total_var)) / AVG_INDEL_FREQ for k in INDEL_COUNT.keys()} + + if is_bed: + track_sum = float(my_bed['track_len'].sum()) + AVG_MUT_RATE = total_var / track_sum + else: + AVG_MUT_RATE = total_var / float(TOTAL_REFLEN) + + # if values weren't found in data, appropriately append null entries + print_trinuc_warning = False + for trinuc in VALID_TRINUC: + trinuc_mut = [trinuc[0] + n + trinuc[2] for n in VALID_NUCL if n != trinuc[1]] + if trinuc not in TRINUC_MUT_PROB: + TRINUC_MUT_PROB[trinuc] = 0. + print_trinuc_warning = True + for trinuc2 in trinuc_mut: + if (trinuc, trinuc2) not in TRINUC_TRANS_PROBS: + TRINUC_TRANS_PROBS[(trinuc, trinuc2)] = 0. + print_trinuc_warning = True + if print_trinuc_warning: + print( + 'Warning: Some trinucleotides transitions were not encountered in the input dataset, ' + 'probabilities of 0.0 have been assigned to these events.') + + # + # print some stuff + # + for k in sorted(TRINUC_MUT_PROB.keys()): + print('p(' + k + ' mutates) =', TRINUC_MUT_PROB[k]) + + for k in sorted(TRINUC_TRANS_PROBS.keys()): + print('p(' + k[0] + ' --> ' + k[1] + ' | ' + k[0] + ' mutates) =', TRINUC_TRANS_PROBS[k]) + + for k in sorted(INDEL_FREQ.keys()): + if k > 0: + print('p(ins length = ' + str(abs(k)) + ' | indel occurs) =', INDEL_FREQ[k]) + else: + print('p(del length = ' + str(abs(k)) + ' | indel occurs) =', INDEL_FREQ[k]) + + for k in sorted(SNP_TRANS_FREQ.keys()): + print('p(' + k[0] + ' --> ' + k[1] + ' | SNP occurs) =', SNP_TRANS_FREQ[k]) + + + print('p(snp) =', SNP_FREQ) + print('p(indel) =', AVG_INDEL_FREQ) + print('overall average mut rate:', AVG_MUT_RATE) + print('total variants processed:', total_var) + + # + # save variables to file + # + if skip_common: + OUT_DICT = {'AVG_MUT_RATE': AVG_MUT_RATE, + 'SNP_FREQ': SNP_FREQ, + 'SNP_TRANS_FREQ': SNP_TRANS_FREQ, + 'INDEL_FREQ': INDEL_FREQ, + 'TRINUC_MUT_PROB': TRINUC_MUT_PROB, + 'TRINUC_TRANS_PROBS': TRINUC_TRANS_PROBS} + else: + OUT_DICT = {'AVG_MUT_RATE': AVG_MUT_RATE, + 'SNP_FREQ': SNP_FREQ, + 'SNP_TRANS_FREQ': SNP_TRANS_FREQ, + 'INDEL_FREQ': INDEL_FREQ, + 'TRINUC_MUT_PROB': TRINUC_MUT_PROB, + 'TRINUC_TRANS_PROBS': TRINUC_TRANS_PROBS, + 'COMMON_VARIANTS': COMMON_VARIANTS, + 'HIGH_MUT_REGIONS': HIGH_MUT_REGIONS} + pickle.dump(OUT_DICT, open(out_pickle, "wb")) + + +if __name__ == "__main__": + main() diff --git a/utilities/generate_random_dna.py b/utilities/generate_random_dna.py new file mode 100644 index 0000000..39e176a --- /dev/null +++ b/utilities/generate_random_dna.py @@ -0,0 +1,25 @@ +import random + + +def generate_random_dna(lnth: int, seed: int = None) -> str: + """ + Takes a parameter length and returns a randomly generated DNA string of that length + :param lnth: how long of a string to generate + :param seed: Optional seed to produce reproducibly random results + :return: randomly generated string + """ + set = ["A", "G", "C", "T"] + if seed: + random.seed(seed) + else: + random.seed() + ret = "" + for i in range(lnth): + ret += random.choice(set) + return ret + + +if __name__ == '__main__': + print(generate_random_dna(10)) + print(generate_random_dna(10, 1)) + print(generate_random_dna(10, 1)) diff --git a/utilities/plotMutModel.py b/utilities/plotMutModel.py old mode 100644 new mode 100755 index 8d1e83d..e94207f --- a/utilities/plotMutModel.py +++ b/utilities/plotMutModel.py @@ -1,10 +1,11 @@ -#!/usr/bin/env python +#!/usr/bin/env source # # a quick script for comparing mutation models # -# python plotMutModel.py -i model1.p [model2.p] [model3.p]... -l legend_label1 [legend_label2] [legend_label3]... -o path/to/pdf_plot_prefix +# source plotMutModel.source -i model1.p [model2.p] [model3.p]... -l legend_label1 [legend_label2] [legend_label3]... -o path/to/pdf_plot_prefix # +# Python 3 ready import sys import pickle @@ -15,198 +16,206 @@ import matplotlib.cm as cmx import argparse -#mpl.rc('text',usetex=True) -#mpl.rcParams['text.latex.preamble']=[r"\usepackage{amsmath}"] - -parser = argparse.ArgumentParser(description='Plot and compare mutation models from genMutModel.py Usage: python plotMutModel.py -i model1.p [model2.p] [model3.p]... -l legend_label1 [legend_label2] [legend_label3]... -o path/to/pdf_plot_prefix') -parser.add_argument('-i', type=str, required=True, metavar='', nargs='+', help="* mutation_model_1.p [mutation_model_2.p] [mutation_model_3] ...") -parser.add_argument('-l', type=str, required=True, metavar='', nargs='+', help="* legend labels: model1_name [model2_name] [model3_name]...") -parser.add_argument('-o', type=str, required=True, metavar='', help="* output pdf prefix") +# mpl.rc('text',usetex=True) +# mpl.rcParams['text.latex.preamble']=[r"\usepackage{amsmath}"] + +parser = argparse.ArgumentParser(description='Plot and compare mutation models from gen_mut_model.source Usage: ' + 'source plotMutModel.source -i model1.p [model2.p] [model3.p]... ' + '-l legend_label1 [legend_label2] [legend_label3]... ' + '-o path/to/pdf_plot_prefix', + formatter_class=argparse.ArgumentDefaultsHelpFormatter,) +parser.add_argument('-i', type=str, required=True, metavar='', nargs='+', + help="* mutation_model_1.p [mutation_model_2.p] [mutation_model_3] ...") +parser.add_argument('-l', type=str, required=True, metavar='', nargs='+', + help="* legend labels: model1_name [model2_name] [model3_name]...") +parser.add_argument('-o', type=str, required=True, metavar='', help="* output pdf prefix") args = parser.parse_args() +def get_color(i, N, colormap='jet'): + cm = mpl.get_cmap(colormap) + c_norm = colors.Normalize(vmin=0, vmax=N + 1) + scalar_map = cmx.ScalarMappable(norm=c_norm, cmap=cm) + color_val = scalar_map.to_rgba(i) + return color_val + + +def is_in_bed(track, ind): + if ind in track: + return True + elif bisect.bisect(track, ind) % 1 == 1: + return True + else: + return False + + +def get_bed_overlap(track, ind_s, ind_e): + if ind_s in track: + my_ind = track.index(ind_s) + return min([track[my_ind + 1] - ind_s + 1, ind_e - ind_s + 1]) + else: + my_ind = bisect.bisect(track, ind_s) + if my_ind % 1 and my_ind < len(track) - 1: + return min([track[my_ind + 1] - ind_s + 1, ind_e - ind_s + 1]) + return 0 -def getColor(i,N,colormap='jet'): - cm = mpl.get_cmap(colormap) - cNorm = colors.Normalize(vmin=0, vmax=N+1) - scalarMap = cmx.ScalarMappable(norm=cNorm, cmap=cm) - colorVal = scalarMap.to_rgba(i) - return colorVal - -def isInBed(track,ind): - if ind in track: - return True - elif bisect.bisect(track,ind)%1 == 1: - return True - else: - return False - -def getBedOverlap(track,ind_s,ind_e): - if ind_s in track: - myInd = track.index(ind_s) - return min([track[myInd+1]-ind_s+1,ind_e-ind_s+1]) - else: - myInd = bisect.bisect(track,ind_s) - if myInd%1 and myInd < len(track)-1: - return min([track[myInd+1]-ind_s+1,ind_e-ind_s+1]) - return 0 # a waaaaaaay slower version of the above function ^^ -#def getTrackOverlap(track1,track2): -# otrack = [0 for n in xrange(max(track1+track2)+1)] -# for i in xrange(0,len(track1),2): -# for j in xrange(track1[i],track1[i+1]+1): +# def getTrackOverlap(track1,track2): +# otrack = [0 for n in range(max(track1+track2)+1)] +# for i in range(0,len(track1),2): +# for j in range(track1[i],track1[i+1]+1): # otrack[j] = 1 # ocount = 0 -# for i in xrange(0,len(track2),2): -# for j in xrange(track2[i],track2[i+1]+1): +# for i in range(0,len(track2),2): +# for j in range(track2[i],track2[i+1]+1): # if otrack[j]: # ocount += 1 # return ocount -OUP = args.o +OUP = args.o LAB = args.l -#print LAB -INP = args.i +# print LAB +INP = args.i N_FILES = len(INP) -mpl.rcParams.update({'font.size': 13, 'font.weight':'bold', 'lines.linewidth': 3}) +mpl.rcParams.update({'font.size': 13, 'font.weight': 'bold', 'lines.linewidth': 3}) ################################################# # # BASIC STATS # ################################################# -mpl.figure(0,figsize=(12,10)) +mpl.figure(0, figsize=(12, 10)) -mpl.subplot(2,2,1) -colorInd = 0 +mpl.subplot(2, 2, 1) +color_ind = 0 for fn in INP: - myCol = getColor(colorInd,N_FILES) - colorInd += 1 - DATA_DICT = pickle.load( open( fn, "rb" ) ) - [AVG_MUT_RATE, SNP_FREQ, INDEL_FREQ] = [DATA_DICT['AVG_MUT_RATE'], DATA_DICT['SNP_FREQ'], DATA_DICT['INDEL_FREQ']] - mpl.bar([colorInd-1],[AVG_MUT_RATE],1.,color=myCol) -mpl.xlim([-1,N_FILES+1]) + my_col = get_color(color_ind, N_FILES) + color_ind += 1 + DATA_DICT = pickle.load(open(fn, "rb"), encoding="utf-8") + [AVG_MUT_RATE, SNP_FREQ, INDEL_FREQ] = [DATA_DICT['AVG_MUT_RATE'], DATA_DICT['SNP_FREQ'], DATA_DICT['INDEL_FREQ']] + mpl.bar([color_ind - 1], [AVG_MUT_RATE], 1., color=my_col) +mpl.xlim([-1, N_FILES + 1]) mpl.grid() -mpl.xticks([],[]) +mpl.xticks([], []) mpl.ylabel('Frequency') mpl.title('Overall mutation rate (1/bp)') -mpl.subplot(2,2,2) -colorInd = 0 +mpl.subplot(2, 2, 2) +color_ind = 0 for fn in INP: - myCol = getColor(colorInd,N_FILES) - colorInd += 1 - DATA_DICT = pickle.load( open( fn, "rb" ) ) - [AVG_MUT_RATE, SNP_FREQ, INDEL_FREQ] = [DATA_DICT['AVG_MUT_RATE'], DATA_DICT['SNP_FREQ'], DATA_DICT['INDEL_FREQ']] - mpl.bar([colorInd-1],[SNP_FREQ],1.,color=myCol) - mpl.bar([colorInd-1],[1.-SNP_FREQ],1.,color=myCol,bottom=[SNP_FREQ],hatch='/') -mpl.axis([-1,N_FILES+1,0,1.2]) + my_col = get_color(color_ind, N_FILES) + color_ind += 1 + DATA_DICT = pickle.load(open(fn, "rb"), encoding='utf-8') + [AVG_MUT_RATE, SNP_FREQ, INDEL_FREQ] = [DATA_DICT['AVG_MUT_RATE'], DATA_DICT['SNP_FREQ'], DATA_DICT['INDEL_FREQ']] + mpl.bar([color_ind - 1], [SNP_FREQ], 1., color=my_col) + mpl.bar([color_ind - 1], [1. - SNP_FREQ], 1., color=my_col, bottom=[SNP_FREQ], hatch='/') +mpl.axis([-1, N_FILES + 1, 0, 1.2]) mpl.grid() -mpl.xticks([],[]) -mpl.yticks([0,.2,.4,.6,.8,1.],[0,0.2,0.4,0.6,0.8,1.0]) +mpl.xticks([], []) +mpl.yticks([0, .2, .4, .6, .8, 1.], [0, 0.2, 0.4, 0.6, 0.8, 1.0]) mpl.ylabel('Frequency') mpl.title('SNP freq [ ] & indel freq [//]') -mpl.subplot(2,1,2) -colorInd = 0 -legText = LAB +mpl.subplot(2, 1, 2) +color_ind = 0 +leg_text = LAB for fn in INP: - myCol = getColor(colorInd,N_FILES) - colorInd += 1 - DATA_DICT = pickle.load( open( fn, "rb" ) ) - [AVG_MUT_RATE, SNP_FREQ, INDEL_FREQ] = [DATA_DICT['AVG_MUT_RATE'], DATA_DICT['SNP_FREQ'], DATA_DICT['INDEL_FREQ']] - x = sorted(INDEL_FREQ.keys()) - y = [INDEL_FREQ[n] for n in x] - mpl.plot(x,y,color=myCol) - #legText.append(fn) + my_col = get_color(color_ind, N_FILES) + color_ind += 1 + DATA_DICT = pickle.load(open(fn, "rb")) + [AVG_MUT_RATE, SNP_FREQ, INDEL_FREQ] = [DATA_DICT['AVG_MUT_RATE'], DATA_DICT['SNP_FREQ'], DATA_DICT['INDEL_FREQ']] + x = sorted(INDEL_FREQ.keys()) + y = [INDEL_FREQ[n] for n in x] + mpl.plot(x, y, color=my_col) +# leg_text.append(fn) mpl.grid() mpl.xlabel('Indel size (bp)', fontweight='bold') mpl.ylabel('Frequency') mpl.title('Indel frequency by size (- deletion, + insertion)') -mpl.legend(legText) -#mpl.show() -mpl.savefig(OUP+'_plot1_mutRates.pdf') +mpl.legend(leg_text) +# mpl.show() +mpl.savefig(OUP + '_plot1_mutRates.pdf') ################################################# # # TRINUC PRIOR PROB # ################################################# -mpl.figure(1,figsize=(14,6)) -colorInd = 0 -legText = LAB +mpl.figure(1, figsize=(14, 6)) +color_ind = 0 +leg_text = LAB for fn in INP: - myCol = getColor(colorInd,N_FILES) - colorInd += 1 - DATA_DICT = pickle.load( open( fn, "rb" ) ) - TRINUC_MUT_PROB = DATA_DICT['TRINUC_MUT_PROB'] - - x = range(colorInd-1,len(TRINUC_MUT_PROB)*N_FILES,N_FILES) - xt = sorted(TRINUC_MUT_PROB.keys()) - y = [TRINUC_MUT_PROB[k] for k in xt] - markerline, stemlines, baseline = mpl.stem(x,y,'-.') - mpl.setp(markerline, 'markerfacecolor', myCol) - mpl.setp(markerline, 'markeredgecolor', myCol) - mpl.setp(baseline, 'color', myCol, 'linewidth', 0) - mpl.setp(stemlines, 'color', myCol, 'linewidth', 3) - if colorInd == 1: - mpl.xticks(x,xt,rotation=90) - #legText.append(fn) + my_col = get_color(color_ind, N_FILES) + color_ind += 1 + DATA_DICT = pickle.load(open(fn, "rb")) + TRINUC_MUT_PROB = DATA_DICT['TRINUC_MUT_PROB'] + + x = range(color_ind - 1, len(TRINUC_MUT_PROB) * N_FILES, N_FILES) + xt = sorted(TRINUC_MUT_PROB.keys()) + y = [TRINUC_MUT_PROB[k] for k in xt] + markerline, stemlines, baseline = mpl.stem(x, y, '-.') + mpl.setp(markerline, 'markerfacecolor', my_col) + mpl.setp(markerline, 'markeredgecolor', my_col) + mpl.setp(baseline, 'color', my_col, 'linewidth', 0) + mpl.setp(stemlines, 'color', my_col, 'linewidth', 3) + if color_ind == 1: + mpl.xticks(x, xt, rotation=90) +# leg_text.append(fn) mpl.grid() mpl.ylabel('p(trinucleotide mutates)') -mpl.legend(legText) -#mpl.show() -mpl.savefig(OUP+'_plot2_trinucPriors.pdf') +mpl.legend(leg_text) +# mpl.show() +mpl.savefig(OUP + '_plot2_trinucPriors.pdf') ################################################# # # TRINUC TRANS PROB # ################################################# -plotNum = 3 +plot_num = 3 for fn in INP: - fig = mpl.figure(plotNum,figsize=(12,10)) - DATA_DICT = pickle.load( open( fn, "rb" ) ) - TRINUC_TRANS_PROBS = DATA_DICT['TRINUC_TRANS_PROBS'] - - xt2 = [m[3] for m in sorted([(n[0],n[2],n[1],n) for n in xt])] - reverse_dict = {xt2[i]:i for i in xrange(len(xt2))} - Z = np.zeros((64,64)) - L = [['' for n in xrange(64)] for m in xrange(64)] - for k in TRINUC_TRANS_PROBS: - i = reverse_dict[k[0]] - j = reverse_dict[k[1]] - Z[i][j] = TRINUC_TRANS_PROBS[k] - - HARDCODED_LABEL = ['A_A','A_C','A_G','A_T', - 'C_A','C_C','C_G','C_T', - 'G_A','G_C','G_G','G_T', - 'T_A','T_C','T_G','T_T'] - - for pi in xrange(16): - mpl.subplot(4,4,pi+1) - Z2 = Z[pi*4:(pi+1)*4,pi*4:(pi+1)*4] - X, Y = np.meshgrid( range(0,len(Z2[0])+1), range(0,len(Z2)+1) ) - im = mpl.pcolormesh(X,Y,Z2[::-1,:],vmin=0.0,vmax=0.5) - mpl.axis([0,4,0,4]) - mpl.xticks([0.5,1.5,2.5,3.5],['A','C','G','T']) - mpl.yticks([0.5,1.5,2.5,3.5],['T','G','C','A']) - mpl.text(1.6, 1.8, HARDCODED_LABEL[pi], color='white') - - # colorbar haxx - fig.subplots_adjust(right=0.8) - cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7]) - cb = fig.colorbar(im,cax=cbar_ax) - cb.set_label(r"p(X$Y_1$Z->X$Y_2$Z | X_Z mutates)") - - #mpl.tight_layout() - #mpl.figtext(0.24,0.94,'Trinucleotide Mutation Frequency',size=20) - #mpl.show() - mpl.savefig(OUP+'_plot'+str(plotNum)+'_trinucTrans.pdf') - plotNum += 1 + fig = mpl.figure(plot_num, figsize=(12, 10)) + DATA_DICT = pickle.load(open(fn, "rb")) + TRINUC_TRANS_PROBS = DATA_DICT['TRINUC_TRANS_PROBS'] + + xt2 = [m[3] for m in sorted([(n[0], n[2], n[1], n) for n in xt])] + reverse_dict = {xt2[i]: i for i in range(len(xt2))} + Z = np.zeros((64, 64)) + L = [['' for n in range(64)] for m in range(64)] + for k in TRINUC_TRANS_PROBS: + i = reverse_dict[k[0]] + j = reverse_dict[k[1]] + Z[i][j] = TRINUC_TRANS_PROBS[k] + + HARDCODED_LABEL = ['A_A', 'A_C', 'A_G', 'A_T', + 'C_A', 'C_C', 'C_G', 'C_T', + 'G_A', 'G_C', 'G_G', 'G_T', + 'T_A', 'T_C', 'T_G', 'T_T'] + + for pi in range(16): + mpl.subplot(4, 4, pi + 1) + Z2 = Z[pi * 4:(pi + 1) * 4, pi * 4:(pi + 1) * 4] + X, Y = np.meshgrid(range(0, len(Z2[0]) + 1), range(0, len(Z2) + 1)) + im = mpl.pcolormesh(X, Y, Z2[::-1, :], vmin=0.0, vmax=0.5) + mpl.axis([0, 4, 0, 4]) + mpl.xticks([0.5, 1.5, 2.5, 3.5], ['A', 'C', 'G', 'T']) + mpl.yticks([0.5, 1.5, 2.5, 3.5], ['T', 'G', 'C', 'A']) + mpl.text(1.6, 1.8, HARDCODED_LABEL[pi], color='white') + + # colorbar haxx + fig.subplots_adjust(right=0.8) + cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7]) + cb = fig.colorbar(im, cax=cbar_ax) + cb.set_label(r"p(X$Y_1$Z->X$Y_2$Z | X_Z mutates)") + + # mpl.tight_layout() + # mpl.figtext(0.24,0.94,'Trinucleotide Mutation Frequency',size=20) + # mpl.show() + mpl.savefig(OUP + '_plot' + str(plot_num) + '_trinucTrans.pdf') + plot_num += 1 ################################################# # @@ -214,65 +223,66 @@ def getBedOverlap(track,ind_s,ind_e): # ################################################# track_byFile_byChr = [{} for n in INP] -bp_total_byFile = [0 for n in INP] -colorInd = 0 +bp_total_byFile = [0 for n in INP] +color_ind = 0 for fn in INP: - DATA_DICT = pickle.load( open( fn, "rb" ) ) - HIGH_MUT_REGIONS = DATA_DICT['HIGH_MUT_REGIONS'] - for region in HIGH_MUT_REGIONS: - if region[0] not in track_byFile_byChr[colorInd]: - track_byFile_byChr[colorInd][region[0]] = [] - track_byFile_byChr[colorInd][region[0]].extend([region[1],region[2]]) - bp_total_byFile[colorInd] += region[2]-region[1]+1 - colorInd += 1 + DATA_DICT = pickle.load(open(fn, "rb")) + HIGH_MUT_REGIONS = DATA_DICT['HIGH_MUT_REGIONS'] + for region in HIGH_MUT_REGIONS: + if region[0] not in track_byFile_byChr[color_ind]: + track_byFile_byChr[color_ind][region[0]] = [] + track_byFile_byChr[color_ind][region[0]].extend([region[1], region[2]]) + bp_total_byFile[color_ind] += region[2] - region[1] + 1 + color_ind += 1 bp_overlap_count = [[0 for m in INP] for n in INP] -for i in xrange(N_FILES): - bp_overlap_count[i][i] = bp_total_byFile[i] - for j in xrange(i+1,N_FILES): - for k in track_byFile_byChr[i].keys(): - if k in track_byFile_byChr[j]: - for ii in xrange(len(track_byFile_byChr[i][k][::2])): - bp_overlap_count[i][j] += getBedOverlap(track_byFile_byChr[j][k],track_byFile_byChr[i][k][ii*2],track_byFile_byChr[i][k][ii*2+1]) - -print '' -print 'HIGH_MUT_REGION OVERLAP BETWEEN '+str(N_FILES)+' MODELS...' -for i in xrange(N_FILES): - for j in xrange(i,N_FILES): - nDissimilar = (bp_overlap_count[i][i]-bp_overlap_count[i][j]) + (bp_overlap_count[j][j]-bp_overlap_count[i][j]) - if bp_overlap_count[i][j] == 0: - percentageV = 0.0 - else: - percentageV = bp_overlap_count[i][j]/float(bp_overlap_count[i][j]+nDissimilar) - print 'overlap['+str(i)+','+str(j)+'] = '+str(bp_overlap_count[i][j])+' bp ({0:.3f}%)'.format(percentageV*100.) -print '' +for i in range(N_FILES): + bp_overlap_count[i][i] = bp_total_byFile[i] + for j in range(i + 1, N_FILES): + for k in track_byFile_byChr[i].keys(): + if k in track_byFile_byChr[j]: + for ii in range(len(track_byFile_byChr[i][k][::2])): + bp_overlap_count[i][j] += get_bed_overlap(track_byFile_byChr[j][k], track_byFile_byChr[i][k][ii * 2], + track_byFile_byChr[i][k][ii * 2 + 1]) + +print('') +print('HIGH_MUT_REGION OVERLAP BETWEEN ' + str(N_FILES) + ' MODELS...') +for i in range(N_FILES): + for j in range(i, N_FILES): + n_dissimilar = (bp_overlap_count[i][i] - bp_overlap_count[i][j]) + ( + bp_overlap_count[j][j] - bp_overlap_count[i][j]) + if bp_overlap_count[i][j] == 0: + percentage_v = 0.0 + else: + percentage_v = bp_overlap_count[i][j] / float(bp_overlap_count[i][j] + n_dissimilar) + print('overlap[' + str(i) + ',' + str(j) + '] = ' + str(bp_overlap_count[i][j]) + ' bp ({0:.3f}%)'.format( + percentage_v * 100.)) +print('') ################################################# # # COMMON VARIANTS # ################################################# -setofVars = [set([]) for n in INP] -colorInd = 0 +set_of_vars = [set([]) for n in INP] +color_ind = 0 for fn in INP: - DATA_DICT = pickle.load( open( fn, "rb" ) ) - COMMON_VARIANTS = DATA_DICT['COMMON_VARIANTS'] - for n in COMMON_VARIANTS: - setofVars[colorInd].add(n) - colorInd += 1 - -print '' -print 'COMMON_VARIANTS OVERLAP BETWEEN '+str(N_FILES)+' MODELS...' -for i in xrange(N_FILES): - for j in xrange(i,N_FILES): - overlapCount = len(setofVars[i].intersection(setofVars[j])) - nDissimilar = (len(setofVars[i])-overlapCount) + (len(setofVars[j])-overlapCount) - if overlapCount == 0: - percentageV = 0.0 - else: - percentageV = overlapCount/float(overlapCount+nDissimilar) - print 'overlap['+str(i)+','+str(j)+'] = '+str(overlapCount)+' variants ({0:.3f}%)'.format(percentageV*100.) -print '' - - - + DATA_DICT = pickle.load(open(fn, "rb")) + COMMON_VARIANTS = DATA_DICT['COMMON_VARIANTS'] + for n in COMMON_VARIANTS: + set_of_vars[color_ind].add(n) + color_ind += 1 + +print('') +print('COMMON_VARIANTS OVERLAP BETWEEN ' + str(N_FILES) + ' MODELS...') +for i in range(N_FILES): + for j in range(i, N_FILES): + overlap_count = len(set_of_vars[i].intersection(set_of_vars[j])) + n_dissimilar = (len(set_of_vars[i]) - overlap_count) + (len(set_of_vars[j]) - overlap_count) + if overlap_count == 0: + percentage_v = 0.0 + else: + percentage_v = overlap_count / float(overlap_count + n_dissimilar) + print('overlap[' + str(i) + ',' + str(j) + '] = ' + str(overlap_count) + ' variants ({0:.3f}%)'.format( + percentage_v * 100.)) +print('') diff --git a/utilities/repickle.py b/utilities/repickle.py new file mode 100755 index 0000000..461eb06 --- /dev/null +++ b/utilities/repickle.py @@ -0,0 +1,22 @@ +import sys +import os +import pathlib +import glob +import pickle + + +def main(): + dir_to_repickle = pathlib.Path(sys.argv[1]) + + if not dir_to_repickle.is_dir(): + print("Input is not a directory.") + sys.exit(1) + + os.chdir(dir_to_repickle) + for file in glob.glob("*.p"): + data = pickle.load(open(file, 'rb'), encoding="bytes") + pickle.dump(data, open(file, "wb")) + + +if __name__ == "__main__": + main() diff --git a/utilities/validateBam.py b/utilities/validateBam.py old mode 100644 new mode 100755 index 47da72e..aff1ada --- a/utilities/validateBam.py +++ b/utilities/validateBam.py @@ -1,84 +1,88 @@ -#!/usr/bin/env python +#!/usr/bin/env source + +# Python 3 ready import sys import os import gzip from struct import unpack -BAM_EOF = ['1f', '8b', '08', '04', '00', '00', '00', '00', '00', 'ff', '06', '00', '42', '43', '02', '00', '1b', '00', '03', '00', '00', '00', '00', '00', '00', '00', '00', '00'] +BAM_EOF = ['1f', '8b', '08', '04', '00', '00', '00', '00', '00', 'ff', '06', '00', '42', '43', '02', '00', '1b', '00', + '03', '00', '00', '00', '00', '00', '00', '00', '00', '00'] + + +def get_bytes(fmt, amt): + if fmt == '>16)&65535 - mapq = (bmqnl>>8)&255 - lrn = bmqnl&255 - print '-- bmqnl:', bmqnl, '(bin='+str(binv)+', mapq='+str(mapq)+', l_readname+1='+str(lrn)+')' - flgnc = getBytes('>16)&65535 - ncig = flgnc&65535 - print '-- flgnc:', flgnc, '(flag='+str(flag)+', ncig='+str(ncig)+')' - print '-- l_seq:', getBytes('> 16) & 65535 + mapq = (bmqnl >> 8) & 255 + lrn = bmqnl & 255 + print('-- bmqnl:', bmqnl, '(bin=' + str(binv) + ', mapq=' + str(mapq) + ', l_readname+1=' + str(lrn) + ')') + flgnc = get_bytes('> 16) & 65535 + ncig = flgnc & 65535 + print('-- flgnc:', flgnc, '(flag=' + str(flag) + ', ncig=' + str(ncig) + ')') + print('-- l_seq:', get_bytes('?@ABCDEFGHIJ' ALLOWED_NUCL = 'ACGTN' -def validate4lines(l1,l2,l3,l4): - failed = 0 - # make sure lines contain correct delimiters - if l1[0] != '@' or l1[-2] != '/' or l3[0] != '+': - failed = 1 - # make sure seq len == qual length - if len(l2) != len(l4): - failed = 2 - # make sure seq string contains only valid characters - for n in l2: - if n not in ALLOWED_NUCL: - failed = 3 - # make sure qual string contains only valid characters - for n in l4: - if n not in ALLOWED_QUAL: - failed = 4 - if failed: - print '\nError: malformed lines:' - if failed == 1: print ' ---- invalid delimiters\n' - elif failed == 2: print ' ---- seq len != qual len\n' - elif failed == 3: print ' ---- seq contains invalid characters\n' - elif failed == 4: print ' ---- qual contains invalid characters\n' - print l1+'\n'+l2+'\n'+l3+'\n'+l4+'\n' - exit(1) -f1 = open(sys.argv[1],'r') -(l1_r1, l2_r1, l3_r1, l4_r1) = get4lines(f1) +def validate_4_lines(l1, l2, l3, l4): + failed = 0 + # make sure lines contain correct delimiters + if l1[0] != '@' or l1[-2] != '/' or l3[0] != '+': + failed = 1 + # make sure seq len == qual length + if len(l2) != len(l4): + failed = 2 + # make sure seq string contains only valid characters + for n in l2: + if n not in ALLOWED_NUCL: + failed = 3 + # make sure qual string contains only valid characters + for n in l4: + if n not in ALLOWED_QUAL: + failed = 4 + if failed: + print('\nError: malformed lines:') + if failed == 1: + print(' ---- invalid delimiters\n') + elif failed == 2: + print(' ---- seq len != qual len\n') + elif failed == 3: + print(' ---- seq contains invalid characters\n') + elif failed == 4: + print(' ---- qual contains invalid characters\n') + print(l1 + '\n' + l2 + '\n' + l3 + '\n' + l4 + '\n') + exit(1) + + +f1 = open(sys.argv[1], 'r') +(l1_r1, l2_r1, l3_r1, l4_r1) = get_4_lines(f1) f2 = None if len(sys.argv) == 3: - f2 = open(sys.argv[2],'r') - (l1_r2, l2_r2, l3_r2, l4_r2) = get4lines(f2) + f2 = open(sys.argv[2], 'r') + (l1_r2, l2_r2, l3_r2, l4_r2) = get_4_lines(f2) while l1_r1: - # check line syntax - validate4lines(l1_r1,l2_r1,l3_r1,l4_r1) - if f2 != None: - validate4lines(l1_r2,l2_r2,l3_r2,l4_r2) - # make sure seq id is same for r1/r2 - if l1_r1[:-1] != l1_r2[:-1]: - print '\nError: mismatched r1/r2 name:\n' - print l1_r1+'\n'+l1_r2+'\n' - exit(1) + # check line syntax + validate_4_lines(l1_r1, l2_r1, l3_r1, l4_r1) + if f2 != None: + validate_4_lines(l1_r2, l2_r2, l3_r2, l4_r2) + # make sure seq id is same for r1/r2 + if l1_r1[:-1] != l1_r2[:-1]: + print('\nError: mismatched r1/r2 name:\n') + print(l1_r1 + '\n' + l1_r2 + '\n') + exit(1) - # grab next 4 lines... - (l1_r1, l2_r1, l3_r1, l4_r1) = get4lines(f1) - if f2 != None: - (l1_r2, l2_r2, l3_r2, l4_r2) = get4lines(f2) + # grab next 4 lines... + (l1_r1, l2_r1, l3_r1, l4_r1) = get_4_lines(f1) + if f2 != None: + (l1_r2, l2_r2, l3_r2, l4_r2) = get_4_lines(f2) if f2 != None: - f2.close() + f2.close() f1.close() -print '\nPASSED WITH FLYING COLORS. GOOD DAY.\n' - +print('\nPASSED WITH FLYING COLORS. GOOD DAY.\n') diff --git a/utilities/vcf_compare_OLD.py b/utilities/vcf_compare_OLD.py old mode 100644 new mode 100755 index d451ee6..910c5ce --- a/utilities/vcf_compare_OLD.py +++ b/utilities/vcf_compare_OLD.py @@ -1,9 +1,11 @@ -#!/usr/bin/env python +#!/usr/bin/env source # encoding: utf-8 +# Python 3 ready + """ ************************************************** -vcf_compare.py +vcf_compare.source - compare vcf file produced by workflow to golden vcf produced by simulator @@ -14,708 +16,737 @@ ************************************************** """ import sys -import os import copy import time import bisect import re import numpy as np -import optparse - - -EV_BPRANGE = 50 # how far to either side of a particular variant location do we want to check for equivalents? - -DEFAULT_QUAL = -666 # if we can't find a qual score, use this instead so we know it's missing - -MAX_VAL = 9999999999999 # an unreasonably large value that no reference fasta could concievably be longer than - -DESC = """%prog: vcf comparison script.""" -VERS = 0.1 - -PARSER = optparse.OptionParser('python %prog [options] -r -g -w ',description=DESC,version="%prog v"+str(VERS)) - -PARSER.add_option('-r', help='* Reference Fasta', dest='REFF', action='store', metavar='') -PARSER.add_option('-g', help='* Golden VCF', dest='GVCF', action='store', metavar='') -PARSER.add_option('-w', help='* Workflow VCF', dest='WVCF', action='store', metavar='') -PARSER.add_option('-o', help='* Output Prefix', dest='OUTF', action='store', metavar='') -PARSER.add_option('-m', help='Mappability Track', dest='MTRK', action='store', metavar='') -PARSER.add_option('-M', help='Maptrack Min Len', dest='MTMM', action='store', metavar='') -PARSER.add_option('-t', help='Targetted Regions', dest='TREG', action='store', metavar='') -PARSER.add_option('-T', help='Min Region Len', dest='MTRL', action='store', metavar='') -PARSER.add_option('-c', help='Coverage Filter Threshold [%default]', dest='DP_THRESH', default=15, action='store', metavar='') -PARSER.add_option('-a', help='Allele Freq Filter Threshold [%default]', dest='AF_THRESH', default=0.3, action='store', metavar='') - -PARSER.add_option('--vcf-out', help="Output Match/FN/FP variants [%default]", dest='VCF_OUT', default=False, action='store_true') -PARSER.add_option('--no-plot', help="No plotting [%default]", dest='NO_PLOT', default=False, action='store_true') -PARSER.add_option('--incl-homs', help="Include homozygous ref calls [%default]", dest='INCL_H', default=False, action='store_true') -PARSER.add_option('--incl-fail', help="Include calls that failed filters [%default]", dest='INCL_F', default=False, action='store_true') -PARSER.add_option('--fast', help="No equivalent variant detection [%default]", dest='FAST', default=False, action='store_true') - -(OPTS,ARGS) = PARSER.parse_args() - -REFERENCE = OPTS.REFF -GOLDEN_VCF = OPTS.GVCF -WORKFLOW_VCF = OPTS.WVCF -OUT_PREFIX = OPTS.OUTF -MAPTRACK = OPTS.MTRK -MIN_READLEN = OPTS.MTMM -BEDFILE = OPTS.TREG -DP_THRESH = int(OPTS.DP_THRESH) -AF_THRESH = float(OPTS.AF_THRESH) - -VCF_OUT = OPTS.VCF_OUT -NO_PLOT = OPTS.NO_PLOT -INCLUDE_HOMS = OPTS.INCL_H -INCLUDE_FAIL = OPTS.INCL_F -FAST = OPTS.FAST +import argparse + +from Bio.Seq import Seq + +EV_BPRANGE = 50 # how far to either side of a particular variant location do we want to check for equivalents? + +DEFAULT_QUAL = -666 # if we can't find a qual score, use this instead so we know it's missing + +MAX_VAL = 9999999999999 # an unreasonably large value that no reference fasta could concievably be longer than + +DESC = """%prog: vcf comparison script.""" +VERS = 0.1 + +parser = argparse.ArgumentParser('source %prog [options] -r -g -w ', + description=DESC, + version="%prog v" + str(VERS), + formatter_class=argparse.ArgumentDefaultsHelpFormatter, ) + +parser.add_argument('-r', help='* Reference Fasta', dest='reff', action='store', metavar='') +parser.add_argument('-g', help='* Golden VCF', dest='golden_vcf', action='store', metavar='') +parser.add_argument('-w', help='* Workflow VCF', dest='workflow_vcf', action='store', metavar='') +parser.add_argument('-o', help='* Output Prefix', dest='outfile', action='store', metavar='') +parser.add_argument('-m', help='Mappability Track', dest='map_track', action='store', metavar='') +parser.add_argument('-M', help='Maptrack Min Len', dest='map_track_min_len', action='store', metavar='') +parser.add_argument('-t', help='Targetted Regions', dest='target_reg', action='store', metavar='') +parser.add_argument('-T', help='Min Region Len', dest='min_reg_len', action='store', metavar='') +parser.add_argument('-c', help='Coverage Filter Threshold [%default]', dest='dp_thresh', default=15, action='store', + metavar='') +parser.add_argument('-a', help='Allele Freq Filter Threshold [%default]', dest='af_thresh', default=0.3, action='store', + metavar='') + +parser.add_argument('--vcf-out', help="Output Match/FN/FP variants [%default]", dest='vcf_out', default=False, + action='store_true') +parser.add_argument('--no-plot', help="No plotting [%default]", dest='no_plot', default=False, action='store_true') +parser.add_argument('--incl-homs', help="Include homozygous ref calls [%default]", dest='include_homs', default=False, + action='store_true') +parser.add_argument('--incl-fail', help="Include calls that failed filters [%default]", dest='include_fail', + default=False, + action='store_true') +parser.add_argument('--fast', help="No equivalent variant detection [%default]", dest='fast', default=False, + action='store_true') + +(opts, args) = parser.parse_args() + +reference = opts.reff +golden_vcf = opts.golden_vcf +workflow_vcf = opts.workflow_vcf +out_prefix = opts.outfile +maptrack = opts.map_track +min_read_len = opts.map_track_min_len +bedfile = opts.target_reg +dp_thresh = int(opts.dp_thresh) +af_thresh = float(opts.af_thresh) + +vcf_out = opts.vcf_out +no_plot = opts.no_plot +include_homs = opts.include_homs +include_fail = opts.include_fail +fast = opts.fast if len(sys.argv[1:]) == 0: - PARSER.print_help() - exit(1) + parser.print_help() + exit(1) -if OPTS.MTRL != None: - MINREGIONLEN = int(OPTS.MTRL) +if opts.MTRL is not None: + min_region_len = int(opts.min_reg_len) else: - MINREGIONLEN = None + min_region_len = None -if MIN_READLEN == None: - MIN_READLEN = 0 +if min_read_len is None: + min_read_len = 0 else: - MIN_READLEN = int(MIN_READLEN) - -if REFERENCE == None: - print 'Error: No reference provided.' - exit(1) -if GOLDEN_VCF == None: - print 'Error: No golden VCF provided.' - exit(1) -if WORKFLOW_VCF == None: - print 'Error: No workflow VCF provided.' - exit(1) -if OUT_PREFIX == None: - print 'Error: No output prefix provided.' - exit(1) -if (BEDFILE != None and MINREGIONLEN == None) or (BEDFILE == None and MINREGIONLEN != None): - print 'Error: Both -t and -T must be specified' - exit(1) - -if NO_PLOT == False: - import matplotlib - matplotlib.use('Agg') - import matplotlib.pyplot as mpl - from matplotlib_venn import venn2, venn3 - import warnings - warnings.filterwarnings("ignore", category=UserWarning, module='matplotlib_venn') + min_read_len = int(min_read_len) + +if reference is None: + print('Error: No reference provided.') + sys.exit(1) +if golden_vcf is None: + print('Error: No golden VCF provided.') + sys.exit(1) +if workflow_vcf is None: + print('Error: No workflow VCF provided.') + sys.exit(1) +if out_prefix is None: + print('Error: No output prefix provided.') + sys.exit(1) +if (bedfile is not None and min_region_len is None) or (bedfile is None and min_region_len is not None): + print('Error: Both -t and -T must be specified') + sys.exit(1) + +if no_plot is False: + import matplotlib + + matplotlib.use('Agg') + import matplotlib.pyplot as mpl + from matplotlib_venn import venn2, venn3 + import warnings + + warnings.filterwarnings("ignore", category=UserWarning, module='matplotlib_venn') AF_STEPS = 20 -AF_KEYS = np.linspace(0.0,1.0,AF_STEPS+1) +AF_KEYS = np.linspace(0.0, 1.0, AF_STEPS + 1) + def quantize_AF(af): - if af >= 1.0: - return AF_STEPS - elif af <= 0.0: - return 0 - else: - return int(af*AF_STEPS) - -VCF_HEADER = '##fileformat=VCFv4.1\n##reference='+REFERENCE+'##INFO=\n##INFO=\n' - -DP_TOKENS = ['DP','DPU','DPI'] # in the order that we'll look for them - -def parseLine(splt,colDict,colSamp): - - # check if we want to proceed.. - ra = splt[colDict['REF']] - aa = splt[colDict['ALT']] - if not(INCLUDE_HOMS) and (aa == '.' or aa == '' or aa == ra): - return None - if not(INCLUDE_FAIL) and (splt[colDict['FILTER']] != 'PASS' and splt[colDict['FILTER']] != '.'): - return None - - # default vals - cov = None - qual = DEFAULT_QUAL - alt_alleles = [] - alt_freqs = [None] - - # any alt alleles? - alt_split = aa.split(',') - if len(alt_split) > 1: - alt_alleles = alt_split - - # cov - for dp_tok in DP_TOKENS: - # check INFO for DP first - if 'INFO' in colDict and dp_tok+'=' in splt[colDict['INFO']]: - cov = int(re.findall(re.escape(dp_tok)+r"=[0-9]+",splt[colDict['INFO']])[0][3:]) - # check FORMAT/SAMPLE for DP second: - elif 'FORMAT' in colDict and len(colSamp): - format = splt[colDict['FORMAT']]+':' - if ':'+dp_tok+':' in format: - dpInd = format.split(':').index(dp_tok) - cov = int(splt[colSamp[0]].split(':')[dpInd]) - if cov != None: - break - - # check INFO for AF first - af = None - if 'INFO' in colDict and ';AF=' in ';'+splt[colDict['INFO']]: - info = splt[colDict['INFO']]+';' - af = re.findall(r"AF=.*?(?=;)",info)[0][3:] - # check FORMAT/SAMPLE for AF second: - elif 'FORMAT' in colDict and len(colSamp): - format = splt[colDict['FORMAT']]+':' - if ':AF:' in format: - afInd = splt[colDict['FORMAT']].split(':').index('AF') - af = splt[colSamp[0]].split(':')[afInd] - - if af != None: - af_splt = af.split(',') - while(len(af_splt) < len(alt_alleles)): # are we lacking enough AF values for some reason? - af_splt.append(af_splt[-1]) # phone it in. - if len(af_splt) != 0 and af_splt[0] != '.' and af_splt[0] != '': # missing data, yay - alt_freqs = [float(n) for n in af_splt] - else: - alt_freqs = [None]*max([len(alt_alleles),1]) - - # get QUAL if it's interesting - if 'QUAL' in colDict and splt[colDict['QUAL']] != '.': - qual = float(splt[colDict['QUAL']]) - - return (cov, qual, alt_alleles, alt_freqs) - - -def parseVCF(VCF_FILENAME,refName,targRegionsFl,outFile,outBool): - v_Hashed = {} - v_posHash = {} - v_Alts = {} - v_Cov = {} - v_AF = {} - v_Qual = {} - v_TargLen = {} - nBelowMinRLen = 0 - line_unique = 0 # number of lines in vcf file containing unique variant - hash_coll = 0 # number of times we saw a hash collision ("per line" so non-unique alt alleles don't get counted multiple times) - var_filtered = 0 # number of variants excluded due to filters (e.g. hom-refs, qual) - var_merged = 0 # number of variants we merged into another due to having the same position specified - colDict = {} - colSamp = [] - for line in open(VCF_FILENAME,'r'): - if line[0] != '#': - if len(colDict) == 0: - print '\n\nError: VCF has no header?\n'+VCF_FILENAME+'\n\n' - exit(1) - splt = line[:-1].split('\t') - if splt[0] == refName: - - var = (int(splt[1]),splt[3],splt[4]) - targInd = bisect.bisect(targRegionsFl,var[0]) - - if targInd%2 == 1: - targLen = targRegionsFl[targInd]-targRegionsFl[targInd-1] - if (BEDFILE != None and targLen >= MINREGIONLEN) or BEDFILE == None: - - pl_out = parseLine(splt,colDict,colSamp) - if pl_out == None: - var_filtered += 1 - continue - (cov, qual, aa, af) = pl_out - - if var not in v_Hashed: - - vpos = var[0] - if vpos in v_posHash: - if len(aa) == 0: - aa = [var[2]] - aa.extend([n[2] for n in v_Hashed.keys() if n[0] == vpos]) - var_merged += 1 - v_posHash[vpos] = 1 - - if len(aa): - allVars = [(var[0],var[1],n) for n in aa] - for i in xrange(len(allVars)): - v_Hashed[allVars[i]] = 1 - #if allVars[i] not in v_Alts: - # v_Alts[allVars[i]] = [] - #v_Alts[allVars[i]].extend(allVars) - v_Alts[allVars[i]] = allVars - else: - v_Hashed[var] = 1 - - if cov != None: - v_Cov[var] = cov - v_AF[var] = af[0] # only use first AF, even if multiple. fix this later? - v_Qual[var] = qual - v_TargLen[var] = targLen - line_unique += 1 - - else: - hash_coll += 1 - - else: - nBelowMinRLen += 1 - else: - if line[1] != '#': - cols = line[1:-1].split('\t') - for i in xrange(len(cols)): - if 'FORMAT' in colDict: - colSamp.append(i) - colDict[cols[i]] = i - if VCF_OUT and outBool: - outBool = False - outFile.write(line) - - return (v_Hashed, v_Alts, v_Cov, v_AF, v_Qual, v_TargLen, nBelowMinRLen, line_unique, var_filtered, var_merged, hash_coll) - - -def condenseByPos(listIn): - varListOfInterest = [n for n in listIn] - indCount = {} - for n in varListOfInterest: - c = n[0] - if c not in indCount: - indCount[c] = 0 - indCount[c] += 1 - #nonUniqueDict = {n:[] for n in sorted(indCount.keys()) if indCount[n] > 1} # the python 2.7 way - nonUniqueDict = {} - for n in sorted(indCount.keys()): - if indCount[n] > 1: - nonUniqueDict[n] = [] - delList = [] - for i in xrange(len(varListOfInterest)): - if varListOfInterest[i][0] in nonUniqueDict: - nonUniqueDict[varListOfInterest[i][0]].append(varListOfInterest[i]) - delList.append(i) - delList = sorted(delList,reverse=True) - for di in delList: - del varListOfInterest[di] - for v in nonUniqueDict.values(): - var = (v[0][0],v[0][1],','.join([n[2] for n in v[::-1]])) - varListOfInterest.append(var) - return varListOfInterest + if af >= 1.0: + return AF_STEPS + elif af <= 0.0: + return 0 + else: + return int(af * AF_STEPS) + + +VCF_HEADER = '##fileformat=VCFv4.1\n##reference=' + reference + '##INFO=\n##INFO=\n' + +DP_TOKENS = ['DP', 'DPU', 'DPI'] # in the order that we'll look for them + + +def parse_line(splt, col_dict, col_samp): + # check if we want to proceed.. + ra = splt[col_dict['REF']] + aa = splt[col_dict['ALT']] + if not (include_homs) and (aa == '.' or aa == '' or aa == ra): + return None + if not (include_fail) and (splt[col_dict['FILTER']] != 'PASS' and splt[col_dict['FILTER']] != '.'): + return None + + # default vals + cov = None + qual = DEFAULT_QUAL + alt_alleles = [] + alt_freqs = [None] + + # any alt alleles? + alt_split = aa.split(',') + if len(alt_split) > 1: + alt_alleles = alt_split + + # cov + for dp_tok in DP_TOKENS: + # check INFO for DP first + if 'INFO' in col_dict and dp_tok + '=' in splt[col_dict['INFO']]: + cov = int(re.findall(re.escape(dp_tok) + r"=[0-9]+", splt[col_dict['INFO']])[0][3:]) + # check FORMAT/SAMPLE for DP second: + elif 'FORMAT' in col_dict and len(col_samp): + format = splt[col_dict['FORMAT']] + ':' + if ':' + dp_tok + ':' in format: + dp_ind = format.split(':').index(dp_tok) + cov = int(splt[col_samp[0]].split(':')[dp_ind]) + if cov is not None: + break + + # check INFO for AF first + af = None + if 'INFO' in col_dict and ';AF=' in ';' + splt[col_dict['INFO']]: + info = splt[col_dict['INFO']] + ';' + af = re.findall(r"AF=.*?(?=;)", info)[0][3:] + # check FORMAT/SAMPLE for AF second: + elif 'FORMAT' in col_dict and len(col_samp): + format = splt[col_dict['FORMAT']] + ':' + if ':AF:' in format: + af_ind = splt[col_dict['FORMAT']].split(':').index('AF') + af = splt[col_samp[0]].split(':')[af_ind] + + if af is not None: + af_splt = af.split(',') + while (len(af_splt) < len(alt_alleles)): # are we lacking enough AF values for some reason? + af_splt.append(af_splt[-1]) # phone it in. + if len(af_splt) != 0 and af_splt[0] != '.' and af_splt[0] != '': # missing data, yay + alt_freqs = [float(n) for n in af_splt] + else: + alt_freqs = [None] * max([len(alt_alleles), 1]) + + # get QUAL if it's interesting + if 'QUAL' in col_dict and splt[col_dict['QUAL']] != '.': + qual = float(splt[col_dict['QUAL']]) + + return (cov, qual, alt_alleles, alt_freqs) + + +def parse_vcf(vcf_filename, ref_name, targ_regions_FL, out_file, out_bool): + v_hashed = {} + v_pos_hash = {} + v_alts = {} + v_cov = {} + v_af = {} + v_qual = {} + v_targ_len = {} + n_below_min_r_len = 0 + line_unique = 0 # number of lines in vcf file containing unique variant + hash_coll = 0 # number of times we saw a hash collision ("per line" so non-unique alt alleles don't get counted multiple times) + var_filtered = 0 # number of variants excluded due to filters (e.g. hom-refs, qual) + var_merged = 0 # number of variants we merged into another due to having the same position specified + col_dict = {} + col_samp = [] + for line in open(vcf_filename, 'r'): + if line[0] != '#': + if len(col_dict) == 0: + print('\n\nError: VCF has no header?\n' + vcf_filename + '\n\n') + exit(1) + splt = line[:-1].split('\t') + if splt[0] == ref_name: + + var = (int(splt[1]), splt[3], splt[4]) + targ_ind = bisect.bisect(targ_regions_FL, var[0]) + + if targ_ind % 2 == 1: + targ_Len = targ_regions_FL[targ_ind] - targ_regions_FL[targ_ind - 1] + if (bedfile is not None and targ_Len >= min_region_len) or bedfile is None: + + pl_out = parse_line(splt, col_dict, col_samp) + if pl_out is None: + var_filtered += 1 + continue + (cov, qual, aa, af) = pl_out + + if var not in v_hashed: + + v_pos = var[0] + if v_pos in v_pos_hash: + if len(aa) == 0: + aa = [var[2]] + aa.extend([n[2] for n in v_hashed.keys() if n[0] == v_pos]) + var_merged += 1 + v_pos_hash[v_pos] = 1 + + if len(aa): + all_vars = [(var[0], var[1], n) for n in aa] + for i in range(len(all_vars)): + v_hashed[all_vars[i]] = 1 + # if all_vars[i] not in v_alts: + # v_alts[all_vars[i]] = [] + # v_alts[all_vars[i]].extend(all_vars) + v_alts[all_vars[i]] = all_vars + else: + v_hashed[var] = 1 + + if cov is not None: + v_cov[var] = cov + v_af[var] = af[0] # only use first AF, even if multiple. fix this later? + v_qual[var] = qual + v_targ_len[var] = targ_Len + line_unique += 1 + + else: + hash_coll += 1 + + else: + n_below_min_r_len += 1 + else: + if line[1] != '#': + cols = line[1:-1].split('\t') + for i in range(len(cols)): + if 'FORMAT' in col_dict: + col_samp.append(i) + col_dict[cols[i]] = i + if vcf_out and out_bool: + out_bool = False + out_file.write(line) + + return ( + v_hashed, v_alts, v_cov, v_af, v_qual, v_targ_len, n_below_min_r_len, line_unique, var_filtered, var_merged, + hash_coll) + + +def condense_by_pos(list_in): + var_list_of_interest = [n for n in list_in] + ind_count = {} + for n in var_list_of_interest: + c = n[0] + if c not in ind_count: + ind_count[c] = 0 + ind_count[c] += 1 + # non_unique_dict = {n:[] for n in sorted(ind_count.keys()) if ind_count[n] > 1} # the source 2.7 way + non_unique_dict = {} + for n in sorted(ind_count.keys()): + if ind_count[n] > 1: + non_unique_dict[n] = [] + del_list = [] + for i in range(len(var_list_of_interest)): + if var_list_of_interest[i][0] in non_unique_dict: + non_unique_dict[var_list_of_interest[i][0]].append(var_list_of_interest[i]) + del_list.append(i) + del_list = sorted(del_list, reverse=True) + for di in del_list: + del var_list_of_interest[di] + for v in non_unique_dict.values(): + var = (v[0][0], v[0][1], ','.join([n[2] for n in v[::-1]])) + var_list_of_interest.append(var) + return var_list_of_interest def main(): - - ref = [] - f = open(REFERENCE,'r') - nLines = 0 - prevR = None - prevP = None - ref_inds = [] - sys.stdout.write('\nindexing reference fasta... ') - sys.stdout.flush() - tt = time.time() - while 1: - nLines += 1 - data = f.readline() - if not data: - ref_inds.append( (prevR, prevP, f.tell()-len(data)) ) - break - if data[0] == '>': - if prevP != None: - ref_inds.append( (prevR, prevP, f.tell()-len(data)) ) - prevP = f.tell() - prevR = data[1:-1] - print '{0:.3f} (sec)'.format(time.time()-tt) - #ref_inds = [('chrM', 6, 16909), ('chr1', 16915, 254252549), ('chr2', 254252555, 502315916), ('chr3', 502315922, 704298801), ('chr4', 704298807, 899276169), ('chr5', 899276175, 1083809741), ('chr6', 1083809747, 1258347116), ('chr7', 1258347122, 1420668559), ('chr8', 1420668565, 1569959868), ('chr9', 1569959874, 1713997574), ('chr10', 1713997581, 1852243023), ('chr11', 1852243030, 1989949677), ('chr12', 1989949684, 2126478617), ('chr13', 2126478624, 2243951900), ('chr14', 2243951907, 2353448438), ('chr15', 2353448445, 2458030465), ('chr16', 2458030472, 2550192321), ('chr17', 2550192328, 2633011443), ('chr18', 2633011450, 2712650243), ('chr19', 2712650250, 2772961813), ('chr20', 2772961820, 2837247851), ('chr21', 2837247858, 2886340351), ('chr22', 2886340358, 2938671016), ('chrX', 2938671022, 3097046994), ('chrY', 3097047000, 3157608038)] - - ztV = 0 # total golden variants - ztW = 0 # total workflow variants - znP = 0 # total perfect matches - zfP = 0 # total false positives - znF = 0 # total false negatives - znE = 0 # total equivalent variants detected - zgF = 0 # total golden variants that were filtered and excluded - zgR = 0 # total golden variants that were excluded for being redundant - zgM = 0 # total golden variants that were merged into a single position - zwF = 0 # total workflow variants that were filtered and excluded - zwR = 0 # total workflow variants that were excluded for being redundant - zwM = 0 # total workflow variants that were merged into a single position - if BEDFILE != None: - zbM = 0 - - mappability_vs_FN = {0:0, 1:0} # [0] = # of FNs that were in mappable regions, [1] = # of FNs that were in unmappable regions - coverage_vs_FN = {} # [C] = # of FNs that were covered by C reads - alleleBal_vs_FN = {} # [AF] = # of FNs that were heterozygous genotypes with allele freq AF (quantized to multiples of 1/AF_STEPS) - for n in AF_KEYS: - alleleBal_vs_FN[n] = 0 - - # - # read in mappability track - # - mappability_tracks = {} # indexed by chr string (e.g. 'chr1'), has boolean array - prevRef = '' - relevantRegions = [] - if MAPTRACK != None: - mtf = open(MAPTRACK,'r') - for line in mtf: - splt = line.strip().split('\t') - if prevRef != '' and splt[0] != prevRef: - # do stuff - if len(relevantRegions): - myTrack = [0]*(relevantRegions[-1][1]+100) - for r in relevantRegions: - for ri in xrange(r[0],r[1]): - myTrack[ri] = 1 - mappability_tracks[prevRef] = [n for n in myTrack] - # - relevantRegions = [] - if int(splt[3]) >= MIN_READLEN: - relevantRegions.append((int(splt[1]),int(splt[2]))) - prevRef = splt[0] - mtf.close() - # do stuff - if len(relevantRegions): - myTrack = [0]*(relevantRegions[-1][1]+100) - for r in relevantRegions: - for ri in xrange(r[0],r[1]): - myTrack[ri] = 1 - mappability_tracks[prevRef] = [n for n in myTrack] - # - - # - # init vcf output, if desired - # - vcfo2 = None - vcfo3 = None - global vcfo2_firstTime - global vcfo3_firstTime - vcfo2_firstTime = False - vcfo3_firstTime = False - if VCF_OUT: - vcfo2 = open(OUT_PREFIX+'_FN.vcf','w') - vcfo3 = open(OUT_PREFIX+'_FP.vcf','w') - vcfo2_firstTime = True - vcfo3_firstTime = True - - # - # data for plotting FN analysis - # - set1 = [] - set2 = [] - set3 = [] - varAdj = 0 - - # - # - # For each sequence in reference fasta... - # - # - for n_RI in ref_inds: - - refName = n_RI[0] - if FAST == False: - f.seek(n_RI[1]) - print 'reading '+refName+'...', - myDat = f.read(n_RI[2]-n_RI[1]).split('\n') - myLen = sum([len(m) for m in myDat]) - if sys.version_info >= (2,7): - print '{:,} bp'.format(myLen) - else: - print '{0:} bp'.format(myLen) - inWidth = len(myDat[0]) - if len(myDat[-1]) == 0: # if last line is empty, remove it. - del myDat[-1] - if inWidth*(len(myDat)-1)+len(myDat[-1]) != myLen: - print 'fasta column-width not consistent.' - print myLen, inWidth*(len(myDat)-1)+len(myDat[-1]) - for i in xrange(len(myDat)): - if len(myDat[i]) != inWidth: - print i, len(myDat[i]), inWidth - exit(1) - - myDat = bytearray(''.join(myDat)).upper() - myLen = len(myDat) - - # - # Parse relevant targeted regions - # - targRegionsFl = [] - if BEDFILE != None: - bedfile = open(BEDFILE,'r') - for line in bedfile: - splt = line.split('\t') - if splt[0] == refName: - targRegionsFl.extend((int(splt[1]),int(splt[2]))) - bedfile.close() - else: - targRegionsFl = [-1,MAX_VAL+1] - - # - # Parse vcf files - # - sys.stdout.write('comparing variation in '+refName+'... ') - sys.stdout.flush() - tt = time.time() - - (correctHashed, correctAlts, correctCov, correctAF, correctQual, correctTargLen, correctBelowMinRLen, correctUnique, gFiltered, gMerged, gRedundant) = parseVCF(GOLDEN_VCF, refName, targRegionsFl, vcfo2, vcfo2_firstTime) - (workflowHashed, workflowAlts, workflowCov, workflowAF, workflowQual, workflowTarLen, workflowBelowMinRLen, workflowUnique, wFiltered, wMerged, wRedundant) = parseVCF(WORKFLOW_VCF, refName, targRegionsFl, vcfo3, vcfo3_firstTime) - zgF += gFiltered - zgR += gRedundant - zgM += gMerged - zwF += wFiltered - zwR += wRedundant - zwM += wMerged - - # - # Deduce which variants are FP / FN - # - solvedInds = {} - for var in correctHashed.keys(): - if var in workflowHashed or var[0] in solvedInds: - correctHashed[var] = 2 - workflowHashed[var] = 2 - solvedInds[var[0]] = True - for var in correctHashed.keys()+workflowHashed.keys(): - if var[0] in solvedInds: - correctHashed[var] = 2 - workflowHashed[var] = 2 - nPerfect = len(solvedInds) - - # correctHashed[var] = 1: were not found - # = 2: should be discluded because we were found - # = 3: should be discluded because an alt was found - notFound = [n for n in sorted(correctHashed.keys()) if correctHashed[n] == 1] - FPvariants = [n for n in sorted(workflowHashed.keys()) if workflowHashed[n] == 1] - - # - # condense all variants who have alternate alleles and were *not* found to have perfect matches - # into a single variant again. These will not be included in the candidates for equivalency checking. Sorry! - # - notFound = condenseByPos(notFound) - FPvariants = condenseByPos(FPvariants) - - # - # tally up some values, if there are no golden variants lets save some CPU cycles and move to the next ref - # - totalGoldenVariants = nPerfect + len(notFound) - totalWorkflowVariants = nPerfect + len(FPvariants) - if totalGoldenVariants == 0: - zfP += len(FPvariants) - ztW += totalWorkflowVariants - print '{0:.3f} (sec)'.format(time.time()-tt) - continue - - # - # let's check for equivalent variants - # - if FAST == False: - delList_i = [] - delList_j = [] - regionsToCheck = [] - for i in xrange(len(FPvariants)): - pos = FPvariants[i][0] - regionsToCheck.append((max([pos-EV_BPRANGE-1,0]),min([pos+EV_BPRANGE,len(myDat)-1]))) - - for n in regionsToCheck: - refSection = myDat[n[0]:n[1]] - - fpWithin = [] - for i in xrange(len(FPvariants)): - m = FPvariants[i] - if (m[0] > n[0] and m[0] < n[1]): - fpWithin.append((m,i)) - fpWithin = sorted(fpWithin) - adj = 0 - altSection = copy.deepcopy(refSection) - for (m,i) in fpWithin: - lr = len(m[1]) - la = len(m[2]) - dpos = m[0]-n[0]+adj - altSection = altSection[:dpos-1] + m[2] + altSection[dpos-1+lr:] - adj += la-lr - - nfWithin = [] - for j in xrange(len(notFound)): - m = notFound[j] - if (m[0] > n[0] and m[0] < n[1]): - nfWithin.append((m,j)) - nfWithin = sorted(nfWithin) - adj = 0 - altSection2 = copy.deepcopy(refSection) - for (m,j) in nfWithin: - lr = len(m[1]) - la = len(m[2]) - dpos = m[0]-n[0]+adj - altSection2 = altSection2[:dpos-1] + m[2] + altSection2[dpos-1+lr:] - adj += la-lr - - if altSection == altSection2: - for (m,i) in fpWithin: - if i not in delList_i: - delList_i.append(i) - for (m,j) in nfWithin: - if j not in delList_j: - delList_j.append(j) - - nEquiv = 0 - for i in sorted(list(set(delList_i)),reverse=True): - del FPvariants[i] - for j in sorted(list(set(delList_j)),reverse=True): - del notFound[j] - nEquiv += 1 - nPerfect += nEquiv - - # - # Tally up errors and whatnot - # - ztV += totalGoldenVariants - ztW += totalWorkflowVariants - znP += nPerfect - zfP += len(FPvariants) - znF += len(notFound) - if FAST == False: - znE += nEquiv - if BEDFILE != None: - zbM += correctBelowMinRLen - - # - # try to identify a reason for FN variants: - # - - venn_data = [[0,0,0] for n in notFound] # [i] = (unmappable, low cov, low het) - for i in xrange(len(notFound)): - var = notFound[i] - - noReason = True - - # mappability? - if MAPTRACK != None: - if refName in mappability_tracks and var[0] < len(mappability_tracks[refName]): - if mappability_tracks[refName][var[0]]: - mappability_vs_FN[1] += 1 - venn_data[i][0] = 1 - noReason = False - else: - mappability_vs_FN[0] += 1 - - # coverage? - if var in correctCov: - c = correctCov[var] - if c != None: - if c not in coverage_vs_FN: - coverage_vs_FN[c] = 0 - coverage_vs_FN[c] += 1 - if c < DP_THRESH: - venn_data[i][1] = 1 - noReason = False - - # heterozygous genotype messing things up? - #if var in correctAF: - # a = correctAF[var] - # if a != None: - # a = AF_KEYS[quantize_AF(a)] - # if a not in alleleBal_vs_FN: - # alleleBal_vs_FN[a] = 0 - # alleleBal_vs_FN[a] += 1 - # if a < AF_THRESH: - # venn_data[i][2] = 1 - - # no reason? - if noReason: - venn_data[i][2] += 1 - - for i in xrange(len(notFound)): - if venn_data[i][0]: set1.append(i+varAdj) - if venn_data[i][1]: set2.append(i+varAdj) - if venn_data[i][2]: set3.append(i+varAdj) - varAdj += len(notFound) - - # - # if desired, write out vcf files. - # - notFound = sorted(notFound) - FPvariants = sorted(FPvariants) - if VCF_OUT: - for line in open(GOLDEN_VCF,'r'): - if line[0] != '#': - splt = line.split('\t') - if splt[0] == refName: - var = (int(splt[1]),splt[3],splt[4]) - if var in notFound: - vcfo2.write(line) - for line in open(WORKFLOW_VCF,'r'): - if line[0] != '#': - splt = line.split('\t') - if splt[0] == refName: - var = (int(splt[1]),splt[3],splt[4]) - if var in FPvariants: - vcfo3.write(line) - - print '{0:.3f} (sec)'.format(time.time()-tt) - - # - # close vcf output - # - print '' - if VCF_OUT: - print OUT_PREFIX+'_FN.vcf' - print OUT_PREFIX+'_FP.vcf' - vcfo2.close() - vcfo3.close() - - # - # plot some FN stuff - # - if NO_PLOT == False: - nDetected = len(set(set1+set2+set3)) - set1 = set(set1) - set2 = set(set2) - set3 = set(set3) - - if len(set1): s1 = 'Unmappable' - else: s1 = '' - if len(set2): s2 = 'DP < '+str(DP_THRESH) - else: s2 = '' - #if len(set3): s3 = 'AF < '+str(AF_THRESH) - if len(set3): s3 = 'Unknown' - else: s3 = '' - - mpl.figure(0) - tstr1 = 'False Negative Variants (Missed Detections)' - #tstr2 = str(nDetected)+' / '+str(znF)+' FN variants categorized' - tstr2 = '' - if MAPTRACK != None: - v = venn3([set1, set2, set3], (s1, s2, s3)) - else: - v = venn2([set2, set3], (s2, s3)) - mpl.figtext(0.5,0.95,tstr1,fontdict={'size':14,'weight':'bold'},horizontalalignment='center') - mpl.figtext(0.5,0.03,tstr2,fontdict={'size':14,'weight':'bold'},horizontalalignment='center') - - ouf = OUT_PREFIX+'_FNvenn.pdf' - print ouf - mpl.savefig(ouf) - - # - # spit out results to console - # - print '\n**********************************\n' - if BEDFILE != None: - print 'ONLY CONSIDERING VARIANTS FOUND WITHIN TARGETED REGIONS\n\n' - print 'Total Golden Variants: ',ztV,'\t[',zgF,'filtered,',zgM,'merged,',zgR,'redundant ]' - print 'Total Workflow Variants:',ztW,'\t[',zwF,'filtered,',zwM,'merged,',zwR,'redundant ]' - print '' - if ztV > 0 and ztW > 0: - print 'Perfect Matches:',znP,'({0:.2f}%)'.format(100.*float(znP)/ztV) - print 'FN variants: ',znF,'({0:.2f}%)'.format(100.*float(znF)/ztV) - print 'FP variants: ',zfP#,'({0:.2f}%)'.format(100.*float(zfP)/ztW) - if FAST == False: - print '\nNumber of equivalent variants denoted differently between the two vcfs:',znE - if BEDFILE != None: - print '\nNumber of golden variants located in targeted regions that were too small to be sampled from:',zbM - if FAST: - print "\nWarning! Running with '--fast' means that identical variants denoted differently between the two vcfs will not be detected! The values above may be lower than the true accuracy." - #if NO_PLOT: - if True: - print '\n#unmappable: ',len(set1) - print '#low_coverage:',len(set2) - print '#unknown: ',len(set3) - print '\n**********************************\n' - - - + global bedfile + ref = [] + f = open(reference, 'r') + n_lines = 0 + prev_r = None + prev_p = None + ref_inds = [] + sys.stdout.write('\nindexing reference fasta... ') + sys.stdout.flush() + tt = time.time() + while 1: + n_lines += 1 + data = f.readline() + if not data: + ref_inds.append((prev_r, prev_p, f.tell() - len(data))) + break + if data[0] == '>': + if prev_p is not None: + ref_inds.append((prev_r, prev_p, f.tell() - len(data))) + prev_p = f.tell() + prev_r = data[1:-1] + print('{0:.3f} (sec)'.format(time.time() - tt)) + # ref_inds = [('chrM', 6, 16909), ('chr1', 16915, 254252549), ('chr2', 254252555, 502315916), ('chr3', 502315922, 704298801), ('chr4', 704298807, 899276169), ('chr5', 899276175, 1083809741), ('chr6', 1083809747, 1258347116), ('chr7', 1258347122, 1420668559), ('chr8', 1420668565, 1569959868), ('chr9', 1569959874, 1713997574), ('chr10', 1713997581, 1852243023), ('chr11', 1852243030, 1989949677), ('chr12', 1989949684, 2126478617), ('chr13', 2126478624, 2243951900), ('chr14', 2243951907, 2353448438), ('chr15', 2353448445, 2458030465), ('chr16', 2458030472, 2550192321), ('chr17', 2550192328, 2633011443), ('chr18', 2633011450, 2712650243), ('chr19', 2712650250, 2772961813), ('chr20', 2772961820, 2837247851), ('chr21', 2837247858, 2886340351), ('chr22', 2886340358, 2938671016), ('chrX', 2938671022, 3097046994), ('chrY', 3097047000, 3157608038)] + + zt_v = 0 # total golden variants + zt_w = 0 # total workflow variants + zn_p = 0 # total perfect matches + zf_p = 0 # total false positives + zn_f = 0 # total false negatives + zn_e = 0 # total equivalent variants detected + zgF = 0 # total golden variants that were filtered and excluded + zgR = 0 # total golden variants that were excluded for being redundant + zgM = 0 # total golden variants that were merged into a single position + zwF = 0 # total workflow variants that were filtered and excluded + zwR = 0 # total workflow variants that were excluded for being redundant + zwM = 0 # total workflow variants that were merged into a single position + if bedfile is not None: + zb_m = 0 + + mappability_vs_FN = {0: 0, + 1: 0} # [0] = # of FNs that were in mappable regions, [1] = # of FNs that were in unmappable regions + coverage_vs_FN = {} # [C] = # of FNs that were covered by C reads + allele_bal_vs_FN = {} # [AF] = # of FNs that were heterozygous genotypes with allele freq AF (quantized to multiples of 1/AF_STEPS) + for n in AF_KEYS: + allele_bal_vs_FN[n] = 0 + + # + # read in mappability track + # + mappability_tracks = {} # indexed by chr string (e.g. 'chr1'), has boolean array + prev_Ref = '' + relevant_regions = [] + if maptrack is not None: + mtf = open(maptrack, 'r') + for line in mtf: + splt = line.strip().split('\t') + if prev_Ref != '' and splt[0] != prev_Ref: + # do stuff + if len(relevant_regions): + my_track = [0] * (relevant_regions[-1][1] + 100) + for r in relevant_regions: + for ri in range(r[0], r[1]): + my_track[ri] = 1 + mappability_tracks[prev_Ref] = [n for n in my_track] + # + relevant_regions = [] + if int(splt[3]) >= min_read_len: + relevant_regions.append((int(splt[1]), int(splt[2]))) + prev_Ref = splt[0] + mtf.close() + # do stuff + if len(relevant_regions): + my_track = [0] * (relevant_regions[-1][1] + 100) + for r in relevant_regions: + for ri in range(r[0], r[1]): + my_track[ri] = 1 + mappability_tracks[prev_Ref] = [n for n in my_track] + # + + # + # init vcf output, if desired + # + vcfo2 = None + vcfo3 = None + global vcfo2_first_time + global vcfo3_first_time + vcfo2_first_time = False + vcfo3_first_time = False + if vcf_out: + vcfo2 = open(out_prefix + '_FN.vcf', 'w') + vcfo3 = open(out_prefix + '_FP.vcf', 'w') + vcfo2_first_time = True + vcfo3_first_time = True + + # + # data for plotting FN analysis + # + set1 = [] + set2 = [] + set3 = [] + var_adj = 0 + + # + # + # For each sequence in reference fasta... + # + # + for n_RI in ref_inds: + + ref_name = n_RI[0] + if not fast: + f.seek(n_RI[1]) + print('reading ' + ref_name + '...', end=' ') + my_dat = f.read(n_RI[2] - n_RI[1]).split('\n') + my_len = sum([len(m) for m in my_dat]) + if sys.version_info >= (2, 7): + print('{:,} bp'.format(my_len)) + else: + print('{0:} bp'.format(my_len)) + in_width = len(my_dat[0]) + if len(my_dat[-1]) == 0: # if last line is empty, remove it. + del my_dat[-1] + if in_width * (len(my_dat) - 1) + len(my_dat[-1]) != my_len: + print('fasta column-width not consistent.') + print(my_len, in_width * (len(my_dat) - 1) + len(my_dat[-1])) + for i in range(len(my_dat)): + if len(my_dat[i]) != in_width: + print(i, len(my_dat[i]), in_width) + exit(1) + + my_dat = Seq(''.join(my_dat)).upper().tomutable() + my_len = len(my_dat) + + # + # Parse relevant targeted regions + # + targ_regions_fl = [] + if bedfile is not None: + bedfile = open(bedfile, 'r') + for line in bedfile: + splt = line.split('\t') + if splt[0] == ref_name: + targ_regions_fl.extend((int(splt[1]), int(splt[2]))) + bedfile.close() + else: + targ_regions_fl = [-1, MAX_VAL + 1] + + # + # Parse vcf files + # + sys.stdout.write('comparing variation in ' + ref_name + '... ') + sys.stdout.flush() + tt = time.time() + + (correct_hashed, correct_alts, correct_cov, correct_AF, correct_qual, correct_targ_len, correct_below_min_R_len, + correct_unique, g_filtered, g_merged, g_redundant) = parse_vcf(golden_vcf, ref_name, targ_regions_fl, vcfo2, + vcfo2_first_time) + (workflow_hashed, workflow_alts, workflow_COV, workflow_AF, workflow_qual, workflow_tar_len, + workflow_below_min_R_len, + workflow_unique, w_filtered, w_merged, w_redundant) = parse_vcf(workflow_vcf, ref_name, targ_regions_fl, vcfo3, + vcfo3_first_time) + zgF += g_filtered + zgR += g_redundant + zgM += g_merged + zwF += w_filtered + zwR += w_redundant + zwM += w_merged + + # + # Deduce which variants are FP / FN + # + solved_inds = {} + for var in correct_hashed.keys(): + if var in workflow_hashed or var[0] in solved_inds: + correct_hashed[var] = 2 + workflow_hashed[var] = 20 + solved_inds[var[0]] = True + for var in list(correct_hashed.keys()) + list(workflow_hashed.keys()): + if var[0] in solved_inds: + correct_hashed[var] = 2 + workflow_hashed[var] = 2 + n_perfect = len(solved_inds) + + # correct_hashed[var] = 1: were not found + # = 2: should be discluded because we were found + # = 3: should be discluded because an alt was found + not_found = [n for n in sorted(correct_hashed.keys()) if correct_hashed[n] == 1] + fp_variants = [n for n in sorted(workflow_hashed.keys()) if workflow_hashed[n] == 1] + + # + # condense all variants who have alternate alleles and were *not* found to have perfect matches + # into a single variant again. These will not be included in the candidates for equivalency checking. Sorry! + # + not_found = condense_by_pos(not_found) + fp_variants = condense_by_pos(fp_variants) + + # + # tally up some values, if there are no golden variants lets save some CPU cycles and move to the next ref + # + tot_golden_variants = n_perfect + len(not_found) + total_workflow_variants = n_perfect + len(fp_variants) + if tot_golden_variants == 0: + zf_p += len(fp_variants) + zt_w += total_workflow_variants + print('{0:.3f} (sec)'.format(time.time() - tt)) + continue + + # + # let's check for equivalent variants + # + if fast == False: + del_list_i = [] + del_list_j = [] + regions_to_check = [] + for i in range(len(fp_variants)): + pos = fp_variants[i][0] + regions_to_check.append((max([pos - EV_BPRANGE - 1, 0]), min([pos + EV_BPRANGE, len(my_dat) - 1]))) + + for n in regions_to_check: + ref_section = my_dat[n[0]:n[1]] + + FP_within = [] + for i in range(len(fp_variants)): + m = fp_variants[i] + if n[0] < m[0] < n[1]: + FP_within.append((m, i)) + FP_within = sorted(FP_within) + adj = 0 + alt_section = copy.deepcopy(ref_section) + for (m, i) in FP_within: + lr = len(m[1]) + la = len(m[2]) + d_pos = m[0] - n[0] + adj + alt_section = alt_section[:d_pos - 1] + m[2] + alt_section[d_pos - 1 + lr:] + adj += la - lr + + nf_within = [] + for j in range(len(not_found)): + m = not_found[j] + if n[0] < m[0] < n[1]: + nf_within.append((m, j)) + nf_within = sorted(nf_within) + adj = 0 + alt_section2 = copy.deepcopy(ref_section) + for (m, j) in nf_within: + lr = len(m[1]) + la = len(m[2]) + d_pos = m[0] - n[0] + adj + alt_section2 = alt_section2[:d_pos - 1] + m[2] + alt_section2[d_pos - 1 + lr:] + adj += la - lr + + if alt_section == alt_section2: + for (m, i) in FP_within: + if i not in del_list_i: + del_list_i.append(i) + for (m, j) in nf_within: + if j not in del_list_j: + del_list_j.append(j) + + n_equiv = 0 + for i in sorted(list(set(del_list_i)), reverse=True): + del fp_variants[i] + for j in sorted(list(set(del_list_j)), reverse=True): + del not_found[j] + n_equiv += 1 + n_perfect += n_equiv + + # + # Tally up errors and whatnot + # + zt_v += tot_golden_variants + zt_w += total_workflow_variants + zn_p += n_perfect + zf_p += len(fp_variants) + zn_f += len(not_found) + if fast is False: + zn_e += n_equiv + if bedfile is not None: + zb_m += correct_below_min_R_len + + # + # try to identify a reason for FN variants: + # + + venn_data = [[0, 0, 0] for n in not_found] # [i] = (unmappable, low cov, low het) + for i in range(len(not_found)): + var = not_found[i] + + no_reason = True + + # mappability? + if maptrack is not None: + if ref_name in mappability_tracks and var[0] < len(mappability_tracks[ref_name]): + if mappability_tracks[ref_name][var[0]]: + mappability_vs_FN[1] += 1 + venn_data[i][0] = 1 + no_reason = False + else: + mappability_vs_FN[0] += 1 + + # coverage? + if var in correct_cov: + c = correct_cov[var] + if c is not None: + if c not in coverage_vs_FN: + coverage_vs_FN[c] = 0 + coverage_vs_FN[c] += 1 + if c < dp_thresh: + venn_data[i][1] = 1 + no_reason = False + + # heterozygous genotype messing things up? + # if var in correct_AF: + # a = correct_AF[var] + # if a != None: + # a = AF_KEYS[quantize_AF(a)] + # if a not in allele_bal_vs_FN: + # allele_bal_vs_FN[a] = 0 + # allele_bal_vs_FN[a] += 1 + # if a < AF_THRESH: + # venn_data[i][2] = 1 + + # no reason? + if no_reason: + venn_data[i][2] += 1 + + for i in range(len(not_found)): + if venn_data[i][0]: + set1.append(i + var_adj) + if venn_data[i][1]: + set2.append(i + var_adj) + if venn_data[i][2]: + set3.append(i + var_adj) + var_adj += len(not_found) + + # + # if desired, write out vcf files. + # + not_found = sorted(not_found) + fp_variants = sorted(fp_variants) + if vcf_out: + for line in open(golden_vcf, 'r'): + if line[0] != '#': + splt = line.split('\t') + if splt[0] == ref_name: + var = (int(splt[1]), splt[3], splt[4]) + if var in not_found: + vcfo2.write(line) + for line in open(workflow_vcf, 'r'): + if line[0] != '#': + splt = line.split('\t') + if splt[0] == ref_name: + var = (int(splt[1]), splt[3], splt[4]) + if var in fp_variants: + vcfo3.write(line) + + print('{0:.3f} (sec)'.format(time.time() - tt)) + + # + # close vcf output + # + print('') + if vcf_out: + print(out_prefix + '_FN.vcf') + print(out_prefix + '_FP.vcf') + vcfo2.close() + vcfo3.close() + + # + # plot some FN stuff + # + if no_plot == False: + n_detected = len(set(set1 + set2 + set3)) + set1 = set(set1) + set2 = set(set2) + set3 = set(set3) + + if len(set1): + s1 = 'Unmappable' + else: + s1 = '' + if len(set2): + s2 = 'DP < ' + str(dp_thresh) + else: + s2 = '' + # if len(set3): s3 = 'AF < '+str(AF_THRESH) + if len(set3): + s3 = 'Unknown' + else: + s3 = '' + + mpl.figure(0) + tstr1 = 'False Negative Variants (Missed Detections)' + # tstr2 = str(n_detected)+' / '+str(zn_f)+' FN variants categorized' + tstr2 = '' + if maptrack is not None: + v = venn3([set1, set2, set3], (s1, s2, s3)) + else: + v = venn2([set2, set3], (s2, s3)) + mpl.figtext(0.5, 0.95, tstr1, fontdict={'size': 14, 'weight': 'bold'}, horizontalalignment='center') + mpl.figtext(0.5, 0.03, tstr2, fontdict={'size': 14, 'weight': 'bold'}, horizontalalignment='center') + + ouf = out_prefix + '_FNvenn.pdf' + print(ouf) + mpl.savefig(ouf) + + # + # spit out results to console + # + print('\n**********************************\n') + if bedfile is not None: + print('ONLY CONSIDERING VARIANTS FOUND WITHIN TARGETED REGIONS\n\n') + print('Total Golden Variants: ', zt_v, '\t[', zgF, 'filtered,', zgM, 'merged,', zgR, 'redundant ]') + print('Total Workflow Variants:', zt_w, '\t[', zwF, 'filtered,', zwM, 'merged,', zwR, 'redundant ]') + print('') + if zt_v > 0 and zt_w > 0: + print('Perfect Matches:', zn_p, '({0:.2f}%)'.format(100. * float(zn_p) / zt_v)) + print('FN variants: ', zn_f, '({0:.2f}%)'.format(100. * float(zn_f) / zt_v)) + print('FP variants: ', zf_p) # ,'({0:.2f}%)'.format(100.*float(zf_p)/zt_w) + if not fast: + print('\nNumber of equivalent variants denoted differently between the two vcfs:', zn_e) + if bedfile is not None: + print('\nNumber of golden variants located in targeted regions that were too small to be sampled from:', zb_m) + if fast: + print( + "\nWarning! Running with '--fast' means that identical variants denoted differently between the two vcfs will not be detected! The values above may be lower than the true accuracy.") + # if NO_PLOT: + if True: + print('\n#unmappable: ', len(set1)) + print('#low_coverage:', len(set2)) + print('#unknown: ', len(set3)) + print('\n**********************************\n') if __name__ == '__main__': - main() + main()