- Introduction
- Settings
- Models
- Readers
- Toolkit
- Annotation tools
- Assembly tools
- Tulip
- TRs nomenclature
- Sequence
- Various for files
- Various
- Kmers
- Edit distance
- Repbase
- Trace
- Statistics
- Parsing
- TRF
- TRs datasets
- SRA datasets
- Sequence patterns
- Blast
- ESA
- NCBI genomes
- Networks
- Networks
- Jellyfish
- Trees
- Classifiction TRs in types
- Working with Entrez NCBI databases
Framework состоит из следующих частей:
- модели данных
- чтение/запись данных согласно моделям
- инструменты работы с этими данными
В файле settings.py:
SETTINGS_FILENAME = "settings.yaml"
NGRAM_LENGTH = 23
NGRAM_N = 100000000
Настройки можно прочитать и записать:
from trseeker.settings import load_settings
from trseeker.settings import save_settings
settings_dict = load_settings()
save_settings(settings_dict)
Настройки содержат следующие параметры:
trseeker:
os: linux64
root_dir: /root/Dropbox/workspace/trseeker
work_dir: /home
blast_settings:
blast_location:
blast_location_NIX: /home/ncbi-blast-2.2.26+/bin/legacy_blast.pl bl2seq
blast_location_WIN: c:\Program Files\NCBI\blast-2.2.26+\bin\legacy_blast.pl bl2seq.exe
jellyfish_location: jellyfish
repbase_db_folder: /home/rebase_blast_db/repbase
blast_e: 1e-20
blast_b: 20000000
blast_v: 20000000
trf_settings:
trf_location: /root/trf404.linux64
trf_match: 2
trf_mismatch: 5
trf_indel: 7
trf_p: 80
trf_q: 10
trf_threshold: 50
trf_length: 2000
trf_param_postfix: 2.5.7.80.10.50.2000
trf_masked_file: False
trf_flanked_data: True
trf_data_file: True
trf_nohtml: True
overlapping_cutoff: 10
overlapping_cutoff_proc: 30
overlapping_gc_diff: 0.05
ngrams_settings:
ngram_length: 23
ngram_m: 10000000
from trseeker.models.sequence_model import SequenceModel
Attribites:
- seq_gi (int)
- seq_ref
- seq_description
- seq_sequence
- seq_length (int)
- seq_gc (float)
- seq_revcom, reverce complement
- seq_gapped (int)
- seq_chr
- seq_head
- seq_start_position (int)
- seq_end_position (int)
Properties
- length (self.seq_length)
- sequence (self.seq_sequence)
- fasta
- header
- contige_coverage if cov key in name
print seq_obj.fasta
>>> ">[seq_ref]\n[seq_sequence]\n"
- sa_input
print seq_obj.sa_input
>>> "[seq_sequence]$"
- ncbi_fasta
print seq_obj.ncbi_fasta
>>> ">gi|[seq_gi]|ref|[seq_ref]|[seq_description]\n[seq_sequence]\n"
Methods:
- add_sequence_revcom()
- set_dna_sequence(self, title, sequence, description=None)
self.seq_ref = title
- set_ncbi_sequence(self, head, sequence)
Chromosome name is ? or setted with parse_chromosome_name(head).
(self.seq_gi, self.seq_ref, self.seq_description) = parse_fasta_head(head)
- set_gbff_sequence(self, head, sequence)
Head is a dictionary with gi, ref, description keys.
Chromosome name is ? or setted with parse_chromosome_name(head["description"]).
Sequence is cleared with clear_sequence(s) function. Lowercase and all non-DNA characters replacing with n. If the sequence has n then it is gapped.
from trseeker.models.trf_model import TRModel
Attributes:
- project, project name
- id (float)
- trf_id (int)
- trf_type,
- trf_family,
- trf_family_prob (float),
- trf_l_ind (int)
- trf_r_ind (int)
- trf_period (int)
- trf_n_copy (float)
- trf_pmatch (float)
- trf_pvar (float)
- trf_entropy (float)
- trf_consensus
- trf_array
- trf_array_gc (float)
- trf_consensus_gc (float)
- trf_gi
- trf_head
- trf_param
- trf_array_length (int)
- trf_chr
- trf_joined (int)
- trf_superfamily
- trf_superfamily_self
- trF_superfamily_ref
- trf_family
- trf_subfamily
- trf_subsubfamily
- trf_family_network
- trf_family_self
- trf_family_ref
- trf_hor (int)
- trf_n_chrun (int)
- trf_chr_refgenome
- trf_bands_refgenome
- trf_repbase
- trf_strand
Methods
- set_project_data(project), set self.project to given project
- set_raw_trf(head, body, line), head, body and line from TRF parser
- get_index_repr()
print trf_obj.get_index_repr()
'''
Tab delimted string with \n-symbol:
trf_id
trf_period
trf_array_length
trf_pvar
trf_gi
trf_l_ind
trf_r_ind
trf_chr
'''
- get_numerical_repr()
print trf_obj.get_numerical_repr()
>>> [trf_period]\t[trf_array_length]\t[trf_array_gc]\n
- get_fasta_repr(), where head is trf_obj.trf_id and sequence is trf_obj.trf_array
- or fasta property
- get_monomer_fasta_repr(), where head is trf_obj.trf_id and sequence is trf_obj.trf_consensus
- get_family_repr()
- get_gff3_string(self, chromosome=True, trs_type="complex_tandem_repeat", probability=1000, tool="PySatDNA", prefix=None, properties={"id":"trf_id", "family": "trf_family_self"})
print trf_obj.get_family_repr()
'''
Tab delimted string with \n-symbol:
trf_id
trf_period
trf_array_length
trf_array_gc
trf_pvar
trf_gi
trf_l_ind
trf_r_ind
trf_chr
trf_repbase
trf_superfamily
trf_family
trf_subfamily
'''
For network slice added one more index - gid (group id)
from trseeker.models.trf_model import NetworkSliceModel
slice_obj = NetworkSliceModel()
Model for keeping classification data.
Attributes:
- project
- id (int)
- trf_id (int)
- trf_period (int)
- trf_array_length (int)
- trf_array_gc
- trf_type
- trf_family
- trf_subfamily
- trf_family_prob (float)
- trf_family_kmer
- trf_subfamily_kmer
- trf_family_self
- class_ssr
- class_tssr
- class_sl
- class_good
- class_micro
- class_100bp
- class_perfect
- class_x4
- class_entropy
- class_gc
- trf_consensus
class_obj = TRsClassificationModel()
class_obj.set_with_trs(trf_obj)
print class_obj.network_head()
from trseeker.models.organism_model import OrganismModel
Attributes:
- organism_taxon
- organism_common_name
- organism_acronym
- organism_description
- organism_wgs_projects
- organism_genome_assemblies
from trseeker.models.dataset_model import DatasetModel
Attributes:
- dataset_taxon
- dataset_id
- dataset_sources
- dataset_description
- dataset_gc (float)
- dataset_length (int)
- dataset_trs_n (int)
- dataset_trs_length (int)
- dataset_trs_mean_gc (float)
- dataset_trs_fraq (float)
from trseeker.models.blast_model import BlastResultModel
Attributes:
- query_id (int)
- query_gi (int)
- query_ref
- subject_id
- subject_gi(int)
- subject_ref
- query_start (int)
- query_end (int)
- subject_start (int)
- subject_end (int)
- evalue (float)
- bit_score (flaot)
- score (int)
- alignment_length (int)
- proc_identity (float)
- identical (int)
- mismatches (int)
- positives (int)
- gap_opens (int)
- gaps (int)
- proc_positives (float)
- frames
- query_frame (int)
- subject_frame (int)
- fraction_of_query (float)
Additional functions:
- read_blast_file(blast_file, length), return subject_ref -> list of matches (BlastResultModel models).
from trseeker.models.blast_model import read_blast_file
ref_to_blast_obj = read_blast_file(file_name)
from trseeker.models.chromosome_model import ChromosomeModel
Attributes:
- chr_genome
- chr_number
- chr_taxon
- chr_prefix
- chr_gpid
- chr_acronym
- chr_contigs
- chr_length
- chr_mean_gc
- chr_trs_all
- chr_trs_3000
- chr_trs_all_proc
- chr_trs_3000_proc
- chr_trs_all_length
- chr_trs_3000_length
- genome_gaps
- chr_sum_gc
from trseeker.models.wgs_model import WGSModel
Attributes:
- wgs_prefix
- wgs_taxon
- wgs_gpid
- wgs_acronym
- wgs_contigs (int)
- wgs_length (int)
- wgs_mean_gc (float)
- wgs_trs_all (int)
- wgs_trs_3000 (int)
- wgs_trs_1000 (int)
- wgs_trs_500 (int)
- wgs_trs_all_proc (float)
- wgs_trs_3000_proc (float)
- wgs_trs_1000_proc (float)
- wgs_trs_500_proc (float)
- wgs_trs_all_length (int)
- wgs_trs_3000_length (int)
- wgs_trs_1000_length (int)
- wgs_trs_500_length (int)
- wgs_sum_gc (float)
Methods:
Clear trf information (set to 0):
wgs_obj.clear_trf()
from trseeker.models.genome_model import GenomeModel
- genome_taxon
- genome_prefix
- genome_gpid
- genome_acronym
- genome_chromosomes
- genome_contigs
- genome_length
- genome_mean_gc
- genome_trs_all
- genome_trs_3000
- genome_trs_all_proc
- genome_trs_3000_proc
- genome_trs_all_length
- genome_trs_3000_length
- genome_gaps
- genome_sum_gc
from trseeker.models.genome_model import RepeatMaskerAnnotation
Container for RepeatMasker track. For example from http://hgdownload.cse.ucsc.edu/goldenPath/felCat5/database/rmsk.txt.gz
Table fields:
`bin` smallint(5) unsigned NOT NULL,
`swScore` int(10) unsigned NOT NULL,
`milliDiv` int(10) unsigned NOT NULL,
`milliDel` int(10) unsigned NOT NULL,
`milliIns` int(10) unsigned NOT NULL,
`genoName` varchar(255) NOT NULL,
`genoStart` int(10) unsigned NOT NULL,
`genoEnd` int(10) unsigned NOT NULL,
`genoLeft` int(11) NOT NULL,
`strand` char(1) NOT NULL,
`repName` varchar(255) NOT NULL,
`repClass` varchar(255) NOT NULL,
`repFamily` varchar(255) NOT NULL,
`repStart` int(11) NOT NULL,
`repEnd` int(11) NOT NULL,
`repLeft` int(11) NOT NULL,
`id` char(1) NOT NULL,
from trseeker.models.ngram_model import NgramModel
ngram_obj = NgramModel(seq_f, seq_r)
ngram_obj.add_tr(trf_obj, tf)
print ngram_obj
>>> '[fseq]\t[rseq]\t[tf]\t[df]\t[len taxons]\t[len fams]\n'
print ngram_obj.get_families()
>>> ???
Attributes
- seq_r
- seq_f
- tf (float)
- df (int)
- taxons (set)
- trs (set)
- families (dict)
from trseeker.models.ngrams_model import KmerIndexModel
Attributes:
- kmer
- rkmer
- tf (int)
- df (int)
- docs (list of int)
- freqs (list of float)
from trseeker.models.ngrams_model import KmerSliceModel
- kmer
- local_tf (int)
- df (int)
- f_trs
- f_wgs
- f_mwgs
- f_trace
- f_sra
- f_ref1
- f_ref2
- f_caroli
from trseeker.models.ngrams_model import SliceTreeModel
- deep (int)
- size (int)
- blast_fams (list of str)
- maxdf (int)
- nmaxdf (int)
- pmaxdf (float)
- gc_var (float)
- units (list of int)
- trs (list of int)
- kmers (list of SliceTreeModel)
Yield KmerIndexModel:
sc_ngram_reader(file_name)
Yield (ngram id, [(seq id, tf), ...]):
sc_ngram_trid_reader(file_name)
Read kmer index data as list:
Parameters:
- index file
- microsatellite kmer dictionary, kmer->index
read_kmer_index(ngram_index_file, micro_kmers, cutoff=1)
from trseeker.seqio.tab_file import TabDelimitedFileIO
reader = TabDelimitedFileIO(skip_first=False, format_func=None, delimiter='\t', skip_startswith='#')
reader.read_from_file(file_name)
Avaliable all functions from parent class AbstractFileIO from PyExp package.
Useful functions:
- sc_iter_tab_file(input_file, data_type, remove_starts_with=None)
- sc_iter_simple_tab_file(input_file)
- sc_read_dictionary(dict_file, value_func=None)
- sc_read_simple_tab_file(input_file)
- sc_write_model_to_tab_file(output_file, objs)
from trseeker.seqio.tab_file import sc_iter_tab_file
for wgs_obj in sc_iter_tab_file(file_name, WGSModel):
print wgs_obj.wgs_taxon
from trseeker.seqio.tab_file import sc_iter_simple_tab_file
for item in sc_iter_simple_tab_file(file_name):
print item[0]
from trseeker.seqio.tab_file import sc_read_dictionary
for data in sc_read_dictionary(file_name):
for key, value in data.items():
print key, value
from trseeker.seqio.tab_file import sc_read_simple_tab_file
for data in sc_read_simple_tab_file(file_name):
for item in data:
print "id = ", item[0]
from trseeker.seqio.block_file import AbstractBlockFileIO
reader = AbstractBlockFileIO(token, **args)
for (head, body, start, next) in reader.read_online(file_name):
pirnt head
Avaliable all functions from parent class AbstractFileIO from PyExp package.
from trseeker.seqio.fasta_file import FastaFileIO
reader = FastaFileIO()
- sc_iter_fasta(file_name)
- fasta_reader(file_name), same as sc_iter_fasta
- sc_iter_fasta_simple(file_name)
- save_all_seq_with_exact_substring(fasta_file, substring, output_file)
- sort_fasta_file_by_length(file_name)
from trseeker.seqio.fasta_file import sc_iter_fasta
for seq_obj in sc_iter_fasta(file_name):
print seq_obj.sequence
from trseeker.seqio.fasta_file import sc_iter_fasta_simple
for (gi, sequence) in sc_iter_fasta_simple(file_name):
print gi
from trseeker.seqio.fasta_file import save_all_seq_with_exact_substring
save_all_seq_with_exact_substring(fasta_file, substring, output_file)
Sort by length and save fasta file:
sort_fasta_file_by_length(file_name)
from trseeker.seqio.gbff_file import GbffFileIO
reader = GbffFileIO()
- sc_iter_gbff(file_name)
- sc_iter_gbff_simple(file_name)
- sc_parse_gbff_in_folder(gbbf_folder, fasta_folder, fasta_postfix, mask)
from trseeker.seqio.gbff_file import sc_iter_gbff
for seq_obj in sc_iter_gbff(file_name):
print seq_obj.length
from trseeker.seqio.gbff_file import sc_iter_gbff_simple
for (gi, sequence) in sc_iter_gbff_simple(file_name):
print gi
from trseeker.seqio.gbff_file import sc_parse_gbff_in_folder
sc_parse_gbff_in_folder("/home/user/", "/home/user/fasta", "fa", "mouse")
from trseeker.seqio.frt_io import AbstractFtpIO
reader = AbstractFtpIO(ftp_address=address)
reader.connect()
reader.cd(['/home', 'user', 'data'])
reader.ls()
>>> ['readme.txt', 'data.fa']
file_size = reader.size(file_name)
reader.get(file, output_file)
reader.unzip(file_name)
Download from NCBI ftp with aspera:
download_with_aspera_from_ncbi(source, destination)
from trseeker.seqio.ncbi_ftp import NCBIFtpIO
reader = NCBIFtpIO()
reader.download_wgs_fasta(wgs_list, file_suffix, output_folder, unzip=False)
reader.download_all_wgs_in_fasta(output_folder, aspera=False, get_size=False, unzip=False)
reader.download_all_wgs_in_gbff(output_folder)
reader.download_chromosomes_in_fasta(ftp_address, name, output_folder)
reader.download_trace_data(taxon, output_folder)
reader.download_sra_from_ddbj(ftp_address, output_folder)
reader.download_with_aspera(local_path, remove_server, remote_path)
Not implemented yet.
from trseeker.seqio.mongodb_reader import MongoDBReader
reader = MongoDBReader()
db = reader.get_trdb_conn()
- sc_parse_raw_trf_folder(trf_raw_folder, output_trf_file, project=None)
from trseeker.seqio.trf_file import TRFFileIO
from trseeker.seqio.trf_file import sc_parse_raw_trf_folder
for trf_obj in TRFFileIO(file_name, filter=True):
print trf_obj.trf_id
sc_parse_raw_trf_folder(trf_raw_folder, output_trf_file, project="mouse_genome")
При чтение данных TRF происходит их фильтрация по следующим параметрам:
- Убираются все вложенные поля меньшей длины.
- Если поля overlapping, то сейчас ничего не делается, а раньше если
overlap_proc_diff >= settings["trf_settings"]["overlapping_cutoff_proc"]
gc_dif <= settings["trf_settings"]["overlapping_gc_diff"]
поля объяединяются в одно поле. Иначе считаем, что это отдельные поля. 3. Если поля совпадают, то выбирается то которое с большим trf_pmatch.
TODO: убрать пересечение, так как это видимо в том числе и ошибки ассмеблера и это будет мешать корректной классификации. Убрать в отдельную часть.
### TRs filefrom trseeker.seqio.tr_file import *
Functions:
- save_trs_dataset(trs_dataset, output_file)
- save_trs_class_dataset(tr_class_data, output_file)
- get_trfid_obj_dict(trf_large_file)
- get_classification_dict(fam_kmer_index_file)
Save trf file as fasta file:
# don't save trf_obj if trf_family is ALPHA
save_trs_as_fasta(trf_file, fasta_file, add_project=False, skip_alpha=False)
from trseeker.seqio.tr_file import get_trfid_obj_dict
trid2trfobj = get_trfid_obj_dict(trf_large_file)
- get_all_trf_objs(trf_large_file)
- get_all_class_objs(trf_class_file)
- get_class_objs_dict(trf_class_file)
from trseeker.seqio.tr_file import get_all_trf_objs
from trseeker.seqio.tr_file import get_all_class_objs
trf_obj_list = get_all_trf_objs(trf_large_file)
class_obj_list = get_all_class_objs(trf_class_file)
- get_trf_objs_dict(trf_large_file)
trf_obj_dics = get_trf_objs_dict(trf_large_file)
- read_trid2meta(file_name)
Load trid to full index dictionary as string:
read_trid2ngrams(annotation_ngram_folder, trf_large_file)
TODO: rewrite this with AbstractReaders
TODO: rewrite this with AbstractReaders
from trseeker.seqio.ngram_file import save_ngram_index
save_ngram_index(ngram_index_file,
hash2id,
result_tf,
result_df,
result_rtf,
result_rdf,
seen_rev,
hash2rev,
ngram2kmerann)
save_ngram_pos_index(ngram_trids_file, id2trids, id2trid2tf)
save_distance_data(dist_file, distances)
from trseeker.seqio.blast_file import get_blast_result
get_blast_result(blast_file, length, gap_size=1000, min_align=500, min_length=2400, format_function=None, min_score=90)
Для фильтров используются following parameters:
- cutoff - TRs array length.
- match and min_match from blast_match_proc settings
- min_align is minimal length of alignment (array_length / .min_alig)
- blast_gap_size is size of the gap between two match regions (array_length * min_match)
- min_length is minimal total length (array_length * match)
Этапы анализа:
- From blast output files extracted ref_gi to BlastResultModel dictionary.
- Removed all nested matches.
- Overlapping matches joined.
- Joined matches with length less than min_align are dropped.
- Matches with gap length less than gap_size are joined
- Joined gapped matches with length less than min_align are dropped.
- Results formatted with given format_function
Dataset updating functions:
update_with_repbase_blast_result(trs_dataset, annotation_self_folder, filters)
# inside function
# result is semicolon-delimited
trs_dataset[i].trf_repbase = result
# filters example
filters = {
"blast_gap_size": 300,
"min_align": 1000,
"min_length": 3000,
}
update_with_self_blast_result(trs_dataset, annotation_self_folder, filters_obj)
# inside function
# result comma-delimited
# inside get_filters(array_length) generate parameters by array_length:
# filters = filters_obj.get_filters(trf_obj.trf_array_length)
trs_dataset[i].trf_family_self = result
update_with_ref_blast_result(trs_dataset, annotation_self_folder, filters)
# inside function
# comma-delimited data
trs_dataset[i].trf_family_ref = result
TODO: refractor to more generic form with setattr(...)
Self and ref annotation comma-delimited. Repbase annotation semicolon-delimited. Avalibale format functions:
- format_function_repbase(blast_dataset)
- format_function_self(blast_dataset) [DEFAULT]
While blast results parsing:
- skipped all inner matches
- joined overlapped
- skipped all alignments with length less thatn min_align paramter
- joined gapped if gap less than gap_size paramter
TODO: move FastqObj to models
from trseeker.seqio.sra_file import FastqObj
fastq_obj = FastqObj(head, seq, strain, qual_str, phred33=False)
print fastq_obj.fastq
print fastq_obj.fastq_with_error(error)
print fastq_obj.sequence
print fastq_obj.fasta
print fastq_obj.trimmed_fastq
print fastq_obj.gc
Properties:
- head
- seq
- qual
- strain
- id
- trimmed_seq
- trimmed_qual
- qv: list of Q - phredX
- adaptor_positions: positions of adapter
- adapter_contamination
- qual_junk
- status
- parts
Methods related to trimming:
- trim(), NotImplemented
- trim_by_quality_cutoff(cutoff=30)
- trim_exact_adaptor(adaptor, verbose=False), NotImplemented
- trim_inexact_adaptor(adaptor, verbose=False, num_errors=2), NotImplemented
Additional functions:
- fastq_reader
for fastq_obj in fastq_reader(fastq_file, phred33=False):
print fastq_obj.seq
Compute intersection between two datasets. Dataset format: list of tuples (chrName, startPos, endPos, feature_id).
from trseeker.tools.annotation import *
compute_intersection_intervals(data_a, data_b)
Usage example:
data_b = open("cat_good_trf.gff3").readlines()
data_b = [x.strip().split("\t") for x in data_b]
data_b = [(x[0], int(x[3]), int(x[4]), x[-1]) for x in data_b]
for x in compute_intersection_intervals(data_a, data_b):
print x
from trseeker.tools.assembly_tools import *
N50_contig_length, N50, shortest_seq, longest_seq = get_N50(lengths)
Format and write network data to tulip format from distance_file with given cutoff (defaul=90).
- distance_file: (node_a, node_b, distance)
- tulip_file_output: network in tulip format
- cutoff: distance cutoff
from trseeker.tools.tulip_tools import *
format_distances_to_tulip(distance_file, tulip_file_output, cutoff=90)
TODO: rename package
from trseeker.tools.trs_groups import *
Get next family index. Index limited to latin alphabet characters. Otherwise will return X1,X2 and so on:
letter = get_index(i)
Get most frequent element of list:
element = get_popular(s)
Get unit and letter for family. Unit picked by most common pmatch value. A seen_units is a dictionary unit_size to frequence, it is needed for letter choosing:
unit, letter, seen_units = get_family_name(ids, seen_units)
Join families with common members, sort by family size:
join_families_with_common(families)
from trseeker.tools.sequence_tools import *
Return complementary sequence:
revcom = get_revcomp(sequence)
Get shift variants for TR monomer:
sequences = get_revcomp(sequence)
Return normalized sequence with following rules:
- if T > A then return reverse complement
- if A == T and G > C then return reverse complement
- else return sequence
sequence = fix_strand(sequence)
Count GC content:
gc = get_gc(sequence)
Check for N|n in sequence. Return n(with N) or w(whole):
bool_value = check_gapped(sequence)
Return subsequence:
get_subseq(seq, start, end)
Clear sequence (ACTGN alphabet):
- lower case
- remove any gaps
- remove all letters except atgcn
sequence = clear_sequence(sequence)
Return list of sequences splited by pattern with added end. Pattern is regexp:
restriction(sequence, pattern, end=None)
Return number of fragments after restriction of sequence with given regexp pattern:
restriction_fragments_n(sequence, pattern)
Check tandem repeat synonims between two sequence with same length:
check_cyclic_repeats(seq_a, seq_b)
Return sequence with n mutations:
random_mutation(seq, n, alphabet="actgn +")
Return consensus string for given list of strings:
get_consensus(strs)
Take a minimal sequence from lexicographically sorted rotations of sequence and its reverse complement. Example: ACT, underlined - reverse complement seqeunces ACT, AGT, CTA, GTA, TAC, TAG.
Find all possible multimers, e.g. replace GTAGTAGTA consensus sequence with ACT.
Return:
- list of sorted TRs
- list of (df, consensus) pairs sorted by array number
remove_consensus_redundancy(trf_objs)
from trseeker.tools.seqfile import *
- save_list(file_name, data)
- save_dict(file_name, dictionary)
- save_sorted_dict(d, file, by_value=True, reverse=True, min_value=None, key_type=None)
- count_lines(file)
- sort_file_by_int_field(file_name, field)
from trseeker.tools.other_tools import *
Sort dictionary by key. Retrun list of (k, v) pairs:
sort_dictionary_by_key(d, reverse=False)
Cast list elements to string:
as_strings(alist)
Return list from dictionary, return list of (key, value) pairs:
dict2list(adict)
Sort dictionary by value. Retrun list of (v, k) pairs:
sort_dictionary_by_value(d, reverse=False)
Remove duplicates from list and sort it:
remove_duplicates_and_sort(data)
remove_duplicates(data)
Remove redundancy or empty elements in given list:
remove_redundancy(alist)
Remove nested fragments, ata format [start, end, ...]:
clear_fragments_redundancy(data, extend=False, same_case_func=None)
from trseeker.tools.ngrams_tools import *
Yields all ngrams of length n from given text:
generate_ngrams(text, n=12)
Yields all ngrams of length n from given text:
generate_window(text, n=12, step=None)
Returns m most frequent (ngram of length n, tf) tuples for given text:
get_ngrams(text, m=None, n=23, k=None, skip_n=False)
Returns m most frequent (ngram of length n, fraction of possible ngrams) tuples for given text:
get_ngrams_freq(text, m=None, n=23, k=None)
Returns a feature set {'ngram':'ngram',...} of m most frequent ngram of length n for given text:
get_ngrams_feature_set(text, m=5, n=12)
Returns a distance between two ngram sets where distance is a sum(min(ngram_a, ngram_b) for each common ngram). Format for ngrams_a and ngrams_b is a dictionary {ngram:n, ...}:
get_ngram_freq_distance(ngrams_a, ngrams_b)
Returns a distance between two ngram sets where distance is a len(). Format for ngrams_a and ngrams_b is a dictionary {ngram:n, ...}:
get_ngram_common_distance(ngrams_a, ngrams_b)
Return repeatness coefficient. From 0 (e.g. polyA) to 1 (unique sequence):
get_repeatness_coefficent(length, k, kmern)
# Inside this function:
# kmern * (N-1) / (N**2)
Return expressivenesss coefficient.
get_expessiveness_coefficent(kmern, k)
# Inside this function:
# kmern * 1. / 4^k
Update tf and df data with k-mers from given sequence (it will be lowered):
from trseeker.tools.ngrams_tools import count_kmer_tfdf
k = 23
for sequence in sequences:
tf_dict, df_dict, local_tf, local_df = count_kmer_tfdf(sequence, tf_dict, df_dict, k)
Compute max df, number and procent of sequence with given ngram. Return (maxdf, nmaxdf, pmaxdf, ngram_seqs)
(maxdf, nmaxdf, pmaxdf, ngram_seqs) = get_df_stats_for_list(data, k, kmer2df):
Get tf and df frequencies for kmer list:
get_kmer_tf_df_for_data(data, k, docids=False)
Get list of (kmer, revkmer, tf, df, docids) for given data:
process_list_to_kmer_index(data, k, docids=True, cutoff=None)
Get list of (kmer, revkmer, tf, df, docids) for multifasta file:
compute_kmer_index_for_fasta_file(file_name, index_file, k=23)
Get list of (kmer, revkmer, tf, df, docids) for TRs file, filter ny sequence complexity defined by get_zlib_complexity function:
compute_kmer_index_for_trf_file(file_name, index_file, k=23, max_complexity=None, min_complexity=None)
Compute kmer coverage and set of kmers:
(m, kmers) = get_sequence_kmer_coverage(sequence, kmers, k)
from trseeker.tools.edit_distance import *
Return list of element (d) positions in given list:
get_pos(L, d)
Return edit distance between two given strings. Edit distance dynamic programming implementation. Length of second sequence does not change similarity value:
- monomer mode: double short monomer sequence if len2/len1 < 2
- full_info: boolean, return full information
Return: percent of ED similarity, or if full_info (distance, number of d, positions of d, S matrix)
get_ed_similarity(s1, s2, monomer_mode=False, full_info=False, verbose=False)
Return two sequences edit distance full information.
- s1: sequence of tandem repeat monomer
- s2: sequence of tandem repeat monomer
- verbose: boolean
- monomer mode: double short monomer sequence if len2/len1 < 2.
Return: ("max_sim Nmax_sim %sim pos_list", pos_list, all_result, length seq1, length seq2)
get_edit_distance_info(s1, s2, verbose=False, monomer_mode=False)
Return ED valie for two sequences:
get_edit_distance(s1, s2)
Get last row of ED matrix between two strings:
get_edit_distance_row(s1, s2)
Get Hamming distance: the number of corresponding symbols that differs in given strings:
hamming_distance(s1, s2)
TODO: move to Readers
from trseeker.tools.repbase_tools import join_repbase_files
Function join all Repbase fasta files in one huge fasta; reformat headers for compatibility with NCBI tools:
join_repbase_files(input_folder, output_file)
from trseeker.tools.trace_tools import unclip_trace_file
Unclip. Remove vector flanks from Trace data.
unclip_trace_file(fasta_file, clip_file, uncliped_file)
Read clip dictionary from clip_file_name gz archive:
get_clip_data(clip_file_name)
from trseeker.tools.statistics import *
Calculate sigma, return sum(module(xi - mean):
get_sigma(data)
Return dictionary containing simple statistics for given list. Dictionary keys: mean, variance, sigma, sample deviation:
get_mean(data)
get_standard_deviation(variance)
t_test(sample_mean, dist_mean, variance, N)
get_element_frequences(data)
get_simple_statistics(data)
from trseeker.tools.parsers import *
parse_fasta_head(fa_head)
parse_chromosome_name(head)
trf_parse_line(line)
trf_parse_param(line)
trf_parse_head(line)
get_wgs_prefix_from_ref(ref)
get_wgs_prefix_from_head(head)
from trseeker.tools.trf_tools import *
trf_search(file_name)
trf_search_in_dir(folder, verbose=True, file_suffix=".fa", output_folder=None)
Parallel TRF seeach:
trf_worker(args)
trf_search_in_dir_parallel(folder, verbose=True, file_suffix=".fa", output_folder=None, threads=1)
Create output TRF file with tandem repeats with length greater than from input file. Function returns number of tandem repeats in output file:
trf_filter_by_array_length(trf_file, output_file, cutoff)
Create output TRF file with tandem repeats with unit length greater than from input file. Function returns number of tandem repeats in output file:
trf_filter_by_monomer_length(trf_file, output_file, cutoff)
Create output TRF file with tandem repeats with GI that don't match GI_LIST. List of GI, see TRF and FA specifications, GI is first value in TRF row.
trf_filter_exclude_by_gi_list(trf_file, output_file, gi_list_to_exclude)
Write TRF file tab delimited representation.Representation: numerical|index|agc_apm|with_monomer|family
trf_representation(trf_file, trf_output, representation)
Write statistics data: field, N:
trf_write_field_n_data(trf_file, file_output, field, field_format="%s")
Write statistics data: field_a, field_b:
trf_write_two_field_data(trf_file, file_output, field_a, field_b)
Function prints chr, all trs, 3000 trs, 10000 trs:
count_trs_per_chrs(all_trf_file)
Function prints number of items with given fasta head fragment:
count_trf_subset_by_head(trf_file, head_value)
Several fasta heads impossible to parse, so it is simpler to fix them postfactum:
fix_chr_names(trf_file, temp_file_name=None, case=None)
from trseeker.tools.trs_dataset import *
- COLORS dictionary
- COLORS_50 dictionary
COLORS_50 = {
"denim": ("#1560bd", (21,96,189)),
}
# TODO: finish it
- get_colors(family)
- get_colors_rgb(family)
- create_mathematice_dataset_by_family(trs_dataset, path_to_mathematica_folder, min_borders, max_borders)
from trseeker.tools.sra_tools import *
Iterate over fastaq data.
sra_fastaq_reader(file_name)
Iterate over fasta SRA data.
sra_fasta_reader(file_name)
- read_fastaq_freq_to_memory(file_name)
- fastaq_to_fasta(file_name, output_file)
- write_fastaq_repeats(input_file, output_file, min_tf=1000)
- seq_to_bin(seq)
- bin_to_seq(bseq)
Read SRA fasta data without read repeats.
write_reduced_fasta(input_file, output_reduced)
Write ngrams data from SRA fasta data.
write_ngrams(input_file, output_ngram, NGRAM_N)
from trseeker.tools.sequence_patterns import *
Avaliable patterns:
- MARS1
- MARS2
- CENPB
- PRDB9
Functions:
Translate sequence in DNA15 in reg exp:
re_translate(sequence)
Return list of sequences where one letter changed to N:
re_get_plus_minus(sequence)
re_get_mutations(sequence)
Return list of patterns one not mutated and second mutated:
get_double_pattern(pattern_static, pattern_dynamic)
- get_mutated_pattern_twice(pattern)
- get_mutated_pattern_trice(pattern)
- pattern_search(name, sequence, pattern_function, pattern_function_params)
from trseeker.tools.blast_tools import *
- blastn(database, query, output, e_value=None)
- create_db(fasta_file, output, verbose=False, title=None)
- alias_tool(dblist, output, title)
- bl2seq(input1, input2, output)
- create_db_for_genome(file_pattern=None, chromosome_list=None, output=None, title=None)
- get_gi_list(gi_score_file, score_limit=90)
- get_all_gi_from_blast(blast_file, mode="gi")
- get_all_blast_obj_from_blast(blast_file, mode="ref")
- blastn_search_for_trs(trf_large_file, db, annotation_self_folder, temp_file, skip_by_family=None, is_huge_alpha=False)
- bl2seq_search_for_trs(trf_large_file, annotation_bl2seq_folder, temp_file)
from trseeker.tools.sa_tools import *
Fasta to SA input. Text file of sequences delimeted by $ symbol:
fasta_to_sa_input(fasta_file, sa_input_file, index_file_name=None, file_name=None, start_id=None, increment=False)
SA input file to fasta file:
sa_input_to_fasta(input_file, output_file)
Function precompiles doc2id and id2doc pickled dictionary:
pickle_dictionary_for_docid_trid(sa_doc_index, doc_to_trf_file, trf_to_doc_file)
Function filters and writes sa data with given min tf and df:
filter_sa_dataset(sa_file, output_file, min_tf, min_df)
Yields corpus texts. Corpus is $ delimited text file:
iterate_sa_corpus(corpus)
from trseeker.tools.ncbi_genomes import *
Scrap finished WGS genome projects wtih given SubGroup name:
load_genome_projects(subgroup)
Scrap finished WGS genome projects wtih given SubGroup name:
load_genome_projects_exclude(notsubgroups)
Print data for initiating PySatDNA projects:
print_add_project_data(genome_projects, pid_type, pr_type)
from trseeker.tools.network_tools import *
Compute network slices using graph_tool library or networkx.
compute_network(network_file, output_file_pattern, id2nodename)
- create_graph_on_fly(network_file)
- load_graph(ml_file)
- analyse_network_graph_tool(G, output_file_pattern, trf_large_index_file)
- write_classification_graph_tool(output_file, components, id2nodename)
- write_classification_neworkx(output_file, components, trid2meta)
- create_graphml(network_file, ml_file, id2nodename)
- load_networkx(network_file)
- init_graph_networkx(network_data, start=0, precise=1, id2nodename=None)
- analyse_networkx(G, network_data, output_file_pattern, id2nodename)
from trseeker.tools.ncbi_annotation_tools import *
Read ideogram file and return return dict chr -> list.
get_ideogram_dict(idiogram_file, mode="NCBI")
get_gene_list(file_gene_list, color='#000000', left_padding=30, gene_group_label='C57BL/6J')
# NB: Not implemented yet.
from trseeker.tools.jellyfish_tools import *
# jellyfish settings:
location = settings["blast_settings"]["jellyfish_location"]
Count kmers with Jellyfish
count_kmers(input_fasta, ouput_prefix, k, mintf=None)
merge_kmers(folder, ouput_prefix, ouput_file)
stats_kmers(db_file, stats_file)
histo_kmers(db_file, histo_file)
dump_kmers(db_file, fasta_file, dumpmintf)
query_kmers(db_file, query_hashes, both_strands=True, verbose=True)
get_kmer_db_and_fasta(folder, input_file, kmers_file, k=23, mintf=None)
query_and_write_coverage_histogram(db_file, query_sequence, output_file, k=23)
# Save coverage histogram into output_file for given query_sequence.
Shortcut:
from trseeker.tools.jellyfish_tools import sc_compute_kmer_data
sc_compute_kmer_data(fasta_file, jellyfish_data_folder, jf_db, jf_dat, k, mintf, dumpmintf)
Classification of TRs into types.
from trseeker.tools.trs_types import *
Available types:
trs_types = {
"all": 0,
"micro": 0,
"perfect": 0,
"too_short": 0,
"100bp": 0,
"gc": 0,
"x4": 0,
"entropy": 0,
"ok": 0,
}
Usage:
trs_types, class_objs, trf_objs = get_trs_types(trf_all_file, trf_all_class_file, settings)
Used settings:
settings["other"]["trs_types"]["possible_array_length"]
settings["other"]["trs_types"]["microsat_monomer"]
settings["other"]["trs_types"]["min_array_length"]
settings["other"]["trs_types"]["min_gc"]
settings["other"]["trs_types"]["max_gc"]
settings["other"]["trs_types"]["min_copy_number"]
settings["other"]["trs_types"]["min_entropy"]
from trseeker.tools.tree_tools import *
Stores tree-relevant data associated with nodes.
node = NodeData(taxon=None, branchlength=1.0,
support=None, comment=None, bootstrap=None,
taxon_name=None, distance=None, elements=None)
Stores a list of nodes that are linked together.
chain = Chain()
# Return a list of all node ids
ids = chain.all_ids()
# Attaches node to another: (self, node, prev)
chain.add(node, prev=None)
# Deletes node from chain and relinks successors to predecessor: collapse(self, id).
chain.collapse(id)
# Kills a node from chain without caring to what it is connected
chain.kill(id)
# Disconnects node from his predecessor
chain.unlink(id)
# Connects son to parent
chain.link(id)
# Check if grandchild is a subnode of parent
chain.is_parent_of(parent, grandchild)
# Returns a list of all node_ids between two nodes (excluding start, including end)
chain.trace(start, finish)
A single node
# Represents a node with one predecessor and multiple successors
node = Node(data=None)
node.set_id(id)
node.set_data(data)
id = node.get_id()
# Returns a list of the node's successors
id = node.get_succ()
# Returns the id of the node's predecessor
id = node.get_prev()
node.set_prev()
# Adds a node id to the node's successors
node.add_succ(id)
# Removes a node id from the node's successors
node.remove_succ(id)
# Sets the node's successors
node.set_succ(id)
Represents a tree using a chain of nodes with on predecessor (=ancestor) and multiple successors (=subclades)
t = Tree(weight=1.0, rooted=False,
name="", data=NodeData, max_support=1.0)
t.is_leaf(node)
t.is_subtree(node)
t.is_terminal(node)
t.is_internal(node)
t.is_preterminal(node)
t.has_support(node=None)
l = t.get_node_to_one_d(tree)
t.is_identical(tree)
node = t.node(id)
nodes = t.get_terminals()
n = t.count_terminals(node=None)
node = t.common_ancestor(node1, node2)
d = t.distance(nod1, node2)
dist_dict = t.get_distance_for_all_from_node(id)
bl = t.sum_branchlength(root=None, node=None)
ids = t.split(parent_id=None, n=2, branchlength=1.0)
# Bootstrap tree with given cutoff value. Nodes with suppert below cutoff will be collapsed
t.bootstrap(value)
# Prunes a terminal taxon from the tree
t.prune(taxon)
# Return a list of all otus downwards from a node
t.get_taxa(node_id=None)
ids = t.search_taxon(taxon)
t.to_string(support_as_branchlengths=False, branchlengths_only=False, plain=True, plain_newick=False, ladderize=None)
t.make_info_string(data, terminal=False)
t.ladderize_nodes(nodes, ladderize=None)
t.newickize(node, ladderize=None)
Functions:
Get list of (slice_id, file_path):
slice_files = get_slice_files(folder)
read_tree_slices(slice_files):
slice_files = read_tree_slices(slice_files)
from trseeker.tools.entrez_datanase import *
Download all proteins according to a given query into output file and then return these proteins as fasta text.
fasta_data = download_proteins_from_ncbi(query, output_file, email, batch=500, verbose_step=1000)
Download items from given database according to query, rettype, and retmode. Save output to output_file and return as text.
data = download_items_from_ncbi(query,
database,
output_file,
email,
rettype="fasta",
retmode="text",
batch=500,
verbose_step=1000)
Get items from given database according to query, rettype, and retmode.
items = get_items_from_ncbi(query,
database,
email,
rettype="fasta",
retmode="text",
batch=500,
verbose_step=1000)
Get RNA SRA datasets from NCBI according to taxid.
Output includes LIBRARY_SOURCE, STUDY_ABSTRACT, DESIGN_DESCRIPTION, PRIMARY_ID, DESCRIPTION, LINKS.
And the second part of the output contains full xml data.
data, _ = get_rna_sra_datasets_by_taxid(taxid, email, batch=500, verbose_step=1)
all_links = {}
for *item, links in datasets.values():
links = dict([
(url.split("/")[-1], url)
for
url
in links])
for sra in links:
all_links[sra] = links[sra]
Download and unpack SRA filrs from NCBI according to taxid.
download_rna_sra_datasets_by_taxid(taxid, email, output_folder, threads=30, batch=500, verbose_step=1)
Download genomes and annotation from NCBI according to taxid. Return refseq and genbank datasets.
download_genome_assemblies_and_annotation_from_ncbi(taxid, output_folder, threads=30, only_refseq=True)