Skip to content

Latest commit

 

History

History
111 lines (70 loc) · 2.02 KB

README.md

File metadata and controls

111 lines (70 loc) · 2.02 KB

Collection of short python scripts for bioinformatics

Data scrapping

Tandem repeats search with TRF

Search tandem repeats in given folder with fasta files:

python parallel_trf.py input_folder output_folder mask threads

Example:

python parallel_trf.py ~/human_genome/fasta ~/human_genome/trf fa 20

Illumina run statistics

Compute and draw distribution of PE fragment lengths:

python fragments_length_from_sam.py -o image_file -i sam_file

Functions related to SAM file

Count unmapped reads:

from PyBioSnippets.sam.sam_functions import count_unmapped

(mapped, unmapped) = count_unmapped(sam_file)

Save unmapped reads from SAM file to fasta file:

from PyBioSnippets.sam.sam_functions import save_unmapped_to_fasta

save_unmapped_to_fasta(sam_file, fasta_file)

Compute fragment lengths statistics for first l lines.

python fragments_length_from_sam.py -o stat.png -i data.sam -l 100000

Count FLAG values for given SAM file:

python hiseq/sam_stats.py -i data.sam

Fastq operations

Join splitted HiSeq files:

python hiseq/join_fastq.py --remove False --input some_folder --mask read_L001_R1

Fix too long quality scores in corrupted HiSeq files

fix_uncorrect_long_quality(fastq_file, corrected_fastq_output)

Iterator for pair end files:

for read_obj1, read_obj2 in iter_pe_data(fastq_file1, fastq_file2):
	do_somethind()

Convert fastq to fasta:

python hiseq/fastq_to_fasta.py -i data.fastq -o data.fasta

Kmers analysis

Compute kmer frequences percents for coverage plot.

python compute_kmer_coverage.py input_file output_file

PacBio analysis

Convert bax.h5 files into fasta and fastq files.

ls | grep bax.h5 | xargs -n 1 --max-procs 64 python baxh5_to_fastq.py

cat *fasta > pacbio.fasta

cat *fastq > pacbio.fastq

Chromosome statistics

Get dictionary with chromosome lengths

chr2length = get_chromosome_lengths(rerence_multifasta)