Identifies and processes deletions from Illumina sequencing reads.
Tom Christy 2021
[email protected]
Python 2.7 including modules numpy, matplotlib, argparse, re,
ShapeMapper-V2, available at
In addition to having Shapemapper in your path, you should also add the internals/bin
folder of Shapemapper to your path.
Flash, available at
BWA, available at
All programs should be installed and added to the path
clone this package to your home directory with this command: git clone
Within the now downloaded ShapeJumper_V1.0 folder is the script
Copy this script to your working directory.
Fasta file - a text file containing one or more reference DNA sequences of the RNAs probed in SHAPE-JuMP experiment.
Example Format:
Fastq file - text files obtained from sequencer. Fastq files can be paired or unpaired. Fastq file for both crosslink and control sample required.
Fastq file format:
@Read Name
Quality (PHRED scores)
See for an explanation of Quality scores generated by Illumina sequencing
See to decode ASCII symbols of PHRED scores
Copy script to your current working directory. The directory you run this script in is where output will be generated.
Paired End Reads
bash -f referenceSequence.fasta -t1 crosslinkRead1.fastq -t2 crosslinkread2.fastq -c1 controlRead1.fastq -c2 controlRead2.fastq
Unpaired Reads
bash -f referenceSequence.fasta -t1 crosslinkUnPairedReads.fastq -c1 controlUnPairedReads.fastq
Input file names: The fastq files are assumed to have Illumina generated file names for paired end reads and FLASH generated filenames for unpaired reads.
Paired end read file names should contain a sample name followed by \_S##\_L001\_R#\_001.fastq
where ## is an ilumina generated sample number and # is the read number, 1 or 2.
Unpaired reads should contain the sample name followed by .extendedFrags.fastq
Failure to follow naming conventions will result in truncated and misnamed output files.
The bash script may also be run with the following optional flags
-sc, --structureCassette
This option removes any deletions to or from a structure cassette. Also, all deletion sites are subtracted by 14 from both start and stop sites. This allows the numbering to reflect a deletions position in the target RNA, regardless of the structure cassette. This option is very useful for downstream analysis of deletions mapping on to secondary or tertiary structures.
-s, --shift5Prime SHIFTINTEGER
Change the default shift applied to the 5' end of deletions from the default value of 2. 0 and negative numbers are accepted.
-a, --keepAmbiguousDeletions
Include ambiguous deletions in the final output. By default they are removed.
-n, --normalizeByBothEnds
Normalize deletion counts by determining the median depth of the 5 nucleotides at BOTH ends of the deletion and divide the deletion count by square rooted product of these medians. By default deletions are normalized by the median read depth of the 5 nucleotides at the 3' end only.
Individual python scripts also have additional options. More experienced users can easily alter their execution in the bash script to include these options.
Run the python scripts without arguments to see all possible options.
Provided example files from a SHAPE-JuMP experiment on RNase P Catalytic domain, a small RNA with an available structure in the pdb: 3DHS.
RNA was crosslinked with TBIA and IA was used as a mono-adduct control.
Note: All read files have been compressed. To extract files use command tar -xzf fileName.tar.gz
Reference Fasta File
Contains the reference 268 nucleotides of DNA sequence for the RNase P catalytic domain.
Also included are the 5' and 3' structure cassettes. Most in vitro studies of small RNAs use transcripts with structure cassettes to aid in library prep.
The 5' structure casette is 14 nucleotides long, the 3' cassette is 43 nucleotides.
See for a more in depth explanation of structure cassettes.
Paired End Reads
TBIA-RNaseP_S1_L001_R1_001.fastq TBIA-RNaseP_S1_L001_R2_001.fastq
IA-RNaseP_S2_L001_R1_001.fastq IA-RNaseP_S2_L001_R2_001.fastq
25,000 paired end reads from a SHAPE-JuMP experiment on RNase P catalytic domain. TBIA samples were crosslinked, IA is the mono-adduct control.
Unpaired Reads
25,000 reads from the same SHAPE-JuMP experiment on RNase P, but these have already been merged with FLASH to create paired end reads.
Deletion Text File
The final output from succesful execution of ShapeJumper will be stored in a text file ending in _ProcessedDeletions.txt
and starting with the name of the crosslinked sample.
Deletions in this file have been fully processed: Rates have been normalized and subtracted by control deletion rates. Ambiguous deletions and and those with inserts in the deletion longer than 10 nucleotides have been removed. Exact edge matching at deletion sites has been enforced. The 5' deletion start sites have been shifted 2 nucleotides downstream. If selected, numbering has been adjusted for to account for structure cassettes.
Total Reads Aligned:11450 Total Deletions:728.0
rnasep 123 184 0.0017623086
rnasep 81 94 0.0013324450
The first line is a header, denoting total reads aligned in the crosslinked sample. Total Deletions are the raw count of deletions longer than 10 nucleotides observed in the crosslinked sample, regardless of downstream filtering by ambiguous deletions or so on.
Subsequent lines follow the same 4 column format:
- Column 1 = Reference name from fasta file matching alignment. Samples with multiple reference sequences may contain multiple names.
- Column 2 = Deletion start site. Numbering is relative to reference fasta sequence.
- Column 3 = Deletion stop site.
- Column 4 = Deletion rate frequency. This value is normalized by read depth.
A folder will be generated during execution, ShapeJumperIntermediateFiles. In it are contained all the files generated during ShapeJumper execution. Files are generated for both crosslink and control inputs. All files are text files.
File Name Guide:
= Pair mate merged reads post FLASH.notCombined_1.fastq
= Reads that were unable to be merged by FLASH. A _2 file is generated for read 2..sam
= Alignments generated by BWA-MEM from extendedFrags fastq files._NoFlash.sam
= Alignments of reads that did not merge by FLASH._Merged.sam
= The alignment files from both FLASH merged and unmerged reads concatenated together into one file._deletions.txt
= Set of deletions identified in Merged.sam file. Each deletion found is recorded with nucleotide coordinates of the deletion and the frequency it is found._normalizedDels.txt
= Deletions with counts normalized by read depth. If the option to renumber deletion sites to account for structure cassettes was selected, that is implementd here._Subtracted.txt
= Set of crosslinked sample deletions with normalized rates subtracted by normalized rates of control sample._DelReadNames.txt
= Stores every read name of sequences containing one or more identified deletions. Can be useful for follow up/in-depth analysis..fa.amb, .fa.ann, .fa.bwt, .fa.pac,
= index files generated by BWA-MEM
Note: Depending on options selected, some files will not be present.