MosaicViewer_HTT is a pipeline for schematic visualization of alleles with somatic mosaicism. Due to mosaicism, long sequencing reads can not be collapsed into an accurate consensus sequence. Therefore, only repeat annotation of each single read can be performed. MosaicViewer_HTT integrates tool for performing repeat annotation of noisy long reads, performs alignment to left and right flanking regions, and generates "simplified" reads, for easier identification of alternative motifs in IGV visualization. The pipeline has only been used for HTT alleles, but its applicability can be extended with minor modification.
Prerequisites
- Miniconda3.
Tested with conda 4.10.3.
which conda
should return the path to the executable. If you don't have Miniconda3 installed, you could download and install it with:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod 755 Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh
- A fastq file containing reads from one sample.
- A fasta file containing reference sequence (e.g. hg38)
- Coordinates of flanking regions (e.g. regions flanking the repeat)
Installation
git clone https://github.com/MaestSi/MosaicViewer_HTT.git
cd MosaicViewer_HTT
chmod 755 *
./install.sh
A conda environment named MosaicViewer_env is created, where seqtk, minimap2, samtools, NoiseCancellingRepeatFinder, BBMap and R with package Biostrings are installed. Another conda environment named NanoFilt_env is created, where NanoFilt is installed. Then, you can open the config_MosaicViewer.sh file with a text editor and set the variables PIPELINE_DIR and MINICONDA_DIR to the value suggested by the installation step.
As a first step, open the config_MosaicViewer.sh file with a text editor and set all the variables. Depending on the reference coordinates set in the file, in-silico PCR primers and flanking regions for performing left or right alignment are extracted.
MosaicViewer.sh
Usage: ./MosaicViewer.sh
Note: the file config_MosaicViewer.sh should be in the same directory. It currently supports CAG, CGG and CAA repeat motifs.
Outputs:
- $SAMPLE_NAME"_trimmed_"$SIDE".bam": bam file containing expanded reads aligned to $GENE_NAME"_masked_reference_"$SIDE".fasta"
- $SAMPLE_NAME"_trimmed_simplified_"$SIDE"_final.bam": bam file containing simplified version of expanded reads aligned to $GENE_NAME"_masked_reference_"$SIDE".fasta", where the sequence of each identified repeat has been replaced with a single repeated nucleotide (CAG -> C; CGG -> GGG; CAA -> AAA; other -> N)
- Other temporary files
For example, this is how the right alignment of trimmed reads (with or without colouring based on annotated repeats) would look like.