Author: Miles Woodcock-Girard Walker Lab, UIC Department of Biological Sciences
As-Sembl-y pipeline for tr-ans-criptomes
Semblans is a tool that enables the automatic assembly of de novo transcriptomes for non-model organisms.
The easiest way to install Semblans is via Docker, or to download the latest the pre-built binaries here
Through the integration of several external packages and the leveraging of C++ data streaming performance, Semblans streamlines the necessary pre-processing, quality control, assembly, and post-assembly steps, allowing a hands-off assembly process without loss to versatility. The following diagram shows a graphical workflow of the pipeline. The reference proteome has been omitted for simplicity, but is utilized by Diamond during the BLASTX / BLASTP steps of postprocess:
All documentation for Semblans can be found in the wiki
Semblans will install most of the dependencies it requires, but make sure you have working installations of:
On Ubuntu this can be done by running:
sudo apt update
sudo apt install bowtie2 jellyfish salmon samtools python3-numpy
The easiest way to install Semblans is via Docker, or to download the latest the binaries here.
If instead the user wishes to build from source, they must clone this repository, navigate to the Semblans root directory and then call:
./install.sh
Please allow several minutes for Semblans to set up the necessary packages.
By default, Semblans will not retrieve the PantherDB functional protein database for sequence annotation. **If the user intends to utilize Semblans' annotation functionality, they should instead call the following installation command:
./install.sh --with-panther
Be aware that the PantherDB database is large (~17GB compressed; ~80GB uncompressed), and can take some time to download.
Included with Semblans is a directory called 'examples'. This directory contains a very small short read dataset ("ChloroSubSet") for testing/verifying functionality of the Semblans pipeline. To test, uncompress the data from ChloroSubSet.tar.gz. The user should then ensure they have a reference proteome, as one is necessary for several of the pipeline's postprocessing stages. Links to broad, kingdom-level reference proteomes are hosted at the bottom of this document. In this example, I use the kingdom-level plant proteome. Once prepared, the user may call:
semblans \
--left ChloroSubSet_1.fq \
--right ChloroSubSet_2.fq \
--prefix ChloroSubSet \
--ref-proteome ensembl_plant.pep.all.fa \
--threads 4 \
--ram 10
Some users may experience issues, particularly during the transcript assembly phase during Trinity. Common errors and solutions are hosted on our GitHub's wiki page. As cataloguing these is an ongoing process, we urge users to post an issue on the Semblans repository page detailing their problem if it persists or is otherwise unaddressed by this page.
Reference peptide sets (gzipped FASTA) for the postprocess step:
Ensembl animal reference (3.1 GiB) [Option 1 | Option 2]