A small Snakemake workflow to extract a set of tandem poly(A) sites (= sites on transcripts with more than one known poly(A) site) from the PolyASite atlas (BED) and a genomic annotation (GTF). The output bed files from this pipeline can be used to run PAQR.
Go to the desired directory/folder on your file system, then clone/get the repository and move into the respective directory with:
git clone [email protected]:zavolanlab/tandem-pas.git
cd tandem-pas
Workflow dependencies can be conveniently installed with the Conda package manager. We recommend that you install Miniconda for your system (Linux). Be sure to select Python 3 option.
For improved reproducibility and reusability of the workflow, each individual step of the workflow runs either in its own Singularity container OR in its own Conda virtual environemnt. As a consequence, running this workflow has very few individual dependencies. The container execution requires Singularity to be installed on the system where the workflow is executed. As the functional installation of Singularity requires root privileges, and Conda currently only provides Singularity for Linux architectures, the installation instructions are slightly different depending on your system/setup:
If you do not have root privileges on the machine you want to run the workflow on or if you do not have a Linux machine, please install Singularity separately and in privileged mode, depending on your system. You may have to ask an authorized person (e.g., a systems administrator) to do that. This will almost certainly be required if you want to run the workflow on a high-performance computing (HPC) cluster.
After installing Singularity, or should you choose not to use containerization but only conda environments, install the remaining dependencies with:
conda env create -f install/environment.yml
If you have a Linux machine, as well as root privileges, (e.g., if you plan to run the workflow on your own computer), you can execute the following command to include Singularity in the Conda environment:
conda env create -f install/environment.root.yml
Activate the Conda environment with:
conda activate tandem_pas
This repository contains a small test dataset included for the users to test their installation. In order to initiate the test run (with conda environments technology) please navigate to the root of the cloned repository (make sure you have the conda environment tandem_pas
activated) and execute the following command:
bash execute/run_local_conda_test.sh
The file "configs/config.yaml" contains all information about used parameter values, data locations, file names and so on. During a run, all steps of the pipeline will retrieve their paramter values from this file. It follows the yaml syntax (find more information about yaml and it's syntax here) making it easy to read and edit. The main principles are:
- everything that comes after a
#
symbol is considered as comment and will not be interpreted - paramters are given as key-value pair, with
key
being the name andvalue
the value of any paramter
Some entries require your editing (e.g. filepaths or whether you want a tandem PAS file for unstranded or stranded data) while most of them you can leave unchanged. This config file contains all parameters used in the pipeline and the comments should give you the information about their meaning. If you need to change path names please ensure to use relative instead of absolute path names.
Download the PolyASite atlas and gene annotations (e.g. from ensembl) for your organism and specify their paths in the config.yaml
. Please be mindful of the biotype_key key in the configuration file, which should be set according to the annotation type provided to the pipeline (e.g. "transcript_biotype" for ensembl gtf files, "transcript_type" for gencode gtf files).
NOTE: the pipeline will only work if the poly(A) sites file and annotation gtf use the same chromosome naming scheme. For PolyASite 2.0 derived files, this means that ensembl annotations have to be used (lacking the leading "chr"). However, as gencode annotations are based on ensembl, you could possibly - AT YOUR OWN RISK - remove the "chr" from gencode annotations before running the pipeline with
awk -F'\t' -vOFS='\t' '{ gsub("chr","",$1) ; print}' GENCODE.gtf > GENCODE.CHR_REMOVED.gtf
The other way around, if you need gencode style annotations, you could add the leading "chr" to the PolyASite atlas file.
# Example for mouse
cat atlas.clusters.2.0.GRCm38.96.bed | sed -E 's/^([0-9]+|[XY])/chr\1/' | sed -E 's/^MT/chrM/' > atlas.clusters.2.0.GRCm38.96.mod.bed
# Example for human
sed -E 's/^([0-9]+|[XYM])/chr\1/' atlas.clusters.2.0.GRCh38.96.bed > atlas.clusters.2.0.GRCh38.96.wchr.bed
If you are using poly(A) site annotations from a different source than PolyASite, make sure to provide proper bed format and create a PolyASite-like ID for each site in column 4 (format: "chr:site:strand", where site is the representative site of the cluster (or the single nucleotide position of the single site)).
Go to the root folder of this repo and make sure you have the conda environment tandem_pas
activated. For your convenience, the directory execute
contains bash scripts that can be used to start local runs, using either singularity or conda, and a slurm cluster run, using singularity.