rG4-seeker is a pipeline for processing and analyzing rG4-seq data (https://www.nature.com/articles/nmeth.3965)
- 16GB RAM
- At least 30GB free hard disk space
- Single CPU core / Recommended: A 4-core desktop CPU
- Linux operating system (tested on Ubuntu 16.04 and 18.04)
- Python 3
git clone https://github.com/TF-Chan-Lab/rG4-seeker cd rG4-seeker pip install .
Prepare input files for rG4-seeker
Reference genome in FASTA format
One or more gene annotation in GFF3 or GTF format
Genome-aligned, coordinate-sorted rG4-seq reads in BAM format
- BAM indexing is required (generated by “samtools index” command)
- Currently, only output from STAR aligner is supported/tested
- Support for HISAT2 and Tophat2 aligner will be added in the future
A working directory for rG4-seeker to write its intermediate/output files
Construct a configuration file
- rG4-seeker reads program settings and location of input files via a configuration file (config.ini)
- A template configuration file ( example.ini ) is provided
- Please refer to the ‘configuration file format’ section for modifying the configuration file
Run rG4-seeker:
rG4-seeker config.ini
Obtain results:
- RTS sites identified by rG4-seeker will be reported in 2 csv files, which can be directly opened by spreadsheet editors such as Microsoft Excel.
- The 2 csv files have identical content except for level of details in the sequence diagram:
[SAMPLE_NAME].rG4_list.full_combined.csv
The sequence diagram shows the aggregated RTS site from all rG4-seq K+ replicates and K+-PDS replicate datasets
[SAMPLE_NAME].rG4_list.full_combined_breakdown.csv
The sequence diagram shows RTS site identified in each individual dataset
An example dataset hela-2016-chr20.tar.gz derived from published HeLa rG4-seq dataset (Kwok et al. 2016) is available for testing newly rG4-seeker installations. It will run for approximately 60 minutes.
The example dataset is extracted into a directory containing all input files and a configuration file, and can be run directly:
tar -axzf hela-2016-chr20.tar.gz cd hela-2016-chr20 rG4-seeker hela-2016-chr20.ini
- rG4-seeker reads configuration files with Python3 configparser module
- An rG4-seeker configuration file is consisting of 4 sections, each lead by a [section] header, followed by options/value entries separated by ‘=’
- Unless specified, all options are required
[global] section
- Global parameters are configured in this section
Options
Values
Remarks
WORKING_DIR
Path to the working directory for rG4-seeker to place intermediate/output files
SAMPLE_NAME
An arbitrary identifier for the rG4-seq sample (e.g. HeLa-rG4seq)
The identifier will be the prefix for all output files
THEADS
No. of CPU threads rG4-seeker can use
NO_OF_ANNOTATIONS
No. of gene annotation sets to use
rG4-seeker can simultaneously use multiple annotation sets (e.g. GENCODE and RefSeq)
NO_OF_REPLICATES
No. replicates present in the rG4-seq dataset
HAVE_KPDS_CONDITION
‘True’ or ‘False’, indicating whether K+/PDS condition is present the rG4-seq dataset
ALIGNER
Name of short read aligner used (e.g. ‘STAR’)
Currently only ‘STAR’ aligner is supported.
READS_TYPE
‘SE’ or ‘PE’, corresponding to single-end or pair-end illumina read types
Example configuration for [global] section:
[global] WORKING_DIR = /home/user/rg4seeker_working_dir/ SAMPLE_NAME = HeLa-rG4seq THREADS = 8 NO_OF_ANNOTATIONS = 2 NO_OF_REPLICATES = 2 HAVE_KPDS_CONDITION = True ALIGNER = STAR READS_TYPE = SE
[genome] section
- The reference genome to use is specified in this section
Options
Values
Remarks
GENOME_FASTA
Path to the reference genome sequence in FASTA format
The FASTA file must be in uncompressed format
GENOME_FASTA_FAI
Path to the fai index file of the reference genome sequence
A fai index can be generated using samtools
Example configuration for [genome] section:
[genome] GENOME_FASTA = /home/user/references/GRCh38.primary_assembly.genome.fa GENOME_FASTA_FAI = /home/user/references/GRCh38.primary_assembly.genome.fa.fai
[annotation] section
- The gene annotation set(s) to use are specified in this section
Options
Values
Remarks
ANNOTATION_NAME
An identifier for the gene annotation (e.g. GENCODE)
ANNOTATION_GFF
Path to the annotation GFF3/GTF file
The GFF3/GTF file can be compressed (in .gz format)
Note: Please provide multiple [annotation_n] sections matching the number of annotations sets
Example configuration for [annotation] section when 2 annotations sets are used:
[annotation_1] ANNOTATION_NAME = Gencode ANNOTATION_GFF = /home/user/references/gencode.v29.primary_assembly.annotation.gff3.gz [annotation_2] ANNOTATION_NAME = RefSeq ANNOTATION_GFF = /home/user/references/GRCh38.RefSeqGeneAnnotation.gff.gz
[replicate_n] section
- The rG4-seq datasets to use (in format of aligned reads) are specified in this section
Options
Values
Remarks
LI_BAM_FILE
Path to the BAM file containing aligned reads from rG4-seq (Li+ condition)
K_BAM_FILE
Path to the BAM file containing aligned reads from rG4-seq (K+ condition)
KPDS_BAM_FILE
Path to the BAM file containing aligned reads from rG4-seq (K+/PDS condition)
Required if ‘HAVE_KPDS_CONDITION’ is set as ‘True’
Note: Please provide multiple [annotation_n] sections matching the number of rG4-seq replicates
Example configuration for [replicate_n] section when NO_OF_REPLICATES = 2 and HAVE_KPDS_CONDITION = TRUE:
[replicate_1] LI_BAM_FILE = /home/user/HeLa-rG4Seq/Li-rep1.Aligned.sortedByCoord.out.bam K_BAM_FILE = /home/user/HeLa-rG4Seq/K-rep1.Aligned.sortedByCoord.out.bam KPDS_BAM_FILE = /home/user/HeLa-rG4Seq/KPDS-rep1.Aligned.sortedByCoord.out.bam [replicate_2] LI_BAM_FILE = /home/user/HeLa-rG4Seq/Li-rep2.Aligned.sortedByCoord.out.bam K_BAM_FILE = /home/user/HeLa-rG4Seq/K-rep2.Aligned.sortedByCoord.out.bam KPDS_BAM_FILE = /home/user/HeLa-rG4Seq/KPDS-rep2.Aligned.sortedByCoord.out.bam
rG4-seeker is also available as a Docker image
Installation
Install Docker following instructions on Docker homepage https://docs.docker.com/
Download the rG4-seeker Docker image rg4_seeker.docker.tar.gz
Import rG4-seeker Docker image:
sudo docker load -i rg4_seeker.docker.tar.gz sudo docker run rg4_seeker
Usage
When using docker version of rG4-seeker, we strongly recommended putting all input files (Genome/Annotation/Reads) and the configuration file in the same working directory to simplify.
Running rG4-seeker from Docker:
cd working_dir sudo docker run -v [working_dir]:[working_dir] rg4_seeker [abs_path_to_config.ini] * Notes: The ‘-v’ option allows dockerized programs to read/write files outside its container, and is required for rG4-seeker to access input files / write result files.
Running the example data
Download the example dataset hela-2016-chr20.tar.gz derived
Decompress the example dataset and enter the working directory:
tar -axzf hela-2016-chr20.tar.gz cd hela-2016-chr20
Update the configuration file with the current working directory:
cat hela-2016-chr20.ini | awk -v srch="./" -v repl="$PWD/" '{ sub(srch,repl,$0); print $0 }' >hela-2016-chr20.docker.ini
Run rG4-seeker:
sudo docker run -v $PWD:$PWD rg4_seeker $PWD/hela-2016-chr20.docker.ini