rG4-seeker

rG4-seeker is a pipeline for processing and analyzing rG4-seq data (https://www.nature.com/articles/nmeth.3965)

Requirements

16GB RAM
At least 30GB free hard disk space
Single CPU core / Recommended: A 4-core desktop CPU
Linux operating system (tested on Ubuntu 16.04 and 18.04)
Python 3

Installation

git clone https://github.com/TF-Chan-Lab/rG4-seeker
cd rG4-seeker
pip install .

Usage

Prepare input files for rG4-seeker
- Reference genome in FASTA format
- One or more gene annotation in GFF3 or GTF format
- Genome-aligned, coordinate-sorted rG4-seq reads in BAM format
  - BAM indexing is required (generated by “samtools index” command)
  - Currently, only output from STAR aligner is supported/tested
  - Support for HISAT2 and Tophat2 aligner will be added in the future
- A working directory for rG4-seeker to write its intermediate/output files
Construct a configuration file
- rG4-seeker reads program settings and location of input files via a configuration file (config.ini)
- A template configuration file ( example.ini ) is provided
- Please refer to the ‘configuration file format’ section for modifying the configuration file
Run rG4-seeker:
```
rG4-seeker config.ini
```

Obtain results:

RTS sites identified by rG4-seeker will be reported in 2 csv files, which can be directly opened by spreadsheet editors such as Microsoft Excel.

The 2 csv files have identical content except for level of details in the sequence diagram:

[SAMPLE_NAME].rG4_list.full_combined.csv

The sequence diagram shows the aggregated RTS site from all rG4-seq K+ replicates and K+-PDS replicate datasets

[SAMPLE_NAME].rG4_list.full_combined_breakdown.csv

The sequence diagram shows RTS site identified in each individual dataset

Example dataset

An example dataset hela-2016-chr20.tar.gz derived from published HeLa rG4-seq dataset (Kwok et al. 2016) is available for testing newly rG4-seeker installations. It will run for approximately 60 minutes.
The example dataset is extracted into a directory containing all input files and a configuration file, and can be run directly:
```
tar -axzf hela-2016-chr20.tar.gz
cd hela-2016-chr20
rG4-seeker hela-2016-chr20.ini
```

Configuration file formatting

rG4-seeker reads configuration files with Python3 configparser module
An rG4-seeker configuration file is consisting of 4 sections, each lead by a [section] header, followed by options/value entries separated by ‘=’
Unless specified, all options are required

[global] section

Global parameters are configured in this section

Options

Values

Remarks

WORKING_DIR

Path to the working directory for rG4-seeker to place intermediate/output files

SAMPLE_NAME

An arbitrary identifier for the rG4-seq sample (e.g. HeLa-rG4seq)

The identifier will be the prefix for all output files

THEADS

No. of CPU threads rG4-seeker can use

NO_OF_ANNOTATIONS

No. of gene annotation sets to use

rG4-seeker can simultaneously use multiple annotation sets (e.g. GENCODE and RefSeq)

NO_OF_REPLICATES

No. replicates present in the rG4-seq dataset

HAVE_KPDS_CONDITION

‘True’ or ‘False’, indicating whether K+/PDS condition is present the rG4-seq dataset

ALIGNER

Name of short read aligner used (e.g. ‘STAR’)

Currently only ‘STAR’ aligner is supported.

READS_TYPE

‘SE’ or ‘PE’, corresponding to single-end or pair-end illumina read types
Example configuration for [global] section:
[global]
WORKING_DIR = /home/user/rg4seeker_working_dir/
SAMPLE_NAME = HeLa-rG4seq
THREADS = 8
NO_OF_ANNOTATIONS = 2
NO_OF_REPLICATES = 2
HAVE_KPDS_CONDITION = True
ALIGNER = STAR
READS_TYPE = SE

[genome] section

The reference genome to use is specified in this section

Options

Values

Remarks

GENOME_FASTA

Path to the reference genome sequence in FASTA format

The FASTA file must be in uncompressed format

GENOME_FASTA_FAI

Path to the fai index file of the reference genome sequence

A fai index can be generated using samtools
Example configuration for [genome] section:
[genome]
GENOME_FASTA = /home/user/references/GRCh38.primary_assembly.genome.fa
GENOME_FASTA_FAI = /home/user/references/GRCh38.primary_assembly.genome.fa.fai

[annotation] section

The gene annotation set(s) to use are specified in this section

Options	Values	Remarks
ANNOTATION_NAME	An identifier for the gene annotation (e.g. GENCODE)
ANNOTATION_GFF	Path to the annotation GFF3/GTF file	The GFF3/GTF file can be compressed (in .gz format)

Note: Please provide multiple [annotation_n] sections matching the number of annotations sets

Example configuration for [annotation] section when 2 annotations sets are used:

[annotation_1]
ANNOTATION_NAME = Gencode
ANNOTATION_GFF = /home/user/references/gencode.v29.primary_assembly.annotation.gff3.gz

[annotation_2]
ANNOTATION_NAME = RefSeq
ANNOTATION_GFF = /home/user/references/GRCh38.RefSeqGeneAnnotation.gff.gz

[replicate_n] section

The rG4-seq datasets to use (in format of aligned reads) are specified in this section

Options

Values

Remarks

LI_BAM_FILE

Path to the BAM file containing aligned reads from rG4-seq (Li+ condition)

K_BAM_FILE

Path to the BAM file containing aligned reads from rG4-seq (K+ condition)

KPDS_BAM_FILE

Path to the BAM file containing aligned reads from rG4-seq (K+/PDS condition)

Required if ‘HAVE_KPDS_CONDITION’ is set as ‘True’
Note: Please provide multiple [annotation_n] sections matching the number of rG4-seq replicates
Example configuration for [replicate_n] section when NO_OF_REPLICATES = 2 and HAVE_KPDS_CONDITION = TRUE:
[replicate_1]
LI_BAM_FILE = /home/user/HeLa-rG4Seq/Li-rep1.Aligned.sortedByCoord.out.bam
K_BAM_FILE = /home/user/HeLa-rG4Seq/K-rep1.Aligned.sortedByCoord.out.bam
KPDS_BAM_FILE = /home/user/HeLa-rG4Seq/KPDS-rep1.Aligned.sortedByCoord.out.bam

[replicate_2]
LI_BAM_FILE = /home/user/HeLa-rG4Seq/Li-rep2.Aligned.sortedByCoord.out.bam
K_BAM_FILE = /home/user/HeLa-rG4Seq/K-rep2.Aligned.sortedByCoord.out.bam
KPDS_BAM_FILE = /home/user/HeLa-rG4Seq/KPDS-rep2.Aligned.sortedByCoord.out.bam

Docker image distribution

rG4-seeker is also available as a Docker image
Installation
1. Install Docker following instructions on Docker homepage https://docs.docker.com/
2. Download the rG4-seeker Docker image rg4_seeker.docker.tar.gz
3. Import rG4-seeker Docker image:
```
sudo docker load -i rg4_seeker.docker.tar.gz
sudo docker run rg4_seeker
```

Usage

When using docker version of rG4-seeker, we strongly recommended putting all input files (Genome/Annotation/Reads) and the configuration file in the same working directory to simplify.
Running rG4-seeker from Docker:
cd working_dir
sudo docker run -v [working_dir]:[working_dir] rg4_seeker [abs_path_to_config.ini]


* Notes: The ‘-v’ option allows dockerized programs to read/write files outside its container, and is required for rG4-seeker to access input files / write result files.

Running the example data

Download the example dataset hela-2016-chr20.tar.gz derived
Decompress the example dataset and enter the working directory:
tar -axzf hela-2016-chr20.tar.gz
cd hela-2016-chr20
Update the configuration file with the current working directory:
cat hela-2016-chr20.ini | awk -v srch="./" -v repl="$PWD/" '{ sub(srch,repl,$0); print $0 }' >hela-2016-chr20.docker.ini
Run rG4-seeker:
sudo docker run -v $PWD:$PWD rg4_seeker $PWD/hela-2016-chr20.docker.ini

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
rg4seeker		rg4seeker
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.rst		README.rst
example.ini		example.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rG4-seeker

Requirements

Installation

Usage

Example dataset

Configuration file formatting

Docker image distribution

About

Releases

Packages

Languages

[SAMPLE_NAME].rG4_list.full_combined.csv	The sequence diagram shows the aggregated RTS site from all rG4-seq K+ replicates and K+-PDS replicate datasets
[SAMPLE_NAME].rG4_list.full_combined_breakdown.csv	The sequence diagram shows RTS site identified in each individual dataset

Options	Values	Remarks
WORKING_DIR	Path to the working directory for rG4-seeker to place intermediate/output files
SAMPLE_NAME	An arbitrary identifier for the rG4-seq sample (e.g. HeLa-rG4seq)	The identifier will be the prefix for all output files
THEADS	No. of CPU threads rG4-seeker can use
NO_OF_ANNOTATIONS	No. of gene annotation sets to use	rG4-seeker can simultaneously use multiple annotation sets (e.g. GENCODE and RefSeq)
NO_OF_REPLICATES	No. replicates present in the rG4-seq dataset
HAVE_KPDS_CONDITION	‘True’ or ‘False’, indicating whether K+/PDS condition is present the rG4-seq dataset
ALIGNER	Name of short read aligner used (e.g. ‘STAR’)	Currently only ‘STAR’ aligner is supported.
READS_TYPE	‘SE’ or ‘PE’, corresponding to single-end or pair-end illumina read types

Options	Values	Remarks
GENOME_FASTA	Path to the reference genome sequence in FASTA format	The FASTA file must be in uncompressed format
GENOME_FASTA_FAI	Path to the fai index file of the reference genome sequence	A fai index can be generated using samtools

Options	Values	Remarks
LI_BAM_FILE	Path to the BAM file containing aligned reads from rG4-seq (Li+ condition)
K_BAM_FILE	Path to the BAM file containing aligned reads from rG4-seq (K+ condition)
KPDS_BAM_FILE	Path to the BAM file containing aligned reads from rG4-seq (K+/PDS condition)	Required if ‘HAVE_KPDS_CONDITION’ is set as ‘True’

License

TF-Chan-Lab/rG4-seeker

Folders and files

Latest commit

History

Repository files navigation

rG4-seeker

Requirements

Installation

Usage

Example dataset

Configuration file formatting

Docker image distribution

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages