Skip to content

Commit

Permalink
Merge pull request #93 from CCBR/v0.10.0-dev
Browse files Browse the repository at this point in the history
V0.10.0 dev ... to be v0.10.1
  • Loading branch information
kopardev authored Dec 23, 2023
2 parents 02f1d14 + ea537d2 commit 6b03415
Show file tree
Hide file tree
Showing 37 changed files with 3,288 additions and 1,377 deletions.
Binary file not shown.
185 changes: 181 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,22 @@
# CHARLIE
![img](https://img.shields.io/github/issues/CCBR/CHARLIE?style=for-the-badge)![img](https://img.shields.io/github/forks/CCBR/CHARLIE?style=for-the-badge)![img](https://img.shields.io/github/stars/CCBR/CHARLIE?style=for-the-badge)![img](https://img.shields.io/github/license/CCBR/CHARLIE?style=for-the-badge)

**C**ircrnas in **H**ost **A**nd vi**R**uses ana**L**ysis p**I**p**E**line

### Table of Contents
- [CHARLIE - **C**ircrnas in **H**ost **A**nd vi**R**uses ana**L**ysis p**I**p**E**line](#charlie)
- [Table of Contents](#table-of-contents)
- [1. Introduction](#1-introduction)
- [2. Flowchart](#2-flowchart)
- [3. Software Dependencies](#3-software-dependencies)
- [4. Usage](#4-usage)
- [5. License](#5-license)
- [6. Testing](#6-testing)
- [6.1 Test data](#61-test-data)
- [6.2 Expected output](#62-expected-output)

### 1. Introduction

**C**ircrnas in **H**ost **A**nd vi**R**uses ana**L**ysis p**I**p**E**line

Things to know about CHARLIE:

Expand All @@ -29,14 +43,46 @@ This circularRNA detection pipeline uses CIRCExplorer2, CIRI2 and many other too
> Note: BWA<sup>1</sup>, BWA<sup>2</sup> denote 2 different alignment parameters, etc.
Flowchart:
### 2. Flowchart
![](docs/images/CHARLIE_v0.8.x.png)

For complete documentation with tutorial go [here](https://ccbr.github.io/CCBR_circRNA_DAQ/).
For complete documentation with tutorial go [here](https://CCBR.github.io/CHARLIE/).

> DISCLAIMER: New circRNA tools have been added CHARLIE and the documentation is currently out of date!

### 3. Software Dependencies

The following version of various bioinformatics tools are using within CHARLIE:

| tool | version |
| ------------- | --------- |
| blat | 3.5 |
| bedtools | 2.30.0 |
| bowtie | 2-2.5.1 |
| bowtie | 1.3.1 |
| bwa | 0.7.17 |
| circexplorer2 | 2.3.8 |
| cufflinks | 2.2.1 |
| cutadapt | 4.4 |
| fastqc | 0.11.9 |
| hisat | 2.2.2.1 |
| java | 18.0.1.1 |
| multiqc | 1.9 |
| parallel | 20231122 |
| perl | 5.34 |
| picard | 2.27.3 |
| python | 2.7 |
| python | 3.8 |
| sambamba | 0.8.2 |
| samtools | 1.16.1 |
| STAR | 2.7.6a |
| stringtie | 2.2.1 |
| ucsc | 450 |
| R | 4.0.5 |
| novocraft | 4.03.05 |


### 4. Usage

```bash
% ./charlie
Expand Down Expand Up @@ -132,3 +178,134 @@ VersionInfo:

##########################################################################################
```
### 5. License
MIT License
Copyright (c) 2021 Vishal Koparde
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
### 6. Testing
#### Init
Run init mode:
```bash
bash <path to charlie> -w=<path to output dir> -m=init
```
This will create the folder provided by `-w=`. The user should have write permission to this folder.
#### Dry-run
Test data (1 paired-end subsample and 1 single-end subsample) have been including under the `.tests/dummy_fastqs` folder. After running in `-m=init`, `samples.tsv` should be edited to point the copies of the above mentioned samples with the column headers:
- sampleName
- path_to_R1_fastq
- path_to_R2_fastq
Column `path_to_R2_fastq` will be blank in case of single-end samples.
After editing `samples.tsv`, dry run should be run:
```bash
bash <path to charlie> -w=<path to output dir> -m=dryrun
```
This will create the reference fasta and gtf file based on the selections made in the `config.yaml`.
#### Run
If `-m=dryrun` was sucessful, then simply do `-m=run`. The output will look something like this
```
... ... skipping ~1000 lines
...
...
Job stats:
job count min threads max threads
--------------------------------------------- ------- ----------
all 1 1 1
annotate_clear_output 2 1 1
circExplorer 2 2 2
circExplorer_bwa 2 2 2
circrnafinder 2 1 1
ciri 2 56 56
clear 2 2 2
create_bowtie2_index 1 1 1
create_bwa_index 1 1 1
create_circExplorer_BSJ_bam 2 4 4
create_circExplorer_linear_spliced_bams 2 56 56
create_circExplorer_merged_found_counts_table 2 1 1
create_hq_bams 2 1 1
create_index 1 56 56
create_master_counts_file 1 1 1
cutadapt 2 56 56
dcc 2 4 4
dcc_create_samplesheets 2 1 1
estimate_duplication 2 1 1
fastqc 2 4 4
find_circ 2 56 56
find_circ_align 2 56 56
merge_SJ_tabs 1 2 2
merge_alignment_stats 1 1 1
merge_genecounts 1 1 1
merge_per_sample 2 1 1
star1p 2 56 56
star2p 2 56 56
star_circrnafinder 2 56 56
total 52 1 56
Reasons:
(check individual jobs above for details)
input files updated by another job:
alignment_stats, all, annotate_clear_output, circExplorer, circExplorer_bwa, circrnafinder, ciri, clear, create_circExplorer_BSJ_bam, create_circExplorer_linear_spliced_bams, create_circExplorer_merged_found_counts_table, create_hq_bams, create_master_counts_file, dcc, dcc_create_samplesheets, estimate_duplication, fastqc, find_circ, find_circ_align, merge_SJ_tabs, merge_alignment_stats, merge_genecounts, merge_per_sample, star1p, star2p, star_circrnafinder
missing output files:
alignment_stats, annotate_clear_output, circExplorer, circExplorer_bwa, circrnafinder, ciri, clear, create_bowtie2_index, create_bwa_index, create_circExplorer_BSJ_bam, create_circExplorer_linear_spliced_bams, create_circExplorer_merged_found_counts_table, create_hq_bams, create_index, create_master_counts_file, cutadapt, dcc, dcc_create_samplesheets, estimate_duplication, fastqc, find_circ, find_circ_align, merge_SJ_tabs, merge_alignment_stats, merge_genecounts, merge_per_sample, star1p, star2p, star_circrnafinder
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
Running...
14743440
```
##### 6.1 Test Data
The `.tests/dummy_fastqs` folder in the repo has test dataset:
```bash
% tree .tests/dummy_fastqs
.tests/dummy_fastqs
├── GI1_N.R1.fastq.gz
├── GI1_N.R2.fastq.gz
└── GI1_T.R1.fastq.gz
```
`GI1_N` is a PE sample while `GI1_T` is a SE sample.
##### 6.2 Expected Output
Expected output from the sample data is stored under `.tests/expected_output`.
More details about running test data can be found [here](https://ccbr.github.io/CHARLIE/tutorial).
> DISCLAIMER:
>
> CHARLIE is built to be run only on [BIOWULF](https://hpc.nih.gov). A newer HPC-agnostic version of CHARLIE is planned for 2024.
42 changes: 34 additions & 8 deletions charlie
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@
set -eo pipefail
module purge

# decide trigger
trigger="mtime"
# trigger="input"
# trigger="code"

##########################################################################################
# functions
##########################################################################################
Expand Down Expand Up @@ -36,10 +41,11 @@ GIT_COMMIT_TAG=$(get_git_commitid_tag $PIPELINE_HOME)
##########################################################################################

CLUSTER_SBATCH_CMD="sbatch --parsable --cpus-per-task {cluster.threads} -p {cluster.partition} -t {cluster.time} --mem {cluster.mem} --job-name {cluster.name} --output {cluster.output} --error {cluster.error}"
if [ "$HOSTNAME" == "biowulf.nih.gov" ];then
# if [ "$HOSTNAME" == "biowulf.nih.gov" ];then
# if [ "$SLURM_CLUSTER_NAME" == "biowulf" ];then
EXTRA_SINGULARITY_BINDS="/lscratch"
CLUSTER_SBATCH_CMD="$CLUSTER_SBATCH_CMD --gres {cluster.gres}"
fi
# fi
PYTHONVERSION="3.7"
SNAKEMAKEVERSION="7.19.1"
# SNAKEMAKEVERSION="5.24.1"
Expand All @@ -63,7 +69,7 @@ function usage() { cat << EOF
##########################################################################################
Welcome to charlie(v0.9.0)
Welcome to charlie(v0.10.0-dev)
_______ __ __ _______ ______ ___ ___ _______
| || | | || _ || _ | | | | | | |
| || |_| || |_| || | || | | | | | ___|
Expand Down Expand Up @@ -293,6 +299,12 @@ function dryrun() {
run "--dry-run" | tee ${WORKDIR}/dryrun.${timestamp}.log
}

function touch() {
runcheck
timestamp=$(date +"%y%m%d%H%M%S")
run "--touch" | tee ${WORKDIR}/touch.${timestamp}.log
}

##########################################################################################
# UNLOCK
##########################################################################################
Expand Down Expand Up @@ -367,7 +379,11 @@ function create_runinfo {
echo "Pipeline Dir: $PIPELINE_HOME" > ${WORKDIR}/runinfo.yaml
echo "Git Commit/Tag: $GIT_COMMIT_TAG" >> ${WORKDIR}/runinfo.yaml
userlogin=$(whoami)
username=$(finger $userlogin|grep ^Login|awk -F"Name: " '{print $2}')
if [[ `which finger 2>/dev/null` ]];then
username=$(finger $userlogin |grep ^Login | awk -F"Name: " '{print $2}');
elif [[ `which lslogins 2>/dev/null` ]];then
username=$(lslogins -u $userlogin | grep ^Geco | awk -F": " '{print $2}' | awk '{$1=$1;print}');
else username="";fi
echo "Login: $userlogin" >> ${WORKDIR}/runinfo.yaml
echo "Name: $username" >> ${WORKDIR}/runinfo.yaml
g=$(groups)
Expand Down Expand Up @@ -435,7 +451,7 @@ function run() {
--configfile $CONFIGFILE \
--cores all \
--rerun-incomplete \
--rerun-triggers input \
--rerun-triggers $trigger \
--retries 2 \
--keep-going \
--stats ${WORKDIR}/snakemake.stats \
Expand All @@ -457,7 +473,7 @@ function run() {
#SBATCH --job-name="charlie"
#SBATCH --mem=40g
#SBATCH --partition="ccr,norm"
#SBATCH --time=96:00:00
#SBATCH --time=48:00:00
#SBATCH --cpus-per-task=2
module load python/$PYTHONVERSION
Expand All @@ -479,7 +495,7 @@ snakemake -s $SNAKEFILE \
--cluster-status $CLUSTERSTATUSCMD \
-j 500 \
--rerun-incomplete \
--rerun-triggers input \
--rerun-triggers $trigger \
--retries 2 \
--keep-going \
--stats ${WORKDIR}/snakemake.stats \
Expand All @@ -496,8 +512,17 @@ EOF

sbatch ${WORKDIR}/submit_script.sbatch

elif [ "$1" == "--touch" ];then

snakemake $1 -s $SNAKEFILE \
--directory $WORKDIR \
--configfile $CONFIGFILE \
--cores 1

else # dry-run and unlock

echo $CLUSTER_SBATCH_CMD

snakemake $1 -s $SNAKEFILE \
--directory $WORKDIR \
--use-envmodules \
Expand All @@ -508,7 +533,7 @@ snakemake $1 -s $SNAKEFILE \
--cluster "$CLUSTER_SBATCH_CMD" \
-j 500 \
--rerun-incomplete \
--rerun-triggers input \
--rerun-triggers $trigger \
--keep-going \
--reason \
--stats ${WORKDIR}/snakemake.stats
Expand Down Expand Up @@ -597,6 +622,7 @@ function main(){
run) runslurm && exit 0;;
runlocal) runlocal && exit 0;;
reset) reset && exit 0;;
touch) touch && exit 0;;
dry) dryrun && exit 0;; # hidden option
local) runlocal && exit 0;; # hidden option
reconfig) reconfig && exit 0;; # hidden option for debugging
Expand Down
11 changes: 7 additions & 4 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,9 @@ run_nclscan: False
nclscan_config: "WORKDIR/nclscan.config"
#
# Should we also run find_circ? True or False WITHOUT quotes
run_findcirc: True

run_findcirc: False
# findcirc_params: "--noncanonical --allhits" # this gives way too many circRNAs
findcirc_params: "--noncanonical"


# select references .... host + viruses(comma-separated):
Expand Down Expand Up @@ -90,15 +91,17 @@ tools: "PIPELINE_HOME/resources/tools.yaml"
cluster: "WORKDIR/cluster.json"

adapters: "PIPELINE_HOME/resources/TruSeq_and_nextera_adapters.consolidated.fa"
circexplorer_bsj_circRNA_min_reads: 2 # in addition to "known" and "low-conf" circRNAs identified by circexplorer, we also include those found in back_spliced.bed file but not classified as known/low-conf only if the number of reads supporting the BSJ call is greater than this number
minreadcount: 2 # this is used to filter circRNAs while creating the per-sample counts table
circexplorer_bsj_circRNA_min_reads: 3 # in addition to "known" and "low-conf" circRNAs identified by circexplorer, we also include those found in back_spliced.bed file but not classified as known/low-conf only if the number of reads supporting the BSJ call is greater than this number
minreadcount: 3 # this is used to filter circRNAs while creating the per-sample counts table
flanksize: 18 # 18bp flank on either side of the BSJ .. used by multiple BSJ callers
dcc_strandedness: "-ss" # "-ss" for stranded library and "--nonstrand" for unstranded
cutadapt_min_length: 15
cutadapt_n: 5
cutadapt_max_n: 0.5
cutadapt_O: 5
cutadapt_q: 20
high_confidence_core_callers: "circExplorer,circExplorer_bwa"
high_confidence_core_callers_plus_n: 1

ciri_perl_script: "/data/CCBR_Pipeliner/bin/CIRI_v2.0.6/CIRI2.pl"
nclscan_dir: "/data/CCBR_Pipeliner/bin/NCLscan-1.7.0"
Expand Down
4 changes: 2 additions & 2 deletions config/samples.tsv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
sampleName path_to_R1_fastq path_to_R2_fastq
GI1_N /data/CCBR_Pipeliner/test_datasets/circRNA/human/GI1_N_ss.R1.fastq.gz /data/CCBR_Pipeliner/test_datasets/circRNA/human/GI1_N_ss.R2.fastq.gz
GI1_T /data/CCBR_Pipeliner/test_datasets/circRNA/human/GI1_T_ss.R1.fastq.gz
GI1_N /data/CCBR_Pipeliner/testdata/circRNA/human/GI1_N_ss.R1.fastq.gz /data/CCBR_Pipeliner/testdata/circRNA/human/GI1_N_ss.R2.fastq.gz
GI1_T /data/CCBR_Pipeliner/testdata/circRNA/human/GI1_T_ss.R1.fastq.gz
7 changes: 4 additions & 3 deletions docs/flowchart.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# circRNA DAQ Pipeline
# CHARLIE

![img](https://img.shields.io/github/issues/kopardev/circRNA?style=for-the-badge)![img](https://img.shields.io/github/forks/kopardev/circRNA?style=for-the-badge)![img](https://img.shields.io/github/stars/kopardev/circRNA?style=for-the-badge)![img](https://img.shields.io/github/license/kopardev/circRNA?style=for-the-badge)

Flowchart for [v0.3.3](https://github.com/kopardev/circRNA/releases/tag/v0.3.3):
Flowchart

![img](circRNA_v0.3.3.png)
![img](images/CHARLIE_v0.8.x.png)

> DISCLAIMER: This chart is for v0.8.x may be slightly outdated.
Loading

0 comments on commit 6b03415

Please sign in to comment.