Documentation for bcbio: bcbio-nextgen readthedocs
-
One time only follow set-up instructions for Rory's bcbio.rnaseq:
- Install lein - I installed lein in ~/bin
- Add lein location to path in
~/.bashrc
:export PATH=~/bin:$PATH
- I could not get pandoc installed
-
Make directory structure
cd path-to-consult-folder
mkdir analysis meta config data
-
Download fastq files from facility to data folder
-
Download fastq files from a non-password protected url
-
wget --mirror url
(for each file of sample in each lane) -
Rory's code to concatenate files for the same samples on multiple lanes:
barcodes="BC1 BC2 BC3 BC4" for barcode in $barcodes do find folder -name $barcode_*R1.fastq.gz -exec cat {} \; > data/${barcode}_R1.fastq.gz find folder -name $barcode_*R2.fastq.gz -exec cat {} \; > data/${barcode}_R2.fastq.gz done
-
-
Download from password protected FTP such as Dana Farber
wget -r <FTP address of folder> --user <username> --password <pwd> <destination>
-
Download fastq files from BioPolymers:
rsync -avr [email protected]:./folder_name .
--OR--
sftp [email protected]
cd
to correct foldermget *.tab
mget *.bz2
-
Download from the Broad using Aspera:
- To download data I use this script.
-
-
Create metadata in Excel create sym links by concatenate("ln -s ", column $A2 with path_to_where_files_are_stored, " ", column with name of sym link $D2). Can extract parts of column using delimiters in Data tab column to text.
-
Save Excel as text and replace ^M with new lines in vim:
:%s/<Ctrl-V><Ctrl-M>/\r/g
-
Settings for bcbio- make sure you have following settings in
~/.bashrc
file:
unset PYTHONHOME
unset PYTHONPATH
module load stats/R/3.2.1
module load dev/perl/5.18.1
export PATH=/opt/bcbio/centos/bin:$PATH
-
Within the
meta
folder, add your comma-separated metadata file (projectname_rnaseq.csv
)- first column is
samplename
and is the names of the fastq files as they appear in the directory (should be the file name without the extension (no .fastq or R#.fastq for paired-end reads)) - second column is
description
and is unique names to call samples - provide the names you want to have the samples called by - column entitled
samplegroup
is your sample groups - FOR CHIP-SEQ need additional columns:
phenotype
:chip
orinput
for each samplebatch
: batch1, batch2, batch3, ... for grouping each input with it's appropriate chip(s)
- additional specifics regarding the metadata file: http://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration
- first column is
-
Within the
config
folder, add your custom Illumina template- Example template for human RNA-seq using Illumina prepared samples (genome_build for mouse = mm10):
details: - analysis: RNA-seq genome_build: hg19 algorithm: aligner: star quality_format: Standard trim_reads: False strandedness: firststrand upload: dir: ../results star-illumina-rnaseq.yaml
- List of genomes available can be found by running
bcbio_setup_genome.py
- strandedness options:
unstranded
,firststrand
,secondstrand
- Additional parameters can be found: http://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration
- Best practice templates can be found: https://github.com/chapmanb/bcbio-nextgen/tree/master/config/templates
-
Within the
data
folder, add all your fastq files to analyze.
-
Go to analysis folder and create the full Illumina instructions using the Illumina template created in Set-up: step #6.
bsub -Is -n 4 -q interactive bash
start interactive jobcd path-to-folder/*_RNAseq/analysis
change directories to analysis folderbcbio_nextgen.py -w template ../config/star-illumina-rnaseq.yaml ../meta/*-rnaseq.csv ../data/*fastq.gz
run command to create the full yam file
-
Create script for running the job (in analysis folder)
#!/bin/sh #BSUB -q priority #BSUB -J *-rnaseq #BSUB -o *-rnaseq.out #BSUB -N #BSUB -u "[email protected]" #BSUB -n 1 #BSUB -R "rusage[mem=8024]" #BSUB -W 50:00 # date bcbio_nextgen.py ../config/*-rnaseq.yaml -n 64 -t ipython -s lsf -q parallel '-rW=90:00' -r mincores=2 -rminconcores=2 --retries 3 --timeout 580 date
-
Go to work folder and start the job - make sure in an interactive session
cd path-to-folder/*-rnaseq/analysis/*-rnaseq/work bsub < ../../runJob-*-rnaseq.lsf
-
The bam files will be located here:
path-to-folder/*-rnaseq/analysis/*-rnaseq/work/align/SAMPLENAME/NAME_*-rnaseq_star/
-
Extracting interesting region (example)
-
samtools view -h -b sample1.bam "chr2:176927474-177089906" > sample1_hox.bam
-
samtools index sample1_hox.bam
-
- Report creation and creating project_summary.csv
source ~/.bashrc
cd ~/bcbio.rnaseq
bsub -Is -q interactive bash
lein run summarize path-to-project-summary.yaml -f "~batch+panel"
-
Copy to local computer the results/-rnaseq/ folder and the results/-rnaseq/summary/qc-summary.Rmd
scp -r [email protected]:path-to-folder/*-rnaseq/analysis/*-rnaseq/results/ date_*-rnaseq/ .
-
Within R Studio:
- load
library(knitrBootstrap)
- three dashes at top and bottom of knitrBootstrap specifics
- Copy over header info for knitrBootstrap
- Alter paths to files
- load
sshfs [email protected]:/n/data1/cores/bcbio ~/bcbio -o volname=bcbio -o follow_symlinks