diff --git a/README.md b/README.md index c69904c..5055ab1 100644 --- a/README.md +++ b/README.md @@ -2,13 +2,13 @@ Processing DamID-seq data involves extending single-end reads, aligning the reads to the genome and determining the coverage, similar to processing regular ChIP-seq datasets. However, as DamID data is represented as a log2 ratio of (Dam-fusion/Dam), normalisation of the sample and Dam-only control is necessary and adding pseudocounts to mitigate the effect of background counts is highly recommended. -We use a single pipeline script to handle sequence alignment, read extension, binned counts, normalisation, pseudocount addition and final ratio file generation. The script uses FASTQ or BAM files as input, and outputs the final log2 ratio files in GFF or bedGraph format. These files can easily be converted to TDF for viewing in [IGV](http://www.broadinstitute.org/software/igv/) with the provided [gff2tdf.pl](http://github.com/owenjm/damid_pipeline/blob/master/gff2tdf.pl?raw=true) script (see below). +[damidseq_pipeline](https://github.com/owenjm/damidseq_pipeline/tarball/master) is a single script that automatically handles sequence alignment, read extension, binned counts, normalisation, pseudocount addition and final ratio file generation. The script uses FASTQ or BAM files as input, and outputs the final log2 ratio files in GFF or bedGraph format. These files can easily be converted to TDF for viewing in [IGV](http://www.broadinstitute.org/software/igv/) with the provided [gff2tdf.pl](http://github.com/owenjm/damid_pipeline/blob/master/gff2tdf.pl?raw=true) script (see below). ### Download Download the latest version of the pipeline script and associated files: -* [As a zipfile](https://github.com/owenjm/damidseq_pipeline/zipball/master) * [As a tarball](https://github.com/owenjm/damidseq_pipeline/tarball/master) +* [As a zipfile](https://github.com/owenjm/damidseq_pipeline/zipball/master) Prebuilt GATC fragment files used by the script are available for the following genomes: * [*Drosophila melanogaster* r5.57](https://github.com/owenjm/damidseq_pipeline/raw/gh-pages/pipeline_gatc_files/Dmel_r5.57.GATC.gff.gz) @@ -26,16 +26,16 @@ Prebuilt GATC fragment files used by the script are available for the following ### Installation -1. Unzip the pipeline script zip file, make the damid_pipeline.pl file executable and place it in your path +1. Extract the pipeline script archive, make the damid_pipeline file executable and place it in your path 1. Install [Bowtie 2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) 1. Obtain Bowtie 2 indices provided by [Bowtie 2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) or [Illumina's iGenome](http://support.illumina.com/sequencing/sequencing_software/igenome.html) Alternatively, build the Bowtie 2 index files manually: - 1. Download the latest FASTA genome primary_assembly (or toplevel) file from [Ensembl](ftp.ensembl.org/pub/current_fasta/) + 1. Download the latest FASTA genome primary_assembly (or toplevel) file from [Ensembl](http://ftp.ensembl.org/pub/current_fasta/) e.g. [the current release for *Mus musculus*](http://ftp.ensembl.org/pub/current_fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz) - (alternatively, for *Drosophila*, download from the [Flybase FTP site](http://ftp.flybase.net/releases/current/) - e.g. [*D. melanogaster* release 5.57](http://ftp.flybase.net/releases/FB2014_03/dmel_r5.57/fasta/dmel-all-chromosome-r5.57.fasta.gz)) + (alternatively, for *Drosophila*, download from the Flybase FTP site (ftp://ftp.flybase.net/releases/current/) + e.g. ftp://ftp.flybase.net/releases/FB2014_03/dmel_r5.57/fasta/dmel-all-chromosome-r5.57.fasta.gz ) 1. Extract the .gz file 1. Run bowtie2-build in the directory containing the extracted .fasta file. For the examples above: @@ -69,7 +69,7 @@ In order to run correctly, the script needs to know the locations of two paths, In order to setup the pipeline to process the *D. melanogaster* genome, for example, the first-run command would be: - damidseq_pipeline.pl --gatc_frag_file=path/to/Dmel_r5.57.GATC.gff.gz --bowtie2_genome_dir=path/to/dmel_r5.57/dmel_r.5.57 + damidseq_pipeline --gatc_frag_file=path/to/Dmel_r5.57.GATC.gff.gz --bowtie2_genome_dir=path/to/dmel_r5.57/dmel_r.5.57 If these paths do not already exist and the script is run with these options and correct values, the paths will be saved for all future runs unless overridden on the command-line. @@ -83,13 +83,13 @@ The script will by default determine sample names from the file names, and expec To see all available options, run the script with --help command-line option: - damidseq_pipeline.pl --help + damidseq_pipeline --help This will give you a list of adjustable parameters and their default and current values if applicable. We recommend keeping these at the default value in most cases; however, these can be modified on the command-line with --option=value (no spaces). To save modified values for all future runs, run the script with the parameter you wish to change together with the --save_defaults command-line option: - damidseq_pipeline.pl --save_defaults + damidseq_pipeline --save_defaults If bowtie2 and samtools are not in your path, you can specify these on the command-line also. @@ -105,12 +105,12 @@ Either file can be converted to .tdf format for viewing in [IGV](http://www.broa If the user expects to process data from multiple genomes, separate genome specifications can be saved by using the --save_defaults=[name] along with the --bowtie2_genome_dir and --gatc_frag_file options (and any other custom options that the user wishes to set as default for this genome, e.g. the bin width). For e.g.: - damidseq_pipeline.pl --save_defaults=fly --gatc_frag_file=path/to/Dmel_r5.57.GATC.gff.gz --bowtie2_genome_dir=path/to/dmel_r5.57/dmel_r.5.57 - damidseq_pipeline.pl --save_defaults=mouse --bins=500 --gatc_frag_file=path/to/MmGRCm38.GATC.gff.gz --bowtie2_genome_dir=path/to/Mm_GRCm38/GRCm38 + damidseq_pipeline --save_defaults=fly --gatc_frag_file=path/to/Dmel_r5.57.GATC.gff.gz --bowtie2_genome_dir=path/to/dmel_r5.57/dmel_r.5.57 + damidseq_pipeline --save_defaults=mouse --bins=500 --gatc_frag_file=path/to/MmGRCm38.GATC.gff.gz --bowtie2_genome_dir=path/to/Mm_GRCm38/GRCm38 Once set up, different genome definitions can be quickly loaded using the --load_defaults=[name] option, e.g.: - damidseq_pipeline.pl --load_defaults=fly + damidseq_pipeline --load_defaults=fly All currently saved genome definitions can be listed using --load_defaults=list.