v1.5.3

owenjm · Nov 6, 2022 · 5d34672 · 5d34672
1 parent 21fb45f
commit 5d34672
Show file tree

Hide file tree

Showing 3 changed files with 435 additions and 144 deletions.
diff --git a/README.md b/README.md
@@ -1,51 +1,50 @@
 # Introduction
 
-Processing DamID-seq data involves extending single-end reads, aligning the reads to the genome and determining the coverage, similar to processing regular ChIP-seq datasets. However, as DamID data is represented as a log2 ratio of (Dam-fusion/Dam), normalisation of the sample and Dam-only control is necessary and adding pseudocounts to mitigate the effect of background counts is highly recommended.
+[damidseq_pipeline](https://github.com/owenjm/damidseq_pipeline/releases) is a single script that automatically handles sequence alignment, read extension, binned counts, normalisation, pseudocount addition and final ratio file generation. The script uses FASTQ or BAM files as input, and outputs the final log2 ratio files in bedGraph format.
 
-[damidseq_pipeline](https://github.com/owenjm/damidseq_pipeline/tarball/master) is a single script that automatically handles sequence alignment, read extension, binned counts, normalisation, pseudocount addition and final ratio file generation. The script uses FASTQ or BAM files as input, and outputs the final log2 ratio files in bedGraph (or optionally GFF) format.
-
-The output ratio files can easily be converted to TDF for viewing in [IGV](http://www.broadinstitute.org/software/igv/) using igvtools.  The files can be processed for peak calling using [find_peaks](http://github.com/owenjm/find_peaks) or, if using RNA pol II DamID, transcribed genes can be determined using [polii.gene.call](http://github.com/owenjm/polii.gene.call).
+## Features
+* Fully automated processing of NGS DamID-seq datasets, from FASTQ input to bedGraph output
+* Handles both single- and paired-end datasets
+* Can be used with either FASTQ or pre-aligned BAM input files
+* Multiple methods of normalisation provided
+* As of v1.5.3 and greater, can also handle and process ChIP-seq NGS data
 
 ## Citation
 
 If you find this software useful, please cite:
 
-Marshall OJ and Brand AH. (2015) damidseq_pipeline: an automated pipeline for processing DamID sequencing datasets. *Bioinformatics.* Oct 15;31(20):3371-3.  doi: 10.1093/bioinformatics/btv386. ([pubmed](http://www.ncbi.nlm.nih.gov/pubmed/26112292); [full text, open access](http://bioinformatics.oxfordjournals.org/content/early/2015/07/13/bioinformatics.btv386.long))
-
-# Download
+Marshall OJ and Brand AH. (2015) damidseq_pipeline: an automated pipeline for processing DamID sequencing datasets. *Bioinformatics.* 31(20): 3371--3.
+([pubmed](http://www.ncbi.nlm.nih.gov/pubmed/26112292); [full text, open access](https://academic.oup.com/bioinformatics/article/31/20/3371/196153))
 
-**Important: if using a version of the pipeline less than v1.3, please update as older versions did not correctly account for reads on the minus strand during read extension.**  The difference in the final output files is minimal (the Spearman's correlation between files generated with v1.2.6 or less or v1.3 is >0.95) but files generated by the new method should be more accurate.
+# Download and installation
 
-Download the latest version of the pipeline script and associated files:
-* [As a tarball](https://github.com/owenjm/damidseq_pipeline/tarball/master)
-* [As a zipfile](https://github.com/owenjm/damidseq_pipeline/zipball/master)
+[Download the latest version](https://github.com/owenjm/damidseq_pipeline/releases) of the pipeline script and associated files.
 
 Prebuilt GATC fragment files used by the script are available for the following genomes:
 * [*Drosophila melanogaster* r5.57](https://github.com/owenjm/damidseq_pipeline/raw/gh-pages/pipeline_gatc_files/Dmel_r5.57.GATC.gff.gz)
 * [*D. melanogaster* r6](https://github.com/owenjm/damidseq_pipeline/raw/gh-pages/pipeline_gatc_files/Dmel_BDGP6.GATC.gff.gz)
 * [*Mus musculus* GRCm38](https://github.com/owenjm/damidseq_pipeline/raw/gh-pages/pipeline_gatc_files/MmGRCm38.GATC.gff.gz) or
 * [Human GRCh38](https://github.com/owenjm/damidseq_pipeline/raw/gh-pages/pipeline_gatc_files/HsGRCh38.GATC.gff.gz).
 
-# Requirements
+## Requirements
 
 * [Bowtie 2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) v2.1 or above (and appropriate genome indices -- see below) (not required if using pre-aligned BAM files)
 * [SAMtools](http://samtools.sourceforge.net) v0.1.9 or above
 * a GFF file containing all GATC sites in the genome (a file for the Drosophila genome is provided in the script .zip archive)
 * a *nix operating system (e.g. linux, Mac OSX) with Perl v5.10 or greater installed. We recommend using Ubuntu Linux in a virtual machine if using Windows.
 * Sequencing data in FASTQ or BAM format
 
-# Installation
+## Installation
 
 1. Extract the pipeline script archive, make the damid_pipeline file executable and place it in your path
 1. Install [Bowtie 2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
 1. Obtain Bowtie 2 indices provided by [Bowtie 2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) or [Illumina's iGenome](http://support.illumina.com/sequencing/sequencing_software/igenome.html)
 
     Alternatively, build the Bowtie 2 index files manually:
     1. Download the latest FASTA genome primary_assembly (or toplevel) file from [Ensembl](http://ftp.ensembl.org/pub/current_fasta/)
-        e.g. [the current release for *Mus musculus*](http://ftp.ensembl.org/pub/current_fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz)
+        e.g. [the current release for *Mus musculus*](http://ftp.ensembl.org/pub/current_fasta/mus_musculus/dna/)
 
-        (alternatively, for *Drosophila*, download from the Flybase FTP site (ftp://ftp.flybase.net/releases/current/)
-         e.g. ftp://ftp.flybase.net/releases/FB2014_03/dmel_r5.57/fasta/dmel-all-chromosome-r5.57.fasta.gz )
+        (alternatively, for *Drosophila*, download from [the Flybase FTP site](ftp://ftp.flybase.net/releases/current/)
     1. Extract the .gz file
     1. Run bowtie2-build in the directory containing the extracted .fasta file. For the examples above:
 
@@ -65,69 +64,66 @@ Prebuilt GATC fragment files used by the script are available for the following
 
             perl gatc.track.maker.pl --name=dmel_r5.57 dmel-all-chromosome-r5.57.fasta
 
-## Running the script the first time
+## Initial setup
 
-In order to run correctly, the script needs to know the locations of two paths, specified using the following command-line options:
+In order to run correctly, damidseq_pipeline needs to know the locations of two paths, specified using the following command-line options (it will prompt you for these if you fail to specify them):
 
 1. The directory and basename of the bowtie2 index files (obtained or built in step 3 above)
-    (specified with the --bowtie2_genome_dir option)
+    (specified with the `--bowtie2_genome_dir` option)
         e.g. in the example above, use
 
         --bowtie2_genome_dir=[path_to_.bt2_files]/dmel_r5.57
 1. The GATC fragment .gff file (downloaded from the pre-built files listed in step 5, or built following the instructions above)
-    (specified with the --gatc_frag_file option)
+    (specified with the `--gatc_frag_file` option)
 
 In order to setup the pipeline to process the *D. melanogaster* genome, for example, the first-run command would be:
 
     damidseq_pipeline --gatc_frag_file=path/to/Dmel_r5.57.GATC.gff.gz --bowtie2_genome_dir=path/to/dmel_r5.57/dmel_r.5.57
 
-If these paths do not already exist and the script is run with these options and correct values, the paths will be saved for all future runs unless overridden on the command-line.
-
-(To clear all user-saved values, run the script with the --reset_defaults option.)
+Once run once with these options and correct values, the paths will be saved for all future runs unless overridden on the command-line. **You do not need to specify these options each time**
+{: .notice--info}
 
-# Running the script
+(To clear all user-saved values, including these values, run with the `--reset_defaults` option.)
 
-Run the script in a directory with sequencing files in FASTQ or BAM format.  The default behaviour is to process all files in FASTQ format, and if none are found, all files in BAM format.  Alternatively, individual files may be specified on the command line if the user does not wish to process all available files present in the directory (for example, if the sequencing lane contained multiple replicates).
+# Using damidseq_pipeline
 
-The script will by default determine sample names from the file names, and expects a single filename to start with "Dam" in order to automatically assign the Dam-only control.
+Run damidseq_pipeline in a directory containing sequencing files in FASTQ or BAM format.  The default behaviour is to process all files in FASTQ format, and if none are found, all files in BAM format.  
 
-To see all available options, run the script with --help command-line option:
-
-    damidseq_pipeline --help
-
-This will give you a list of adjustable parameters and their default and current values if applicable. We recommend keeping these at the default value in most cases; however, these can be modified on the command-line with --option=value (no spaces).
-
-To save modified values for all future runs, run the script with the parameter you wish to change together with the --save_defaults command-line option:
-
-    damidseq_pipeline --save_defaults
+Alternatively, individual files may be specified on the command line if the user does not wish to process all available files present in the directory (for example, if the sequencing lane contained multiple replicates).
 
-If bowtie2 and samtools are not in your path, you can specify these on the command-line also.
+## Sample names
+Sample names are assigned from filenames.  If a single filename being processed begins with "Dam", this will be assigned as the Dam-only control.  
 
-## Output and downstream data processing
+If no sample filename, or multiple filenames, begin with "Dam", use `--dam=[filename]` to specify the Dam-only control sample manually.  If a Dam-only control cannot be automatically determined, damidseq_pipeline will exit and prompt you to specify one.
 
-The final output will be a single ratio file: Sample-vs-DAM.gatc.bedgraph. The .gatc.bedgraph file represents the data at GATC fragment resolution (based on the reference genome) and should be used for all subsequent analysis.
+## Paired-end sequencing files
+To process paired-end FASTQ files, use the `--paired` option and the pipeline will search for, and match, paired reads.
 
-The [bedGraph format](http://genome.ucsc.edu/goldenpath/help/bedgraph.html) is used by default.  The pipeline script can output the final ratio files in [GFF format](http://www.ensembl.org/info/website/upload/gff.html) instead if the --output_format=gff command-line switch is used.
+BAM files generated from paired-end data are automatically detected and processed, without requiring this option.
 
-### Visualising the DNA binding profiles
+## Processing ChIP-seq data
+As of v1.5.3, damidseq_pipeline can also handle ChIP-seq data via the `--chipseq` flag.  This option will remove PCR duplicate reads, only process uniquely mapping reads, and output binned coverage tracks in RPM (reads per million mapped reads).
 
-The bedgraph output file can be can be converted to .tdf format via igvtools, either using the graphical interface provided by [IGV](http://www.broadinstitute.org/software/igv/) or via the command-line.  A legacy [gff2tdf.pl](http://github.com/owenjm/damid_pipeline/blob/master/gff2tdf.pl?raw=true) script is also provided for converting GFF files (the default output file format prior to v1.4) to TDF.
+Warning -- do not use this option with DamID-seq data.  DamID-seq is all about the PCR duplicates!
+{: .notice--warning}
 
-### Calling significant peaks from the data
+## Other options
+To see all available options, run the script with `--help` command-line option:
 
-The [find_peaks](http://github.com/owenjm/find_peaks) software will process the output .gatc.bedgraph ratio file and call significant peaks present in the dataset.  Please see the find_peaks page for more details.
+    damidseq_pipeline --help
 
-### Calling transcribed genes from RNA pol II datasets
+This will give you a list of adjustable parameters and their default and current values if applicable. We recommend keeping these at the default value in most cases; however, these can be modified on the command-line with `--option=value` (no spaces).
 
-The [polii.gene.call](http://github.com/owenjm/polii.gene.call) Rscript will call transcribed genes (i.e. gene bodies with significantly enriched pol II occupancy) from the output .gatc.bedgraph file.  Please see the polii.gene.call page for more details.
+## Saving default option values
+To save modified values for all future runs, run the script with the parameter you wish to change together with the `--save_defaults` command-line option:
 
-### Other useful scripts and utilities
+    damidseq_pipeline --save_defaults
 
-A collection of useful R and Perl scripts for comparing and analysing DamID-seq data is maintained at [http://github.com/owenjm/damid_misc].
+If bowtie2 and samtools are not in your path, you can specify these on the command-line also.
 
 ## Working with multiple genomes
 
-If the user expects to process data from multiple genomes, separate genome specifications can be saved by using the --save_defaults=[name] along with the --bowtie2_genome_dir and --gatc_frag_file options (and any other custom options that the user wishes to set as default for this genome, e.g. the bin width).  For e.g.:
+If the user expects to process data from multiple genomes, separate genome specifications can be saved by using the `--save_defaults=[name]` along with the `--bowtie2_genome_dir` and `--gatc_frag_file` options (and any other custom options that the user wishes to set as default for this genome, e.g. the bin width).  For e.g.:
 
     damidseq_pipeline --save_defaults=fly --gatc_frag_file=path/to/Dmel_r5.57.GATC.gff.gz --bowtie2_genome_dir=path/to/dmel_r5.57/dmel_r.5.57
     damidseq_pipeline --save_defaults=mouse --bins=500 --gatc_frag_file=path/to/MmGRCm38.GATC.gff.gz --bowtie2_genome_dir=path/to/Mm_GRCm38/GRCm38
@@ -138,17 +134,28 @@ Once set up, different genome definitions can be quickly loaded using the --load
 
 All currently saved genome definitions can be listed using --load_defaults=list.
 
-## Processing FASTQ files with adaptor codes
+## Example dataset
 
-The damidseq_pipeline will attempt to automatically determine the sample name from the filenames provided.  
+An example set of two small (3000 reads each) fastq.gz files and an index.txt file are provided in the zip archive (as "example.zip"), or you can [download these separately](http://github.com/owenjm/damid_pipeline/blob/master/example.zip?raw=true). Running the pipeline script on these files should successfully produce a polII-vs-Dam.gatc.bedgraph ratio file as output.
 
-Some sequencing facilities, however, may provide only adaptor index IDs rather than sample names.  If sequencing filenames are provided with index IDs rather than names, a file "index.txt" in the working directory can be provided with the format of [index] [sample_name], eg:
+# Output and downstream data processing
 
-    A6 Dam
-    A12 polII 
+The final output will be a single ratio file: Sample-vs-Dam.gatc.bedgraph. The .gatc.bedgraph file represents the data at GATC fragment resolution (based on the reference genome) and should be used for all subsequent analysis.
 
-where A6 is the sequencing adaptor index. The sample name cannot contain spaces and there has to be one (and only one) sample called "Dam"; the adaptor index must to be referenced in the fastq filename (e.g. for "A6", either "Index6" or "A006" are expected in the filename). Please see the [provided example](http://github.com/owenjm/damid_pipeline/blob/master/example.zip?raw=true) for an illustration of the index.txt file format and matching file names.
+The [bedGraph format](http://genome.ucsc.edu/goldenpath/help/bedgraph.html) is used by default.  The pipeline script can output the final ratio files in [GFF format](http://www.ensembl.org/info/website/upload/gff.html) instead if the `--output_format=gff` command-line switch is used.
 
-## Example dataset
+### Visualising the DNA binding profiles
 
-An example set of two small (3000 reads each) fastq.gz files and an index.txt file are provided in the zip archive (as "example.zip"), or you can [download these separately](http://github.com/owenjm/damid_pipeline/blob/master/example.zip?raw=true). Running the pipeline script on these files should successfully produce a polII-vs-Dam.gatc.bedgraph ratio file as output.
+The bedgraph output files can be can viewed directly in genome browsers such as [IGV](http://www.broadinstitute.org/software/igv/).  For publication-quality figures, we recommend [pyGenomeTracks](https://pygenometracks.readthedocs.io/).
+
+### Calling significant peaks from the data
+
+The [find_peaks](http://github.com/owenjm/find_peaks) software will process the output .gatc.bedgraph ratio file and call significant peaks present in the dataset.  Please see the find_peaks page for more details.
+
+### Calling transcribed genes from RNA pol II datasets
+
+The [polii.gene.call](http://github.com/owenjm/polii.gene.call) Rscript will call transcribed genes (i.e. gene bodies with significantly enriched pol II occupancy) from the output .gatc.bedgraph file.  Please see the polii.gene.call page for more details.
+
+### Other useful scripts and utilities
+
+A collection of useful R and Perl scripts for comparing and analysing DamID-seq data [is maintained here](http://github.com/owenjm/damid_misc).
diff --git a/changelog b/changelog
@@ -1,3 +1,15 @@
+
+[v1.5.3]
+*  New feature: --coverage flag outputs coverage bedGraph track of each sample (useful with Dam-only tracks for CATaDa analysis).  --just_coverage exits after coverage tracks are written.
+*  New feature: Kernel density normalisation routines sped up ~10x via Inline::C.  This is enabled by default, but will only work if the Inline::C module is installed; damidseq_pipeline will fall back on the Perl coded routine if this is not present.  If you want to control explicitly, do so via the --kde_inline flag.
+*  New feature: ChIP-seq-specific processing routines added: --remdups removes PCR duplicates, --unique only maps unique reads.  Enable all of these with --chipseq (also turns of GATC-level binning, GATC-based read extension and will only output coverage tracks not ratios)
+*  New feature: New normalisation method, "rawbins".  This writes the normalisation data to TSV format and runs a user-provided script for custom normalisation.  Your script should read the data file and output the normalisation factor.  Advanced users only, use with caution.
+*  New feature: consider all occupancy as RPM (reads per million mapped reads), set as default.  Control with --scores_as_rpm flag.
+*  Changed: bowtie2 command handling refactored more elegantly
+*  Added: ability to add custom flags to bowtie2 command with --bowtie2_add_flags
+*  Fixed: small changes to read mapping code for more accurate mapping
+*  Changed: filename filters are now turned off by default (use --no_file_filters=0 to turn back on)
+
 [v1.4.6]
 *  New feature: paired-end reads are now handled natively at the fastq level.  Use the --paired flag to enable, and your paired reads should be automagically detected and aligned
 *  Changed: BAM file format is now used natively without SAM intermediates at all levels (also fixes the recent samtools version handling bug)