Juicer version 1.5

* Added DNAse mode; specify the restriction site as "none" to enable * Added exclude fragment flag; use "-x" to run Juicer without creating fragment-delimited maps (this saves 8x space and a lot of time) * Collisions (more than two loci in one read) are now determined and outputted * Added another stage to relaunch at, and updated documentation to make this mode more clear * Added diploid Hi-C creation scripts * Added new flags for easier portability to LSF from AWS/OpenLava * Updated documentation; added version information to output
aidenlab · Jul 28, 2016 · 1c414dd · 1c414dd
1 parent 429d6a0
commit 1c414dd
Show file tree

Hide file tree

Showing 64 changed files with 2,927 additions and 960 deletions.
diff --git a/AWS/README.md b/AWS/README.md
@@ -1,32 +1,47 @@
 #Quick Start
 
+Extensive instructions, including how to create and launch an instance, are 
+available at <http://aidenlab.org/juicer/aws.html>
+
+Below is the set of instructions we gave to reviewers. You can obtain the 
+MBR19 test data set at  
+<http://juicerawsmirror.s3.amazonaws.com/opt/juicer/work/MBR19/fastq/chr19_R1.fastq.gz>  
+<http://juicerawsmirror.s3.amazonaws.com/opt/juicer/work/MBR19/fastq/chr19_R1.fastq.gz>  
+The HIC003 test data set is available at
+<http://juicerawsmirror.s3.amazonaws.com/opt/juicer/work/HIC003/fastq/HIC003_S2_L001_R1_001.fastq.gz>  
+<http://juicerawsmirror.s3.amazonaws.com/opt/juicer/work/HIC003/fastq/HIC003_S2_L001_R2_001.fastq.gz>  
+
+You will need to launch your own instance, use your own Juicer_AWS.pem, and
+use the public IP associated with your instance.
+
+------------------------------------------------------------------------------
 We've provided a step-by-step guide to showcase some of the features of
 Juicer. If you run into problems, see below for more detailed documentation.
 This example runs on Amazon Web Services, but you can install the pipeline
 on any LSF, Univa Grid Engine, or SLURM cluster.
 
 1. Make sure you're in the top-level directory, with the Juicer_AWS.pem file.
 2. You were given an anonymous IP address. At a command line prompt, type:
-      ssh -i Juicer_AWS.pem ubuntu@<given IP address>
+     ` ssh -i Juicer_AWS.pem ubuntu@<given IP address>`
 3. This will log you into an AWS instance that contains all the software
    needed to run the pipeline. Type
-      cd /opt/juicer/work/
+      `cd /opt/juicer/work/`
 4. We will run the pipeline on a test dataset of a single chromosome of the primary+
    replicate map from (Rao+Huntley et al., 2014). Type:
-      cd MBR19
+      `cd MBR19`
 5. Run the Juicer pipeline on the raw data, which is stored in the fastq
    directory:
-      /opt/juicer/scripts/juicer.sh -g hg19 -s MboI
+      `/opt/juicer/scripts/juicer.sh -g hg19 -s MboI`
 6. You will see a series of messages sending jobs to the cluster. Do not
    kill the script or close the server connection until you see:
-      “(-: Finished adding all jobs... please wait while processing.”
+      `“(-: Finished adding all jobs... please wait while processing.”`
 7. At this point you can close the connection and come back later. 
    To see the progress of the pipeline as it works, type:
-      bjobs -w
-7. Eventually the bjobs command will report “No unfinished job found”. Type:
-      tail lsf.out
+      `bjobs -w`
+8. Eventually the bjobs command will report “No unfinished job found”. Type:
+      `tail lsf.out`
    You should see “(-: Pipeline successfully completed (-:”
-8. Results are available in the aligned directory. The Hi-C maps are in
+9. Results are available in the aligned directory. The Hi-C maps are in
    inter.hic (for MAPQ > 0) and inter_30.hic (for MAPQ >= 30). The Hi-C maps
    can be loaded in Juicebox and explored. They can also be used for
    automatic feature annotation and to extract matrices at specific
@@ -37,16 +52,20 @@ on any LSF, Univa Grid Engine, or SLURM cluster.
    annotation of contact domains (identified using the Arrowhead algorithm). The formats 
    of these files are described in the Juicebox tutorial online; both files can be loaded 
    into Juicebox as a 2D annotation.
-9. To download a file (e.g. inter.hic) from AWS to load into Juicebox, type:
-      sftp -i Juicer_AWS.pem ubuntu@<given IP address>
-      cd /opt/juicer/work/MBR19/aligned
-      get inter.hic
-      get inter_30.hic
-      get ... (each of hiccups, apa, motifs, and arrowhead output files)
-10. You can also run the pipeline on genome-wide dataset that is lower resolution. Type
-      cd /opt/juicer/work/HIC003
+10. To download a file (e.g. inter.hic) from AWS to load into Juicebox, type:
+
+   ```
+      sftp -i Juicer_AWS.pem ubuntu@<given IP address>  
+      cd /opt/juicer/work/MBR19/aligned  
+      get inter.hic  
+      get inter_30.hic  
+      get ... (each of hiccups, apa, motifs, and arrowhead output files)  
+   ```
+
+11. You can also run the pipeline on genome-wide dataset that is lower resolution. Type
+      `cd /opt/juicer/work/HIC003`
    Then
-      /opt/juicer/scripts/juicer.sh -g hg19 -s MboI
+      `/opt/juicer/scripts/juicer.sh -g hg19 -s MboI`
    Again the pipeline will run. The results will be available in the aligned directory.
    Because this is not a deeply sequenced map, loop lists and domain lists will not be 
    produced.
diff --git a/AWS/scripts/LibraryComplexity.java b/AWS/scripts/LibraryComplexity.java
@@ -0,0 +1,204 @@
+/*
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2011-2015 Broad Institute, Aiden Lab
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ *  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ *  IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ *  FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ *  AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ *  LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ *  OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ *  THE SOFTWARE.
+ *
+ * Based on PicardTools 
+ * Juicer version 1.5
+ */
+import java.text.NumberFormat;
+import java.text.ParseException;
+import java.util.Locale;
+import java.io.FileReader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.File;
+class LibraryComplexity {
+  public static void main(String[] args) {
+    if (args.length != 2 && args.length != 3 && args.length != 1) {
+      System.out.println("Usage: java LibraryComplexity <directory> <output file>");
+      System.out.println("     : java LibraryComplexity <unique> <pcr> <opt>");
+      System.exit(0);
+    }
+    NumberFormat nf = NumberFormat.getInstance(Locale.US);
+
+    BufferedReader reader = null;
+    long readPairs = 0;
+    long uniqueReadPairs = 0;
+    long opticalDups = 0;
+    long totalReadPairs = 0;
+    if (args.length == 2 || args.length==1) {
+      try {
+        File f = new File(args[0] + "/opt_dups.txt");
+        if (f.exists()) {
+          reader = new BufferedReader(new FileReader(f));
+          while (reader.readLine() != null) opticalDups++;
+          reader.close();
+        }
+        f = new File(args[0] + "/merged_nodups.txt");
+        if (f.exists()) {
+          reader = new BufferedReader(new FileReader(f));
+          while (reader.readLine() != null) uniqueReadPairs++;
+          reader.close();				
+        }
+        f = new File(args[0] + "/dups.txt");
+        if (f.exists()) {
+          reader = new BufferedReader(new FileReader(args[0] + "/dups.txt"));
+          while (reader.readLine() != null) readPairs++;
+          reader.close();				
+          readPairs += uniqueReadPairs; 
+        }
+        String fname = "inter.txt";
+        if (args.length == 2) fname = args[1];
+        f = new File(args[0] + "/" + fname);
+        if (f.exists()) {
+          reader = new BufferedReader(new FileReader(args[0] + "/" + fname));
+          String line = reader.readLine();
+          boolean done = false;
+          while (line != null && !done) {
+            if (line.contains("Sequenced Read")) {
+              String[] parts = line.split(":");
+              try {
+                totalReadPairs = nf.parse(parts[1].trim()).longValue();
+              }
+              catch (ParseException e) {
+                totalReadPairs = 0;
+              }
+              done = true;
+            }
+            line = reader.readLine(); 
+          }
+          reader.close();
+        }
+      } catch (IOException error) {
+        System.err.println("Problem counting lines in merged_nodups and dups");
+        System.exit(1);
+      }
+    }
+    else {
+      try {
+        uniqueReadPairs = Integer.valueOf(args[0]);
+        readPairs = Integer.valueOf(args[1]);
+        opticalDups = Integer.valueOf(args[2]);
+      }
+      catch (NumberFormatException error) {
+        System.err.println("When called with three arguments, must be integers");
+        System.exit(1);
+      }
+      readPairs += uniqueReadPairs;
+    }
+    NumberFormat decimalFormat = NumberFormat.getPercentInstance();
+    decimalFormat.setMinimumFractionDigits(2);
+    decimalFormat.setMaximumFractionDigits(2);
+
+    long result2 = readPairs*readPairs/(2*(readPairs-uniqueReadPairs));
+
+    System.out.print("Unique Reads: " + NumberFormat.getInstance().format(uniqueReadPairs) + " ");
+    if (totalReadPairs > 0) {
+      System.out.println("(" + decimalFormat.format(uniqueReadPairs/(double)totalReadPairs) + ")");
+    }
+    else {
+      System.out.println();
+    }
+
+    System.out.print("PCR Duplicates: " + nf.format(readPairs - uniqueReadPairs) + " ");
+    if (totalReadPairs > 0) {
+      System.out.println("(" + decimalFormat.format((readPairs - uniqueReadPairs)/(double)totalReadPairs) + ")");
+    }
+    else {
+      System.out.println();
+    }
+    System.out.print("Optical Duplicates: " + nf.format(opticalDups) + " ");
+    if (totalReadPairs > 0) {
+      System.out.println("(" + decimalFormat.format(opticalDups/(double)totalReadPairs) + ")");
+    }
+    else {
+      System.out.println();
+    }
+    long result;
+    try {
+      result = estimateLibrarySize(readPairs, uniqueReadPairs);
+    }
+    catch (NullPointerException e) {
+      System.err.println("Library complexity undefined when total = " + readPairs + " and unique = " + uniqueReadPairs);
+      return;
+    }
+    System.out.println("Library Complexity Estimate: " + nf.format(result));
+    //    System.out.println("Library complexity (old): " + nf.format(result2));
+
+  }
+
+  /**
+   * Estimates the size of a library based on the number of paired end molecules 
+   * observed and the number of unique pairs observed.
+   * <br>
+   * Based on the Lander-Waterman equation that states:<br>
+   *     C/X = 1 - exp( -N/X )<br>
+   * where<br>
+   *     X = number of distinct molecules in library<br>
+   *     N = number of read pairs<br>
+   *     C = number of distinct fragments observed in read pairs<br>
+   */
+  public static Long estimateLibrarySize(final long readPairs, 
+                                         final long uniqueReadPairs) {
+    final long readPairDuplicates = readPairs - uniqueReadPairs;
+
+    if (readPairs > 0 && readPairDuplicates > 0) {
+	    long n = readPairs;
+	    long c = uniqueReadPairs;
+
+	    double m = 1.0, M = 100.0;
+
+	    if (c >= n || f(m*c, c, n) < 0) {
+        throw new IllegalStateException("Invalid values for pairs and unique pairs: " + n + ", " + c);
+	    }
+
+	    while( f(M*c, c, n) >= 0 ) {
+        m = M;
+        M *= 10.0;
+	    }
+
+	    double r = (m+M)/2.0;
+	    double u = f( r * c, c, n );
+	    int i = 0;
+	    while (u != 0 && i < 1000) {
+        if (u > 0) m = r;
+        else M = r;
+        r = (m+M)/2.0;
+        u = f( r * c, c, n );
+        i++;
+	    }
+	    if (i == 1000) {
+        System.err.println("Iterated 1000 times, returning estimate");
+	    }
+
+	    return (long) (c * (m+M)/2.0);
+    }
+    else {
+	    return null;
+    }
+  }
+
+  /** Method that is used in the computation of estimated library size. */
+  private static double f(double x, double c, double n) {
+    return c/x - 1 + Math.exp(-n/x);
+  }
+}    
diff --git a/AWS/scripts/check.sh b/AWS/scripts/check.sh
@@ -1,5 +1,4 @@
 #!/bin/bash
-
 ##########
 #The MIT License (MIT)
 #
@@ -26,7 +25,7 @@
 #
 # Sanity check once pipeline is complete to be sure number of lines adds up, deduping
 # proceeded appropriately, and that files were created
-
+# Juicer version 1.5
 if [ -z $ARG1 ]
 then
 # Start by checking the statistics to see if they add up 
@@ -35,6 +34,7 @@ then
     if [ $check1 -eq 0 ] || [ -z "$res1" ]
     then
         echo "***! Error! The statistics do not add up. Alignment likely failed to complete on one or more files. Run relaunch_prep.sh"
+    	echo "Stats don't add up.  Check ${outputdir} for results"
         exit 100
     fi
 fi
@@ -49,18 +49,21 @@ total2=`ls -l ${outputdir}/merged_nodups.txt ${outputdir}/dups.txt ${outputdir}/
 if [ -z $total ] || [ -z $total2 ] || [ $total -ne $total2 ]
 then
     echo "***! Error! The sorted file and dups/no dups files do not add up, or were empty. Merge or dedupping likely failed, restart pipeline with -S merge or -S dedup"
+    echo "Dups don't add up.  Check ${outputdir} for results"
     exit 100
 fi
 
 wctotal=`cat ${splitdir}/*_linecount.txt | awk '{sum+=$1}END{print sum/4}'`
 check2=`cat ${splitdir}/*norm*res* | awk '{s2+=$2;}END{print s2}'`
-check4=`cat ${splitdir}/*norm*res* | awk '{s4+=$4;}END{print s4}'`
+# the below, check4, is still buggy and needs to be corrected
+check4=`cat ${splitdir}/*norm*res* | awk '{s+=$3+$4+$5+$6;}END{print s}'`
 
 if [ -z $ARG1 ] 
 then
     if [ $wctotal -ne $check2 ]
     then
         echo "***! Error! The number of reads in the fastqs (${wctotal}) is not the same as the number of reads reported in the stats (${check2}), likely due to a failure during alignment"
+        echo "Reads don't add up.  Check ${outputdir} for results"
         exit 100
     fi
 else
@@ -70,6 +73,7 @@ else
    if [ $wctotal -ne $calctotal ]
    then
         echo "***! Error! The number of reads in the fastqs (${wctotal}) is not the same as the number of alignable reads reported in the stats (${check4}) plus unmapped/ambiguous (${unmapamb}), likely due to a failure during alignment. Run relaunch_prep.sh"
+        echo "Reads don't add up.  Check ${outputdir} for results"
         exit 100
    else
        echo "Total: $wctotal"
@@ -81,7 +85,9 @@ if [ -f ${outputdir}/inter.hic ] && [ -s ${outputdir}/inter.hic ] && [ -f ${outp
 then
     echo "(-: Pipeline successfully completed (-:";
     echo "Run cleanup.sh to remove the splits directory";
+    echo "Check ${outputdir} for results"
 else
     echo "***! Error! Either inter.hic or inter_30.hic were not created"
+    echo "Either inter.hic or inter_30.hic were not created.  Check ${outputdir} for results"
     exit 100
 fi
diff --git a/AWS/scripts/chimeric_blacklist.awk b/AWS/scripts/chimeric_blacklist.awk
@@ -40,6 +40,7 @@
 # one non-ligation junction end is far from the others, are sent to fname2.
 #
 # awk -f chimeric_blacklist.awk -v fname1="norm_chimera" fname2="abnorm_chimera" fname3="unmapped"
+# Juicer version 1.5 
 
 # returns absolute value
 function abs(value)
@@ -287,6 +288,13 @@ BEGIN{
 				count_unmapped++;
       }
     }
+    else if (count == 1) {
+      # this actually shouldn't happen, but it happens with alternate aligners on occasion
+      count_abnorm++;
+      for (i in c) {
+				print c[i] > fname2;
+      }
+    }
     # reset variables
     delete c;
     count=1;

diff --git a/AWS/scripts/cleanup.sh b/AWS/scripts/cleanup.sh
@@ -22,9 +22,9 @@
 #  OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 #  THE SOFTWARE.
 ##########
-
 # Script to clean up big repetitive files and zip fastqs. Run after you are 
 # sure the pipeline ran successfully.  Run from top directory (HIC001 e.g.).
+# Juicer version 1.5
 total=`ls -l aligned/merged_sort.txt | awk '{print $5}'`
 total2=`ls -l aligned/merged_nodups.txt aligned/dups.txt aligned/opt_dups.txt | awk '{sum = sum + $5}END{print sum}'`
 if [ $total -eq $total2 ]; then rm aligned/merged_sort.txt; rm -r splits; else echo "Problem: The sum of merged_nodups and the dups files is not the same size as merged_sort.txt"; fi