Merge pull request #24 from AdmiralenOla/development

1.3.4
AdmiralenOla · Jun 16, 2016 · c1d7322 · c1d7322
2 parents d1e90b1 + 27f7ab7
commit c1d7322
Show file tree

Hide file tree

Showing 3 changed files with 50 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -15,13 +15,18 @@ Scoary is designed to take the gene_presence_absence.csv file from [Roary] (http
 - [License] (#license)
 - [Etymology] (#etymology)
 - [Bugs] (#bugs)
+- [FAQ] (#faq)
 - [Coming soon] (#coming-soon)
 - [Acknowledgements] (#acknowledgements)
 - [Feedback] (#feedback)
 - [Citation] (#citation)
 - [Contact] (#contact)
 
 ## What's new?
+v1.3.4 (16th Jun 2016)
+- Scoary no longer crashes when using Scipy 0.16 instead of 0.17.
+- More information about what's going on is printed. (Useful for very large datasets that take long to analyze)
+
 v1.3.3 (9th Jun 2016)
 - BUG FIX: Tree calculation had been broken since 1.3.2 yesterday. Sorry about that.
 
@@ -74,8 +79,8 @@ v1.1 (29th Mar 2016)
 
 ## Dependencies
 
-- Python (Tested in 2.7 and 3.5)
-- [SciPy 0.17 or greater.] (http://www.scipy.org/install.html) Note: 0.16 or lower will **not** work!
+- Python (Tested with versions 2.7 and 3.5)
+- [SciPy] (http://www.scipy.org/install.html) (Tested with versions 0.16 and 0.17)
 
 
 ## Installation
@@ -174,7 +179,7 @@ optional arguments:
   -p P_VALUE_CUTOFF, --p_value_cutoff P_VALUE_CUTOFF
                         P-value cut-off. SCOARY will not report genes with
                         higher p-values than this. Set to 1.0 to report all
-                        genes. Default = 0.05
+                        genes. Accepts standard form (e.g. 1E-8). Default = 0.05
   -c {Individual,Bonferroni,Benjamini-Hochberg}, --correction {Individual,Bonferroni,Benjamini-Hochberg}
                         Instead of cutting off at the individual test p-value
                         (option -p), use the indicated corrected p-value for
@@ -194,11 +199,7 @@ optional arguments:
                         indexing)
   --delimiter DELIMITER
                         The delimiter between cells in the gene
-                        presence/absence and trait files. NOTE: Even though
-                        commas are the default they might mess with the
-                        annotation column, and it is therefore recommended to
-                        save your files using semicolon instead.
-                        SCOARY will output files delimited by semicolon
+                        presence/absence and trait files.
   --version             Display Scoary version, and exit.
 
 ```
@@ -218,6 +219,9 @@ This will restrict the current analysis to isolates 1,2,4 and 9, and will omit a
 #### The -s parameter
 The **-s** parameter is used to indicate to Scoary which column in the gene_presence_absence.csv file is the _first_ column representing an isolate. By default it is set to 15 (1-based indexing).
 
+#### The -p, -m and -c parameters
+These parameters control your output. **-m** sets a hard cut-off on the number of hits reported. With **-p** you can set that no gene with a higher p-value will be reported. (Tip: Set this to 1.0 to report every single gene). You can mix these parameters with **-c**. If you only wanted genes with a Bonferroni-adjusted p-value < 1E-10 you could use _-p 1E-10 -c Bonferroni_.
+
 #### The -u flag
 Calling Scoary with the **-u** flag will cause it to write a newick file of the UPGMA tree that is calculated internally. The tree is based on pairwise Hamming distances in the gene_presence_absence matrix.
 
@@ -249,10 +253,16 @@ Scoary is freely available under a GPLv3 license.
 Scoary is an anagram of "scoring" and "Roary", the pan-genome pipeline. It was named as an homage to Roary.
 
 ## Bugs
-Known bugs:
-- I'm currently (8th Jun 2016) not aware of any bugs.
 - Please report bugs here (Issues) or to me directly at [email protected]
 
+## FAQ
+- **How can you justify p=0.5 in your pairwise comparisons method? Is this species-specific?**
+
+The reasoning is as follows: Scoary first finds the maximum number of independent contrasting pairs in a phylogenetic tree, irrespective of gene-trait status. Thus, AB-ab pairs should be equally likely as Ab-aB pairs if your null hypothesis is true. Your null hypothesis in this case, is that there is no detectable association between A/a and B/b. If AB-ab pairs are much more common than Ab-aB pairs then you can be confident that the true p was not 0.5. And if this is the case then then there seems to be an association between your A/a (your gene) and your B/b (your phenotype). A justification for this way of testing can be found in Read and Nee, 1995.
+- **Why is my "Best_pairwise_comp_p" higher than my "Worst_pairwise_comp_p"?**
+
+The "best" and "worst" labels are attached to the odds ratio of the gene in the non-population structure-corrected analysis. For example, you may find an odds ratio of 2.0 for a particular gene, meaning presence of the gene was tied to presence of the phenotype. But when you inspect your pairwise comparisons p-values you see that the "best" p-value was 0.2 and the "worst" was 1.0E-5. This means that in your phylogenetic tree, an enrichment of Ab-aB pairs was more common. In other words, the presence of this gene actually seems associated to a _silencing_ of the phenotype, in spite of your original odds ratio. Note that the odds ratio can be inflated for example by sampling of very closely related isolates.
+
 ## Coming soon
 Please feel free to suggest improvements, point out bugs or methods that could be better optimized.
 

diff --git a/scoary/__init__.py b/scoary/__init__.py
@@ -1 +1 @@
-__version__ = '1.3.3'
+__version__ = '1.3.4'
diff --git a/scoary/methods.py b/scoary/methods.py
@@ -49,7 +49,8 @@ def main():
     parser.add_argument('-p', '--p_value_cutoff',
                         help='P-value cut-off. SCOARY will not report genes '
                         'with higher p-values than this. Set to 1.0 to report '
-                        'all genes. Default = 0.05',
+                        'all genes. Accepts standard form (e.g. 1E-8). '
+                        'Default = 0.05',
                         default=0.05,
                         type=float)
     parser.add_argument('-c', '--correction',
@@ -83,11 +84,7 @@ def main():
                         action='store_true')
     parser.add_argument('--delimiter',
                         help='The delimiter between cells in the gene '
-                        'presence/absence and trait files. NOTE: Even though '
-                        'commas are the default they might mess with the '
-                        'annotation column, and it is therefore recommended to '
-                        'save your files using semicolon instead. '
-                        'SCOARY will output files delimited by semicolon',
+                        'presence/absence and trait files. ',
                         default=',',
                         type=str)
     parser.add_argument('--version', help='Display Scoary version, and exit.',
@@ -110,19 +107,21 @@ def main():
             # this actually means all isolates are allowed
             # and included in the analysis
             allowed_isolates = None
-
+
+        print("Reading gene presence absence file")    
         genedic_and_matrix = Csv_to_dic_Roary(genes,
                                               args.delimiter,
                                               startcol=args.start_col - 1,
                                               allowed_isolates=allowed_isolates)
         genedic = genedic_and_matrix["Roarydic"]
         zeroonesmatrix = genedic_and_matrix["Zero_ones_matrix"]
         strains = genedic_and_matrix["Strains"]
-
+        print("Creating Hamming distance matrix based on gene presence/absence")
         TDM = CreateTriangularDistanceMatrix(zeroonesmatrix, strains)
         QT = PopulateQuadTreeWithDistances(TDM)
+        print("Building UPGMA tree from distance matrix")
         upgmatree = upgma(QT)
-
+        print("Reading traits file")
         traitsdic = Csv_to_dic(traits, args.delimiter, allowed_isolates)
 
         print("Finished loading files into memory.")
@@ -154,8 +153,10 @@ def CreateTriangularDistanceMatrix(zeroonesmatrix, strainnames):
     triangular matrix of pairwise Hamming distances.
     The distance d(i,i) is set to 1 for all i.
     """
-
-    hamming_distances = list(spatial.distance.pdist(zeroonesmatrix, 'hamming'))
+    try:
+        hamming_distances = list(spatial.distance.pdist(zeroonesmatrix, 'hamming'))
+    except TypeError:
+        sys.exit("Could not locate scipy.spatial.distance.pdist. Perhaps you have an old version of SciPy installed?")
     nstrains = int((1 + (1 + 8*len(hamming_distances))**0.5)/2)
     TriangularDistanceMatrix = []
     Strain_names = []
@@ -215,7 +216,11 @@ def Csv_to_dic_Roary(genefile, delimiter, startcol=0, allowed_isolates=None):
 
     for line in csvfile:
         q = line
-        r[q[genecol]] = {"Non-unique Gene name": q[nugcol], "Annotation": q[anncol]} if roaryfile else {}
+        try:
+            r[q[genecol]] = {"Non-unique Gene name": q[nugcol], "Annotation": q[anncol]} if roaryfile else {}
+        except IndexError:
+            sys.exit("ERROR: Could not read gene presence absence file. Verify that this file is a proper Roary file "
+            "using the specified delimiter (default is ',').")
         # The zero_ones_line represents the presence (1) or absence (0) of a gene. It is used for calculating distances between strains.
         zero_ones_line = []
 
@@ -258,7 +263,7 @@ def Csv_to_dic(csvfile, delimiter, allowed_isolates):
             name_trait = p[""]
             del p[""]
         elif "Name" in p:
-            name_trait = p["name"]
+            name_trait = p["Name"]
             del p["Name"]
         else:
             sys.exit("Make sure the top-left cell in the traits file is either empty or 'Name'. Do not include empty rows")
@@ -414,7 +419,7 @@ def StoreResults(Results, max_hits, p_cutoff, correctionmethod, upgmatree, GTC):
     A method for storing the results. Calls StoreTraitResult for each trait column in the input file
     """
     for Trait in Results:
-        print("Storing results: " + Trait)
+        print("\nStoring results: " + Trait)
         StoreTraitResult(Results[Trait], Trait, max_hits, p_cutoff, correctionmethod, upgmatree, GTC)
 
 
@@ -443,7 +448,7 @@ def StoreTraitResult(Trait, Traitname, max_hits, p_cutoff, correctionmethod, upg
             # Start with lowest p-value, the one which has key 0 in sort_instructions
             currentgene = sort_instructions[x]
             if (Trait[currentgene][cut_possibilities[correctionmethod]] > p_cutoff):
-                sys.stdout.write("\r100.00%\n")
+                sys.stdout.write("\r100.00%")
                 sys.stdout.flush()
                 break
 
@@ -452,14 +457,15 @@ def StoreTraitResult(Trait, Traitname, max_hits, p_cutoff, correctionmethod, upg
             max_total_pairs = Max_pairwise_comparisons["Total"]
             max_propairs = Max_pairwise_comparisons["Pro"]
             max_antipairs = Max_pairwise_comparisons["Anti"]
-            best_pairwise_comparison_p = ss.binom_test(max_propairs,
-                                                       max_total_pairs,
-                                                       0.5,
-                                                       alternative="greater")
-            worst_pairwise_comparison_p = ss.binom_test(max_total_pairs-max_antipairs,
-                                                        max_total_pairs,
-                                                        0.5,
-                                                        alternative="greater")
+            try:
+                best_pairwise_comparison_p = ss.binom_test(max_propairs,
+                                                           max_total_pairs,
+                                                           0.5) / 2
+                worst_pairwise_comparison_p = ss.binom_test(max_total_pairs-max_antipairs,
+                                                            max_total_pairs,
+                                                            0.5) / 2
+            except TypeError:
+                sys.exit("There was a problem using scipy.stats.binom_test. Ensure you have a recent distribution of SciPy installed.")
 
             outfile.write('"' + currentgene + '";"' + str(Trait[currentgene]["NUGN"]) + '";"' + str(Trait[currentgene]["Annotation"]) +
             '";"' + str(Trait[currentgene]["tpgp"]) + '";"' + str(Trait[currentgene]["tngp"]) + '";"' + str(Trait[currentgene]["tpgn"]) +