diff --git a/README.md b/README.md index d102c83..24df3a9 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,7 @@ Scoary is designed to take the gene_presence_absence.csv file from [Roary] (http - [License] (#license) - [Etymology] (#etymology) - [Bugs] (#bugs) +- [FAQ] (#faq) - [Coming soon] (#coming-soon) - [Acknowledgements] (#acknowledgements) - [Feedback] (#feedback) @@ -22,6 +23,10 @@ Scoary is designed to take the gene_presence_absence.csv file from [Roary] (http - [Contact] (#contact) ## What's new? +v1.3.4 (16th Jun 2016) +- Scoary no longer crashes when using Scipy 0.16 instead of 0.17. +- More information about what's going on is printed. (Useful for very large datasets that take long to analyze) + v1.3.3 (9th Jun 2016) - BUG FIX: Tree calculation had been broken since 1.3.2 yesterday. Sorry about that. @@ -74,8 +79,8 @@ v1.1 (29th Mar 2016) ## Dependencies -- Python (Tested in 2.7 and 3.5) -- [SciPy 0.17 or greater.] (http://www.scipy.org/install.html) Note: 0.16 or lower will **not** work! +- Python (Tested with versions 2.7 and 3.5) +- [SciPy] (http://www.scipy.org/install.html) (Tested with versions 0.16 and 0.17) ## Installation @@ -174,7 +179,7 @@ optional arguments: -p P_VALUE_CUTOFF, --p_value_cutoff P_VALUE_CUTOFF P-value cut-off. SCOARY will not report genes with higher p-values than this. Set to 1.0 to report all - genes. Default = 0.05 + genes. Accepts standard form (e.g. 1E-8). Default = 0.05 -c {Individual,Bonferroni,Benjamini-Hochberg}, --correction {Individual,Bonferroni,Benjamini-Hochberg} Instead of cutting off at the individual test p-value (option -p), use the indicated corrected p-value for @@ -194,11 +199,7 @@ optional arguments: indexing) --delimiter DELIMITER The delimiter between cells in the gene - presence/absence and trait files. NOTE: Even though - commas are the default they might mess with the - annotation column, and it is therefore recommended to - save your files using semicolon instead. - SCOARY will output files delimited by semicolon + presence/absence and trait files. --version Display Scoary version, and exit. ``` @@ -218,6 +219,9 @@ This will restrict the current analysis to isolates 1,2,4 and 9, and will omit a #### The -s parameter The **-s** parameter is used to indicate to Scoary which column in the gene_presence_absence.csv file is the _first_ column representing an isolate. By default it is set to 15 (1-based indexing). +#### The -p, -m and -c parameters +These parameters control your output. **-m** sets a hard cut-off on the number of hits reported. With **-p** you can set that no gene with a higher p-value will be reported. (Tip: Set this to 1.0 to report every single gene). You can mix these parameters with **-c**. If you only wanted genes with a Bonferroni-adjusted p-value < 1E-10 you could use _-p 1E-10 -c Bonferroni_. + #### The -u flag Calling Scoary with the **-u** flag will cause it to write a newick file of the UPGMA tree that is calculated internally. The tree is based on pairwise Hamming distances in the gene_presence_absence matrix. @@ -249,10 +253,16 @@ Scoary is freely available under a GPLv3 license. Scoary is an anagram of "scoring" and "Roary", the pan-genome pipeline. It was named as an homage to Roary. ## Bugs -Known bugs: -- I'm currently (8th Jun 2016) not aware of any bugs. - Please report bugs here (Issues) or to me directly at olbb@fhi.no +## FAQ +- **How can you justify p=0.5 in your pairwise comparisons method? Is this species-specific?** + +The reasoning is as follows: Scoary first finds the maximum number of independent contrasting pairs in a phylogenetic tree, irrespective of gene-trait status. Thus, AB-ab pairs should be equally likely as Ab-aB pairs if your null hypothesis is true. Your null hypothesis in this case, is that there is no detectable association between A/a and B/b. If AB-ab pairs are much more common than Ab-aB pairs then you can be confident that the true p was not 0.5. And if this is the case then then there seems to be an association between your A/a (your gene) and your B/b (your phenotype). A justification for this way of testing can be found in Read and Nee, 1995. +- **Why is my "Best_pairwise_comp_p" higher than my "Worst_pairwise_comp_p"?** + +The "best" and "worst" labels are attached to the odds ratio of the gene in the non-population structure-corrected analysis. For example, you may find an odds ratio of 2.0 for a particular gene, meaning presence of the gene was tied to presence of the phenotype. But when you inspect your pairwise comparisons p-values you see that the "best" p-value was 0.2 and the "worst" was 1.0E-5. This means that in your phylogenetic tree, an enrichment of Ab-aB pairs was more common. In other words, the presence of this gene actually seems associated to a _silencing_ of the phenotype, in spite of your original odds ratio. Note that the odds ratio can be inflated for example by sampling of very closely related isolates. + ## Coming soon Please feel free to suggest improvements, point out bugs or methods that could be better optimized. diff --git a/scoary/__init__.py b/scoary/__init__.py index 07f744c..ac422f1 100644 --- a/scoary/__init__.py +++ b/scoary/__init__.py @@ -1 +1 @@ -__version__ = '1.3.3' +__version__ = '1.3.4' diff --git a/scoary/methods.py b/scoary/methods.py index 39c0ab3..7ec6962 100644 --- a/scoary/methods.py +++ b/scoary/methods.py @@ -49,7 +49,8 @@ def main(): parser.add_argument('-p', '--p_value_cutoff', help='P-value cut-off. SCOARY will not report genes ' 'with higher p-values than this. Set to 1.0 to report ' - 'all genes. Default = 0.05', + 'all genes. Accepts standard form (e.g. 1E-8). ' + 'Default = 0.05', default=0.05, type=float) parser.add_argument('-c', '--correction', @@ -83,11 +84,7 @@ def main(): action='store_true') parser.add_argument('--delimiter', help='The delimiter between cells in the gene ' - 'presence/absence and trait files. NOTE: Even though ' - 'commas are the default they might mess with the ' - 'annotation column, and it is therefore recommended to ' - 'save your files using semicolon instead. ' - 'SCOARY will output files delimited by semicolon', + 'presence/absence and trait files. ', default=',', type=str) parser.add_argument('--version', help='Display Scoary version, and exit.', @@ -110,7 +107,8 @@ def main(): # this actually means all isolates are allowed # and included in the analysis allowed_isolates = None - + + print("Reading gene presence absence file") genedic_and_matrix = Csv_to_dic_Roary(genes, args.delimiter, startcol=args.start_col - 1, @@ -118,11 +116,12 @@ def main(): genedic = genedic_and_matrix["Roarydic"] zeroonesmatrix = genedic_and_matrix["Zero_ones_matrix"] strains = genedic_and_matrix["Strains"] - + print("Creating Hamming distance matrix based on gene presence/absence") TDM = CreateTriangularDistanceMatrix(zeroonesmatrix, strains) QT = PopulateQuadTreeWithDistances(TDM) + print("Building UPGMA tree from distance matrix") upgmatree = upgma(QT) - + print("Reading traits file") traitsdic = Csv_to_dic(traits, args.delimiter, allowed_isolates) print("Finished loading files into memory.") @@ -154,8 +153,10 @@ def CreateTriangularDistanceMatrix(zeroonesmatrix, strainnames): triangular matrix of pairwise Hamming distances. The distance d(i,i) is set to 1 for all i. """ - - hamming_distances = list(spatial.distance.pdist(zeroonesmatrix, 'hamming')) + try: + hamming_distances = list(spatial.distance.pdist(zeroonesmatrix, 'hamming')) + except TypeError: + sys.exit("Could not locate scipy.spatial.distance.pdist. Perhaps you have an old version of SciPy installed?") nstrains = int((1 + (1 + 8*len(hamming_distances))**0.5)/2) TriangularDistanceMatrix = [] Strain_names = [] @@ -215,7 +216,11 @@ def Csv_to_dic_Roary(genefile, delimiter, startcol=0, allowed_isolates=None): for line in csvfile: q = line - r[q[genecol]] = {"Non-unique Gene name": q[nugcol], "Annotation": q[anncol]} if roaryfile else {} + try: + r[q[genecol]] = {"Non-unique Gene name": q[nugcol], "Annotation": q[anncol]} if roaryfile else {} + except IndexError: + sys.exit("ERROR: Could not read gene presence absence file. Verify that this file is a proper Roary file " + "using the specified delimiter (default is ',').") # The zero_ones_line represents the presence (1) or absence (0) of a gene. It is used for calculating distances between strains. zero_ones_line = [] @@ -258,7 +263,7 @@ def Csv_to_dic(csvfile, delimiter, allowed_isolates): name_trait = p[""] del p[""] elif "Name" in p: - name_trait = p["name"] + name_trait = p["Name"] del p["Name"] else: sys.exit("Make sure the top-left cell in the traits file is either empty or 'Name'. Do not include empty rows") @@ -414,7 +419,7 @@ def StoreResults(Results, max_hits, p_cutoff, correctionmethod, upgmatree, GTC): A method for storing the results. Calls StoreTraitResult for each trait column in the input file """ for Trait in Results: - print("Storing results: " + Trait) + print("\nStoring results: " + Trait) StoreTraitResult(Results[Trait], Trait, max_hits, p_cutoff, correctionmethod, upgmatree, GTC) @@ -443,7 +448,7 @@ def StoreTraitResult(Trait, Traitname, max_hits, p_cutoff, correctionmethod, upg # Start with lowest p-value, the one which has key 0 in sort_instructions currentgene = sort_instructions[x] if (Trait[currentgene][cut_possibilities[correctionmethod]] > p_cutoff): - sys.stdout.write("\r100.00%\n") + sys.stdout.write("\r100.00%") sys.stdout.flush() break @@ -452,14 +457,15 @@ def StoreTraitResult(Trait, Traitname, max_hits, p_cutoff, correctionmethod, upg max_total_pairs = Max_pairwise_comparisons["Total"] max_propairs = Max_pairwise_comparisons["Pro"] max_antipairs = Max_pairwise_comparisons["Anti"] - best_pairwise_comparison_p = ss.binom_test(max_propairs, - max_total_pairs, - 0.5, - alternative="greater") - worst_pairwise_comparison_p = ss.binom_test(max_total_pairs-max_antipairs, - max_total_pairs, - 0.5, - alternative="greater") + try: + best_pairwise_comparison_p = ss.binom_test(max_propairs, + max_total_pairs, + 0.5) / 2 + worst_pairwise_comparison_p = ss.binom_test(max_total_pairs-max_antipairs, + max_total_pairs, + 0.5) / 2 + except TypeError: + sys.exit("There was a problem using scipy.stats.binom_test. Ensure you have a recent distribution of SciPy installed.") outfile.write('"' + currentgene + '";"' + str(Trait[currentgene]["NUGN"]) + '";"' + str(Trait[currentgene]["Annotation"]) + '";"' + str(Trait[currentgene]["tpgp"]) + '";"' + str(Trait[currentgene]["tngp"]) + '";"' + str(Trait[currentgene]["tpgn"]) +