Skip to content

Commit

Permalink
Merge pull request #24 from AdmiralenOla/development
Browse files Browse the repository at this point in the history
1.3.4
  • Loading branch information
AdmiralenOla authored Jun 16, 2016
2 parents d1e90b1 + 27f7ab7 commit c1d7322
Show file tree
Hide file tree
Showing 3 changed files with 50 additions and 34 deletions.
30 changes: 20 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,18 @@ Scoary is designed to take the gene_presence_absence.csv file from [Roary] (http
- [License] (#license)
- [Etymology] (#etymology)
- [Bugs] (#bugs)
- [FAQ] (#faq)
- [Coming soon] (#coming-soon)
- [Acknowledgements] (#acknowledgements)
- [Feedback] (#feedback)
- [Citation] (#citation)
- [Contact] (#contact)

## What's new?
v1.3.4 (16th Jun 2016)
- Scoary no longer crashes when using Scipy 0.16 instead of 0.17.
- More information about what's going on is printed. (Useful for very large datasets that take long to analyze)

v1.3.3 (9th Jun 2016)
- BUG FIX: Tree calculation had been broken since 1.3.2 yesterday. Sorry about that.

Expand Down Expand Up @@ -74,8 +79,8 @@ v1.1 (29th Mar 2016)

## Dependencies

- Python (Tested in 2.7 and 3.5)
- [SciPy 0.17 or greater.] (http://www.scipy.org/install.html) Note: 0.16 or lower will **not** work!
- Python (Tested with versions 2.7 and 3.5)
- [SciPy] (http://www.scipy.org/install.html) (Tested with versions 0.16 and 0.17)


## Installation
Expand Down Expand Up @@ -174,7 +179,7 @@ optional arguments:
-p P_VALUE_CUTOFF, --p_value_cutoff P_VALUE_CUTOFF
P-value cut-off. SCOARY will not report genes with
higher p-values than this. Set to 1.0 to report all
genes. Default = 0.05
genes. Accepts standard form (e.g. 1E-8). Default = 0.05
-c {Individual,Bonferroni,Benjamini-Hochberg}, --correction {Individual,Bonferroni,Benjamini-Hochberg}
Instead of cutting off at the individual test p-value
(option -p), use the indicated corrected p-value for
Expand All @@ -194,11 +199,7 @@ optional arguments:
indexing)
--delimiter DELIMITER
The delimiter between cells in the gene
presence/absence and trait files. NOTE: Even though
commas are the default they might mess with the
annotation column, and it is therefore recommended to
save your files using semicolon instead.
SCOARY will output files delimited by semicolon
presence/absence and trait files.
--version Display Scoary version, and exit.
```
Expand All @@ -218,6 +219,9 @@ This will restrict the current analysis to isolates 1,2,4 and 9, and will omit a
#### The -s parameter
The **-s** parameter is used to indicate to Scoary which column in the gene_presence_absence.csv file is the _first_ column representing an isolate. By default it is set to 15 (1-based indexing).

#### The -p, -m and -c parameters
These parameters control your output. **-m** sets a hard cut-off on the number of hits reported. With **-p** you can set that no gene with a higher p-value will be reported. (Tip: Set this to 1.0 to report every single gene). You can mix these parameters with **-c**. If you only wanted genes with a Bonferroni-adjusted p-value < 1E-10 you could use _-p 1E-10 -c Bonferroni_.

#### The -u flag
Calling Scoary with the **-u** flag will cause it to write a newick file of the UPGMA tree that is calculated internally. The tree is based on pairwise Hamming distances in the gene_presence_absence matrix.

Expand Down Expand Up @@ -249,10 +253,16 @@ Scoary is freely available under a GPLv3 license.
Scoary is an anagram of "scoring" and "Roary", the pan-genome pipeline. It was named as an homage to Roary.

## Bugs
Known bugs:
- I'm currently (8th Jun 2016) not aware of any bugs.
- Please report bugs here (Issues) or to me directly at [email protected]

## FAQ
- **How can you justify p=0.5 in your pairwise comparisons method? Is this species-specific?**

The reasoning is as follows: Scoary first finds the maximum number of independent contrasting pairs in a phylogenetic tree, irrespective of gene-trait status. Thus, AB-ab pairs should be equally likely as Ab-aB pairs if your null hypothesis is true. Your null hypothesis in this case, is that there is no detectable association between A/a and B/b. If AB-ab pairs are much more common than Ab-aB pairs then you can be confident that the true p was not 0.5. And if this is the case then then there seems to be an association between your A/a (your gene) and your B/b (your phenotype). A justification for this way of testing can be found in Read and Nee, 1995.
- **Why is my "Best_pairwise_comp_p" higher than my "Worst_pairwise_comp_p"?**

The "best" and "worst" labels are attached to the odds ratio of the gene in the non-population structure-corrected analysis. For example, you may find an odds ratio of 2.0 for a particular gene, meaning presence of the gene was tied to presence of the phenotype. But when you inspect your pairwise comparisons p-values you see that the "best" p-value was 0.2 and the "worst" was 1.0E-5. This means that in your phylogenetic tree, an enrichment of Ab-aB pairs was more common. In other words, the presence of this gene actually seems associated to a _silencing_ of the phenotype, in spite of your original odds ratio. Note that the odds ratio can be inflated for example by sampling of very closely related isolates.

## Coming soon
Please feel free to suggest improvements, point out bugs or methods that could be better optimized.

Expand Down
2 changes: 1 addition & 1 deletion scoary/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '1.3.3'
__version__ = '1.3.4'
52 changes: 29 additions & 23 deletions scoary/methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,8 @@ def main():
parser.add_argument('-p', '--p_value_cutoff',
help='P-value cut-off. SCOARY will not report genes '
'with higher p-values than this. Set to 1.0 to report '
'all genes. Default = 0.05',
'all genes. Accepts standard form (e.g. 1E-8). '
'Default = 0.05',
default=0.05,
type=float)
parser.add_argument('-c', '--correction',
Expand Down Expand Up @@ -83,11 +84,7 @@ def main():
action='store_true')
parser.add_argument('--delimiter',
help='The delimiter between cells in the gene '
'presence/absence and trait files. NOTE: Even though '
'commas are the default they might mess with the '
'annotation column, and it is therefore recommended to '
'save your files using semicolon instead. '
'SCOARY will output files delimited by semicolon',
'presence/absence and trait files. ',
default=',',
type=str)
parser.add_argument('--version', help='Display Scoary version, and exit.',
Expand All @@ -110,19 +107,21 @@ def main():
# this actually means all isolates are allowed
# and included in the analysis
allowed_isolates = None


print("Reading gene presence absence file")
genedic_and_matrix = Csv_to_dic_Roary(genes,
args.delimiter,
startcol=args.start_col - 1,
allowed_isolates=allowed_isolates)
genedic = genedic_and_matrix["Roarydic"]
zeroonesmatrix = genedic_and_matrix["Zero_ones_matrix"]
strains = genedic_and_matrix["Strains"]

print("Creating Hamming distance matrix based on gene presence/absence")
TDM = CreateTriangularDistanceMatrix(zeroonesmatrix, strains)
QT = PopulateQuadTreeWithDistances(TDM)
print("Building UPGMA tree from distance matrix")
upgmatree = upgma(QT)

print("Reading traits file")
traitsdic = Csv_to_dic(traits, args.delimiter, allowed_isolates)

print("Finished loading files into memory.")
Expand Down Expand Up @@ -154,8 +153,10 @@ def CreateTriangularDistanceMatrix(zeroonesmatrix, strainnames):
triangular matrix of pairwise Hamming distances.
The distance d(i,i) is set to 1 for all i.
"""

hamming_distances = list(spatial.distance.pdist(zeroonesmatrix, 'hamming'))
try:
hamming_distances = list(spatial.distance.pdist(zeroonesmatrix, 'hamming'))
except TypeError:
sys.exit("Could not locate scipy.spatial.distance.pdist. Perhaps you have an old version of SciPy installed?")
nstrains = int((1 + (1 + 8*len(hamming_distances))**0.5)/2)
TriangularDistanceMatrix = []
Strain_names = []
Expand Down Expand Up @@ -215,7 +216,11 @@ def Csv_to_dic_Roary(genefile, delimiter, startcol=0, allowed_isolates=None):

for line in csvfile:
q = line
r[q[genecol]] = {"Non-unique Gene name": q[nugcol], "Annotation": q[anncol]} if roaryfile else {}
try:
r[q[genecol]] = {"Non-unique Gene name": q[nugcol], "Annotation": q[anncol]} if roaryfile else {}
except IndexError:
sys.exit("ERROR: Could not read gene presence absence file. Verify that this file is a proper Roary file "
"using the specified delimiter (default is ',').")
# The zero_ones_line represents the presence (1) or absence (0) of a gene. It is used for calculating distances between strains.
zero_ones_line = []

Expand Down Expand Up @@ -258,7 +263,7 @@ def Csv_to_dic(csvfile, delimiter, allowed_isolates):
name_trait = p[""]
del p[""]
elif "Name" in p:
name_trait = p["name"]
name_trait = p["Name"]
del p["Name"]
else:
sys.exit("Make sure the top-left cell in the traits file is either empty or 'Name'. Do not include empty rows")
Expand Down Expand Up @@ -414,7 +419,7 @@ def StoreResults(Results, max_hits, p_cutoff, correctionmethod, upgmatree, GTC):
A method for storing the results. Calls StoreTraitResult for each trait column in the input file
"""
for Trait in Results:
print("Storing results: " + Trait)
print("\nStoring results: " + Trait)
StoreTraitResult(Results[Trait], Trait, max_hits, p_cutoff, correctionmethod, upgmatree, GTC)


Expand Down Expand Up @@ -443,7 +448,7 @@ def StoreTraitResult(Trait, Traitname, max_hits, p_cutoff, correctionmethod, upg
# Start with lowest p-value, the one which has key 0 in sort_instructions
currentgene = sort_instructions[x]
if (Trait[currentgene][cut_possibilities[correctionmethod]] > p_cutoff):
sys.stdout.write("\r100.00%\n")
sys.stdout.write("\r100.00%")
sys.stdout.flush()
break

Expand All @@ -452,14 +457,15 @@ def StoreTraitResult(Trait, Traitname, max_hits, p_cutoff, correctionmethod, upg
max_total_pairs = Max_pairwise_comparisons["Total"]
max_propairs = Max_pairwise_comparisons["Pro"]
max_antipairs = Max_pairwise_comparisons["Anti"]
best_pairwise_comparison_p = ss.binom_test(max_propairs,
max_total_pairs,
0.5,
alternative="greater")
worst_pairwise_comparison_p = ss.binom_test(max_total_pairs-max_antipairs,
max_total_pairs,
0.5,
alternative="greater")
try:
best_pairwise_comparison_p = ss.binom_test(max_propairs,
max_total_pairs,
0.5) / 2
worst_pairwise_comparison_p = ss.binom_test(max_total_pairs-max_antipairs,
max_total_pairs,
0.5) / 2
except TypeError:
sys.exit("There was a problem using scipy.stats.binom_test. Ensure you have a recent distribution of SciPy installed.")

outfile.write('"' + currentgene + '";"' + str(Trait[currentgene]["NUGN"]) + '";"' + str(Trait[currentgene]["Annotation"]) +
'";"' + str(Trait[currentgene]["tpgp"]) + '";"' + str(Trait[currentgene]["tngp"]) + '";"' + str(Trait[currentgene]["tpgn"]) +
Expand Down

0 comments on commit c1d7322

Please sign in to comment.