Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

classif_table.py undercounts read tallies after reduplication #334

Open
nhoffman opened this issue Mar 17, 2014 · 0 comments
Open

classif_table.py undercounts read tallies after reduplication #334

nhoffman opened this issue Mar 17, 2014 · 0 comments

Comments

@nhoffman
Copy link
Collaborator

In the examples below, I created a data set that either underwent "real" deduplication (using deduplicate_sequences.py) or "fake" deduplication using this script:

#!/usr/bin/env python

"""
Usage seqmagick extract-ids seqs.fasta | pretend_to_deduplicate.py > dedup_info.csv
"""

import sys
import csv

def main():
    writer = csv.writer(sys.stdout)
    for name in sys.stdin:
        name = name.strip()
        writer.writerow([name, name, 1])


main()

In the examples below, the output using "real" deduplication is in output-dedup, "fake" in output.

% grep -c '>' data/seqs.fasta
12279
% src/pplacer/scripts/classif_table.py output/placements.db -m data/seq_info.csv by_taxon.species.csv
% R -q --no-save -e 'sum(read.csv("by_taxon.species.csv")$tally)'
> sum(read.csv("by_taxon.species.csv")$tally)
[1] 12279
% src/pplacer/scripts/classif_table.py output-dedup/placements.db -m data/seq_info.csv by_taxon.species.csv
% R -q --no-save -e 'sum(read.csv("by_taxon.species.csv")$tally)'
> sum(read.csv("by_taxon.species.csv")$tally)
[1] 10849

This manifests in an actual analysis as under-counting for each specimen, for example:

   specimen raw_tally tally yield
1   j14tr10      6560  3800  57.9
2   j14tr11      5719  4787  83.7
3   j14tr12      6201  5925  95.5
4   j14tr17      6674  3124  46.8
5   j14tr18      8023  3349  41.7
6   j14tr19      8206  3101  37.8
7   j14tr22      6349  2512  39.6
8   j14tr23      7627  3040  39.9
9   j14tr24      7509  2516  33.5

As far as I can tell, the behavior is the same in classif_rect.py (the original script from which classif_table.py was derived).

For the time being, the fix appears to be not to deduplicate before aligning and running pplacer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant