-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for update to Taxon Table #93
Comments
Hi @AlexaBennett ,
By default, it does. But there is a rank propagation option: to disable, use
The taxonomy does not perfectly match your desired taxonomy because disabling rank propagation yields the raw NCBI taxonomy (which in this case has no kingdom annotation as in example 2, and contains annotations g__uncultured; s__bacterium), but as shown in the third example you can pick and choose your ranks. Does that meet your need?
I really like this idea, and it was on our radar at some point. @BenKaehler can you enlighten us as to whether there is a workaround for this and/or what it would take to implement?
Also a great idea. I don't think there would be an issue with adding an option to ignore the sequences, and e.g., We are very open to contributions if you are interested in tackling either of these features @AlexaBennett 😉 |
I can make it work. I might have to use superkingdom as in your last example. If I am not mistaken 'superkingdom' is the same as 'domain'? Therefore, it is reasonable to use the later in a prokaryotic study without loss of information. 🤔
Sure! I have an idea of tackle them, but I am a only a 3rd year graduate student coming from a Microbiology background. Would there be anyone I may contact if I have questions throughout the process? |
correct and that sounds reasonable, but I cannot guarantee that this is uniform across the NCBI taxonomy... e.g., that annotations for superkingdom are consistently used across entries, so it would be worth checking after you get the data whether any superkingdom annotations are missing if that's a problem for your application.
Great! Yes, feel free to ask any questions here that are specific to RESCRIPt/this issue, and for more general development questions you can ask on teh QIIME 2 forum. To get you started:
Keep us updated and let us know if you run into problems. Thanks! |
Hi @AlexaBennett, very sorry for the slow response. Unless I've missed something, it is fairly straightforward to download a range of accessions. I didn't understand your example, but this one appears to work as expected:
The issue of avoiding downloading the sequences when downloading the taxonomies is a bit trickier. At the moment the script downloads the sequences and the taxids for the given accessions in the same download. It is probably possible to download the metadata around the accession ids without downloading the sequences, but I would have to play around with the download formats. @nbokulich, I'm not sure that this is a good beginner issue. The unit tests on their own would very challenging for a beginner coder. @AlexaBennett, how badly do you need to be able to download taxa without sequences? As far as I can see the only challenge might be performance, given that you could delete the sequences after you download them. Not saying that it wouldn't be a nice feature to have, just that there is an obvious workaround until we get the opportunity to implement it. |
Great news, thanks for the example @BenKaehler ! But I think @AlexaBennett 's request is to trim to a specific position from within a genome. Is this already possible @BenKaehler ? |
Oh, gotcha, that is covered by the "Unless I've missed something". Adding that functionality would be a reasonable beginner issue. |
Yes, I am interested in the ability to trim to 1+ positions within the genome.
This is the current workaround I am utilizing in the QIIME environment. However, my primary dilemma is a product of having to strip the positions from accession numbers. The resulting taxa file contains only accessions ids and taxa values. It has thus dereplicated accession numbers from organisms with multiple copy numbers. My current method to handle this is the following:
The script I am currently developing extracts the TaxId from the eSummary (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi). That bypasses having to download the sequences. @BenKaehler, could that work in the current script? |
Request to update the get-ncbi-data to format the taxon table as it would appear in SILVA. Currently, the scripts populates all ranks regardless of presence in TaxID lineage.
Example of actual:
Example of desired:
Additionally, would it be possible to allow for specification of accession number target range? E.x. 'KR045484.1:3-200'. Or, to specify only to retrieve the taxonomy? This utility would be ideal for those of us with a curated list of sequences from a HMM. It is common to retrieve sequences from genomic assemblies. It is also possible to have multiple copies with genetic variation between the genomic regions.
The text was updated successfully, but these errors were encountered: