-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching seqinfo #26
Comments
There's no plan at the moment to take advantage of R's support for caching user data to save NCBI or UCSC assembly/genome information and allow One concern with a persistent caching solution is that there's the slight possibility that the information provided by NCBI or UCSC for a given assembly/genome changes in the future. But maybe the risk that this actually happens is so low that we shouldn't be too concerned. This could also be mitigated via an expiration mechanism e.g. NCBI or UCSC chromosome information gets automatically removed from the persistent cache after a couple of months or something like that. Also note that even with a persistent caching solution, an internet connection would still be initially necessary so it doesn't really solve the problem for users on networks that are blocking NCBI/UCSC/Ensembl traffic. |
Thanks for taking the time to respond. I understand the risk of sequence information changing. A persistent caching solution would help users who sometimes work offline, and it would prevent some crashes in HPC environments (e.g., a random node is misconfigured or has network problems). Maybe it's too niche of a need, but it could also help out package developers to be able to insert their own entries into .UCSC_cached_chrom_info and .NCBI_cached_chrom_info for use in these situations. The need for the end user to do simple harmonization of human data (just making chr prefixes and M/MT consistent, without regard for non-primary assembly sequences) is probably pretty widespread. |
Bingo! And just when we were talking about the possibility of UCSC suddenly changing the chromosome information of their genomes, they just do it! See issue #27. Note that this is not the first time. They already did this last year with hg19 when they decided to base it on GRCh37.p13 instead of GRCh37. This broke many things and created a lot of confusion. |
Hi @jeff-mandell , Just to let you know that I implemented an "offline mode" for Note that it's only a partial "offline mode" i.e. it works when called with Cheers, |
Thank you, this is nice to have! |
@hpages Are there plans to make "offline" assembly metadata available on AnnotationHub like the Ensembl, UCSC transcription DBs? |
There are plans to make some assembly metadata available offline but there's no clear roadmap yet. In particular whether it's going to be via AnnotationHub or other means has not been decided. Note that the chrom info for some UCSC genomes is already available offline e.g. |
Hi, my package uses genomeInfoDb, and we use the seqlevelsStyle function to clean up user-inputted data and ensure consistent chromosome names (in our case, we go with NCBI style, which means stripping chr prefixes). I can see that what seems like a simple task gets complicated under the hood with the need to download the latest info from NCBI, Ensembl, and UCSC.
I found that .UCSC_cached_chrom_info and .NCBI_cached_chrom_info store the necessary information for seqlevelsStyle throughout a session, but an internet connection is initially necessary every new session. This causes a problem for offline users and users on networks that for whatever reason are blocking any of NCBI/UCSC/Ensembl traffic (yes, this is really happening). Since seqinfo is such a small amount of data, is there a plan to take advantage of R's support for caching user data to save this information and allow seqlevelsStyle to run offline? Or is there a safe workaround to supply the necessary seqinfo?
I did it this way, but I'm concerned this could cause problems with new GenomeInfoDb releases or if anything changes on the NCBI/UCSC/Ensembl server side.
`
The text was updated successfully, but these errors were encountered: