Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VDBManagerPathType() returns kptNotFound under stress #25

Open
jgans opened this issue May 8, 2020 · 0 comments
Open

VDBManagerPathType() returns kptNotFound under stress #25

jgans opened this issue May 8, 2020 · 0 comments

Comments

@jgans
Copy link

jgans commented May 8, 2020

I have been bench marking the loading of SRA records using the VDB API to stream sequence data (no quality or other info) on AWS. Similar to the fasterq-dump strategy, I am attempting to read each SRA record in parallel, but using the Message Passing Interface (MPI) instead of just threads. Each MPI rank opens and reads a non-overlapping slice of an SRA record.

For a number of parallel MPI ranks gets larger than about 32, I've noticed that VDBManagerPathType() starts returning kptNotFound for about 10% of the MPI processes. I've been able to work around this by retrying the call to VDBManagerPathType() after waiting 5 seconds. Is there a good way to read an SRA record in parallel, ideally using 100's of independent, but concurrent, processes? I am interested in extracting reads from an SRA file as fast as AWS will allow.

I was assuming that the data is stored in an S3 bucket and that parallel access would be okay. I'm not exactly sure where the data is being stored, since the srapath command returns:
https://locate.ncbi.nlm.nih.gov/sdlr/sdlr.fcgi?jwt=<long string of characters removed>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant