VDBManagerPathType() returns kptNotFound under stress #25

jgans · 2020-05-08T20:45:03Z

I have been bench marking the loading of SRA records using the VDB API to stream sequence data (no quality or other info) on AWS. Similar to the fasterq-dump strategy, I am attempting to read each SRA record in parallel, but using the Message Passing Interface (MPI) instead of just threads. Each MPI rank opens and reads a non-overlapping slice of an SRA record.

For a number of parallel MPI ranks gets larger than about 32, I've noticed that VDBManagerPathType() starts returning kptNotFound for about 10% of the MPI processes. I've been able to work around this by retrying the call to VDBManagerPathType() after waiting 5 seconds. Is there a good way to read an SRA record in parallel, ideally using 100's of independent, but concurrent, processes? I am interested in extracting reads from an SRA file as fast as AWS will allow.

I was assuming that the data is stored in an S3 bucket and that parallel access would be okay. I'm not exactly sure where the data is being stored, since the srapath command returns:
https://locate.ncbi.nlm.nih.gov/sdlr/sdlr.fcgi?jwt=<long string of characters removed>.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VDBManagerPathType() returns kptNotFound under stress #25

VDBManagerPathType() returns kptNotFound under stress #25

jgans commented May 8, 2020

VDBManagerPathType() returns kptNotFound under stress #25

VDBManagerPathType() returns kptNotFound under stress #25

Comments

jgans commented May 8, 2020