-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reference Workflow for Fastq->SRA File (VDB Conversion) #44
Comments
NCBI has an automated pipeline for SRA submissions. This pipeline is old in software years, it's been in production for over a decade. It's grown over the years, mostly in response to the evolution of sequencing technology and how that shows up in changes to data formats (*). The SRA toolkit is only part of it, we don't have much knowledge of those parts to which we don't contribute. I will describe those part that involve our tools. Keep in mind that a big difference between common bio-informatics formats and SRA is that SRA stores all reads of a spot together as one record and not as separate records. Aligned submissions are in BAM, CRAM, or SAM (collectively referred to as BAM). Unaligned submissions are usually in some form of FASTQ. We have 3 programs for ETL of FASTQ. Each has its own strengths and weaknesses. In order of age, they are:
All of these (and all our loaders) produce a directory containing the columns of the table holding the read data, each column in a subdirectory. (
|
Thank you so much for this detailed answer @durbrow - it’s like an early Christmas present. I’m trying to learn the binary format of .SRA files to hardware accelerate a particular application and this information is really useful. If you have any other tips on how to easily extract the reads from an SRA file in binary form (eg is there an example program to un-kar a file and just iterate the DNA bases in binary format?) I would also appreciate them. I’ve spent a lot of time studying the ngs SDK FragTest example but this seems to just give back a StringRef to the bases which I’m guessing is still not as optimal as parsing the binary info directly might be. Thanks for your contributions again and Happy Holidays. |
Hello,
Can anyone point me to a reference to the code that is used by the SRA project to convert from a user's submitted FASTQ files to the VDB/.sra format that is stored then in the SRA? I know the format is not fully documented, but understanding this part of the ETL pipeline would be really helpful in understanding this format. Or any example code/pipeline would be helpful.
Thank you.
The text was updated successfully, but these errors were encountered: