Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference Workflow for Fastq->SRA File (VDB Conversion) #44

Open
LRParser opened this issue Dec 24, 2021 · 2 comments
Open

Reference Workflow for Fastq->SRA File (VDB Conversion) #44

LRParser opened this issue Dec 24, 2021 · 2 comments

Comments

@LRParser
Copy link

Hello,
Can anyone point me to a reference to the code that is used by the SRA project to convert from a user's submitted FASTQ files to the VDB/.sra format that is stored then in the SRA? I know the format is not fully documented, but understanding this part of the ETL pipeline would be really helpful in understanding this format. Or any example code/pipeline would be helpful.
Thank you.

@durbrow
Copy link
Collaborator

durbrow commented Dec 24, 2021

NCBI has an automated pipeline for SRA submissions. This pipeline is old in software years, it's been in production for over a decade. It's grown over the years, mostly in response to the evolution of sequencing technology and how that shows up in changes to data formats (*). The SRA toolkit is only part of it, we don't have much knowledge of those parts to which we don't contribute. I will describe those part that involve our tools.

Keep in mind that a big difference between common bio-informatics formats and SRA is that SRA stores all reads of a spot together as one record and not as separate records.

Aligned submissions are in BAM, CRAM, or SAM (collectively referred to as BAM). bam-load does ETL of BAM. Sometimes, unaligned submissions come in BAM, and bam-load handles those too. BAM has unambiguous read number information, so no need to parse the read names, but mates can come in any order and be anywhere in the input. bam-load can assemble reads into spots regardless of the order the reads come, and detects duplicates. It does not store read names or extra tags from the input.

Unaligned submissions are usually in some form of FASTQ. We have 3 programs for ETL of FASTQ. Each has its own strengths and weaknesses. In order of age, they are:
1 fastq-load
2 latf-load
3 sharq

fastq-load is fast and produces the smallest output. But, it breaks on any read names it can't parse, requires strict ordering of reads for assembling reads into spots (pairing mates), and can't detect duplicate read names. It isn't actively maintained.

latf-load is much more flexible in parsing read names (uses flex/bison), and can assemble reads into spots regardless of the order the reads come, and detects duplicates (functionality from bam-load). However, this makes for a heavyweight and slow process. Until recently, it didn't store read names, only an number for each spot.

sharq is the latest. It's more like fastq-load, but detects duplicates and is easier to maintain. Unlike the other two, it needs an external tool general-loader to actually write the output.

All of these (and all our loaders) produce a directory containing the columns of the table holding the read data, each column in a subdirectory. (bam-load actually produces databases with tables for the reference sequences, the alignment data, and the read data, each in a subdirectory.) Our kar program packs up the directory into a single file that is made available in the archive as an SRR.

  • In spite of the variety of inputs that SRA accepts, SRA endeavors to provide a consistent view of the data. For example, a tool wanting FASTQ out of the SRA should not have to deal with the details of different platforms or the evolution of their file formats. This is the main reason we don't document the details, they change too often.

@LRParser
Copy link
Author

Thank you so much for this detailed answer @durbrow - it’s like an early Christmas present. I’m trying to learn the binary format of .SRA files to hardware accelerate a particular application and this information is really useful. If you have any other tips on how to easily extract the reads from an SRA file in binary form (eg is there an example program to un-kar a file and just iterate the DNA bases in binary format?) I would also appreciate them. I’ve spent a lot of time studying the ngs SDK FragTest example but this seems to just give back a StringRef to the bases which I’m guessing is still not as optimal as parsing the binary info directly might be. Thanks for your contributions again and Happy Holidays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants