Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add svb compression to slow5 #341

Merged
merged 3 commits into from
Sep 16, 2023
Merged

Add svb compression to slow5 #341

merged 3 commits into from
Sep 16, 2023

Conversation

Koeng101
Copy link
Contributor

Replaces #328 , merging this time with ioToBio instead to take advantage of the new parser format.

@Koeng101
Copy link
Contributor Author

Koeng101 commented Sep 11, 2023

Moving #328 here to merge with ioToBio.

#328 (comment) is relevant - I think this is just looking ahead to blow5 compatibility.

To quote the importance of just the svb compression:

I've seen reduction in total database size (storing fastq and slow5 reads) from 12Gb to 7.1Gb using just this one weird trick (database engineers hate him)

This is without implementing the other tricks from blow5.

I also added in examples for how to use the compression / decompression

@carreter
Copy link
Collaborator

Continuing the thread from over there (#328 (comment)):

I'm using these functions for taking the rawSignal in and out of an SQL database, so having it be a method doesn't work as well as a raw function.

CompressedRead I think would just be a blow5 read rather than slow5 read, which we SHOULD implement at some point. (there is a whole binary format specifically for doing those compressed reads well, with some real performance improvements)

If this is the case, I think we should entirely hold off on merging this PR in favor of doing a blow5 PR instead. As it stands, I don't really see a use case for other clients of this package. If someone wants to perform this compression, they can just import your svb package themselves.

@Koeng101
Copy link
Contributor Author

Continuing the thread from over there (#328 (comment)):

I'm using these functions for taking the rawSignal in and out of an SQL database, so having it be a method doesn't work as well as a raw function.
CompressedRead I think would just be a blow5 read rather than slow5 read, which we SHOULD implement at some point. (there is a whole binary format specifically for doing those compressed reads well, with some real performance improvements)

If this is the case, I think we should entirely hold off on merging this PR in favor of doing a blow5 PR instead. As it stands, I don't really see a use case for other clients of this package. If someone wants to perform this compression, they can just import your svb package themselves.

Breaking it down into two parts: 1 blow5 vs svb, and 2 svb triviality

I strongly disagree that blow5 fulfills the same use case:

  1. blow5 uses svb, but the use-case intention is completely different. These slow5 package functions are so that you can use slow5 with different data stores (like an SQL database). blow5 cannot fulfill that use case, because you want to query on the raw data, so it doesn't exist as the binary format. However, the one column of data that takes the majority of space that won't be queried on directly can be compressed with svb to halve the needed storage, which is the limiting factor for these kinds of data files. blow5 is for binary storage, probably putting the Nanopore runs in an object store, while slow5+svb is much better for a document-style database.
  2. I am currently storing ~14 actual sequencing runs in a database with this format, with more coming in every week. I personally have real-world experience that this both works well in actual practice, and that blow5 does not fulfill the niche.

I also disagree that svb is trivial:

  1. svb implementation is non-trivial (before I forked you couldn't actually do it with pre-existing libraries), and here we have it tested in the single context that it is useful. Unlike gzip or zstd, it isn't messing with an io.Reader/io.Writer. Trivial would be piping an io.Reader, not two make([]xint) plus a type conversion loop. What is a mask array? Do you need it? What inputs need to be saved for decompression? Having it there in a function increases readability of an unfamiliar compression/decompression type.
  2. There is messaging with using svb in this way and having functions built-in to do it. The performance increase is pretty darn great compared to other compression methods, and by including it we are saying "THIS is how you should compress raw reads".

As it stands, I don't really see a use case for other clients of this package

I specifically talk about the use-case for other clients in the context comment.

@carreter
Copy link
Collaborator

Thank you for the clarifications re: slow5+svb vs blow5 and svb triviality. I completely missed the context comment, so sorry!

I'm on board with adding this functionality - reviewing now.

bio/slow5/slow5.go Show resolved Hide resolved
bio/slow5/slow5.go Show resolved Hide resolved
bio/slow5/slow5_test.go Outdated Show resolved Hide resolved
@Koeng101 Koeng101 merged commit 34de749 into ioToBio Sep 16, 2023
3 checks passed
@delete-merged-branch delete-merged-branch bot deleted the slow5StreamVByte2 branch September 16, 2023 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants