-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
created a super minimal mash function for sketching sequences. #344
Conversation
mash/mash.go
Outdated
The idea is that we can sketch a sequence of data, and then compare the sketch to other sketches to see how similar they are. | ||
This saves a bunch of computation time and memory, because we don't have to compare the entire sequence to another sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to have some context on how this differs from algorithms like minimap2. Computation upfront?
mash/mash.go
Outdated
|
||
import "github.com/spaolacci/murmur3" | ||
|
||
// sketch algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to have more context for this. Where is the sketch algorithm?
Comments should go up to 80char
For this long of multiline, especially if just context, should have a multiline comment style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to have multiline comments that start each line with //
, and IMO we should standardize this across the repo to be one way or the other - not a huge deal though.
mash/mash.go
Outdated
} | ||
} | ||
|
||
func (m *Mash) Sketch(sequence string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the NewMash function? If so, a function description string would be nice
mash/mash.go
Outdated
// for each kmer in the window, hash it to a 32 bit number | ||
// keep the minimum hash value of all the kmers in the window up to a given sketch size | ||
// the sketch is a vector of the minimum hash values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this?
mash/mash_test.go
Outdated
func TestMash(t *testing.T) { | ||
fingerprint1 := mash.NewMash(17, 10) | ||
fingerprint1.Sketch("ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA") | ||
|
||
fingerprint2 := mash.NewMash(17, 10) | ||
fingerprint2.Sketch("ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA") | ||
|
||
distance := fingerprint1.Distance(fingerprint2) | ||
if distance != 0 { | ||
t.Errorf("Expected distance to be 0, got %f", distance) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No tests for affirming the algorithm actually works on more complicated sequences or in the null case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Any ideas for a good test case @Koeng101?
adding the help wanted tag because @Koeng101 wants this merged but the following still needs to be done and I won't be able to get around to it for at least a week.
|
Linking this to #356 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've suggested a radically different API for performance and future proof reasons. Let me know what y'all think.
mash/mash.go
Outdated
Sketches []uint32 | ||
} | ||
|
||
func NewMash(kmerSize uint, sketchSize uint) *Mash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me that this API could be simplified quite a lot and be made more direct and performant:
// WriteSketches writes `n` sketches into dstSketches by splitting sequence
// into chunks of `kmerSize`.
func WriteSketches(dstSketches []uint32, sequence string, kmerSize int) (n int, _ error) {
}
This would also give us leeway by not exposing leaky abstractions in case in the future we realize we want to be able to configure how we write our Sketches with a Masher
/Mash
type which takes several configuration parameters which may not be apparent to us today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'd avoid uint
in size definitions to catch programmer errors early. Having sizes as uint can do more harm than good i.e:
sketchSize := uint(len(sketchPool) - totalProcessed)
In the code above the user calculates the sketch size from a pair of int
s. If the user is mistaken and the result of the calculation is negative then uint
converts the negative number into a massively huge number which would cause a out of memory panic.
Compared to negative size panics the OOM panic can affect other goroutines running in the same program.
mash/mash.go
Outdated
// count the number of hashes that are the same | ||
// divide the number of hashes that are the same by the total number of hashes | ||
// the result is the distance between the two sequences | ||
func (m *Mash) Distance(other *Mash) float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the above change Distance
could be rewritten as
func Distance(a, b []uint32) float64
mash/mash.go
Outdated
func (m *Mash) Distance(other *Mash) float64 { | ||
var sameHashes int | ||
for i := 0; i < len(m.Sketches); i++ { | ||
for j := 0; j < len(other.Sketches); j++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This nested loop could be changed for two separate loops, one where the shorter sketch slice is stored into a map[uint32]struct{}
type and the other where they are checked to see if other long uint32s slice contains entries which are in the map. Would speed up for large sketches.
I really like @soypat changes in this case. They make a lot of sense to me |
…th the largest and second to largest hashes.
This package is meant to help create "sketches" via the mash sketching algorithm. There's plenty of room for optimization here but I've implemented the core idea.
Mash: fast genome and metagenome distance estimation using MinHash.
Ondov, B.D., Treangen, T.J., Melsted, P. et al.
Genome Biol 17, 132 (2016).
https://doi.org/10.1186/s13059-016-0997-x
Mash Screen: high-throughput sequence containment estimation for genome discovery.
Ondov, B., Starrett, G., Sappington, A. et al.
Genome Biol 20, 232 (2019).
https://doi.org/10.1186/s13059-019-1841-x