Improve Hash2Fragment by using a map to validate allowed sequence characters #402

matiasinsaurralde · 2023-11-16T02:08:20Z

Changes in this PR

Improves Hash2Fragment by using a map for more efficient validation of allowed sequence characters, discussion here.

Why are you making these changes?

General improvement.

Are any changes breaking? (IMPORTANT)

No

Pre-merge checklist

All of these must be satisfied before this PR is considered
ready for merging. Mergeable PRs will be prioritized for review.

New packages/exported functions have docstrings.
New/changed functionality is thoroughly tested.
New/changed functionality has a function giving an example of its usage in the associated test file. See primers/primers_test.go for what this might look like.
Changes are documented in CHANGELOG.md in the [Unreleased] section.
All code is properly formatted and linted.
The PR template is filled out.

…uence characters

TwFlem · 2023-11-16T06:42:31Z

I saw the small alphabet size in the PR and was curious what that threshold might be when a map performs better list for another issue I'm working on.

I created some benchmarks just to see what would happen with sequences with different possible alphabets of varying length. These were the results:

Map:

➜  poly git:(scratch) ✗ go test ./bwt  -bench=Map -benchmem
goos: linux
goarch: amd64
pkg: github.com/TimothyStiles/poly/bwt
cpu: 12th Gen Intel(R) Core(TM) i7-1260P
BenchmarkMapSmallAlphaWithNoMistakes-16                       33          36787498 ns/op          226393 B/op       1917 allocs/op
BenchmarkMapSmallAlphaWithSomeMistakes-16                   1101            951373 ns/op          966786 B/op      20057 allocs/op
BenchmarkMapSmallAlphaWithManyMistakes-16                   1791            654668 ns/op          964172 B/op      20035 allocs/op
BenchmarkMapCompleteAlphaWithNoMistakes-16                    30          41200890 ns/op          249033 B/op       2109 allocs/op
BenchmarkMapCompleteAlphaWithSomeMistakes-16                1154           1061187 ns/op          966474 B/op      20054 allocs/op
BenchmarkMapCompleteAlphaWithManyMistakes-16                1746            715031 ns/op          964279 B/op      20036 allocs/op
BenchmarkMapCompleteAlphaOopsAllMistakes-16                 1928            599078 ns/op          963875 B/op      20032 allocs/op

Contains:

➜  poly git:(scratch) ✗ go test ./bwt  -bench=Contains -benchmem
goos: linux
goarch: amd64
pkg: github.com/TimothyStiles/poly/bwt
cpu: 12th Gen Intel(R) Core(TM) i7-1260P
BenchmarkContainsSmallAlphaWithNoMistakes-16                 109          10461112 ns/op           68541 B/op        580 allocs/op
BenchmarkContainsSmallAlphaWithSomeMistakes-16              1536            756558 ns/op          964864 B/op      20041 allocs/op
BenchmarkContainsSmallAlphaWithManyMistakes-16              1892            630320 ns/op          963949 B/op      20033 allocs/op
BenchmarkContainsCompleteAlphaWithNoMistakes-16               96          12359970 ns/op           77822 B/op        659 allocs/op
BenchmarkContainsCompleteAlphaWithSomeMistakes-16           1490            831489 ns/op          965014 B/op      20042 allocs/op
BenchmarkContainsCompleteAlphaWithManyMistakes-16           1742            642530 ns/op          964289 B/op      20036 allocs/op
BenchmarkContainsCompleteAlphaOopsAllMistakes-16            2058            554534 ns/op          963630 B/op      20030 allocs/op

Contains happened to perform better in these bench marks in a significant way. I wonder if caching starts to come into play with all the different data involved. Including the benchmark code and cpu profile in the comment.

Also, Contains seems to be more performant in the Small/Complete alphabet with no mistakes case, which seems like it would be the most common case.

*Note: updated because because I modified the code to call something that looks like HashToFragment and also saw the the benchmark was inlining the contains. Now that the no inline is gone, the worst case with BenchmarkContainsCompleteAlphaOopsAllMistakes performing worse than the map equivalent makes more sense. Before it was performing wayyyyyyyy better which was a little weird.

TwFlem · 2023-11-16T07:09:06Z

cpu profile (ran them together for the profile. They were ran separately for the results above)

code

Koeng101 · 2023-11-17T15:58:44Z

Also, Contains seems to be more performant in the Small/Complete alphabet with no mistakes case, which seems like it would be the most common case.

Interesting! @matiasinsaurralde Do you have any thoughts on this? The use case of no-mistakes is definitely the most common by an order of magnitude or more

matiasinsaurralde · 2023-11-17T16:01:56Z

@Koeng101 Agree with that, feel free to discard/ignore my suggestion

Koeng101 · 2023-11-17T16:18:34Z

Closing pull request, because it seems benchmarks show better efficiency with a Contains with our particular use case.

seqhash: improve Hash2Fragment by using a map to validate allowed seq…

3bb4c19

…uence characters

matiasinsaurralde mentioned this pull request Nov 16, 2023

Add seqhash v2 #398

Closed

6 tasks

Koeng101 closed this Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Hash2Fragment by using a map to validate allowed sequence characters #402

Improve Hash2Fragment by using a map to validate allowed sequence characters #402

matiasinsaurralde commented Nov 16, 2023 •

edited

Loading

TwFlem commented Nov 16, 2023 •

edited

Loading

TwFlem commented Nov 16, 2023 •

edited

Loading

Koeng101 commented Nov 17, 2023

matiasinsaurralde commented Nov 17, 2023

Koeng101 commented Nov 17, 2023

Improve Hash2Fragment by using a map to validate allowed sequence characters #402

Improve Hash2Fragment by using a map to validate allowed sequence characters #402

Conversation

matiasinsaurralde commented Nov 16, 2023 • edited Loading

Changes in this PR

Why are you making these changes?

Are any changes breaking? (IMPORTANT)

Pre-merge checklist

TwFlem commented Nov 16, 2023 • edited Loading

TwFlem commented Nov 16, 2023 • edited Loading

Koeng101 commented Nov 17, 2023

matiasinsaurralde commented Nov 17, 2023

Koeng101 commented Nov 17, 2023

matiasinsaurralde commented Nov 16, 2023 •

edited

Loading

TwFlem commented Nov 16, 2023 •

edited

Loading

TwFlem commented Nov 16, 2023 •

edited

Loading