Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Hash2Fragment by using a map to validate allowed sequence characters #402

Conversation

matiasinsaurralde
Copy link
Contributor

@matiasinsaurralde matiasinsaurralde commented Nov 16, 2023

Changes in this PR

  • Improves Hash2Fragment by using a map for more efficient validation of allowed sequence characters, discussion here.

Why are you making these changes?

General improvement.

Are any changes breaking? (IMPORTANT)

No

Pre-merge checklist

All of these must be satisfied before this PR is considered
ready for merging. Mergeable PRs will be prioritized for review.

  • New packages/exported functions have docstrings.
  • New/changed functionality is thoroughly tested.
  • New/changed functionality has a function giving an example of its usage in the associated test file. See primers/primers_test.go for what this might look like.
  • Changes are documented in CHANGELOG.md in the [Unreleased] section.
  • All code is properly formatted and linted.
  • The PR template is filled out.

@matiasinsaurralde matiasinsaurralde mentioned this pull request Nov 16, 2023
6 tasks
@TwFlem
Copy link
Contributor

TwFlem commented Nov 16, 2023

I saw the small alphabet size in the PR and was curious what that threshold might be when a map performs better list for another issue I'm working on.

I created some benchmarks just to see what would happen with sequences with different possible alphabets of varying length. These were the results:

Map:

➜  poly git:(scratch) ✗ go test ./bwt  -bench=Map -benchmem
goos: linux
goarch: amd64
pkg: github.com/TimothyStiles/poly/bwt
cpu: 12th Gen Intel(R) Core(TM) i7-1260P
BenchmarkMapSmallAlphaWithNoMistakes-16                       33          36787498 ns/op          226393 B/op       1917 allocs/op
BenchmarkMapSmallAlphaWithSomeMistakes-16                   1101            951373 ns/op          966786 B/op      20057 allocs/op
BenchmarkMapSmallAlphaWithManyMistakes-16                   1791            654668 ns/op          964172 B/op      20035 allocs/op
BenchmarkMapCompleteAlphaWithNoMistakes-16                    30          41200890 ns/op          249033 B/op       2109 allocs/op
BenchmarkMapCompleteAlphaWithSomeMistakes-16                1154           1061187 ns/op          966474 B/op      20054 allocs/op
BenchmarkMapCompleteAlphaWithManyMistakes-16                1746            715031 ns/op          964279 B/op      20036 allocs/op
BenchmarkMapCompleteAlphaOopsAllMistakes-16                 1928            599078 ns/op          963875 B/op      20032 allocs/op

Contains:

➜  poly git:(scratch) ✗ go test ./bwt  -bench=Contains -benchmem
goos: linux
goarch: amd64
pkg: github.com/TimothyStiles/poly/bwt
cpu: 12th Gen Intel(R) Core(TM) i7-1260P
BenchmarkContainsSmallAlphaWithNoMistakes-16                 109          10461112 ns/op           68541 B/op        580 allocs/op
BenchmarkContainsSmallAlphaWithSomeMistakes-16              1536            756558 ns/op          964864 B/op      20041 allocs/op
BenchmarkContainsSmallAlphaWithManyMistakes-16              1892            630320 ns/op          963949 B/op      20033 allocs/op
BenchmarkContainsCompleteAlphaWithNoMistakes-16               96          12359970 ns/op           77822 B/op        659 allocs/op
BenchmarkContainsCompleteAlphaWithSomeMistakes-16           1490            831489 ns/op          965014 B/op      20042 allocs/op
BenchmarkContainsCompleteAlphaWithManyMistakes-16           1742            642530 ns/op          964289 B/op      20036 allocs/op
BenchmarkContainsCompleteAlphaOopsAllMistakes-16            2058            554534 ns/op          963630 B/op      20030 allocs/op

Contains happened to perform better in these bench marks in a significant way. I wonder if caching starts to come into play with all the different data involved. Including the benchmark code and cpu profile in the comment.

Also, Contains seems to be more performant in the Small/Complete alphabet with no mistakes case, which seems like it would be the most common case.

*Note: updated because because I modified the code to call something that looks like HashToFragment and also saw the the benchmark was inlining the contains. Now that the no inline is gone, the worst case with BenchmarkContainsCompleteAlphaOopsAllMistakes performing worse than the map equivalent makes more sense. Before it was performing wayyyyyyyy better which was a little weird.

@TwFlem
Copy link
Contributor

TwFlem commented Nov 16, 2023

cpu profile (ran them together for the profile. They were ran separately for the results above)

cpu map-vs-contains inseqHashFn

code

@Koeng101
Copy link
Contributor

Also, Contains seems to be more performant in the Small/Complete alphabet with no mistakes case, which seems like it would be the most common case.

Interesting! @matiasinsaurralde Do you have any thoughts on this? The use case of no-mistakes is definitely the most common by an order of magnitude or more

@matiasinsaurralde
Copy link
Contributor Author

@Koeng101 Agree with that, feel free to discard/ignore my suggestion

@Koeng101
Copy link
Contributor

Closing pull request, because it seems benchmarks show better efficiency with a Contains with our particular use case.

@Koeng101 Koeng101 closed this Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants