Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAGRA - runtime dispatch of distance functions #324

Conversation

achirkin
Copy link
Contributor

@achirkin achirkin commented Sep 16, 2024

Move the TEAM_SIZE and DATASET_BLOCK_DIM template parameters of CAGRA search kernels (single_cta and multi_cta versions) to runtime by introducing a switch-case statement inside the compute distance components (hence the name: runtime dispatch of the distance implementation). As a result:

  • 4x less search kernel instances against branch-24.10, but each instance is slightly larger
  • There's a small performance penalty due to larger kernel code (and the register usage is bounded by the most register-expensive branch)
  • [implementation limitation] only one DATASET_BLOCK_DIM per TEAM_SIZE is possible, i.e. the team size defines both (though this is how it currently is anyway).

@achirkin
Copy link
Contributor Author

achirkin commented Sep 16, 2024

This is an alternative to #296 for reducing the binary size with minimal changes to the code. Here's how they compare:

enh-cagra-runtime-distance-dispatch (#324)

performance: up to 12% QPS slowdown on deep-100M and 5% QPS slowdown on wiki-all
libcuvs.so size: 791 -> 657 MB
parallel build time in CI (NB: caching is not disabled):
- total: 709 seconds
- neighbors/cagra_search_xxx: 9 min
- neighbors/detail/cagra_xxx (kernels): 2-3 mins

Sheets: https://docs.google.com/spreadsheets/d/1sm-Qs7upa1JP706TnhX7_M2PFj57ZBNUIRk4F44ZvO0/edit?gid=347774484#gid=347774484

enh-cagra-separable-compilation (#296)

performance: up to 13% QPS slowdown on deep-100M and up to 8% QPS slowdown on wiki-all (low-itopk CAGRA-Q cases being the worst, with others seemingly slightly better than (#324)
libcuvs.so size: 791 -> 624 MB
parallel build time in CI (NB: caching is not disabled):
- total: 478 seconds
- neighbors/cagra_search_xxx: 4 min
- neighbors/detail/cagra_xxx (kernels): 1-3 mins
- neighbors/detail/compute_distance_xxx: 50 sec (many of them)

Sheets: https://docs.google.com/spreadsheets/d/1b_yTRcWE9K8SpEVGhzW_UZa2pTLibFSF8CFc2W3bJqM/edit?gid=347774484#gid=347774484

@achirkin
Copy link
Contributor Author

achirkin commented Sep 16, 2024

To sum it up:

@achirkin achirkin marked this pull request as ready for review September 17, 2024 06:46
@achirkin achirkin requested review from a team as code owners September 17, 2024 06:46
@achirkin
Copy link
Contributor Author

Performance update:
After a few optimizations for the worst cases #296, its performance is generally better than this PR on deep-100M dataset (dim = 96) and slightly worse on wiki-all dataset (dim = 768, only VPQ distance tested due to the index size). In the worst case, #296 leads to QPS drop of 6-7%, with plenty of cases actually getting a decent speedup (against both this PR and base-24.10). The caveat here is that some optimizations in #296 are potentially transferable to the base branch, so this benchmark is not entirely apples-to-apples anymore.

Latest benchmarks (includes k = 10 and k = 100 for both datasets):
https://docs.google.com/spreadsheets/d/191f1sYsAUwPidncV4xDzgNtOhKx9sp6jJFvOfpB-nxs

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Sep 20, 2024
@achirkin achirkin closed this Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

2 participants