Improve performance of hash_character_ngrams using warp-per-string kernel #16212

davidwendt · 2024-07-08T13:11:04Z

Description

Improves the performance of nvtext::hash_character_ngrams using a warp-per-string kernel instead of a string per thread.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

davidwendt · 2024-07-09T23:57:21Z

Benchmark for hash_chararacter_ngrams

## [0] NVIDIA RTX A6000

|  rows  | width | ngrams |   Ref Time |   Cmp Time |           Diff |   %Diff |    x   |
|--------|-------|--------|------------|------------|----------------|---------|--------|
|  1024  |   128 |   5    | 222.331 us |  48.466 us |    -173.864 us | -78.20% |  4.6x  |
|  4096  |   128 |   5    | 220.053 us |  57.080 us |    -162.974 us | -74.06% |  3.9x  |
|  8192  |   128 |   5    | 215.480 us |  68.233 us |    -147.247 us | -68.33% |  3.2x  |
| 16364  |   128 |   5    | 220.708 us |  93.588 us |    -127.119 us | -57.60% |  2.4x  |
| 32768  |   128 |   5    | 230.986 us | 145.089 us |     -85.898 us | -37.19% |  1.6x  |
| 262144 |   128 |   5    |   2.924 ms | 913.710 us |   -2010.059 us | -68.75% |  3.2x  |
|  1024  |   512 |   5    | 807.259 us |  63.721 us |    -743.538 us | -92.11% | 12.7x  |
|  4096  |   512 |   5    | 819.646 us |  93.860 us |    -725.786 us | -88.55% |  8.7x  |
|  8192  |   512 |   5    | 819.803 us | 132.703 us |    -687.100 us | -83.81% |  6.2x  |
| 16364  |   512 |   5    | 873.233 us | 215.296 us |    -657.937 us | -75.34% |  4.1x  |
| 32768  |   512 |   5    | 930.522 us | 386.793 us |    -543.730 us | -58.43% |  2.4x  |
| 262144 |   512 |   5    |  39.315 ms |   2.810 ms |  -36504.542 us | -92.85% | 14.0x  |
|  1024  |  2048 |   5    |   3.269 ms | 130.677 us |   -3138.038 us | -96.00% | 25.0x  |
|  4096  |  2048 |   5    |   3.595 ms | 252.422 us |   -3342.432 us | -92.98% | 14.2x  |
|  8192  |  2048 |   5    |   3.471 ms | 406.363 us |   -3064.841 us | -88.29% |  8.5x  |
| 16364  |  2048 |   5    |   3.602 ms | 735.013 us |   -2866.837 us | -79.59% |  4.9x  |
| 32768  |  2048 |   5    |   3.776 ms |   1.387 ms |   -2388.923 us | -63.26% |  2.7x  |
| 262144 |  2048 |   5    | 234.571 ms |  10.527 ms | -224043.462 us | -95.51% | 22.3x  |
|  1024  |   128 |  10    | 296.423 us |  49.777 us |    -246.646 us | -83.21% |  6.0x  |
|  4096  |   128 |  10    | 291.558 us |  61.260 us |    -230.297 us | -78.99% |  4.8x  |
|  8192  |   128 |  10    | 286.159 us |  74.268 us |    -211.892 us | -74.05% |  3.9x  |
| 16364  |   128 |  10    | 291.627 us | 104.296 us |    -187.331 us | -64.24% |  2.8x  |
| 32768  |   128 |  10    | 300.675 us | 165.557 us |    -135.118 us | -44.94% |  1.8x  |
| 262144 |   128 |  10    |   2.761 ms |   1.082 ms |   -1678.584 us | -60.81% |  2.6x  |
|  1024  |   512 |  10    |   1.186 ms |  70.354 us |   -1115.780 us | -94.07% | 16.9x  |
|  4096  |   512 |  10    |   1.169 ms | 109.403 us |   -1059.529 us | -90.64% | 10.7x  |
|  8192  |   512 |  10    |   1.190 ms | 157.289 us |   -1032.764 us | -86.78% |  7.6x  |
| 16364  |   512 |  10    |   1.224 ms | 261.513 us |    -962.974 us | -78.64% |  4.7x  |
| 32768  |   512 |  10    |   1.293 ms | 479.156 us |    -814.235 us | -62.95% |  2.7x  |
| 262144 |   512 |  10    |  56.334 ms |   3.501 ms |  -52833.185 us | -93.79% | 16.1x  |
|  1024  |  2048 |  10    |   4.883 ms | 159.409 us |   -4723.640 us | -96.74% | 30.6x  |
|  4096  |  2048 |  10    |   5.148 ms | 315.407 us |   -4832.774 us | -93.87% | 16.3x  |
|  8192  |  2048 |  10    |   4.988 ms | 513.009 us |   -4474.621 us | -89.71% |  9.7x  |
| 16364  |  2048 |  10    |   5.134 ms | 924.567 us |   -4209.696 us | -81.99% |  5.6x  |
| 32768  |  2048 |  10    |   5.383 ms |   1.750 ms |   -3632.743 us | -67.49% |  3.1x  |
| 262144 |  2048 |  10    | 349.209 ms |  13.209 ms | -335999.289 us | -96.22% | 26.4x  |

cpp/src/text/generate_ngrams.cu

mhaseeb123 · 2024-07-31T16:45:19Z

cpp/src/text/generate_ngrams.cu

+    return;
+  }
+
+  namespace cg = cooperative_groups;


Thanks for accommodating cg changes @davidwendt. Just to confirm, do the new changes affect the performance results from before in any way?

No. It looks like the performance is the same with cooperative groups as with the cub warp-reduce.

mhaseeb123 · 2024-07-31T16:50:24Z

cpp/src/text/generate_ngrams.cu

+  }
+
+  namespace cg = cooperative_groups;
+  auto warp    = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block());


Minor:

Suggested change

auto warp = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block());

auto const warp = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block());

PointKernel · 2024-07-31T16:52:38Z

cpp/src/text/generate_ngrams.cu

+    }
+  }
+  auto const char_count = cg::reduce(warp, count, cg::plus<int>());
+  if (lane_idx == 0) { d_counts[str_idx] = std::max(0, char_count - ngrams + 1); }


Suggested change

if (lane_idx == 0) { d_counts[str_idx] = std::max(0, char_count - ngrams + 1); }

if (lane_idx == 0) { d_counts[str_idx] = max(0, char_count - ngrams + 1); }

non-blocking nit: use CUDA native functions instead of STL ones in device code since the former is hardware accelerated. cuda::std::max also works.

mhaseeb123 · 2024-07-31T18:18:41Z

cpp/src/text/generate_ngrams.cu

-      }));
-  auto [offsets, total_ngrams] =
-    cudf::detail::make_offsets_child_column(sizes_itr, sizes_itr + input.size(), stream, mr);
+  auto [offsets, total_ngrams] = [&] {


davidwendt · 2024-08-01T21:00:15Z

/merge

Fixes performance regression in `nvtext::generate_character_ngrams` introduced in #16212. Thread-per-row kernel is faster for smaller strings. Authors: - David Wendt (https://github.com/davidwendt) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) URL: #16849

Improve performance of hash_character_ngrams

00de3e5

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 8, 2024

davidwendt self-assigned this Jul 8, 2024

davidwendt added 3 commits July 9, 2024 10:49

Merge branch 'branch-24.08' into ngram-hash-wide-chars

2a00e2e

create char-ngram counting kernel

9bb5600

Merge branch 'branch-24.08' into ngram-hash-wide-chars

b4cbafd

davidwendt added 14 commits July 10, 2024 17:03

Merge branch 'branch-24.08' into ngram-hash-wide-chars

20c206e

Merge branch 'branch-24.08' into ngram-hash-wide-chars

3761822

Merge branch 'branch-24.08' into ngram-hash-wide-chars

84f949b

update some var types

06c399b

Merge branch 'branch-24.08' into ngram-hash-wide-chars

f38ded5

Merge branch 'branch-24.08' into ngram-hash-wide-chars

2efb048

fix type of launch parameters

0b08a91

Merge branch 'branch-24.08' into ngram-hash-wide-chars

2496beb

Merge branch 'branch-24.08' into ngram-hash-wide-chars

f7ac0ea

Merge branch 'branch-24.08' into ngram-hash-wide-chars

fbbbe90

Merge branch 'branch-24.08' into ngram-hash-wide-chars

7bffad4

Merge branch 'branch-24.08' into ngram-hash-wide-chars

f7a2689

minor variable changes

5e6b54c

Merge branch 'branch-24.08' into ngram-hash-wide-chars

cae4bc0

davidwendt changed the base branch from branch-24.08 to branch-24.10 July 23, 2024 16:34

davidwendt added 3 commits July 24, 2024 20:29

Merge branch 'branch-24.10' into ngram-hash-wide-chars

f0a9956

Merge branch 'branch-24.10' into ngram-hash-wide-chars

06925ed

use updated grid_1d class

6b7cfd9

davidwendt removed the 2 - In Progress Currently a work in progress label Jul 26, 2024

davidwendt added the 3 - Ready for Review Ready for review by team label Jul 26, 2024

davidwendt marked this pull request as ready for review July 29, 2024 12:41

davidwendt requested a review from a team as a code owner July 29, 2024 12:41

davidwendt requested review from PointKernel and mhaseeb123 July 29, 2024 12:41

bdice reviewed Jul 29, 2024

View reviewed changes

cpp/src/text/generate_ngrams.cu Outdated Show resolved Hide resolved

cpp/src/text/generate_ngrams.cu Outdated Show resolved Hide resolved

davidwendt added 3 commits July 30, 2024 16:40

Merge branch 'branch-24.10' into ngram-hash-wide-chars

4fbbf07

replace cub-warp-reduce with cg reduce

88476a8

add cast to grid_1d ctor call to prevent overflow

9c6ce7f

mhaseeb123 reviewed Jul 31, 2024

View reviewed changes

PointKernel approved these changes Jul 31, 2024

View reviewed changes

use cuda::std::max

b25403f

mhaseeb123 reviewed Jul 31, 2024

View reviewed changes

mhaseeb123 approved these changes Jul 31, 2024

View reviewed changes

Merge branch 'branch-24.10' into ngram-hash-wide-chars

899ad1f

rapids-bot bot merged commit 05745d0 into rapidsai:branch-24.10 Aug 1, 2024
84 checks passed

davidwendt deleted the ngram-hash-wide-chars branch August 1, 2024 21:00

davidwendt mentioned this pull request Sep 19, 2024

Fix performance regression for generate_character_ngrams #16849

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of hash_character_ngrams using warp-per-string kernel #16212

Improve performance of hash_character_ngrams using warp-per-string kernel #16212

davidwendt commented Jul 8, 2024 •

edited

Loading

davidwendt commented Jul 9, 2024

mhaseeb123 Jul 31, 2024

davidwendt Jul 31, 2024

mhaseeb123 Jul 31, 2024

PointKernel Jul 31, 2024

mhaseeb123 Jul 31, 2024

davidwendt commented Aug 1, 2024

	auto warp = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block());
	auto const warp = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block());

	if (lane_idx == 0) { d_counts[str_idx] = std::max(0, char_count - ngrams + 1); }
	if (lane_idx == 0) { d_counts[str_idx] = max(0, char_count - ngrams + 1); }

Improve performance of hash_character_ngrams using warp-per-string kernel #16212

Improve performance of hash_character_ngrams using warp-per-string kernel #16212

Conversation

davidwendt commented Jul 8, 2024 • edited Loading

Description

Checklist

davidwendt commented Jul 9, 2024

mhaseeb123 Jul 31, 2024

Choose a reason for hiding this comment

davidwendt Jul 31, 2024

Choose a reason for hiding this comment

mhaseeb123 Jul 31, 2024

Choose a reason for hiding this comment

PointKernel Jul 31, 2024

Choose a reason for hiding this comment

mhaseeb123 Jul 31, 2024

Choose a reason for hiding this comment

davidwendt commented Aug 1, 2024

davidwendt commented Jul 8, 2024 •

edited

Loading