Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new nvtext minhash_permuted API #16756

Merged
merged 101 commits into from
Nov 12, 2024
Merged
Show file tree
Hide file tree
Changes from 97 commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
7f38f21
Improve minhash performance by using more working memory
davidwendt Sep 5, 2024
a653542
Merge branch 'branch-24.10' into perf-minhash-highmem
davidwendt Sep 5, 2024
99f151e
Merge branch 'branch-24.10' into perf-minhash-highmem
davidwendt Sep 9, 2024
76c0367
fix merge conflict
davidwendt Sep 17, 2024
f5e24ac
Merge branch 'branch-24.10' into perf-minhash-highmem
davidwendt Sep 18, 2024
f81b109
Merge branch 'branch-24.10' into perf-minhash-highmem
davidwendt Sep 18, 2024
9700272
Merge branch 'branch-24.10' into perf-minhash-highmem
davidwendt Sep 19, 2024
f35c16d
change to block per string
davidwendt Sep 19, 2024
6dc19ef
Merge branch 'branch-24.10' into perf-minhash-highmem
davidwendt Sep 19, 2024
01500dd
fix sync call
davidwendt Sep 20, 2024
fcac398
Merge branch 'branch-24.10' into perf-minhash-highmem
davidwendt Sep 20, 2024
e700df2
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Sep 23, 2024
d1c0b85
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Sep 24, 2024
1fef924
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Sep 26, 2024
d611acb
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Sep 26, 2024
2fe7153
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Sep 27, 2024
84f248e
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Sep 27, 2024
c362916
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Sep 30, 2024
117467e
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Sep 30, 2024
6e1bfff
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 1, 2024
70948b9
fix benchmark ranges
davidwendt Oct 2, 2024
f329f84
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 2, 2024
fe565f1
Merge branch 'perf-minhash-highmem' of github.com:davidwendt/cudf int…
davidwendt Oct 2, 2024
81a16be
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 2, 2024
ef3b228
minor fixes
davidwendt Oct 2, 2024
1753a40
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 2, 2024
f4181f7
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 2, 2024
b023aa5
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 3, 2024
03d570d
minor cleanups
davidwendt Oct 3, 2024
3f5f5b5
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 3, 2024
07ddef7
Merge branch 'perf-minhash-highmem' of github.com:davidwendt/cudf int…
davidwendt Oct 3, 2024
c4ff137
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 3, 2024
121419c
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 4, 2024
b1363ee
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 7, 2024
f747163
match benchmark to curator parameters
davidwendt Oct 10, 2024
72acce3
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 10, 2024
aa6f3e0
add minhash_permuted API
davidwendt Oct 10, 2024
23a87ed
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 10, 2024
660641e
Merge branch 'perf-minhash-highmem' of github.com:davidwendt/cudf int…
davidwendt Oct 11, 2024
186477e
revert benchmark API call
davidwendt Oct 11, 2024
24c8073
Merge branch 'perf-minhash-highmem' of github.com:davidwendt/cudf int…
davidwendt Oct 11, 2024
38be18b
fix merge conflict
davidwendt Oct 11, 2024
b49950d
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 14, 2024
a7583d0
experimental single-hash permutation
davidwendt Oct 14, 2024
ff18693
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 14, 2024
00e2bee
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 14, 2024
08e6400
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 15, 2024
42e7d7b
enable seed-hash temporary memory
davidwendt Oct 15, 2024
55245ca
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 15, 2024
8d202dd
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 16, 2024
f79631a
dynamic shared memory to static
davidwendt Oct 16, 2024
8df5acf
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 16, 2024
afe3ade
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 16, 2024
a0816b9
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 17, 2024
fda43cc
cleanup variable names, doxygen
davidwendt Oct 17, 2024
97395c8
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 17, 2024
ce90455
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 18, 2024
0f83584
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 18, 2024
d1e3154
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 21, 2024
1f4d441
support for super-wide strings
davidwendt Oct 22, 2024
9a583dc
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 22, 2024
bf6413f
Merge branch 'perf-minhash-highmem' of github.com:davidwendt/cudf int…
davidwendt Oct 23, 2024
d83e9db
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 23, 2024
58f9206
fix threshold-index init logic
davidwendt Oct 23, 2024
d2abafd
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 23, 2024
fef6e0e
use cudf::detail::device_scalar
davidwendt Oct 23, 2024
bc33896
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 23, 2024
e1067b3
fix benchmarks; add gtest
davidwendt Oct 24, 2024
e5744a5
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 24, 2024
767e163
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 25, 2024
41427fd
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 25, 2024
081546f
reinstate original non-permuted code
davidwendt Oct 25, 2024
c4886a4
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 28, 2024
ef7bb46
add gtest for parameter chunking
davidwendt Oct 28, 2024
2a9928a
move pytests to use permuted api
davidwendt Oct 28, 2024
a17f336
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 28, 2024
a4bce0f
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 28, 2024
9548eb6
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 28, 2024
afb173b
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 29, 2024
a39b738
add docstring for new APIs
davidwendt Oct 29, 2024
dfcd3e6
Merge branch 'perf-minhash-highmem' of github.com:davidwendt/cudf int…
davidwendt Oct 29, 2024
d836f34
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 29, 2024
43541e4
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 31, 2024
b66599c
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 31, 2024
ad4c031
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 31, 2024
ad411a8
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Oct 31, 2024
47cf9e4
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Nov 5, 2024
c5122b3
Merge branch 'perf-minhash-highmem' of github.com:davidwendt/cudf int…
davidwendt Nov 5, 2024
7ade810
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Nov 5, 2024
186befd
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Nov 6, 2024
1989c62
change test_minhash to test_minhash_permuted
davidwendt Nov 6, 2024
7fea20f
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Nov 6, 2024
2804015
add deprecation warnings
davidwendt Nov 6, 2024
99557c7
Merge branch 'perf-minhash-highmem' of github.com:davidwendt/cudf int…
davidwendt Nov 7, 2024
494237d
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Nov 7, 2024
7446d53
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Nov 7, 2024
1ed78a4
change DeprecationWarning to FutureWarning
davidwendt Nov 7, 2024
513218a
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Nov 7, 2024
4e3e25d
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Nov 8, 2024
056eb79
fix shared-memory variable type
davidwendt Nov 8, 2024
8c4e6dc
Merge branch 'branch-24.12' into perf-minhash-highmem
davidwendt Nov 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -348,8 +348,8 @@ ConfigureNVBench(BINARYOP_NVBENCH binaryop/binaryop.cpp binaryop/compiled_binary
ConfigureBench(TEXT_BENCH text/subword.cpp)

ConfigureNVBench(
TEXT_NVBENCH text/edit_distance.cpp text/hash_ngrams.cpp text/jaccard.cpp text/ngrams.cpp
text/normalize.cpp text/replace.cpp text/tokenize.cpp text/vocab.cpp
TEXT_NVBENCH text/edit_distance.cpp text/hash_ngrams.cpp text/jaccard.cpp text/minhash.cpp
text/ngrams.cpp text/normalize.cpp text/replace.cpp text/tokenize.cpp text/vocab.cpp
)

# ##################################################################################################
Expand Down
38 changes: 18 additions & 20 deletions cpp/benchmarks/text/minhash.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,35 +20,32 @@

#include <nvtext/minhash.hpp>

#include <rmm/device_buffer.hpp>

#include <nvbench/nvbench.cuh>

static void bench_minhash(nvbench::state& state)
{
auto const num_rows = static_cast<cudf::size_type>(state.get_int64("num_rows"));
auto const row_width = static_cast<cudf::size_type>(state.get_int64("row_width"));
auto const hash_width = static_cast<cudf::size_type>(state.get_int64("hash_width"));
auto const seed_count = static_cast<cudf::size_type>(state.get_int64("seed_count"));
auto const parameters = static_cast<cudf::size_type>(state.get_int64("parameters"));
auto const base64 = state.get_int64("hash_type") == 64;

if (static_cast<std::size_t>(num_rows) * static_cast<std::size_t>(row_width) >=
static_cast<std::size_t>(std::numeric_limits<cudf::size_type>::max())) {
state.skip("Skip benchmarks greater than size_type limit");
}

data_profile const strings_profile = data_profile_builder().distribution(
cudf::type_id::STRING, distribution_id::NORMAL, 0, row_width);
auto const strings_table =
create_random_table({cudf::type_id::STRING}, row_count{num_rows}, strings_profile);
cudf::strings_column_view input(strings_table->view().column(0));

data_profile const seeds_profile = data_profile_builder().null_probability(0).distribution(
cudf::type_to_id<cudf::hash_value_type>(), distribution_id::NORMAL, 0, row_width);
auto const seed_type = base64 ? cudf::type_id::UINT64 : cudf::type_id::UINT32;
auto const seeds_table = create_random_table({seed_type}, row_count{seed_count}, seeds_profile);
auto seeds = seeds_table->get_column(0);
seeds.set_null_mask(rmm::device_buffer{}, 0);
data_profile const param_profile = data_profile_builder().no_validity().distribution(
cudf::type_to_id<cudf::hash_value_type>(),
distribution_id::NORMAL,
0u,
std::numeric_limits<cudf::hash_value_type>::max());
auto const param_type = base64 ? cudf::type_id::UINT64 : cudf::type_id::UINT32;
auto const param_table =
create_random_table({param_type, param_type}, row_count{parameters}, param_profile);
auto const parameters_a = param_table->view().column(0);
auto const parameters_b = param_table->view().column(1);

state.set_cuda_stream(nvbench::make_cuda_stream_view(cudf::get_default_stream().value()));

Expand All @@ -57,15 +54,16 @@ static void bench_minhash(nvbench::state& state)
state.add_global_memory_writes<nvbench::int32_t>(num_rows); // output are hashes

state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
auto result = base64 ? nvtext::minhash64(input, seeds.view(), hash_width)
: nvtext::minhash(input, seeds.view(), hash_width);
auto result = base64
? nvtext::minhash64_permuted(input, 0, parameters_a, parameters_b, hash_width)
: nvtext::minhash_permuted(input, 0, parameters_a, parameters_b, hash_width);
});
}

NVBENCH_BENCH(bench_minhash)
.set_name("minhash")
.add_int64_axis("num_rows", {1024, 8192, 16364, 131072})
.add_int64_axis("row_width", {128, 512, 2048})
.add_int64_axis("hash_width", {5, 10})
.add_int64_axis("seed_count", {2, 26})
.add_int64_axis("num_rows", {15000, 30000, 60000})
.add_int64_axis("row_width", {6000, 28000, 50000})
.add_int64_axis("hash_width", {12, 24})
.add_int64_axis("parameters", {26, 260})
.add_int64_axis("hash_type", {32, 64});
94 changes: 94 additions & 0 deletions cpp/include/nvtext/minhash.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,53 @@ namespace CUDF_EXPORT nvtext {
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/**
* @brief Returns the minhash values for each string
*
* This function uses MurmurHash3_x86_32 for the hash algorithm.
*
* The input strings are first hashed using the given `seed` over substrings
* of `width` characters. These hash values are then combined with the `a`
* and `b` parameter values using the following formula:
* ```
* max_hash = max of uint32
* mp = (1 << 61) - 1
* hv[i] = hash value of a substring at i
* pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash
* ```
*
* This calculation is performed on each substring and the minimum value is computed
* as follows:
* ```
* mh[j,i] = min(pv[i]) for all substrings in row j
* and where i=[0,a.size())
* ```
*
* Any null row entries result in corresponding null output rows.
*
* @throw std::invalid_argument if the width < 2
* @throw std::invalid_argument if parameter_a is empty
* @throw std::invalid_argument if `parameter_b.size() != parameter_a.size()`
* @throw std::overflow_error if `parameter_a.size() * input.size()` exceeds the column size limit
*
* @param input Strings column to compute minhash
* @param seed Seed value used for the hash algorithm
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
* @param parameter_a Values used for the permuted calculation
* @param parameter_b Values used for the permuted calculation
* @param width The character width of substrings to hash for each row
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned column's device memory
* @return List column of minhash values for each string per seed
*/
std::unique_ptr<cudf::column> minhash_permuted(
cudf::strings_column_view const& input,
uint32_t seed,
cudf::device_span<uint32_t const> parameter_a,
cudf::device_span<uint32_t const> parameter_b,
cudf::size_type width,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/**
* @brief Returns the minhash value for each string
*
Expand Down Expand Up @@ -159,6 +206,53 @@ namespace CUDF_EXPORT nvtext {
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/**
* @brief Returns the minhash values for each string
*
* This function uses MurmurHash3_x64_128 for the hash algorithm.
*
* The input strings are first hashed using the given `seed` over substrings
* of `width` characters. These hash values are then combined with the `a`
* and `b` parameter values using the following formula:
* ```
* max_hash = max of uint64
* mp = (1 << 61) - 1
* hv[i] = hash value of a substring at i
* pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash
* ```
*
* This calculation is performed on each substring and the minimum value is computed
* as follows:
* ```
* mh[j,i] = min(pv[i]) for all substrings in row j
* and where i=[0,a.size())
* ```
*
* Any null row entries result in corresponding null output rows.
*
* @throw std::invalid_argument if the width < 2
* @throw std::invalid_argument if parameter_a is empty
* @throw std::invalid_argument if `parameter_b.size() != parameter_a.size()`
* @throw std::overflow_error if `parameter_a.size() * input.size()` exceeds the column size limit
*
* @param input Strings column to compute minhash
* @param seed Seed value used for the hash algorithm
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
* @param parameter_a Values used for the permuted calculation
* @param parameter_b Values used for the permuted calculation
* @param width The character width of substrings to hash for each row
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned column's device memory
* @return List column of minhash values for each string per seed
*/
std::unique_ptr<cudf::column> minhash64_permuted(
cudf::strings_column_view const& input,
uint64_t seed,
cudf::device_span<uint64_t const> parameter_a,
cudf::device_span<uint64_t const> parameter_b,
cudf::size_type width,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/**
* @brief Returns the minhash values for each row of strings per seed
*
Expand Down
Loading
Loading