Add nvtext::tokenize_with_vocabulary API #13930

davidwendt · 2023-08-21T19:20:12Z

Description

Adds tokenize with vocabulary APIs to libcudf.

struct tokenize_vocabulary{ ... };

std::unique_ptr<tokenize_vocabulary> load_vocabulary(
  cudf::strings_column_view const& input,
  rmm::cuda_stream_view stream,
  rmm::mr::device_memory_resource* mr);

std::unique_ptr<cudf::column> tokenize_with_vocabulary(
  cudf::strings_column_view const& input,
  tokenize_vocabulary const& vocabulary,
  cudf::string_scalar const& delimiter,
  cudf::size_type default_id,
  rmm::cuda_stream_view stream,
  rmm::mr::device_memory_resource* mr);

Returns an integer lists column replacing individual tokens as resolved from the input using delimiter with id values which are the row indices of the input vocabulary column.
If a token is not found in the vocabulary it is assigned default_id.
The vocabulary can be loaded once using the nvtext::load_vocabulary() API and then used in repeated calls to nvtext::tokenize_with_vocabulary() with different input columns.

Python interface is new class TokenizeVocabulary which can be used like the following:

>>> import cudf
>>> from cudf.core.tokenize_vocabulary import TokenizeVocabulary
>>> words = cudf.Series( ['brown', 'the', 'dog', 'jumps'] )
>>> vocab = TokenizeVocabulary(words)
>>> s = cudf.Series( ['the brown dog jumps over the brown cat'] )
>>> print(vocab(s))
0    [1, 0, 2, 3, -1, 1, 0, -1]
dtype: list

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

robertmaynard

Approving CMake changes

nvdbaranec

Looks good. Just a couple of tiny things.

cpp/src/text/vocabulary_tokenize.cu

cpp/tests/text/tokenize_tests.cpp

bdice

Feedback attached. The cuco static map is truly awesome. 👍

cpp/include/nvtext/tokenize.hpp

cpp/src/text/vocabulary_tokenize.cu

python/cudf/cudf/core/tokenize_vocabulary.py

python/cudf/cudf/tests/text/test_text_methods.py

…14163) Moves `cpp/src/hash/hash_allocator.cuh` to `include/cudf/hashing/detail` so it may be more accessible from non-src/hash source files. Also, found `cpp/src/hash/helper_functions.hpp` used in the same way a moved that one as well. No functional changes, just headers moved and includes fixed up. Reference: #13930 (comment) Closes #14143 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) URL: #14163

bdice

One suggestion, otherwise LGTM.

python/cudf/cudf/core/tokenize_vocabulary.py

davidwendt · 2023-09-26T22:53:38Z

/merge

davidwendt added 2 commits August 21, 2023 15:17

Add nvext::tokenize_with_vocabulary API

09fc656

Merge branch 'branch-23.10' into fea-vocab-tokenize

ac8288b

davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Aug 21, 2023

davidwendt self-assigned this Aug 21, 2023

github-actions bot added the CMake CMake build issue label Aug 21, 2023

davidwendt added 2 commits August 23, 2023 08:57

Merge branch 'branch-23.10' into fea-vocab-tokenize

040270d

add python/cython interface to tokenize_with_vocabulary

c4100eb

github-actions bot added the Python Affects Python cuDF API. label Aug 23, 2023

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Aug 23, 2023

update doxygen with UB statement

dd92077

davidwendt marked this pull request as ready for review August 23, 2023 18:50

davidwendt requested review from a team as code owners August 23, 2023 18:50

davidwendt requested review from vyasr, shwina and nvdbaranec August 23, 2023 18:50

bdice changed the title ~~Add nvext::tokenize_with_vocabulary API~~ Add nvtext::tokenize_with_vocabulary API Aug 23, 2023

robertmaynard approved these changes Aug 23, 2023

View reviewed changes

Merge branch 'branch-23.10' into fea-vocab-tokenize

a68cf64

nvdbaranec requested changes Aug 24, 2023

View reviewed changes

cpp/src/text/vocabulary_tokenize.cu Outdated Show resolved Hide resolved

cpp/tests/text/tokenize_tests.cpp Show resolved Hide resolved

davidwendt added 2 commits August 24, 2023 15:39

Merge branch 'branch-23.10' into fea-vocab-tokenize

445e203

Merge branch 'branch-23.10' into fea-vocab-tokenize

99a4374

davidwendt marked this pull request as draft August 28, 2023 19:20

Merge branch 'branch-23.10' into fea-vocab-tokenize

753d2d0

davidwendt marked this pull request as ready for review September 7, 2023 17:44

davidwendt requested a review from nvdbaranec September 7, 2023 17:44

Merge branch 'branch-23.10' into fea-vocab-tokenize

aff8151

nvdbaranec approved these changes Sep 12, 2023

View reviewed changes

davidwendt added 2 commits September 13, 2023 09:51

Merge branch 'branch-23.10' into fea-vocab-tokenize

a4ee6ed

Merge branch 'branch-23.10' into fea-vocab-tokenize

71636d9

bdice reviewed Sep 19, 2023

View reviewed changes

davidwendt added 2 commits September 19, 2023 09:57

Merge branch 'branch-23.10' into fea-vocab-tokenize

54cd282

Merge branch 'branch-23.10' into fea-vocab-tokenize

44cadeb

davidwendt mentioned this pull request Sep 20, 2023

Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail directory #14143

Closed

davidwendt added 2 commits September 20, 2023 14:37

change __call__ to tokenize

1802f49

Merge branch 'branch-23.10' into fea-vocab-tokenize

d504ccc

davidwendt requested a review from bdice September 20, 2023 18:38

davidwendt mentioned this pull request Sep 21, 2023

Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail #14163

Merged

3 tasks

davidwendt added 4 commits September 22, 2023 10:59

Merge branch 'branch-23.10' into fea-vocab-tokenize

bc52507

Merge branch 'branch-23.10' into fea-vocab-tokenize

3c91d2c

Merge branch 'branch-23.10' into fea-vocab-tokenize

155271a

Merge branch 'branch-23.10' into fea-vocab-tokenize

399af22

davidwendt added 3 commits September 25, 2023 11:57

Merge branch 'branch-23.10' into fea-vocab-tokenize

013a766

fix header include

36d4849

Merge branch 'branch-23.10' into fea-vocab-tokenize

e161e1f

bdice approved these changes Sep 26, 2023

View reviewed changes

python/cudf/cudf/core/tokenize_vocabulary.py Outdated Show resolved Hide resolved

davidwendt added 3 commits September 26, 2023 11:58

Merge branch 'branch-23.10' into fea-vocab-tokenize

9195cc7

remove removes section from class docstring

6202b83

Merge branch 'branch-23.10' into fea-vocab-tokenize

2056b58

rapids-bot bot merged commit b25b292 into rapidsai:branch-23.10 Sep 26, 2023

davidwendt deleted the fea-vocab-tokenize branch September 26, 2023 22:53

VibhuJawa mentioned this pull request Oct 17, 2023

[FEA] Switch cudf.Subwordtokenizer to use the vocab file directly instead of hash_vocab. #14294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nvtext::tokenize_with_vocabulary API #13930

Add nvtext::tokenize_with_vocabulary API #13930

davidwendt commented Aug 21, 2023 •

edited

Loading

robertmaynard left a comment

nvdbaranec left a comment

bdice left a comment

bdice left a comment

davidwendt commented Sep 26, 2023

Add nvtext::tokenize_with_vocabulary API #13930

Add nvtext::tokenize_with_vocabulary API #13930

Conversation

davidwendt commented Aug 21, 2023 • edited Loading

Description

Checklist

robertmaynard left a comment

Choose a reason for hiding this comment

nvdbaranec left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

davidwendt commented Sep 26, 2023

davidwendt commented Aug 21, 2023 •

edited

Loading