Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nvtext::tokenize_with_vocabulary API #13930

Merged
merged 33 commits into from
Sep 26, 2023

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Aug 21, 2023

Description

Adds tokenize with vocabulary APIs to libcudf.

struct tokenize_vocabulary{ ... };

std::unique_ptr<tokenize_vocabulary> load_vocabulary(
  cudf::strings_column_view const& input,
  rmm::cuda_stream_view stream,
  rmm::mr::device_memory_resource* mr);

std::unique_ptr<cudf::column> tokenize_with_vocabulary(
  cudf::strings_column_view const& input,
  tokenize_vocabulary const& vocabulary,
  cudf::string_scalar const& delimiter,
  cudf::size_type default_id,
  rmm::cuda_stream_view stream,
  rmm::mr::device_memory_resource* mr);

Returns an integer lists column replacing individual tokens as resolved from the input using delimiter with id values which are the row indices of the input vocabulary column.
If a token is not found in the vocabulary it is assigned default_id.
The vocabulary can be loaded once using the nvtext::load_vocabulary() API and then used in repeated calls to nvtext::tokenize_with_vocabulary() with different input columns.

Python interface is new class TokenizeVocabulary which can be used like the following:

>>> import cudf
>>> from cudf.core.tokenize_vocabulary import TokenizeVocabulary
>>> words = cudf.Series( ['brown', 'the', 'dog', 'jumps'] )
>>> vocab = TokenizeVocabulary(words)
>>> s = cudf.Series( ['the brown dog jumps over the brown cat'] )
>>> print(vocab(s))
0    [1, 0, 2, 3, -1, 1, 0, -1]
dtype: list

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Aug 21, 2023
@davidwendt davidwendt self-assigned this Aug 21, 2023
@github-actions github-actions bot added the CMake CMake build issue label Aug 21, 2023
@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 23, 2023
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Aug 23, 2023
@davidwendt davidwendt marked this pull request as ready for review August 23, 2023 18:50
@davidwendt davidwendt requested review from a team as code owners August 23, 2023 18:50
@bdice bdice changed the title Add nvext::tokenize_with_vocabulary API Add nvtext::tokenize_with_vocabulary API Aug 23, 2023
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving CMake changes

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a couple of tiny things.

cpp/src/text/vocabulary_tokenize.cu Outdated Show resolved Hide resolved
cpp/tests/text/tokenize_tests.cpp Show resolved Hide resolved
@davidwendt davidwendt marked this pull request as draft August 28, 2023 19:20
@davidwendt davidwendt marked this pull request as ready for review September 7, 2023 17:44
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback attached. The cuco static map is truly awesome. 👍

cpp/include/nvtext/tokenize.hpp Outdated Show resolved Hide resolved
cpp/include/nvtext/tokenize.hpp Outdated Show resolved Hide resolved
cpp/src/text/vocabulary_tokenize.cu Outdated Show resolved Hide resolved
cpp/src/text/vocabulary_tokenize.cu Outdated Show resolved Hide resolved
cpp/src/text/vocabulary_tokenize.cu Show resolved Hide resolved
python/cudf/cudf/core/tokenize_vocabulary.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/tokenize_vocabulary.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/tokenize_vocabulary.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/text/test_text_methods.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/text/test_text_methods.py Outdated Show resolved Hide resolved
rapids-bot bot pushed a commit that referenced this pull request Sep 25, 2023
…14163)

Moves `cpp/src/hash/hash_allocator.cuh` to `include/cudf/hashing/detail` so it may be more accessible from non-src/hash source files.  
Also, found `cpp/src/hash/helper_functions.hpp` used in the same way a moved that one as well.
No functional changes, just headers moved and includes fixed up.

Reference: #13930 (comment)

Closes #14143

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #14163
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion, otherwise LGTM.

python/cudf/cudf/core/tokenize_vocabulary.py Outdated Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit b25b292 into rapidsai:branch-23.10 Sep 26, 2023
@davidwendt davidwendt deleted the fea-vocab-tokenize branch September 26, 2023 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants