-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nvtext::tokenize_with_vocabulary API #13930
Add nvtext::tokenize_with_vocabulary API #13930
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving CMake changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just a couple of tiny things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback attached. The cuco static map is truly awesome. 👍
…14163) Moves `cpp/src/hash/hash_allocator.cuh` to `include/cudf/hashing/detail` so it may be more accessible from non-src/hash source files. Also, found `cpp/src/hash/helper_functions.hpp` used in the same way a moved that one as well. No functional changes, just headers moved and includes fixed up. Reference: #13930 (comment) Closes #14143 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) URL: #14163
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion, otherwise LGTM.
/merge |
Description
Adds tokenize with vocabulary APIs to libcudf.
Returns an integer lists column replacing individual tokens as resolved from the
input
usingdelimiter
with id values which are the row indices of the inputvocabulary
column.If a token is not found in the
vocabulary
it is assigneddefault_id
.The vocabulary can be loaded once using the
nvtext::load_vocabulary()
API and then used in repeated calls tonvtext::tokenize_with_vocabulary()
with different input columns.Python interface is new class
TokenizeVocabulary
which can be used like the following:Checklist