-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter #16923
Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter #16923
Conversation
d_scalar.data(), static_cast<size_t>(d_scalar.size())}; | ||
|
||
// Parse the JSON and get the token stream | ||
auto [d_tokens_gpu, d_token_indices_gpu] = cuio_json::detail::get_token_stream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to verify correctness using the output of read_json
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the JsonReaderTest.ViableDelimiterNewlineWS
gtest in json_test.cpp
verifies the correctness of the table created by read_json
for null delimiter.
Since this delimiter bug fix only affects the PDA in the tokenization step, I wanted to add an additional test that directly checks the output of get_token_stream
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
/merge |
…ith non-newline delimiter (rapidsai#16923) Addresses rapidsai#16915 Authors: - Shruti Shivakumar (https://github.com/shrshi) Approvers: - Basit Ayantunde (https://github.com/lamarrr) - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) URL: rapidsai#16923
Description
Addresses #16915
Checklist