Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter #16923

Merged
merged 7 commits into from
Sep 27, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Sep 25, 2024

Description

Addresses #16915

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 25, 2024
@shrshi shrshi added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Sep 25, 2024
@shrshi shrshi changed the title Newline as whitespace chanacter while tokenizing JSONL inputs Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter Sep 26, 2024
@shrshi shrshi marked this pull request as ready for review September 26, 2024 01:09
@shrshi shrshi requested a review from a team as a code owner September 26, 2024 01:09
@shrshi shrshi requested review from vuule and lamarrr September 26, 2024 01:09
cpp/tests/io/json/nested_json_test.cpp Outdated Show resolved Hide resolved
d_scalar.data(), static_cast<size_t>(d_scalar.size())};

// Parse the JSON and get the token stream
auto [d_tokens_gpu, d_token_indices_gpu] = cuio_json::detail::get_token_stream(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to verify correctness using the output of read_json?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the JsonReaderTest.ViableDelimiterNewlineWS gtest in json_test.cpp verifies the correctness of the table created by read_json for null delimiter.
Since this delimiter bug fix only affects the PDA in the tokenization step, I wanted to add an additional test that directly checks the output of get_token_stream.

@shrshi shrshi requested a review from vuule September 27, 2024 18:24
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@shrshi shrshi added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Sep 27, 2024
@shrshi
Copy link
Contributor Author

shrshi commented Sep 27, 2024

/merge

@rapids-bot rapids-bot bot merged commit 6973ef8 into rapidsai:branch-24.12 Sep 27, 2024
100 checks passed
shrshi added a commit to shrshi/cudf that referenced this pull request Sep 27, 2024
…ith non-newline delimiter (rapidsai#16923)

Addresses rapidsai#16915

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Basit Ayantunde (https://github.com/lamarrr)
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: rapidsai#16923
raydouglass pushed a commit that referenced this pull request Sep 30, 2024
…ith non-newline delimiter (#16950)

Backporting PR #16923: : Parse newline as whitespace character while
tokenizing JSONL inputs

Addresses #16915
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants