Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON quote normalization #14545

Merged
merged 22 commits into from
Jan 3, 2024
Merged

JSON quote normalization #14545

merged 22 commits into from
Jan 3, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Dec 2, 2023

Description

The goal of this PR is to address PR 10004 by supporting parsing of JSON files containing single quotes for field/value strings.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Dec 2, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Dec 2, 2023
@shrshi shrshi added the feature request New feature or request label Dec 2, 2023
@shrshi
Copy link
Contributor Author

shrshi commented Dec 2, 2023

/ok to test

@shrshi shrshi added the non-breaking Non-breaking change label Dec 2, 2023
Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already looking great. I think one important point is that we make use of a function object for the translation instead of using a translation table. This allows us to simply return the input symbol for all input symbols that we do not need to modify.
I've slightly modified your FST to accommodate that change, which also simplifies the symbol groups amongst which we need to distinguish.

Also, I've run into a corner case with the current symbol-to-symbol-group-id lookup table implementation for this lookup table we need here. So I temporarily replaced it that lookup with the SymbolToSymbolGroup function object to unblock you here. Sorry for the inconvenience. I'll work on a fix in the meanwhile.

cpp/src/io/fst/lookup_tables.cuh Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Dec 7, 2023

Thank you for the detailed feedback, Elias. I have included all your suggested changes in the most recent commit. The solution for the other corner case that we discussed - the escaped single quote within a single-quoted input string - has been handled as well. There are two examples that need further thought - (i) retaining an escaped single-quote within a double-quoted input string, and (ii) ill-formed JSON with extra single quote before closing brace.

@shrshi
Copy link
Contributor Author

shrshi commented Dec 11, 2023

/ok to test

@shrshi shrshi marked this pull request as ready for review December 11, 2023 23:46
@shrshi shrshi requested a review from a team as a code owner December 11, 2023 23:46
Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far! Left a few minor comments.

I assume the next step is to integrate this preprocessing stage into the reader depending on a new reader option?

cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Show resolved Hide resolved
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me. Some comments and questions.

cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Dec 12, 2023

/ok to test

// Base test fixture for tests
struct FstTest : public cudf::test::BaseFixture {};

void run_test(std::string& input, std::string& output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan for this PR in terms of code location? Here, in tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can place this code in io/fst and have the user read their JSON with a pre-process reader option, or have a separate API that reads this file from device buffer and runs this FST as a pre-processing step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand. Everything here is in a test file, so there is no new feature or API was added. What does this PR actually offer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to integrate this pre-processing FST into libcudf in a follow-up PR to enable parsing the single-quote variant of JSON. This PR constructs the normalizing FST and checks if it can correctly handle valid and invalid JSON cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just a proof-of-concept that is implemented directly into the unit tests then I don't recommend merging it, as the new unit tests do not actually test any existing API. Only when we have a new API implemented (in the follow up PR) then we merge its corresponding unit tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this and decided it would be beneficial to merge this POC before holidays and then build up the API next year.
IMO the unit tests are pretty useful even with POC because of the run_test abstraction, which is the sole entry point. We can easily update the test util code to call the new API from run_test, without even changing each test.

@shrshi
Copy link
Contributor Author

shrshi commented Dec 14, 2023

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Dec 14, 2023

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Dec 15, 2023

/ok to test

cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/quote_normalization_test.cu Outdated Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Dec 15, 2023

/ok to test

Comment on lines 221 to 224
std::cout << "Expected output: " << output << std::endl << "Computed output: ";
for (size_t i = 0; i < output_gpu_size[0]; i++)
std::cout << output_gpu[i];
std::cout << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I think we want to remove printing to stdout to not pollute test logs?

@vuule vuule added the 0 - Waiting on Author Waiting for author to respond to review label Dec 19, 2023
@shrshi
Copy link
Contributor Author

shrshi commented Jan 2, 2024

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Jan 2, 2024

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Jan 2, 2024

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Jan 3, 2024

/merge

@rapids-bot rapids-bot bot merged commit af65d52 into rapidsai:branch-24.02 Jan 3, 2024
67 checks passed
@shrshi shrshi mentioned this pull request Jan 9, 2024
3 tasks
rapids-bot bot pushed a commit that referenced this pull request Jan 24, 2024
The goal of this PR is to address [10004](#10004) by supporting parsing of JSON files containing single quotes for field/value strings. This is a follow-up work to the POC [PR 14545](#14545)

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Andy Grove (https://github.com/andygrove)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Elias Stehle (https://github.com/elstehle)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #14729
PointKernel pushed a commit to PointKernel/cudf that referenced this pull request Jan 25, 2024
The goal of this PR is to address [10004](rapidsai#10004) by supporting parsing of JSON files containing single quotes for field/value strings. This is a follow-up work to the POC [PR 14545](rapidsai#14545)

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Andy Grove (https://github.com/andygrove)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Elias Stehle (https://github.com/elstehle)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: rapidsai#14729
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Waiting on Author Waiting for author to respond to review CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants