-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON quote normalization #14545
JSON quote normalization #14545
Conversation
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's already looking great. I think one important point is that we make use of a function object for the translation instead of using a translation table. This allows us to simply return the input symbol for all input symbols that we do not need to modify.
I've slightly modified your FST to accommodate that change, which also simplifies the symbol groups amongst which we need to distinguish.
Also, I've run into a corner case with the current symbol-to-symbol-group-id lookup table implementation for this lookup table we need here. So I temporarily replaced it that lookup with the SymbolToSymbolGroup
function object to unblock you here. Sorry for the inconvenience. I'll work on a fix in the meanwhile.
Thank you for the detailed feedback, Elias. I have included all your suggested changes in the most recent commit. The solution for the other corner case that we discussed - the escaped single quote within a single-quoted input string - has been handled as well. There are two examples that need further thought - (i) retaining an escaped single-quote within a double-quoted input string, and (ii) ill-formed JSON with extra single quote before closing brace. |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far! Left a few minor comments.
I assume the next step is to integrate this preprocessing stage into the reader depending on a new reader option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me. Some comments and questions.
/ok to test |
// Base test fixture for tests | ||
struct FstTest : public cudf::test::BaseFixture {}; | ||
|
||
void run_test(std::string& input, std::string& output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan for this PR in terms of code location? Here, in tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can place this code in io/fst
and have the user read their JSON with a pre-process reader option, or have a separate API that reads this file from device buffer and runs this FST as a pre-processing step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't understand. Everything here is in a test file, so there is no new feature or API was added. What does this PR actually offer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to integrate this pre-processing FST into libcudf in a follow-up PR to enable parsing the single-quote variant of JSON. This PR constructs the normalizing FST and checks if it can correctly handle valid and invalid JSON cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is just a proof-of-concept that is implemented directly into the unit tests then I don't recommend merging it, as the new unit tests do not actually test any existing API. Only when we have a new API implemented (in the follow up PR) then we merge its corresponding unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this and decided it would be beneficial to merge this POC before holidays and then build up the API next year.
IMO the unit tests are pretty useful even with POC because of the run_test
abstraction, which is the sole entry point. We can easily update the test util code to call the new API from run_test
, without even changing each test.
/ok to test |
/ok to test |
/ok to test |
/ok to test |
std::cout << "Expected output: " << output << std::endl << "Computed output: "; | ||
for (size_t i = 0; i < output_gpu_size[0]; i++) | ||
std::cout << output_gpu[i]; | ||
std::cout << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I think we want to remove printing to stdout to not pollute test logs?
/ok to test |
/ok to test |
/ok to test |
/merge |
The goal of this PR is to address [10004](#10004) by supporting parsing of JSON files containing single quotes for field/value strings. This is a follow-up work to the POC [PR 14545](#14545) Authors: - Shruti Shivakumar (https://github.com/shrshi) Approvers: - Andy Grove (https://github.com/andygrove) - Vyas Ramasubramani (https://github.com/vyasr) - Vukasin Milovanovic (https://github.com/vuule) - Elias Stehle (https://github.com/elstehle) - Robert (Bobby) Evans (https://github.com/revans2) URL: #14729
The goal of this PR is to address [10004](rapidsai#10004) by supporting parsing of JSON files containing single quotes for field/value strings. This is a follow-up work to the POC [PR 14545](rapidsai#14545) Authors: - Shruti Shivakumar (https://github.com/shrshi) Approvers: - Andy Grove (https://github.com/andygrove) - Vyas Ramasubramani (https://github.com/vyasr) - Vukasin Milovanovic (https://github.com/vuule) - Elias Stehle (https://github.com/elstehle) - Robert (Bobby) Evans (https://github.com/revans2) URL: rapidsai#14729
Description
The goal of this PR is to address PR 10004 by supporting parsing of JSON files containing single quotes for field/value strings.
Checklist