-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC for whitespace removal in input JSON data using FST #14931
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for working on this and for putting the FST to use 🙂
I did just some early high-level review on the FST stuff. Overall, this looks good already. Just left a few minor comments that may help us to further simplify the logic a bit.
@elstehle would it be feasible to create a single FST that can perform both quote normalization and whitespace removal, and also be configurable to do only one of these preprocessing steps. I know the JSON parser FST is configurable to some extent, but I don't know how limited this approach. |
I believe it should be possible to have an FST that does both in a single pass. We'd have to see if it makes sense to integrate all three options, i.e., (1) whitespace removal, (2) quote normalization, (3) both, into a single FST instance or whether that would overcomplicate the translation function and make it too branchy. Or whether it'd be better to have three separate FST instances for each of the three options above. |
Thanks. If you're not sure if this is feasible, it probably makes the most sense to start with separate FSTs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor comments, otherwise looks good to me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple suggestions to improve comments. Otherwise LGTM!
* | state, whitespaces following escaped double quotes inside strings may be removed. | ||
* | ||
* NOTE: An important case NOT handled by this FST is that of whitespace following newline | ||
* characters within a string. For example, `{"a":"x\n y"}` ---FST--> `{"a":"x\ny"}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example makes it sound like this FST does that transformation. Maybe write:
* characters within a string. For example, `{"a":"x\n y"}` ---FST--> `{"a":"x\ny"}` | |
* characters within a string. For example, `{"a":"x\n y"}` is unchanged by this FST. It | |
* does not become `{"a":"x\ny"}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current FST, we would get the transformation described in the comment, but that is not the expected behaviour i.e. we should not remove whitespace characters within quotes. I think the following would make it clearer -
* characters within a string. For example, `{"a":"x\n y"}` ---FST--> `{"a":"x\ny"}` | |
* characters within a string. Consider the following example | |
* Input: {"a":"x\n y"} | |
* FST output: {"a":"x\ny"} | |
* Expected output: {"a":"x\n y"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. We are documenting a known bug in the current implementation? Are we intending to fix this before merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For compatibility with Spark, we don't need to consider newlines within strings as a part of the string. While reading from JSON lines with the option set to recover from invalid lines, I think newline characters present before the end of the record (like in the example {"a":"x\n y"}
) will result in the parser treating it as an invalid line.
I have added the note for the sake of completeness and to clarify the scope of the FST.
Quote normalization is used for entire JSON. Whitespace removal is required only for downstream processing of mixed types (#14865) which should be much smaller than entire JSON. So, this may be the reason for separate FSTs. Per string FST for whitespace could be useful (only if without minimizing the performance). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice and clean FST state table! Great work.
{/* IN_STATE " \ \n <SPC> OTHER */ | ||
/* TT_OOS */ {{TT_DQS, TT_OOS, TT_OOS, TT_OOS, TT_OOS}}, | ||
/* TT_DQS */ {{TT_OOS, TT_DEC, TT_OOS, TT_DQS, TT_DQS}}, | ||
/* TT_DEC */ {{TT_DQS, TT_DQS, TT_DQS, TT_DQS, TT_DQS}}}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is error state not expected to happen since we don't have a state for error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no error state for both the quote normalization and whitespace normalization. In case of invalid JSON inputs (such as the GroundTruth_InvalidInput
test case), it processes them anyway and leaves the error-handling and recovery to the next parsing FST.
void run_test(const std::string& input, const std::string& output) | ||
{ | ||
// Prepare cuda stream for data transfers & kernels | ||
rmm::cuda_stream stream{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be cudf::get_default_stream()
for tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! It's better to call cudf::test::get_default_stream()
here instead of creating a new stream. Fixed.
TEST_F(JsonWSNormalizationTest, GroundTruth_InvalidInput) | ||
{ | ||
std::string input = "{\"a\" : \"b }\n{ \"c\" :\t\"d\"}"; | ||
std::string output = "{\"a\":\"b }\n{\"c\":\"d\"}"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (not change suggestion):
Why do some strings cases use raw string literal, but some cases are escaped strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With raw strings, it's hard to see the positions of spaces and tabs when they are next to each other, especially when editors map tabs to different number of spaces. With escaped strings, I think we have more control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really good answer! I hadn't considered that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good.
/merge |
This work is a follow-up to PR #14931 which provided a proof-of-concept for using the a FST to normalize unquoted whitespaces. This PR implements the pre-processing FST in cuIO and adds a JSON reader option that needs to be set to true to invoke the normalizer. Addresses feature request #14865 Authors: - Shruti Shivakumar (https://github.com/shrshi) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Vukasin Milovanovic (https://github.com/vuule) - Robert Maynard (https://github.com/robertmaynard) - Bradley Dice (https://github.com/bdice) URL: #15033
Description
This PR provides a proof-of-concept for the usage of FST in removing unquoted spaces and tabs in JSON strings. This is a useful feature in the cases where we want to cast a hierarchical JSON object to a string, and overcomes the challenge of processing mixed types using Spark. #14865
The FST assumes that the single quotes in the input data have already been normalized (possibly using
normalize_single_quotes
).Checklist