Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape quotes and newlines when converting strings to json format in to_json #9612

Merged
merged 9 commits into from
Nov 14, 2023

Conversation

andygrove
Copy link
Contributor

Closes #9514

This PR implements the escapeJsonString method which was previously just a placeholder.

@andygrove andygrove self-assigned this Nov 2, 2023
@andygrove andygrove marked this pull request as draft November 2, 2023 17:44
@andygrove andygrove changed the title Escape quotes when converting strings to json format in to_json [WIP] Escape quotes when converting strings to json format in to_json Nov 2, 2023
@andygrove andygrove changed the title [WIP] Escape quotes when converting strings to json format in to_json Escape quotes and newlines when converting strings to json format in to_json Nov 2, 2023
@andygrove andygrove marked this pull request as ready for review November 2, 2023 19:17
// this is a placeholder for implementing string escaping
// https://github.com/NVIDIA/spark-rapids/issues/9514
cv
withResource(cv) { _=>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original function did not close the input column, but this does. Are we sure this or the caller is doing the right thing here? Looks like there's an artificial incref occurring in the calling code that's confusing to follow. Seems like we do not want this method to close its input in practice, forcing callers to artifically incref to work around.

val chars = Seq("\\", "\"")
val escaped = chars.map("\\" + _)
val chars = Seq("\r", "\n", "\\", "\"")
val escaped = Seq("\\r", "\\n", "\\\\", "\\\"")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: less hard-coding

Suggested change
val escaped = Seq("\\r", "\\n", "\\\\", "\\\"")
val escaped = chars.map(org.apache.commons.text.StringEscapeUtils.escapeJava)

@sameerz sameerz added the task Work required that improves the product but is not user facing label Nov 2, 2023
jlowe
jlowe previously approved these changes Nov 3, 2023
@jlowe
Copy link
Member

jlowe commented Nov 3, 2023

build

revans2
revans2 previously approved these changes Nov 3, 2023
.with_special_case('\'a\'') \
.with_special_case('\\\'a\\\''),
pytest.param(StringGen('\u001a', nullable=True), marks=pytest.mark.xfail(
reason='cuDF represents two-digit unicode characters in hex format such as \x1a'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a cudf issue to track this which we could link from here?

Maybe we can use something like this https://github.com/rapidsai/cudf/blob/f97e74f00b7a6bac37c9603def95a11b06cb013f/cpp/src/io/json/write_json.cu#L95-L108

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is something that users will encounter - this is specific to escaping non-printable ASCII characters, and it would seem unusual to want to represent those in JSON. I wanted to note this issue to explain why we can't do a full StringGen in the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gerashegalov I added a link to a follow-up issue #9705

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove requested review from jlowe and revans2 November 14, 2023 21:36
@andygrove andygrove merged commit 36baf45 into NVIDIA:branch-23.12 Nov 14, 2023
37 checks passed
@andygrove andygrove deleted the to_json_escape branch November 14, 2023 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Escape JSON strings in to_json
5 participants