Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] JSON reader: ignores Java/C++ style comment #10265

Open
wbo4958 opened this issue Feb 10, 2022 · 5 comments
Open

[FEA] JSON reader: ignores Java/C++ style comment #10265

wbo4958 opened this issue Feb 10, 2022 · 5 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@wbo4958
Copy link
Contributor

wbo4958 commented Feb 10, 2022

This is part of FEA of NVIDIA/spark-rapids#9
We have a JSON file

{"name": // name
 "Reynold Xin"}

Spark can parse it when enabling allowComments and multiLine

or

{'name': /* hello */ 'Reynold Xin'}

Spark can parse it when enabling allowComments

We expect there is a configure allowComments to control this behavior.

@wbo4958 wbo4958 added feature request New feature or request Needs Triage Need team to review and classify labels Feb 10, 2022
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sameerz sameerz added the Spark Functionality that helps Spark RAPIDS label Mar 23, 2022
@revans2
Copy link
Contributor

revans2 commented May 16, 2022

This is off by default in Spark so it is nice to be able to support this, but it is not a blocker.

@vuule vuule added the cuIO cuIO issue label Jun 8, 2022
@GregoryKimball GregoryKimball removed the Needs Triage Need team to review and classify label Jun 28, 2022
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Jun 28, 2022
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment and removed inactive-90d labels Oct 26, 2022
@GregoryKimball
Copy link
Contributor

Do the comments only occur in between field names, values, and symbols : [ {? If so then then adding a comment state might be easier.

Does // comment always end in a \n newline? Can be sure that any / occurring outside a quoted region begins a comment? Single-line comments might be possible, but it would bring some performance penalty from increasing the size of the state machine.

Does a block comment ever contain * characters? If so we would need to add a lookahead for block comments and this would be challenging.

@revans2
Copy link
Contributor

revans2 commented Dec 2, 2022

To be clear this is not a super high priority for us. We have had no one request this yet, but it is a part of Spark. This would be a feature that we could turn on or turn off.

This corresponds to a Jackson JSON parser feature, which is the JSON parser that Spark uses and just exposes some configs for.

https://github.com/FasterXML/jackson-core/blob/0d5c70ff02681b4305f0d4932915e7f025e54e12/src/main/java/com/fasterxml/jackson/core/json/JsonReadFeature.java#L16-L28

Some of the tests for it are at

https://github.com/FasterXML/jackson-core/blob/390472bbdf4722fe058f48bb0eff5865c8d20f73/src/test/java/com/fasterxml/jackson/core/read/CommentParsingTest.java

Do the comments only occur in between field names, values, and symbols : [ {? If so then then adding a comment state might be easier.

Yes, that appears to be correct.

{ "C": 5/*5*/6 } 

results in an error.

Comment parsing within a quoted string is skipped.

{"B":"2"}
{"B":"2//test"}
{"B":"2/*test"}
{"B":"2/*test*/"}

results in

+---------+
|B        |
+---------+
|2        |
|2//test  |
|2/*test  |
|2/*test*/|
+---------+

Does // comment always end in a \n newline?

There are some bugs with this in Spark. For now I would say yes new line \n or the end of the buffer/file. We can standardize on that. The reality is a little more complicated, but it is for features that I don't think we would support, or need to be bug for bug compatible with Spark for.

Can be sure that any / occurring outside a quoted region begins a comment?

From reading the Jackson parser code it appears to be doing exactly that. If it sees a '/' when it is looking to skip white space, a colon, or the end, then it goes into comment skipping mode. If it sees a / by itself that is an error or something after the / that is not another / or a * it is an error.

https://github.com/FasterXML/jackson-core/blob/390472bbdf4722fe058f48bb0eff5865c8d20f73/src/main/java/com/fasterxml/jackson/core/json/UTF8StreamJsonParser.java#L3265-L3282

Single-line comments might be possible, but it would bring some performance penalty from increasing the size of the state machine.

I am not sure what you mean by single line comments? Do you mean something like

{"A": 100}
// This is a single line comment
{"B": 200}
       /* so is this... */
+----+----+
|A   |B   |
+----+----+
|100 |null|
|null|200 |
+----+----+

Does a block comment ever contain * characters? If so we would need to add a lookahead for block comments and this would be challenging.

Yes

{"A": 100 /*****************************************/, "B": 200}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

5 participants