diff --git a/docs/additional-functionality/advanced_configs.md b/docs/additional-functionality/advanced_configs.md index f577daaf10f..3644b09951e 100644 --- a/docs/additional-functionality/advanced_configs.md +++ b/docs/additional-functionality/advanced_configs.md @@ -245,7 +245,7 @@ Name | SQL Function(s) | Description | Default Value | Notes spark.rapids.sql.expression.FromUnixTime|`from_unixtime`|Get the string from a unix timestamp|true|None| spark.rapids.sql.expression.GetArrayItem| |Gets the field at `ordinal` in the Array|true|None| spark.rapids.sql.expression.GetArrayStructFields| |Extracts the `ordinal`-th fields of all array elements for the data with the type of array of struct|true|None| -spark.rapids.sql.expression.GetJsonObject|`get_json_object`|Extracts a json object from path|true|None| +spark.rapids.sql.expression.GetJsonObject|`get_json_object`|Extracts a json object from path|false|This is disabled by default because escape sequences are not processed correctly, the input is not validated, and the output is not normalized the same as Spark| spark.rapids.sql.expression.GetMapValue| |Gets Value from a Map based on a key|true|None| spark.rapids.sql.expression.GetStructField| |Gets the named field of the struct|true|None| spark.rapids.sql.expression.GetTimestamp| |Gets timestamps from strings using given pattern.|true|None| diff --git a/docs/compatibility.md b/docs/compatibility.md index 8060866dc3b..2644c873e98 100644 --- a/docs/compatibility.md +++ b/docs/compatibility.md @@ -441,6 +441,44 @@ parse some variants of `NaN` and `Infinity` even when this option is disabled ([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The RAPIDS Accelerator behavior is consistent with Spark version 3.3.0 and later. +### get_json_object + +The `GetJsonObject` operator takes a JSON formatted string and a JSON path string as input. The +code base for this is currently separate from GPU parsing of JSON for files and `FromJsonObject`. +Because of this the results can be different from each other. Because of several incompatibilities +and bugs in the GPU version of `GetJsonObject` it will be on the CPU by default. If you are +aware of the current limitations with the GPU version, you might see a significant performance +speedup if you enable it by setting `spark.rapids.sql.expression.GetJsonObject` to `true`. + +The following is a list of known differences. + * [No input validation](https://github.com/NVIDIA/spark-rapids/issues/10218). If the input string + is not valid JSON Apache Spark returns a null result, but ours will still try to find a match. + * [Escapes are not properly processed for Strings](https://github.com/NVIDIA/spark-rapids/issues/10196). + When returning a result for a quoted string Apache Spark will remove the quotes and replace + any escape sequences with the proper characters. The escape sequence processing does not happen + on the GPU. + * [Invalid JSON paths could throw exceptions](https://github.com/NVIDIA/spark-rapids/issues/10212) + If a JSON path is not valid Apache Spark returns a null result, but ours may throw an exception + and fail the query. + * [Non-string output is not normalized](https://github.com/NVIDIA/spark-rapids/issues/10218) + When returning a result for things other than strings, a number of things are normalized by + Apache Spark, but are not normalized by the GPU, like removing unnecessary white space, + parsing and then serializing floating point numbers, turning single quotes to double quotes, + and removing unneeded escapes for single quotes. + +The following is a list of bugs in either the GPU version or arguably in Apache Spark itself. + * https://github.com/NVIDIA/spark-rapids/issues/10219 non-matching quotes in quoted strings + * https://github.com/NVIDIA/spark-rapids/issues/10213 array index notation works without root + * https://github.com/NVIDIA/spark-rapids/issues/10214 unquoted array index notation is not + supported + * https://github.com/NVIDIA/spark-rapids/issues/10215 leading spaces can be stripped from named + keys. + * https://github.com/NVIDIA/spark-rapids/issues/10216 It appears that Spark is flattening some + output, which is different from other implementations including the GPU version. + * https://github.com/NVIDIA/spark-rapids/issues/10217 a JSON path execution bug + * https://issues.apache.org/jira/browse/SPARK-46761 Apache Spark does not allow the `?` character in + a quoted JSON path string. + ## Avro The Avro format read is a very experimental feature which is expected to have some issues, so we disable diff --git a/docs/supported_ops.md b/docs/supported_ops.md index 5af0f356627..c23349467b9 100644 --- a/docs/supported_ops.md +++ b/docs/supported_ops.md @@ -6856,7 +6856,7 @@ are limited. GetJsonObject `get_json_object` Extracts a json object from path -None +This is disabled by default because escape sequences are not processed correctly, the input is not validated, and the output is not normalized the same as Spark project json diff --git a/integration_tests/pom.xml b/integration_tests/pom.xml index 21432f5161b..5f4c41a1b9b 100644 --- a/integration_tests/pom.xml +++ b/integration_tests/pom.xml @@ -1,6 +1,6 @@