Update string to float compatibility doc[skip ci] #10156

thirtiseven · 2024-01-05T04:23:36Z

Casting from string to double on the GPU could sometimes return incorrect results if the string contains high precision values. In Apache Spark, the values are rounded to the nearest double, whereas the Rapids accelerator truncates the values directly.

There are some tests in the issue. It looks like not an easy fix, so I think we can update the compatibility doc first.

Signed-off-by: Haoyang Li <[email protected]>

andygrove · 2024-01-08T18:37:29Z

docs/compatibility.md

-represents any number in the following ranges. In both cases the GPU returns `Double.MaxValue`. The
-default behavior in Apache Spark is to return `+Infinity` and `-Infinity`, respectively.


Our behavior is still inconsistent with Spark for this case, although the behavior does seem to have changed since this documentation was written. Perhaps we need to update this documentation rather than remove it?

The test below was with Spark 3.1.1

scala> val df = Seq("1.7976931348623158E308", "123").toDF("a").repartition(2) scala> val df2 = df.withColumn("b", col("a").cast(DataTypes.DoubleType)) scala> spark.conf.set("spark.rapids.sql.enabled", false) scala> df2.show +--------------------+--------------------+ | a| b| +--------------------+--------------------+ |1.797693134862315...|1.797693134862315...| | 123| 123.0| +--------------------+--------------------+ scala> spark.conf.set("spark.rapids.sql.enabled", true) scala> df2.show 24/01/08 18:34:56 WARN GpuOverrides: !Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it @Partitioning <SinglePartition$> could run on GPU *Exec <ProjectExec> will run on GPU *Expression <Alias> cast(cast(a#4 as double) as string) AS b#30 will run on GPU *Expression <Cast> cast(cast(a#4 as double) as string) will run on GPU *Expression <Cast> cast(a#4 as double) will run on GPU *Exec <ShuffleExchangeExec> will run on GPU *Partitioning <RoundRobinPartitioning> will run on GPU ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec @Expression <AttributeReference> a#4 could run on GPU +--------------------+--------+ | a| b| +--------------------+--------+ |1.797693134862315...|Infinity| | 123| 123.0| +--------------------+--------+

Updated. Thanks for your test!

Signed-off-by: Haoyang Li <[email protected]>

hyperbolic2346

Seems an improvement, but I am not really up on this issue.

thirtiseven · 2024-01-17T05:16:36Z

build

Update string to float compatibility doc[skip ci]

7b86d4f

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven requested a review from hyperbolic2346 January 5, 2024 04:23

thirtiseven self-assigned this Jan 5, 2024

sameerz added the documentation Improvements or additions to documentation label Jan 8, 2024

andygrove reviewed Jan 8, 2024

View reviewed changes

Address comment

d89935d

Signed-off-by: Haoyang Li <[email protected]>

hyperbolic2346 approved these changes Jan 17, 2024

View reviewed changes

thirtiseven merged commit 6419da6 into NVIDIA:branch-24.02 Jan 17, 2024
40 checks passed

thirtiseven deleted the string_to_float_doc branch January 17, 2024 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update string to float compatibility doc[skip ci] #10156

Update string to float compatibility doc[skip ci] #10156

thirtiseven commented Jan 5, 2024

andygrove Jan 8, 2024

thirtiseven Jan 9, 2024

hyperbolic2346 left a comment

thirtiseven commented Jan 17, 2024

		represents any number in the following ranges. In both cases the GPU returns `Double.MaxValue`. The
		default behavior in Apache Spark is to return `+Infinity` and `-Infinity`, respectively.

Update string to float compatibility doc[skip ci] #10156

Update string to float compatibility doc[skip ci] #10156

Conversation

thirtiseven commented Jan 5, 2024

andygrove Jan 8, 2024

Choose a reason for hiding this comment

thirtiseven Jan 9, 2024

Choose a reason for hiding this comment

hyperbolic2346 left a comment

Choose a reason for hiding this comment

thirtiseven commented Jan 17, 2024