-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update string to float compatibility doc[skip ci] #10156
Update string to float compatibility doc[skip ci] #10156
Conversation
Signed-off-by: Haoyang Li <[email protected]>
represents any number in the following ranges. In both cases the GPU returns `Double.MaxValue`. The | ||
default behavior in Apache Spark is to return `+Infinity` and `-Infinity`, respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our behavior is still inconsistent with Spark for this case, although the behavior does seem to have changed since this documentation was written. Perhaps we need to update this documentation rather than remove it?
The test below was with Spark 3.1.1
scala> val df = Seq("1.7976931348623158E308", "123").toDF("a").repartition(2)
scala> val df2 = df.withColumn("b", col("a").cast(DataTypes.DoubleType))
scala> spark.conf.set("spark.rapids.sql.enabled", false)
scala> df2.show
+--------------------+--------------------+
| a| b|
+--------------------+--------------------+
|1.797693134862315...|1.797693134862315...|
| 123| 123.0|
+--------------------+--------------------+
scala> spark.conf.set("spark.rapids.sql.enabled", true)
scala> df2.show
24/01/08 18:34:56 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
@Partitioning <SinglePartition$> could run on GPU
*Exec <ProjectExec> will run on GPU
*Expression <Alias> cast(cast(a#4 as double) as string) AS b#30 will run on GPU
*Expression <Cast> cast(cast(a#4 as double) as string) will run on GPU
*Expression <Cast> cast(a#4 as double) will run on GPU
*Exec <ShuffleExchangeExec> will run on GPU
*Partitioning <RoundRobinPartitioning> will run on GPU
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
@Expression <AttributeReference> a#4 could run on GPU
+--------------------+--------+
| a| b|
+--------------------+--------+
|1.797693134862315...|Infinity|
| 123| 123.0|
+--------------------+--------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks for your test!
Signed-off-by: Haoyang Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems an improvement, but I am not really up on this issue.
build |
Closes #10037
Casting from string to double on the GPU could sometimes return incorrect results if the string contains high precision values. In Apache Spark, the values are rounded to the nearest double, whereas the Rapids accelerator truncates the values directly.
There are some tests in the issue. It looks like not an easy fix, so I think we can update the compatibility doc first.