Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update string to float compatibility doc[skip ci] #10156

Merged
merged 2 commits into from
Jan 17, 2024

Conversation

thirtiseven
Copy link
Collaborator

Closes #10037

Casting from string to double on the GPU could sometimes return incorrect results if the string contains high precision values. In Apache Spark, the values are rounded to the nearest double, whereas the Rapids accelerator truncates the values directly.

There are some tests in the issue. It looks like not an easy fix, so I think we can update the compatibility doc first.

@thirtiseven thirtiseven self-assigned this Jan 5, 2024
@sameerz sameerz added the documentation Improvements or additions to documentation label Jan 8, 2024
Comment on lines -726 to -727
represents any number in the following ranges. In both cases the GPU returns `Double.MaxValue`. The
default behavior in Apache Spark is to return `+Infinity` and `-Infinity`, respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our behavior is still inconsistent with Spark for this case, although the behavior does seem to have changed since this documentation was written. Perhaps we need to update this documentation rather than remove it?

The test below was with Spark 3.1.1

scala> val df = Seq("1.7976931348623158E308", "123").toDF("a").repartition(2)
scala> val df2 = df.withColumn("b", col("a").cast(DataTypes.DoubleType))
scala> spark.conf.set("spark.rapids.sql.enabled", false)
scala> df2.show
+--------------------+--------------------+
|                   a|                   b|
+--------------------+--------------------+
|1.797693134862315...|1.797693134862315...|
|                 123|               123.0|
+--------------------+--------------------+

scala> spark.conf.set("spark.rapids.sql.enabled", true)

scala> df2.show
24/01/08 18:34:56 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(cast(a#4 as double) as string) AS b#30 will run on GPU
      *Expression <Cast> cast(cast(a#4 as double) as string) will run on GPU
        *Expression <Cast> cast(a#4 as double) will run on GPU
    *Exec <ShuffleExchangeExec> will run on GPU
      *Partitioning <RoundRobinPartitioning> will run on GPU
      ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
        @Expression <AttributeReference> a#4 could run on GPU

+--------------------+--------+
|                   a|       b|
+--------------------+--------+
|1.797693134862315...|Infinity|
|                 123|   123.0|
+--------------------+--------+

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks for your test!

Signed-off-by: Haoyang Li <[email protected]>
Copy link
Collaborator

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems an improvement, but I am not really up on this issue.

@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven thirtiseven merged commit 6419da6 into NVIDIA:branch-24.02 Jan 17, 2024
40 checks passed
@thirtiseven thirtiseven deleted the string_to_float_doc branch January 17, 2024 05:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] String to float casting mismatch with cpu in different way from compatibility doc
4 participants