-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable to_date (via gettimestamp and casting timestamp to date) for non-UTC time zones #10100
Conversation
…imestamp) Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
build |
Signed-off-by: Navin Kumar <[email protected]>
build |
Some performance testing results:
In this case, I set the session timezone to |
build |
Do we have some ideas why row num = 20000, we had some lower perf acceleration? Will this perf result vary across different measurements? |
That is way too little data to be a good benchmark. 20,000 rows is < 300k of data. I don't know how many row groups there are, but the overhead of launching kernels is likely most of the time being spent here. I would much rather see a test like.
then we can try and isolate just the
and
For UTC on the GPU I get For the CPU I get Or we could just compare UTC runs agains Iran runs for GPU and separately for CPU to give us an idea of the overhead that Iran adds to the operation. |
Using the approach suggested by @revans2, I generated the parquet file using
and then measured the difference between the total time of Resuilts:
I verified the last result with the Spark UI. Looks like 1483 ms is consistent with the 1.4 s shown in the GpuProject operator here: |
Moving this to draft, looks like there is an issue in casting timestamp to date that is causing off by 1 issues when making it timezone aware. |
Signed-off-by: Navin Kumar <[email protected]>
We can use this class to test perf: Add a new case in this suite. |
Signed-off-by: Navin Kumar <[email protected]>
…to date Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
Updated results after casting timestamp to date fix for non-UTC timezones. The timezone tested was
Verified the last item (800,000,000 rows) in Spark UI: |
build |
build |
premerge failure due to some issue with Gpu Time Zone database and unit tests, filed #10129 to track. |
This is blocked until NVIDIA/spark-rapids-jni#1670 is merged |
Also fixes #10006 scala> val df = Seq("2023-12-31 23:59:59").toDF("ts")
scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Shanghai")
scala> df.selectExpr("to_timestamp(ts)").show()
24/01/02 21:30:26 WARN GpuOverrides:
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
@Expression <AttributeReference> toprettystring(to_timestamp(ts))#9 could run on GPU
+-------------------+
| to_timestamp(ts)|
+-------------------+
|2023-12-31 23:59:59|
+-------------------+
scala> val df2 = df.selectExpr("to_timestamp(ts)")
scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
scala> df2.show()
24/01/02 21:31:15 WARN GpuOverrides:
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
@Expression <AttributeReference> toprettystring(to_timestamp(ts))#23 could run on GPU
+-------------------+
| to_timestamp(ts)|
+-------------------+
|2023-12-31 15:59:59|
+-------------------+ |
…nsition times Signed-off-by: Navin Kumar <[email protected]>
build |
Signed-off-by: Navin Kumar <[email protected]>
build |
Signed-off-by: Navin Kumar <[email protected]>
build |
Fixes #9927.
This enables
to_date
for non-UTC time zones.to_date(str, fmt)
is actually an alias forcast(gettimestamp(str, fmt) as date)
. So enable casting timestamp to date, and enable non-UTC time zones forgettimestamp
(which basically the parent of the same algorithm used inunix_timestamp
).