Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support trunc and date_trunc SQL function #11833

Merged
merged 21 commits into from
Dec 14, 2024
Merged

Conversation

ttnghia
Copy link
Collaborator

@ttnghia ttnghia commented Dec 6, 2024

This implements override for TruncDate and TruncTimestamp to support the SQL functions trunc and date_trunc.

Closes #11804 and #11860.

Depends on:

Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia added feature request New feature or request SQL part of the SQL/Dataframe plugin task Work required that improves the product but is not user facing labels Dec 6, 2024
@ttnghia ttnghia requested a review from revans2 December 6, 2024 22:08
@ttnghia ttnghia self-assigned this Dec 6, 2024
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia changed the title Implement trunc and date_trunc SQL function Support trunc and date_trunc SQL function Dec 6, 2024
@ttnghia ttnghia marked this pull request as draft December 7, 2024 05:59
@ttnghia ttnghia marked this pull request as ready for review December 7, 2024 07:39
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia linked an issue Dec 9, 2024 that may be closed by this pull request
@ttnghia
Copy link
Collaborator Author

ttnghia commented Dec 10, 2024

build

Copy link
Collaborator

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, left minor comments/questions on the doc.

jihoonson
jihoonson previously approved these changes Dec 10, 2024
Copy link
Collaborator

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a follow on issue to support the format as a Scalar in the kernel?

@revans2
Copy link
Collaborator

revans2 commented Dec 11, 2024

Performance looks great I ran

spark.time(spark.range(10000000000L).selectExpr("max(date_trunc('week', timestamp_micros(id)))").show(false))

GPU time was around 880 ms CPU time was around 165,600 ms or about 188 times faster. I think this is what happens when we have to deal with time zones on the CPU.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change the tests fail in other time zones, like with TZ='Africa/Casablanca' because the code falls back to the CPU, but the tests don't expect that.

trunc_timestamp_format_gen = StringGen('(?i:YEAR|YYYY|YY|QUARTER|MONTH|MM|MON|WEEK|DAY|DD|HOUR|MINUTE|SECOND|MILLISECOND|MICROSECOND)') \
.with_special_pattern('invalid', weight=50)

@pytest.mark.parametrize('data_gen', [date_gen], ids=idfn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this before all of the tests need to be marked with

@allow_non_gpu(*non_utc_tz_allow)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should apply only for the timestamp test, right? Date tests should not care about timezone.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is only needed for timestamp. But you might want to test it to be sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the latest commit with TZ=Asia/Shanghai ./integration_tests/run_pyspark_from_build.sh ... and all passed.

@revans2
Copy link
Collaborator

revans2 commented Dec 11, 2024

I also filed #11860 I don't see it as a big deal, but I wanted to capture it.

@ttnghia
Copy link
Collaborator Author

ttnghia commented Dec 12, 2024

I also filed #11860 I don't see it as a big deal, but I wanted to capture it.

I've pushed a new JNI PR (NVIDIA/spark-rapids-jni#2687) to optimize the kernel when string format is given as a scalar.

Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia linked an issue Dec 13, 2024 that may be closed by this pull request
@ttnghia
Copy link
Collaborator Author

ttnghia commented Dec 14, 2024

build

@ttnghia ttnghia merged commit 33d97e8 into NVIDIA:branch-25.02 Dec 14, 2024
50 checks passed
@ttnghia ttnghia deleted the trunc branch December 14, 2024 21:09
mythrocks added a commit to mythrocks/spark-rapids that referenced this pull request Dec 18, 2024
This change updates operatorsScore.csv and supportedExprs.csv to include `TruncDate` and `TruncTimestamp`, for all shim versions.

This seems to have been left out of NVIDIA#11833.

Signed-off-by: MithunR <[email protected]>
mythrocks added a commit that referenced this pull request Dec 19, 2024
…11890)

This change updates `operatorsScore.csv` and `supportedExprs.csv` to include `TruncDate` and `TruncTimestamp`, for all shim versions.

This seems to have been left out of #11833.

Signed-off-by: MithunR <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request SQL part of the SQL/Dataframe plugin task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] kernel for date_trunc and trunc that has a scalar format [FEA] Support TruncDate expression
3 participants