Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the legacy mode check: only take effect when reading date/timestamp column #10074

Merged
merged 6 commits into from
Jan 10, 2024

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Dec 19, 2023

contributes to #9792

DB use legacy mode as default which is different from regular Sparks

  • DB write, the properties in the file meta is:
Properties:
  org.apache.spark.timeZone: <TZ you specified>
  org.apache.spark.version: 3.3.0
  org.apache.spark.legacyINT96:              // this means use legacy mode.
  org.apache.spark.legacyDateTime:        // this means use legacy mode.
  • Regular Sparks
// does not set legacy mode and time zone.
Properties:
  org.apache.spark.version: 3.1.1

Current Spark-Rapids does not supports legacy mode + non-UTC time zone.

solution

Update the check when legacy mode + non-UTC time zone:

  • If reading schema contains date/timestamp, throw exception
  • If reading schema does not contain date/timestamp, do not throw exception
    In this way, we can fix some of the cases, of course not all of them.

With this PR, failed cases decreased from 1071 to 123 on DB.

I'll post another PR to fix remaining the failed cases.

Signed-off-by: Chong Gao [email protected]

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

Failed cases decreased from 1071 to 123 on DB.

@res-life res-life self-assigned this Dec 20, 2023
@res-life res-life marked this pull request as ready for review December 20, 2023 05:41
@sameerz sameerz added the task Work required that improves the product but is not user facing label Dec 29, 2023
Copy link
Collaborator

@thirtiseven thirtiseven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM.

Question: I'm a little confused why throwing an exception makes these tests pass, could you list some of the tests that this pr fixes?

@res-life
Copy link
Collaborator Author

res-life commented Jan 5, 2024

Code LGTM.

Question: I'm a little confused why throwing an exception makes these tests pass, could you list some of the tests that this pr fixes?

Below code is to make less chance to throw an exception.

-      if (fileTimeZoneId.normalized() != GpuOverrides.UTC_TIMEZONE_ID) {
+     if (hasDateTimeInReadSchema && fileTimeZoneId.normalized() != GpuOverrides.UTC_TIMEZONE_ID) {

thirtiseven
thirtiseven previously approved these changes Jan 5, 2024
@res-life res-life requested a review from revans2 January 5, 2024 08:57
/**
* Class for helper functions for Date and Timestamp
*/
object DateTypeUtils {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are checking for both date and time, let's call it DateTimeUtils.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, since we already have DateUtils.scala, I would recommend putting this DateTimeUtils class code into it, and renaming DateUtils.scala into dateTimeUtils.scala too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* @param predicate predicate for a date type.
* @return true if date type or its children have a true predicate.
*/
def hasType(t: DataType, predicate: DataType => Boolean): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a duplication of org.apache.spark.sql.rapids.execution.TrampolineUtil#dataTypeExistsRecursively. Can we use that instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,53 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (c) 2023, NVIDIA CORPORATION.
* Copyright (c) 2024, NVIDIA CORPORATION.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@res-life
Copy link
Collaborator Author

res-life commented Jan 9, 2024

build

@res-life res-life force-pushed the fix-db-legacy-mode branch from b83435f to 251d18c Compare January 9, 2024 02:27
@res-life
Copy link
Collaborator Author

res-life commented Jan 9, 2024

build

@@ -774,9 +774,11 @@ private case class GpuParquetFileFilterHandler(
val clipped = GpuParquetUtils.clipBlocksToSchema(clippedSchema, blocks, isCaseSensitive)
(clipped, clippedSchema)
}

val hasDateTimeInReadSchema = DataTypeUtils.hasDateOrTimestampType(readDataSchema)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There is an isOrContainsDateOrTimestamp in GpuOverrides. Can we use it?

def isOrContainsDateOrTimestamp(dataType: DataType): Boolean =
TrampolineUtil.dataTypeExistsRecursively(dataType, dt => dt == TimestampType || dt == DateType)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can use it. Since it's a nit, let's merge this PR first.

@res-life res-life merged commit 708bbac into NVIDIA:branch-24.02 Jan 10, 2024
40 checks passed
@res-life res-life deleted the fix-db-legacy-mode branch January 10, 2024 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants