-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fully support date/time legacy rebase for nested input [databricks] #9660
Merged
revans2
merged 78 commits into
NVIDIA:branch-23.12
from
ttnghia:rebase_nested_timestamp
Nov 16, 2023
Merged
Changes from all commits
Commits
Show all changes
78 commits
Select commit
Hold shift + click to select a range
c578a64
Add check for nested types
ttnghia e368aa6
Add check for nested types
ttnghia 7da416b
Recursively check for rebasing
ttnghia df8f861
Extract common code
ttnghia 95d19ee
Allow nested type in rebase check
ttnghia b426610
Enable nested timestamp in roundtrip test
ttnghia 7343b17
Fix another test
ttnghia 0d48f57
Merge branch 'check_rebase_nested' into rebase_datatime
ttnghia 024e6c9
Enable `LEGACY` rebase in read
ttnghia 9a39628
Remove comment
ttnghia e686bb0
Change function/class signatures
ttnghia b49963e
Merge branch 'branch-23.12' into rebase_datatime
ttnghia 2c232f8
Complete modification
ttnghia ac0f3e4
Misc
ttnghia c773794
Add explicit type
ttnghia 29df7cd
Rename file and add some stuff in DateTimeRebaseHelpers.scala
ttnghia 1b5112d
Move file and rename class
ttnghia 63342a9
Adopt new enum type
ttnghia 6b2d795
Add name for the enum classes
ttnghia 37aa40b
Change exception messages
ttnghia d4cdc1b
Merge branch 'branch-23.12' into refactor_parquet_scan
ttnghia 03f681e
Does not yet support legacy rebase in read
ttnghia 14f230f
Change legacy to corrected mode
ttnghia 1b464ec
Extract common code
ttnghia 0d26d97
Rename functions
ttnghia c2504fd
Reformat
ttnghia edb6c81
Make classes serializable
ttnghia ea86e8f
Revert "Support rebase checking for nested dates and timestamps (#9617)"
ttnghia b14463f
Merge branch 'refactor_parquet_scan' into rebase_datatime
ttnghia adc8ae2
Implement date time rebase
ttnghia 791573c
Optimize rebase op
ttnghia 54e959f
Merge branch 'branch-23.12' into refactor_parquet_scan
ttnghia 3f01690
Change comment
ttnghia 6d9c20b
Merge branch 'refactor_parquet_scan' into rebase_datatime
ttnghia 8c63273
Move tests
ttnghia 1b1fdc3
Add test for datatime rebase
ttnghia e6559ce
Various changes
ttnghia 74fe84a
Various changes
ttnghia a455a90
Fix compile errors
ttnghia b87493c
Fix comments
ttnghia 321e516
Fix indentations
ttnghia 4bc33be
Merge branch 'refactor_parquet_scan' into rebase_datatime
ttnghia 4aab36b
Change comments and indentations
ttnghia 1b4744a
Merge branch 'rebase_datatime' into rebase_nested_timestamp
ttnghia 70310db
Allow nested check for rebase
ttnghia c615925
Merge branch 'branch-23.12' into rebase_datatime
ttnghia be92368
Write different timestamp types in test
ttnghia b09c61f
Fix conversion if timestamp is not micros
ttnghia 00d96e4
Rename var
ttnghia 7d81311
Dont have to down cast after up cast
ttnghia 116bf3e
Change comment
ttnghia 273b2c4
Still cast timestamp to the old type after rebasing
ttnghia 996d9d4
Rename test
ttnghia 5fd6ef5
Should not transform non-datetime types
ttnghia d53ecfa
Merge branch 'rebase_datatime' into rebase_nested_timestamp
ttnghia 4144655
Fix test
ttnghia 5a8b44c
Update tests
ttnghia a33bfd6
Merge branch 'rebase_datatime' into rebase_nested_timestamp
ttnghia e366e5a
Enable int96 rebase in write
ttnghia 247f47f
Change tests
ttnghia 8eba053
Complete tests
ttnghia bda59ef
Revert unrelated changes
ttnghia bbcd9d9
Merge branch 'branch-23.12' into int96_rebase_write
ttnghia fbe37d7
Merge branch 'branch-23.12' into rebase_datatime
ttnghia 4a92d54
Change configs
ttnghia 54c53d3
Merge branch 'rebase_datatime' into rebase_nested_timestamp
ttnghia 2f30ce9
Merge branch 'int96_rebase_write' into rebase_nested_timestamp
ttnghia af817de
Merge tests
ttnghia 13242f4
Simplify test data
ttnghia e1d9f74
Add a new write test
ttnghia 82012b6
Add a mixed rebase test
ttnghia 76694ad
Merge branch 'branch-23.12' into rebase_nested_timestamp
ttnghia cbef912
Change tests
ttnghia 1474dda
Merge branch 'branch-23.12' into rebase_nested_timestamp
ttnghia 14487bf
Fix `seed` in tests
ttnghia 0fff5e6
Rename tests
ttnghia 8bfca59
Merge branch 'branch-23.12' into rebase_nested_timestamp
ttnghia 61d7d3d
Add default seed
ttnghia File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -75,11 +75,12 @@ | |
TimestampGen(start=datetime(1, 1, 1, tzinfo=timezone.utc), | ||
end=datetime(2000, 1, 1, tzinfo=timezone.utc)) | ||
.with_special_case(datetime(1000, 1, 1, tzinfo=timezone.utc), weight=10.0)] | ||
parquet_datetime_in_struct_gen = [StructGen([['child' + str(ind), sub_gen] for ind, sub_gen in enumerate(parquet_datetime_gen_simple)]), | ||
StructGen([['child0', StructGen([['child' + str(ind), sub_gen] for ind, sub_gen in enumerate(parquet_datetime_gen_simple)])]])] | ||
parquet_datetime_in_array_gen = [ArrayGen(sub_gen, max_length=10) for sub_gen in parquet_datetime_gen_simple + parquet_datetime_in_struct_gen] + [ | ||
ArrayGen(ArrayGen(sub_gen, max_length=10), max_length=10) for sub_gen in parquet_datetime_gen_simple + parquet_datetime_in_struct_gen] | ||
parquet_nested_datetime_gen = parquet_datetime_gen_simple + parquet_datetime_in_struct_gen + parquet_datetime_in_array_gen | ||
parquet_datetime_in_struct_gen = [ | ||
StructGen([['child' + str(ind), sub_gen] for ind, sub_gen in enumerate(parquet_datetime_gen_simple)])] | ||
parquet_datetime_in_array_gen = [ArrayGen(sub_gen, max_length=10) for sub_gen in | ||
parquet_datetime_gen_simple + parquet_datetime_in_struct_gen] | ||
parquet_nested_datetime_gen = parquet_datetime_gen_simple + parquet_datetime_in_struct_gen + \ | ||
parquet_datetime_in_array_gen | ||
Comment on lines
+78
to
+83
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Simplify the data generators a bit, since they are too heavy and the tests using them (especially in parquet read tests) become very very slow now. |
||
|
||
parquet_map_gens = parquet_map_gens_sample + [ | ||
MapGen(StructGen([['child0', StringGen()], ['child1', StringGen()]], nullable=False), FloatGen()), | ||
|
@@ -460,15 +461,35 @@ def generate_map_with_empty_validity(spark, path): | |
@datagen_overrides(seed=0, reason='https://github.com/NVIDIA/spark-rapids/issues/9701') | ||
@pytest.mark.parametrize('data_gen', parquet_nested_datetime_gen, ids=idfn) | ||
@pytest.mark.parametrize('ts_write', parquet_ts_write_options) | ||
@pytest.mark.parametrize('ts_rebase_write', ['CORRECTED', 'LEGACY']) | ||
@pytest.mark.parametrize('ts_rebase_read', ['CORRECTED', 'LEGACY']) | ||
def test_datetime_roundtrip_with_legacy_rebase(spark_tmp_path, data_gen, ts_write, ts_rebase_write, ts_rebase_read): | ||
@pytest.mark.parametrize('ts_rebase_write', ['EXCEPTION']) | ||
def test_parquet_write_fails_legacy_datetime(spark_tmp_path, data_gen, ts_write, ts_rebase_write): | ||
data_path = spark_tmp_path + '/PARQUET_DATA' | ||
all_confs = {'spark.sql.parquet.outputTimestampType': ts_write, | ||
'spark.sql.legacy.parquet.datetimeRebaseModeInWrite': ts_rebase_write, | ||
'spark.sql.legacy.parquet.int96RebaseModeInWrite': ts_rebase_write, | ||
'spark.sql.legacy.parquet.datetimeRebaseModeInRead': ts_rebase_read, | ||
'spark.sql.legacy.parquet.int96RebaseModeInRead': ts_rebase_read} | ||
'spark.sql.legacy.parquet.int96RebaseModeInWrite': ts_rebase_write} | ||
def writeParquetCatchException(spark, data_gen, data_path): | ||
with pytest.raises(Exception) as e_info: | ||
unary_op_df(spark, data_gen).coalesce(1).write.parquet(data_path) | ||
assert e_info.match(r".*SparkUpgradeException.*") | ||
with_gpu_session( | ||
lambda spark: writeParquetCatchException(spark, data_gen, data_path), | ||
conf=all_confs) | ||
|
||
@datagen_overrides(seed=0, reason='https://github.com/NVIDIA/spark-rapids/issues/9701') | ||
@pytest.mark.parametrize('data_gen', parquet_nested_datetime_gen, ids=idfn) | ||
@pytest.mark.parametrize('ts_write', parquet_ts_write_options) | ||
@pytest.mark.parametrize('ts_rebase_write', [('CORRECTED', 'LEGACY'), ('LEGACY', 'CORRECTED')]) | ||
@pytest.mark.parametrize('ts_rebase_read', [('CORRECTED', 'LEGACY'), ('LEGACY', 'CORRECTED')]) | ||
def test_parquet_write_roundtrip_datetime_with_legacy_rebase(spark_tmp_path, data_gen, ts_write, | ||
ts_rebase_write, ts_rebase_read): | ||
data_path = spark_tmp_path + '/PARQUET_DATA' | ||
all_confs = {'spark.sql.parquet.outputTimestampType': ts_write, | ||
'spark.sql.legacy.parquet.datetimeRebaseModeInWrite': ts_rebase_write[0], | ||
'spark.sql.legacy.parquet.int96RebaseModeInWrite': ts_rebase_write[1], | ||
# The rebase modes in read configs should be ignored and overridden by the same | ||
# modes in write configs, which are retrieved from the written files. | ||
'spark.sql.legacy.parquet.datetimeRebaseModeInRead': ts_rebase_read[0], | ||
'spark.sql.legacy.parquet.int96RebaseModeInRead': ts_rebase_read[1]} | ||
assert_gpu_and_cpu_writes_are_equal_collect( | ||
lambda spark, path: unary_op_df(spark, data_gen).coalesce(1).write.parquet(path), | ||
lambda spark, path: spark.read.parquet(path), | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deleted test is combined with the
test_parquet_read_roundtrip_datetime
.