-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make serialization key checks uniform for Hive Text file read/write. #11824
Conversation
This is a minor change, to bring the serialization format checks on the Read and Write sides of Hive text files in line with each other. The read side would check if the serialization-key is set to empty (""): https://github.com/NVIDIA/spark-rapids/blob/aa2da410511d8a737e207257769ec662a79174fe/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/HiveProviderImpl.scala#L155-L161 The write side was checking for the expected default: "1": https://github.com/NVIDIA/spark-rapids/blob/aa2da410511d8a737e207257769ec662a79174fe/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/GpuHiveFileFormat.scala#L130-L136 This caused confusion in behaviour, as voiced in NVIDIA#11803. The change in this commit makes the checks uniformly conservative. Signed-off-by: MithunR <[email protected]>
Build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if this fixes #11803?
Looks like I spoke too soon about making these checks the same. Debugging now. |
I've had to add an xfail case for 3.2, where creating a table via CTAS doesn't not seem to set the "serialization.format" key in the storage properties. That bug seems to have been fixed in subsequent Spark versions. Note that this is only in the test setup. |
Build |
It now looks like there is no guarantee of I'll look into setting this programmatically. |
Ok, I don't think this is the right approach. If the output table exists prior to the write, there's a guarantee that the However, if the table is created via CTAS, there is no guarantee that I'm inclined to document this difference in code, and not change the defaults. I'm closing this change. |
I have documented the need for the different settings over at #11856. |
This is a minor change, to bring the serialization format checks on the Read and Write sides of Hive text files in line with each other.
The read side would check if the serialization-key is set to empty (""):
spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/HiveProviderImpl.scala
Lines 155 to 161 in aa2da41
The write side was checking for the expected default: "1":
spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/GpuHiveFileFormat.scala
Lines 130 to 136 in aa2da41
This caused confusion in behaviour, as voiced in #11803.
The change in this commit makes the checks uniformly conservative.