Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make serialization key checks uniform for Hive Text file read/write. #11824

Closed

Conversation

mythrocks
Copy link
Collaborator

This is a minor change, to bring the serialization format checks on the Read and Write sides of Hive text files in line with each other.

The read side would check if the serialization-key is set to empty (""):

val serializationFormat = storage.properties.getOrElse(serializationKey, "")
if (serializationFormat != ctrlASeparatedFormat) {
willNotWorkOnGpu(s"unsupported serialization format found: " +
s"$serializationFormat, " +
s"only \'^A\' separated text input (i.e. serialization.format=1) " +
s"is currently supported")
}

The write side was checking for the expected default: "1":

val serializationFormat = storage.properties.getOrElse(serializationKey, "1")
if (serializationFormat != ctrlASeparatedFormat) {
meta.willNotWorkOnGpu(s"unsupported serialization format found: " +
s"$serializationFormat, " +
s"only \'^A\' separated text output (i.e. serialization.format=1) " +
s"is currently supported")
}

This caused confusion in behaviour, as voiced in #11803.

The change in this commit makes the checks uniformly conservative.

This is a minor change, to bring the serialization format checks on the Read and Write
sides of Hive text files in line with each other.

The read side would check if the serialization-key is set to empty (""):
https://github.com/NVIDIA/spark-rapids/blob/aa2da410511d8a737e207257769ec662a79174fe/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/HiveProviderImpl.scala#L155-L161

The write side was checking for the expected default: "1":
https://github.com/NVIDIA/spark-rapids/blob/aa2da410511d8a737e207257769ec662a79174fe/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/GpuHiveFileFormat.scala#L130-L136

This caused confusion in behaviour, as voiced in NVIDIA#11803.

The change in this commit makes the checks uniformly conservative.

Signed-off-by: MithunR <[email protected]>
@mythrocks mythrocks self-assigned this Dec 5, 2024
@mythrocks
Copy link
Collaborator Author

Build

revans2
revans2 previously approved these changes Dec 5, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this fixes #11803?

@mythrocks
Copy link
Collaborator Author

Do you know if this fixes #11803?

I'm afraid this doesn't fix #11803. :/ That error is on the read path, while this fix affects (only the defaults in) the write path.

@mythrocks
Copy link
Collaborator Author

Looks like I spoke too soon about making these checks the same. Debugging now.

@mythrocks
Copy link
Collaborator Author

I've had to add an xfail case for 3.2, where creating a table via CTAS doesn't not seem to set the "serialization.format" key in the storage properties. That bug seems to have been fixed in subsequent Spark versions.

Note that this is only in the test setup.

@mythrocks
Copy link
Collaborator Author

Build

@mythrocks mythrocks requested a review from revans2 December 6, 2024 23:44
@mythrocks
Copy link
Collaborator Author

It now looks like there is no guarantee of serialization.format being set correctly, when constructed via CTAS. The CI failure seems to be with Spark 3.3.0. I was also able to repro this from 3.3.4.

I'll look into setting this programmatically.

@mythrocks
Copy link
Collaborator Author

Ok, I don't think this is the right approach.

If the output table exists prior to the write, there's a guarantee that the serialization.format would have been set.

However, if the table is created via CTAS, there is no guarantee that STORAGE PROPERTIES would be set until the write has completed. From all tracing, it appears that the properties are empty until after the write.

I'm inclined to document this difference in code, and not change the defaults. I'm closing this change.

@mythrocks mythrocks closed this Dec 10, 2024
@mythrocks
Copy link
Collaborator Author

I have documented the need for the different settings over at #11856.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants