-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix issue with auto columns with mix of scalar values and empty arrays #15083
fix issue with auto columns with mix of scalar values and empty arrays #15083
Conversation
types |= DOUBLE_ARRAY_MASK; | ||
break; | ||
default: | ||
throw new ISE("Unsupported nested array type: [%s]", type.asTypeString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw new ISE("Unsupported nested array type: [%s]", type.asTypeString()); | |
throw new ISE("Unsupported nested array type: [%s]", type.getElementType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, i think I prefer it includes the whole type, so if something like ARRAY<ARRAY<LONG>>
ends up here we don't throw an exception like "Unsupported nested array type: [ARRAY<LONG>]"
which would happen if we use the element type and is kind of confusing because we do handle ARRAY<LONG>
and this can be fore a nested path so is maybe ambiguous, but this is primarily a developer exception message so if it happens it means a bug somewhere, so probably not a big deal either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the whole type includes the element type information, then that's even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix.
We briefly talked about it offline - would it also help to include the column name in the exception message Value not found in string dictionary
, so it's easy to narrow down similar issues in a large dataset?
Description
Fixes a case missed in #14422, where paths with a mix of only a single type of scalar value and empty arrays would incorrectly get treated as single type scalar columns, resulting in serialization failure with errors of the form:
Note that the exception is being thrown from
processArrayField
but the writer is aScalarStringFieldColumnWriter
. This could happen with any combination of scalar type if only encountered empty arrays and no other array values. The problem occurred because the 'type byte' ofFieldTypeInfo
is used to check for "single type" fields to optimize, but was forgetting to check for the presence of the empty array flag. Empty arrays do not set a type at the time of indexing since they don't really have a type similar to nulls, so as to not pollute the type byte with an artificially chosen type. The intention was that at persist type we check for the flag and add an array type of the scalar type if otherwise a single type, to promote it to being mixed type and preserve the array dictionary, or just remain a single type if the single type is a typed array type.The added column to the test data triggered this issue in a number of tests, and I also added some tests for the empty array flag for the field type info tests.
This PR has: