Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue with auto columns with mix of scalar values and empty arrays #15083

Merged
merged 2 commits into from
Oct 5, 2023

Conversation

clintropolis
Copy link
Member

Description

Fixes a case missed in #14422, where paths with a mix of only a single type of scalar value and empty arrays would incorrectly get treated as single type scalar columns, resulting in serialization failure with errors of the form:

org.apache.druid.error.DruidException: Value not found in string dictionary

	at org.apache.druid.error.DruidException$DruidExceptionBuilder.build(DruidException.java:460)
	at org.apache.druid.error.DruidException$DruidExceptionBuilder.build(DruidException.java:450)
	at org.apache.druid.error.DruidException.defensive(DruidException.java:176)
	at org.apache.druid.segment.nested.DictionaryIdLookup.lookupString(DictionaryIdLookup.java:130)
	at org.apache.druid.segment.nested.ScalarStringFieldColumnWriter.lookupGlobalId(ScalarStringFieldColumnWriter.java:58)
	at org.apache.druid.segment.nested.ScalarStringFieldColumnWriter.lookupGlobalId(ScalarStringFieldColumnWriter.java:33)
	at org.apache.druid.segment.nested.GlobalDictionaryEncodedFieldColumnWriter.addValue(GlobalDictionaryEncodedFieldColumnWriter.java:154)
	at org.apache.druid.segment.nested.NestedDataColumnSerializer$1.processArrayField(NestedDataColumnSerializer.java:132)

Note that the exception is being thrown from processArrayField but the writer is a ScalarStringFieldColumnWriter. This could happen with any combination of scalar type if only encountered empty arrays and no other array values. The problem occurred because the 'type byte' of FieldTypeInfo is used to check for "single type" fields to optimize, but was forgetting to check for the presence of the empty array flag. Empty arrays do not set a type at the time of indexing since they don't really have a type similar to nulls, so as to not pollute the type byte with an artificially chosen type. The intention was that at persist type we check for the flag and add an array type of the scalar type if otherwise a single type, to promote it to being mixed type and preserve the array dictionary, or just remain a single type if the single type is a typed array type.

The added column to the test data triggered this issue in a number of tests, and I also added some tests for the empty array flag for the field type info tests.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

types |= DOUBLE_ARRAY_MASK;
break;
default:
throw new ISE("Unsupported nested array type: [%s]", type.asTypeString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
throw new ISE("Unsupported nested array type: [%s]", type.asTypeString());
throw new ISE("Unsupported nested array type: [%s]", type.getElementType());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, i think I prefer it includes the whole type, so if something like ARRAY<ARRAY<LONG>> ends up here we don't throw an exception like "Unsupported nested array type: [ARRAY<LONG>]" which would happen if we use the element type and is kind of confusing because we do handle ARRAY<LONG> and this can be fore a nested path so is maybe ambiguous, but this is primarily a developer exception message so if it happens it means a bug somewhere, so probably not a big deal either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the whole type includes the element type information, then that's even better.

Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix.

We briefly talked about it offline - would it also help to include the column name in the exception message Value not found in string dictionary, so it's easy to narrow down similar issues in a large dataset?

@abhishekagarwal87 abhishekagarwal87 merged commit b4bc9b6 into apache:master Oct 5, 2023
74 checks passed
@clintropolis clintropolis deleted the fix-auto-empty-array branch October 5, 2023 04:46
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants