Update `generate_surrogate_key(...)` to return `NULL` when some columns are `NULL` #935

b-per · 2024-07-09T06:06:41Z

Describe the feature

Today, generate_surrogate_key() will never return a NULL value. If my_col contains some NULL, even generate_surrogate_key(["my_col")] will never be NULL

Describe alternatives you've considered

Add a second argument to generate_surrogate_key to list what columns, if NULL should generate a NULL surrogate key.

dbt_utils.generate_surrogate_key(field_list = ['field_a', 'field_b'], nullable_list = ['field_b'] ) would return NULL for the rows where field_b is NULL.

Additional context

People generally put a unique and not_null tests on their keys. If their key is a surrogate key, testing for not_null on generate_surrogate_key will actually use some compute for nothing, and will not check what pepople might be expecting.

Who will this benefit?

All people using generate_surrogate_key and wanting more control on what columns are allowed to be NULL or not.

Are you interested in contributing this feature?

Sure.

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2024-07-11T01:54:18Z

Today, generate_surrogate_key() will never return a NULL value. If my_col contains some NULL, even generate_surrogate_key(["my_col")] will never be NULL

This seems like one of the best things about generate_surrogate_key(), no?

If there is a column generated via generate_surrogate_key(), then the users could theoretically just drop any not_null test for that column as long as they trust the implementation in dbt_utils.

dbt_utils.generate_surrogate_key(field_list = ['field_a', 'field_b'], nullable_list = ['field_b'] ) would return NULL for the rows where field_b is NULL.

Why wouldn't adding a not_null check on field_b work?

b-per · 2024-07-11T05:59:20Z

This seems like one of the best things about generate_surrogate_key(), no?

Depending on the use case, not returning NULL could be a feature but in other cases it could feel more like a bug 😄

dbt_utils.generate_surrogate_key(field_list = ['field_a', 'field_b'], nullable_list = ['field_b'] ) would return NULL for the rows where field_b is NULL.

Why wouldn't adding a not_null check on field_b work?

This is one example with 2 columns, and yes, in that case, it is "easy" to add a test on field_b. But what about sks from many columns where our expectation is that none of those keys should be NULL. Doesn't it feel more natural to raise it at the sk level rather than having to add dozens of fields?

The slight difference as well is that, in the example before field_b might be able to be NULL for some rows ,but in the downstream model, when we add our logic, we might not want field_b to be ever be NULL for the rows we use it in our sk.

Another slight annoyance is that in dbt_project_evalutator, we consider a model being "PK-tested" if it has a unique and a not_null test on one column in the model. So, if we remove the useless not_null it won't flag the model as PK-tested.

dbeatty10 · 2024-07-11T16:03:54Z

I'm going to close this feature request as "not planned".

It's a strength that the generate_surrogate_key can be relied upon to generate non-null values. One large reason is that SQL's three-valued logic has complicated implications and caveats. By guaranteeing non-null values, any downstream consumers can avoid needing complicated logic to test equality. Determining if sk1 is the same as sk2 is as simple as: sk1 = sk2.

In terms of the other considerations you raised:

Users can add any not_null data tests on columns they have transformed.
Users can choose their trade-off if they want to avoid running the redundant not_null data test or if they'd rather get the "PK-tested" stamp from dbt_project_evaluator

b-per added enhancement New feature or request triage labels Jul 9, 2024

dbeatty10 added awaiting_response and removed triage labels Jul 11, 2024

github-actions bot added triage and removed awaiting_response labels Jul 11, 2024

dbeatty10 added wontfix and removed triage labels Jul 11, 2024

dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `generate_surrogate_key(...)` to return `NULL` when some columns are `NULL` #935

Update `generate_surrogate_key(...)` to return `NULL` when some columns are `NULL` #935

b-per commented Jul 9, 2024

dbeatty10 commented Jul 11, 2024

b-per commented Jul 11, 2024

dbeatty10 commented Jul 11, 2024

Update generate_surrogate_key(...) to return NULL when some columns are NULL #935

Update generate_surrogate_key(...) to return NULL when some columns are NULL #935

Comments

b-per commented Jul 9, 2024

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

dbeatty10 commented Jul 11, 2024

b-per commented Jul 11, 2024

dbeatty10 commented Jul 11, 2024

Update `generate_surrogate_key(...)` to return `NULL` when some columns are `NULL` #935

Update `generate_surrogate_key(...)` to return `NULL` when some columns are `NULL` #935