Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update generate_surrogate_key(...) to return NULL when some columns are NULL #935

Closed
b-per opened this issue Jul 9, 2024 · 3 comments
Closed
Labels
enhancement New feature or request wontfix

Comments

@b-per
Copy link
Contributor

b-per commented Jul 9, 2024

Describe the feature

Today, generate_surrogate_key() will never return a NULL value. If my_col contains some NULL, even generate_surrogate_key(["my_col")] will never be NULL

Describe alternatives you've considered

Add a second argument to generate_surrogate_key to list what columns, if NULL should generate a NULL surrogate key.

dbt_utils.generate_surrogate_key(field_list = ['field_a', 'field_b'], nullable_list = ['field_b'] ) would return NULL for the rows where field_b is NULL.

Additional context

People generally put a unique and not_null tests on their keys. If their key is a surrogate key, testing for not_null on generate_surrogate_key will actually use some compute for nothing, and will not check what pepople might be expecting.

Who will this benefit?

All people using generate_surrogate_key and wanting more control on what columns are allowed to be NULL or not.

Are you interested in contributing this feature?

Sure.

@b-per b-per added enhancement New feature or request triage labels Jul 9, 2024
@dbeatty10
Copy link
Contributor

Today, generate_surrogate_key() will never return a NULL value. If my_col contains some NULL, even generate_surrogate_key(["my_col")] will never be NULL

This seems like one of the best things about generate_surrogate_key(), no?

If there is a column generated via generate_surrogate_key(), then the users could theoretically just drop any not_null test for that column as long as they trust the implementation in dbt_utils.

dbt_utils.generate_surrogate_key(field_list = ['field_a', 'field_b'], nullable_list = ['field_b'] ) would return NULL for the rows where field_b is NULL.

Why wouldn't adding a not_null check on field_b work?

@b-per
Copy link
Contributor Author

b-per commented Jul 11, 2024

This seems like one of the best things about generate_surrogate_key(), no?

Depending on the use case, not returning NULL could be a feature but in other cases it could feel more like a bug 😄

dbt_utils.generate_surrogate_key(field_list = ['field_a', 'field_b'], nullable_list = ['field_b'] ) would return NULL for the rows where field_b is NULL.

Why wouldn't adding a not_null check on field_b work?

This is one example with 2 columns, and yes, in that case, it is "easy" to add a test on field_b. But what about sks from many columns where our expectation is that none of those keys should be NULL. Doesn't it feel more natural to raise it at the sk level rather than having to add dozens of fields?

The slight difference as well is that, in the example before field_b might be able to be NULL for some rows ,but in the downstream model, when we add our logic, we might not want field_b to be ever be NULL for the rows we use it in our sk.

Another slight annoyance is that in dbt_project_evalutator, we consider a model being "PK-tested" if it has a unique and a not_null test on one column in the model. So, if we remove the useless not_null it won't flag the model as PK-tested.

@dbeatty10
Copy link
Contributor

I'm going to close this feature request as "not planned".

It's a strength that the generate_surrogate_key can be relied upon to generate non-null values. One large reason is that SQL's three-valued logic has complicated implications and caveats. By guaranteeing non-null values, any downstream consumers can avoid needing complicated logic to test equality. Determining if sk1 is the same as sk2 is as simple as: sk1 = sk2.

In terms of the other considerations you raised:

  • Users can add any not_null data tests on columns they have transformed.
  • Users can choose their trade-off if they want to avoid running the redundant not_null data test or if they'd rather get the "PK-tested" stamp from dbt_project_evaluator

@dbeatty10 dbeatty10 added wontfix and removed triage labels Jul 11, 2024
@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix
Projects
None yet
Development

No branches or pull requests

2 participants