Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include data_type in the output of generate_model_yaml #120

Closed
dbeatty10 opened this issue Mar 22, 2023 · 7 comments
Closed

Include data_type in the output of generate_model_yaml #120

dbeatty10 opened this issue Mar 22, 2023 · 7 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@dbeatty10
Copy link
Contributor

Describe the feature

When generating output, codegen should have an option to include data_type like the following:

    columns:
      - name: account_id
        data_type: string

Furthermore, this option should probably default to true. It possibly shouldn't even be optional, but just always be included in the generated output.

See here for more context: https://github.com/dbt-labs/internal-analytics/pull/1418#discussion_r1144991800

Describe alternatives you've considered

An alternative is to copy-paste data_type by hand into the YAML. These would be copied from an information_schema.columns query or show columns, etc.

Additional context

Side note to spin out into a separate issue: reconciling greenfield YAML from codegen with pre-existing YAML is annoying / time-consuming. Can we write some kind of guidance on easier ways to reconcile two YAML files?

Who will this benefit?

With the launch of data contracts in the dbt-core 1.5.0 release, it's handy to automatically generate the relevant data_types for pre-existing models.

Are you interested in contributing this feature?

Happy to review the PR if someone else wants to take this on.

@linbug
Copy link
Contributor

linbug commented Apr 7, 2023

@dbeatty10 would something like this be suitable? My concern now is that all tests that invoke this macro will need to include logic for all target types, since they use different data type names (e.g. text in postgres becomes string in bigquery). That might be cumbersome. Do you have any feedback/ suggestions?

@dbeatty10
Copy link
Contributor Author

That PR is looking good @linbug !

Absolutely agreed that the testing part looks cumbersome. #76 ran into something similar. We discussed and tried a couple options before settling on an approach.

There are a few options for your PR:

  1. Adopt a similar testing approach as #76
  2. Try some other testing approach
  3. Skip tests altogether

Want to take a look at these options and let me know what you think?

You won't hear me say this too often, but I might be comfortable with skipping tests in this case if it has the least downsides of the options.

@dbeatty10
Copy link
Contributor Author

Always-on vs. optional

In the original issue description, it said:

this option should probably default to true. It possibly shouldn't even be optional, but just always be included in the generated output.

It is easy enough to just add a new include_data_types parameter though (defaulting to true) -- I think we should do it!

Thought process

For users, this would preserve optionality while still putting forward default behavior we're assuming would be best for the greatest number of users.

generate_source already has an include_data_types (optional, default=False) parameter, so it would align those names (even though the default would differ). Maybe we should change the default for generate_source at the same time?

@linbug
Copy link
Contributor

linbug commented Apr 7, 2023

Thanks @dbeatty10 for the quick and detailed responses!

For the testing, I had a look through that previous PR that you linked. I think that adopting the same approach (using text_type_value and integer_type_value macros) is the simplest (and easiest to read) solution for now and would align with what has already been shipped for generate_source. This could always be refactored in the future if we find a better approach. I don't think that skipping tests is an option without removing the existing tests, although perhaps we could adopt something similar to the tests for generate_source, where we make a include_data_types parameter and set it to false in the all tests but one (perhaps generate_model_yaml). There isn't additional value to testing this functionality in more than one test anyway.

How does that sound?

I'm aligned on adding a include_data_types parameter and defaulting to true. I do think that it makes sense to have both generate_source and generate_model_yaml take the same default, and personally I'm in favour of this being true (as long as this doesn't break anything for existing users)! I'd be happy to update that in this or a separate PR.

@linbug
Copy link
Contributor

linbug commented Apr 7, 2023

Something else I just noticed is that in generate_source, data_type is uppercase whereas here I'm lowering. We should be consistent about this too. Any preference? According to the dbt style guide, we should default to lowercase.

@dbeatty10
Copy link
Contributor Author

@linbug Your proposals for the testing sound good -- let's do it as you proposed.

For the common include_data_types parameter within generate_source and generate_model_yaml, let's do the following:

  • default both to true
  • lowercase both to align with the dbt style guide

It's fine if the changes related to generate_source are made within this PR or within a separate PR -- up to you.

Since there will be changes to how generate_source behaves by default, let's make sure to update the README and also call this out in the changelog.

The current version of dbt_codegen is 0.9.0, and we'll make sure to bump the next version to 0.10.0 (or maybe even 1.0.0!) so these changes don't break anything for folks that have the 0.9.x series as an upper bound.

@linbug
Copy link
Contributor

linbug commented Apr 14, 2023

@dbeatty10 I've made those changes now, and signed the CLA. The PR is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants