-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisions & additions to Model Versions #3232
Changes from 1 commit
b9049a8
cb26fd1
e263bed
8d303df
8fd1fef
1ebed2c
1a00757
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -9,32 +9,76 @@ description: "Version models to help with lifecycle management" | |||||||||||||||||
This functionality is new in v1.5. | ||||||||||||||||||
::: | ||||||||||||||||||
|
||||||||||||||||||
API versioning is a _complex_ problem in software engineering. It's also essential. Our goal is to _overcome obstacles to transform a complex problem into a reality_. | ||||||||||||||||||
Versioning APIs is a challenging problem in software engineering. The goal of model versions is not to make the problem go away, or pretend it's easier than it is. Rather, we want dbt to provide tools that make it possible to tackle this problem, thoughtfully and head-on, and to develop standard patterns for solving it. | ||||||||||||||||||
|
||||||||||||||||||
## Related documentation | ||||||||||||||||||
- [`versions`](resource-properties/versions) | ||||||||||||||||||
- [`latest_version`](resource-properties/latest-version) | ||||||||||||||||||
- [`include` & `exclude`](resource-properties/include-exclude) | ||||||||||||||||||
- [`ref` with `version` argument](ref#versioned-ref) | ||||||||||||||||||
|
||||||||||||||||||
## Why version a model? | ||||||||||||||||||
|
||||||||||||||||||
If a model defines a ["contract"](model-contracts) (a set of guarantees for its structure), it's also possible to change that model's contract in a way that "breaks" the previous set of parameters. | ||||||||||||||||||
If a model defines a ["contract"](model-contracts) (a set of guarantees for its structure), it's also possible to change that model's structure in a way that "breaks" the previous set of guarantees. | ||||||||||||||||||
|
||||||||||||||||||
One approach is to force every model consumer to immediately handle the breaking change when it's deployed to production. While this may work at smaller organizations or while iterating on an immature set of data models, it doesn’t scale well beyond that. | ||||||||||||||||||
One approach is to force every model consumer to immediately handle the breaking change as soon as it's deployed to production. This is actually the appropriate answer at many smaller organizations, or while rapidly iterating on a not-yet-mature set of data models. But it doesn’t scale well beyond that. | ||||||||||||||||||
|
||||||||||||||||||
Instead, the model owner can create a **new version**, during which consumers can migrate from the old version to the new. | ||||||||||||||||||
Instead, for mature models at larger organizations, the model owner can create a **new version**, during which consumers can migrate from the old version to the new. | ||||||||||||||||||
|
||||||||||||||||||
In the meantime, anywhere that model is used downstream, it can be referenced at a specific version. | ||||||||||||||||||
In the meantime, anywhere that model is used downstream, it can continue to be referenced at a specific version. | ||||||||||||||||||
|
||||||||||||||||||
In the future, we intend to also add support for **deprecating models**. Taken together, model versions and deprecation offer a pathway for _sunsetting_ and _migrating_. In the short term, avoid breaking everyone's queries. Over the longer term, older & unmaintained versions go away—they do **not** stick around forever. | ||||||||||||||||||
|
||||||||||||||||||
## When should you version a model? | ||||||||||||||||||
|
||||||||||||||||||
Many changes to a model are not breaking, and do not require a new version! Examples include adding a new column, or fixing a bug in modeling logic. | ||||||||||||||||||
|
||||||||||||||||||
By enforcing a model's contract, dbt can help you catch unintended changes to column names and data types that could cause a big headache for downstream queriers. | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True, but nothing to do with versioning on its own. Is the implicit recommendation here "when you break your contract, you should bump versions"? |
||||||||||||||||||
|
||||||||||||||||||
It's also possible to change the model in more subtle ways — by recalculating a column in a way that doesn't change its name, data type, or enforceable characteristics—but would substantially change the results seen by downstream queriers. | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Likewise here, it feels like this would benefit from driving the point home: if you are going to surprise your querier, probably bump versions |
||||||||||||||||||
|
||||||||||||||||||
The process of sunsetting and migrating model versions requires real work, and may require significant coordination across teams. If, instead of using model versions, you opt for non-breaking changes wherever possible—that's a completely legitimate approach. Even so, after a while, you'll find yourself with lots of unused or deprecated columns. Many teams will want to consider a predictable cadence (once or twice a year) for bumping the version of their mature models, and taking the opportunity to remove no-longer-used columns. | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
if anything, I reckon that that's underselling it. You should make non-breaking changes as much as possible, and if you have to make breaking changes to a model, you should try to bunch them all together instead of dribbling out bad news over time. An interesting thing I just found while looking at breaking changes best practices: Our friends at HubSpot have a stated policy for deprecating tables, because they offer Snowflake Data Shares: https://developers.hubspot.com/docs/breaking-change-definition#snowflake-data-share There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a great find! We're taking exactly the same approach, in terms of what we're considering breaking versus non-breaking, and in recommending a clear migration window:
|
||||||||||||||||||
|
||||||||||||||||||
## How is this different from "version control"? | ||||||||||||||||||
|
||||||||||||||||||
[Version control](git-version-control) allows your team to collaborate simultaneously on a single code repository, manage conflicts between changes, and review changes before deploying into production. In that sense, version control is an essential tool for versioning the deployment of an entire dbt project—always the latest state of the `main` branch, with the ability to "rollback" changes by reverting a commit or pull request. In general, only one version of your project code is deployed into an environment at a time. | ||||||||||||||||||
|
||||||||||||||||||
Model versions are different. Multiple versions of a model will live in the same code repository at the same time and be deployed into the same data environment simultaneously. This is similar to how web APIs are versioned—multiple versions are live simultaneously; older versions are often eventually sunsetted. | ||||||||||||||||||
|
||||||||||||||||||
dbt's model `versions` makes it possible to define multiple versions: | ||||||||||||||||||
- That share the same "reference" name | ||||||||||||||||||
- While reusing the same top-level properties, highlighting just their differences | ||||||||||||||||||
## How is this different from just creating a new model? | ||||||||||||||||||
|
||||||||||||||||||
Honestly, it's only a little bit different! There isn't much magic here, and that's by design. | ||||||||||||||||||
|
||||||||||||||||||
You've always been able to create a new model, and name it `dim_customers_v2`. Why should you opt for a "real" versioned model instead? | ||||||||||||||||||
|
||||||||||||||||||
First, the versioned model preserves its _reference name_. Versioned models are `ref`'d by their _model name_, rather than the name of the file that they're defined in. By default, the `ref` resolves to the latest version (as declared by that model's maintainer), but you can also `ref` a specific version of the model, with a `version` keyword. | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "reference name" is a new concept to me here. Is this the same as "model name"? I think it is from context, but if so then I don't think we want to introduce a brand new term as a one-off There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this needs a more worked example with examples and how the different elements combine: models:
- name: dim_customers
latest_version: 2
versions:
- v: 3
defined_in: dim_customers_NOT_READY_YET.sql
...
- v: 2
alias: dim_customers
...
- v: 1
...
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's not the case in the current implementation. Should it be? I think we could do this. (Naive attempt: dbt-labs/dbt-core@b13cd2b) Even if we don't do this, you could do the same thing as with aliases, and keep moving the models:
- name: dim_customers
latest_version: 2
versions:
- v: 3
defined_in: dim_customers_NOT_READY_YET.sql
...
- v: 2
# because this is the latest, it should have the canonical file name + alias
alias: dim_customers
defined_in: dim_customers
...
- v: 1
... But if that's our strong recommendation - let's just make it the default behavior There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It feels weird/WET to have to move both the Edit: here's my actual objection: |
||||||||||||||||||
|
||||||||||||||||||
<File name="models/schema.yml"> | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although if you take the table and sample code above, I think this File block is totally redundant |
||||||||||||||||||
|
||||||||||||||||||
```sql | ||||||||||||||||||
{{ ref('dim_customers') }} -- resolves to latest | ||||||||||||||||||
{{ ref('dim_customers', version=2) }} -- resolves to v2 | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
</File> | ||||||||||||||||||
|
||||||||||||||||||
Second, a versioned model can reuse the majority of its yaml properties and configuration. Each version needs to only say how it _differs_ from the shared set of attributes. This gives you an opportunity to highlight the differences across versions—which is otherwise difficult to detect in models with dozens or hundreds of columns—and to clearly track, in one place, all versions of the model which are currently live. | ||||||||||||||||||
|
||||||||||||||||||
Third, dbt supports `version`-based selection. For example, you could define a [default yaml selector](node-selection/yaml-selectors#default), to avoid running any old model versions in development—even as you continue to run them in production through a sunset and migration period: | ||||||||||||||||||
|
||||||||||||||||||
```yml | ||||||||||||||||||
selectors: | ||||||||||||||||||
- name: exclude_old_versions | ||||||||||||||||||
default: "{{ target.name == 'dev' }}" | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It actually doesn't. I think the reason is because yaml treats the string |
||||||||||||||||||
definition: | ||||||||||||||||||
method: fqn | ||||||||||||||||||
value: "*" | ||||||||||||||||||
exclude: | ||||||||||||||||||
- method: version | ||||||||||||||||||
value: old | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not a straightforward jump to https://deploy-preview-3232--docs-getdbt-com.netlify.app/reference/node-selection/methods#the-version-method to find the valid options, should we link to that as well? or list them inline here? |
||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
Finally, we intend to add support for **deprecating models** in dbt Core v1.6. When you slate a versioned model for deprecation, dbt will be able to provide more helpful warnings to downstream consumers of that model. Rather than just, "This model is going away," it's - "This older version of the model is going away, and there's a new version coming soon." | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if this is the best spot for it - but it'd be great to raise the tradeoff between cost/clutter (from maintaining multiple versions of a model in a warehouse) and providing consumers with enough time to gracefully migrate off old versions. It could be easy to let old versions of models pile up but deprecation dates create an explicit bound for how costly a migration should to be for a particular model There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I read this as "the new version hasn't been developed yet" - do you mean people might deprecate models ahead of when their replacements have been deployed "behind the feature flag" of not being Do we give a warning if
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point - my wording here was not clear. We'll support
As a consumer, I don't think we'd give a warning if you query At first, I thought you were asking something slightly different - should we warn the consumer about an unpinned reference to a model, as soon as a prerelease version becomes available?
|
||||||||||||||||||
|
||||||||||||||||||
## How to create a new version of a model | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not for this week, but eventually, we might want to pull the procedure section out into its own page so people who don't need the "why" and other context can more easily find the "how." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Heard! A lot of the "how" details are also captured in the reference documentation: https://docs.getdbt.com/reference/resource-properties/versions For now, we should expect that the main audience for these docs is people trying to learn about & understand the new feature. As more the concept becomes more established, and people are already convinced they want to use the feature, we can make it even easier to just do the thing |
||||||||||||||||||
|
||||||||||||||||||
|
@@ -67,7 +111,7 @@ If you wanted to make a breaking change to the model - for example, removing a c | |||||||||||||||||
```yaml | ||||||||||||||||||
models: | ||||||||||||||||||
- name: dim_customers | ||||||||||||||||||
latest_version: 2 | ||||||||||||||||||
latest_version: 1 | ||||||||||||||||||
config: | ||||||||||||||||||
materialized: table | ||||||||||||||||||
contract: | ||||||||||||||||||
|
@@ -92,19 +136,62 @@ models: | |||||||||||||||||
|
||||||||||||||||||
The above configuration will create two models (one for each version), and produce database relations with aliases `dim_customers_v1` and `dim_customers_v2`. | ||||||||||||||||||
|
||||||||||||||||||
By convention, dbt will expect those two models to be defined in files named `dim_customers_v1.sql` and `dim_customers_v2.sql`. It is possible to override this by setting a `defined_in` property. | ||||||||||||||||||
By convention, dbt will expect those two models to be defined in files named `dim_customers_v1.sql` and `dim_customers_v2.sql`. (It is possible to override this by setting `defined_in: any_file_name_you_want`, but we strongly encourage you to follow the convention!) | ||||||||||||||||||
|
||||||||||||||||||
The `latest_version` would be `2` (numerically greatest) if not specified explicitly. In this case, `v1` is specified to still be the latest; `v2` is just a prerelease in early development. When ready to roll out `v2` to everyone by default, bump the `latest_version` to `2` (or remove it from the specification). | ||||||||||||||||||
|
||||||||||||||||||
You can reconfigure each version independently. For example, if you wanted `dim_customers.v1` to continue populating the database table named `dim_customers` (its original name), you could use the `defined_in` configuration: | ||||||||||||||||||
### Configuring versioned models | ||||||||||||||||||
|
||||||||||||||||||
You can reconfigure each version independently. For example, you could materialize `v2` as a table and `v1` as a view: | ||||||||||||||||||
|
||||||||||||||||||
<File name="models/schema.yml"> | ||||||||||||||||||
|
||||||||||||||||||
```yml | ||||||||||||||||||
versions: | ||||||||||||||||||
- v: 2 | ||||||||||||||||||
config: | ||||||||||||||||||
materialized: table | ||||||||||||||||||
- v: 1 | ||||||||||||||||||
config: | ||||||||||||||||||
materialized: view | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
</File> | ||||||||||||||||||
|
||||||||||||||||||
Like with all config inheritance, any configs set _within_ the versioned model's definition (`.sql` or `.py` file) will take precedence over the configs set in yaml. | ||||||||||||||||||
|
||||||||||||||||||
### Configuring database location with `alias` | ||||||||||||||||||
|
||||||||||||||||||
Following the example, let's say you wanted `dim_customers.v1` to continue populating the database table named `dim_customers`. That's what the table was named previously, and you may have several other dashboards or tools expecting to read its data from `<dbname>.<schemaname>.dim_customers`. | ||||||||||||||||||
|
||||||||||||||||||
You could use the `alias` configuration: | ||||||||||||||||||
|
||||||||||||||||||
<File name="models/schema.yml"> | ||||||||||||||||||
|
||||||||||||||||||
```yml | ||||||||||||||||||
- v: 1 | ||||||||||||||||||
defined_in: dim_customers # keep original relation name | ||||||||||||||||||
config: | ||||||||||||||||||
alias: dim_customers # keep v1 in its original database location | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
</File> | ||||||||||||||||||
|
||||||||||||||||||
Or, you could define a separate view that always points to the latest version of the model. We recommend this pattern because it's the most transparent and easiest to follow. | ||||||||||||||||||
|
||||||||||||||||||
<File name="models/dim_customers_view.yml"> | ||||||||||||||||||
|
||||||||||||||||||
```sql | ||||||||||||||||||
{{ config(alias = 'dim_customers') }} | ||||||||||||||||||
|
||||||||||||||||||
select * from {{ ref('dim_customers') }} | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
</File> | ||||||||||||||||||
|
||||||||||||||||||
:::info | ||||||||||||||||||
Projects which have historically implemented [custom aliases](/docs/build/custom-aliases) by reimplementing the `generate_alias_name` macro will need to update their custom implementations to account for model versions. | ||||||||||||||||||
If your project has historically implemented [custom aliases](/docs/build/custom-aliases) by reimplementing the `generate_alias_name` macro, and you'd like to start using model versions, you should update your custom implementation to account for model versions. Specifically, we'd encourage you to add [a condition like this one](https://github.com/dbt-labs/dbt-core/blob/ada8860e48b32ac712d92e8b0977b2c3c9749981/core/dbt/include/global_project/macros/get_custom_name/get_custom_alias.sql#L26-L30). | ||||||||||||||||||
|
||||||||||||||||||
Otherwise, they'll see something like this as soon as they start using versions: | ||||||||||||||||||
Your existing implementation of `generate_alias_name` should not encounter any errors upon first upgrading to v1.5. It's only when you create your first versioned model, that you may see an error like: | ||||||||||||||||||
|
||||||||||||||||||
```sh | ||||||||||||||||||
dbt.exceptions.AmbiguousAliasError: Compilation Error | ||||||||||||||||||
|
@@ -114,3 +201,32 @@ dbt.exceptions.AmbiguousAliasError: Compilation Error | |||||||||||||||||
- model.project_name.model_name.v1 (models/.../model_name.sql) | ||||||||||||||||||
- model.project_name.model_name.v2 (models/.../model_name_v2.sql) | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
We opted to use `generate_alias_name` for this functionality so that the logic remains accessible to end users, and could be reimplemented with custom logic. | ||||||||||||||||||
|
||||||||||||||||||
### Optimizing model versions | ||||||||||||||||||
|
||||||||||||||||||
How you define each model version is completely up to you. While it's easy to start by copy-pasting from one model's SQL definition into another, you should think about _what actually is changing_ from one version to another. | ||||||||||||||||||
|
||||||||||||||||||
For example, if your new model version is only renaming or removing certain columns, you could define one version as a view on top of the other one: | ||||||||||||||||||
|
||||||||||||||||||
<File name="models/dim_customers_v2.sql"> | ||||||||||||||||||
|
||||||||||||||||||
```sql | ||||||||||||||||||
{{ config(materialized = 'view') }} | ||||||||||||||||||
|
||||||||||||||||||
{% set dim_customers_v1 = ref('dim_customers', v=1)} | ||||||||||||||||||
|
||||||||||||||||||
select | ||||||||||||||||||
{{ dbt_utils.star(from=dim_customers_v1, except=["country_name"]) }} | ||||||||||||||||||
from {{ dim_customers_v1 }} | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
</File> | ||||||||||||||||||
|
||||||||||||||||||
Of course, if one model version makes meaningful and substantive changes to logic in another, it may not be possible to optimize it in this way. At that point, the cost of human intuition and legibility is more important than the cost of recomputing similar transformations. | ||||||||||||||||||
|
||||||||||||||||||
We expect to develop more opinionated recommendations as teams start adopting model versions in practice. One recommended pattern we can envision: Prioritize the definition of the `latest_version`, and define other versions (old and prerelease) based on their diffs from the latest. How? | ||||||||||||||||||
- Define the properties and configuration for the latest version in the top-level model yaml, and the diffs for other versions below (via `include`/`exclude`) | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👨🍳 💋 |
||||||||||||||||||
- Where possible, define other versions as `select` transformations, which take the latest version as their starting point | ||||||||||||||||||
- When bumping the `latest_version`, migrate the SQL and yaml accordingly. In this case, we would see if it's possible to redefine `v1` with respect to `v2`. | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The sample above where country name is removed? I don't quite follow this bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a lil fuzzy right now.
Who is the subject here?