Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisions & additions to Model Versions #3232

Merged
merged 7 commits into from
Apr 26, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 130 additions & 14 deletions website/docs/docs/collaborate/govern/model-versions.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,32 +9,76 @@ description: "Version models to help with lifecycle management"
This functionality is new in v1.5.
:::

API versioning is a _complex_ problem in software engineering. It's also essential. Our goal is to _overcome obstacles to transform a complex problem into a reality_.
Versioning APIs is a challenging problem in software engineering. The goal of model versions is not to make the problem go away, or pretend it's easier than it is. Rather, we want dbt to provide tools that make it possible to tackle this problem, thoughtfully and head-on, and to develop standard patterns for solving it.

## Related documentation
- [`versions`](resource-properties/versions)
- [`latest_version`](resource-properties/latest-version)
- [`include` & `exclude`](resource-properties/include-exclude)
- [`ref` with `version` argument](ref#versioned-ref)

## Why version a model?

If a model defines a ["contract"](model-contracts) (a set of guarantees for its structure), it's also possible to change that model's contract in a way that "breaks" the previous set of parameters.
If a model defines a ["contract"](model-contracts) (a set of guarantees for its structure), it's also possible to change that model's structure in a way that "breaks" the previous set of guarantees.

One approach is to force every model consumer to immediately handle the breaking change when it's deployed to production. While this may work at smaller organizations or while iterating on an immature set of data models, it doesn’t scale well beyond that.
One approach is to force every model consumer to immediately handle the breaking change as soon as it's deployed to production. This is actually the appropriate answer at many smaller organizations, or while rapidly iterating on a not-yet-mature set of data models. But it doesn’t scale well beyond that.

Instead, the model owner can create a **new version**, during which consumers can migrate from the old version to the new.
Instead, for mature models at larger organizations, the model owner can create a **new version**, during which consumers can migrate from the old version to the new.

In the meantime, anywhere that model is used downstream, it can be referenced at a specific version.
In the meantime, anywhere that model is used downstream, it can continue to be referenced at a specific version.

In the future, we intend to also add support for **deprecating models**. Taken together, model versions and deprecation offer a pathway for _sunsetting_ and _migrating_. In the short term, avoid breaking everyone's queries. Over the longer term, older & unmaintained versions go away—they do **not** stick around forever.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a lil fuzzy right now.

In the short term, avoid breaking everyone's queries

Who is the subject here?

  • Are we (dbt Labs) avoiding breaking users' queries by shipping versioning first, while in the long term making it possible for old versions to go away?
  • Or is this an imperative to the reader: once deprecation ships, you should use versions to avoid breaking your own queries in the short term, and use the deprecation window to eventually get rid of unmaintained versions?


## When should you version a model?

Many changes to a model are not breaking, and do not require a new version! Examples include adding a new column, or fixing a bug in modeling logic.

By enforcing a model's contract, dbt can help you catch unintended changes to column names and data types that could cause a big headache for downstream queriers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but nothing to do with versioning on its own. Is the implicit recommendation here "when you break your contract, you should bump versions"?


It's also possible to change the model in more subtle ways — by recalculating a column in a way that doesn't change its name, data type, or enforceable characteristics—but would substantially change the results seen by downstream queriers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here, it feels like this would benefit from driving the point home: if you are going to surprise your querier, probably bump versions


The process of sunsetting and migrating model versions requires real work, and may require significant coordination across teams. If, instead of using model versions, you opt for non-breaking changes wherever possible—that's a completely legitimate approach. Even so, after a while, you'll find yourself with lots of unused or deprecated columns. Many teams will want to consider a predictable cadence (once or twice a year) for bumping the version of their mature models, and taking the opportunity to remove no-longer-used columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a completely legitimate approach

if anything, I reckon that that's underselling it. You should make non-breaking changes as much as possible, and if you have to make breaking changes to a model, you should try to bunch them all together instead of dribbling out bad news over time.

An interesting thing I just found while looking at breaking changes best practices: Our friends at HubSpot have a stated policy for deprecating tables, because they offer Snowflake Data Shares: https://developers.hubspot.com/docs/breaking-change-definition#snowflake-data-share

Copy link
Collaborator Author

@jtcohen6 jtcohen6 Apr 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great find! We're taking exactly the same approach, in terms of what we're considering breaking versus non-breaking, and in recommending a clear migration window:

In this case, HubSpot will add a new table with the same data to the share so that you can begin using the new name. The old table will continue to exist until the end of the 90 day notice period.


## How is this different from "version control"?

[Version control](git-version-control) allows your team to collaborate simultaneously on a single code repository, manage conflicts between changes, and review changes before deploying into production. In that sense, version control is an essential tool for versioning the deployment of an entire dbt project—always the latest state of the `main` branch, with the ability to "rollback" changes by reverting a commit or pull request. In general, only one version of your project code is deployed into an environment at a time.

Model versions are different. Multiple versions of a model will live in the same code repository at the same time and be deployed into the same data environment simultaneously. This is similar to how web APIs are versioned—multiple versions are live simultaneously; older versions are often eventually sunsetted.

dbt's model `versions` makes it possible to define multiple versions:
- That share the same "reference" name
- While reusing the same top-level properties, highlighting just their differences
## How is this different from just creating a new model?

Honestly, it's only a little bit different! There isn't much magic here, and that's by design.

You've always been able to create a new model, and name it `dim_customers_v2`. Why should you opt for a "real" versioned model instead?

First, the versioned model preserves its _reference name_. Versioned models are `ref`'d by their _model name_, rather than the name of the file that they're defined in. By default, the `ref` resolves to the latest version (as declared by that model's maintainer), but you can also `ref` a specific version of the model, with a `version` keyword.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"reference name" is a new concept to me here. Is this the same as "model name"? I think it is from context, but if so then I don't think we want to introduce a brand new term as a one-off

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a more worked example with examples and how the different elements combine:

models: 
  - name: dim_customers
    latest_version: 2
    versions:
      - v: 3
        defined_in: dim_customers_NOT_READY_YET.sql
        ...
      - v: 2
        alias: dim_customers
        ...
      - v: 1
        ...      
v ref syntax file name table name
3 ref('dim_customers', v=3) dim_customers_NOT_READY_YET.sql analytics.dim_customers_v3
2 ref('dim_customers') or ref('dim_customers', v=2) dim_customers_v2.sql analytics.dim_customers
1 ref('dim_customers', v=1) dim_customers_v1.sql analytics.dim_customers_v1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does latest version of a model get to look for the _vX-less sql file? I don't like that the grid here forces new sql files for every version, so you don't get a nice git diff

Copy link
Collaborator Author

@jtcohen6 jtcohen6 Apr 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does latest version of a model get to look for the _vX-less sql file?

That's not the case in the current implementation. Should it be? I think we could do this. (Naive attempt: dbt-labs/dbt-core@b13cd2b)

Even if we don't do this, you could do the same thing as with aliases, and keep moving the defined_in property around, so that dim_customers.sql is always your "latest":

models: 
  - name: dim_customers
    latest_version: 2
    versions:
      - v: 3
        defined_in: dim_customers_NOT_READY_YET.sql
        ...
      - v: 2
        # because this is the latest, it should have the canonical file name + alias
        alias: dim_customers
        defined_in: dim_customers
        ...
      - v: 1
        ...      

But if that's our strong recommendation - let's just make it the default behavior

Copy link
Contributor

@joellabes joellabes Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird/WET to have to move both the alias and defined_in around over time to just get the same name as is already defined in the name key up top.

Edit: here's my actual objection: defined_in feels like it should be a "break glass in case of emergency" property, not something that gets rolled out everywhere. If you're using defined_in, you best have a good reason. Encouraging it to be everywhere cheapens that a bit


<File name="models/schema.yml">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<File name="models/schema.yml">
<File name="models/scratchpad.sql">

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although if you take the table and sample code above, I think this File block is totally redundant


```sql
{{ ref('dim_customers') }} -- resolves to latest
{{ ref('dim_customers', version=2) }} -- resolves to v2
```

</File>

Second, a versioned model can reuse the majority of its yaml properties and configuration. Each version needs to only say how it _differs_ from the shared set of attributes. This gives you an opportunity to highlight the differences across versions—which is otherwise difficult to detect in models with dozens or hundreds of columns—and to clearly track, in one place, all versions of the model which are currently live.

Third, dbt supports `version`-based selection. For example, you could define a [default yaml selector](node-selection/yaml-selectors#default), to avoid running any old model versions in development—even as you continue to run them in production through a sunset and migration period:

```yml
selectors:
- name: exclude_old_versions
default: "{{ target.name == 'dev' }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need | as_bool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually doesn't. I think the reason is because yaml treats the string "True" as truthy

definition:
method: fqn
value: "*"
exclude:
- method: version
value: old
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a straightforward jump to https://deploy-preview-3232--docs-getdbt-com.netlify.app/reference/node-selection/methods#the-version-method to find the valid options, should we link to that as well? or list them inline here?

```

Finally, we intend to add support for **deprecating models** in dbt Core v1.6. When you slate a versioned model for deprecation, dbt will be able to provide more helpful warnings to downstream consumers of that model. Rather than just, "This model is going away," it's - "This older version of the model is going away, and there's a new version coming soon."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the best spot for it - but it'd be great to raise the tradeoff between cost/clutter (from maintaining multiple versions of a model in a warehouse) and providing consumers with enough time to gracefully migrate off old versions. It could be easy to let old versions of models pile up but deprecation dates create an explicit bound for how costly a migration should to be for a particular model

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This older version of the model is going away, and there's a new version coming soon."

I read this as "the new version hasn't been developed yet" - do you mean people might deprecate models ahead of when their replacements have been deployed "behind the feature flag" of not being latest?

Do we give a warning if latest is v3 and I query v2 by name, even if deprecation isn't scheduled? If so then it's more like

Rather than just, "This model has been superseded," it's - "This model has been superseded, and you need to act."

Copy link
Collaborator Author

@jtcohen6 jtcohen6 Apr 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - my wording here was not clear.

We'll support deprecation_date for both versioned and non-versioned models. I was trying to differentiate here between the experience of downstream consumers, when an upstream producer deprecates a model that's versioned, versus a standalone (non-versioned) model. Again, trying to make the case for - why use versions at all? Why not just create two separate SQL files, my_model_old and my_model_new, and dbt is none the wiser?

Do we give a warning if latest is v3 and I query v2 by name, even if deprecation isn't scheduled?

As a consumer, I don't think we'd give a warning if you query v2 by name, and it's not scheduled for deprecation. I think we'd strongly encourage producers to put a deprecation date on v2 at the same time that they roll over latest_version: 3.

At first, I thought you were asking something slightly different - should we warn the consumer about an unpinned reference to a model, as soon as a prerelease version becomes available?

12:53:34  4 of 4 START sql view model dbt_jcohen.another_model ........................... [RUN]
12:53:34  FYI!
Found an unpinned reference to versioned model 'my_model' in 'my_dbt_project'.
Resolving to latest version: my_model.v2
A prerelease version 3 is available. It has not yet been marked 'latest' by its maintainer.
When that happens, this reference will resolve to my_model.v3 instead.

  Try out v3: {{ ref(my_dbt_project, my_model, v=3) }}
  Pin to v2:  {{ ref(my_dbt_project, my_model, v=2) }}


## How to create a new version of a model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this week, but eventually, we might want to pull the procedure section out into its own page so people who don't need the "why" and other context can more easily find the "how."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heard! A lot of the "how" details are also captured in the reference documentation: https://docs.getdbt.com/reference/resource-properties/versions

For now, we should expect that the main audience for these docs is people trying to learn about & understand the new feature. As more the concept becomes more established, and people are already convinced they want to use the feature, we can make it even easier to just do the thing


Expand Down Expand Up @@ -67,7 +111,7 @@ If you wanted to make a breaking change to the model - for example, removing a c
```yaml
models:
- name: dim_customers
latest_version: 2
latest_version: 1
config:
materialized: table
contract:
Expand All @@ -92,19 +136,62 @@ models:

The above configuration will create two models (one for each version), and produce database relations with aliases `dim_customers_v1` and `dim_customers_v2`.

By convention, dbt will expect those two models to be defined in files named `dim_customers_v1.sql` and `dim_customers_v2.sql`. It is possible to override this by setting a `defined_in` property.
By convention, dbt will expect those two models to be defined in files named `dim_customers_v1.sql` and `dim_customers_v2.sql`. (It is possible to override this by setting `defined_in: any_file_name_you_want`, but we strongly encourage you to follow the convention!)

The `latest_version` would be `2` (numerically greatest) if not specified explicitly. In this case, `v1` is specified to still be the latest; `v2` is just a prerelease in early development. When ready to roll out `v2` to everyone by default, bump the `latest_version` to `2` (or remove it from the specification).

You can reconfigure each version independently. For example, if you wanted `dim_customers.v1` to continue populating the database table named `dim_customers` (its original name), you could use the `defined_in` configuration:
### Configuring versioned models

You can reconfigure each version independently. For example, you could materialize `v2` as a table and `v1` as a view:

<File name="models/schema.yml">

```yml
versions:
- v: 2
config:
materialized: table
- v: 1
config:
materialized: view
```

</File>

Like with all config inheritance, any configs set _within_ the versioned model's definition (`.sql` or `.py` file) will take precedence over the configs set in yaml.

### Configuring database location with `alias`

Following the example, let's say you wanted `dim_customers.v1` to continue populating the database table named `dim_customers`. That's what the table was named previously, and you may have several other dashboards or tools expecting to read its data from `<dbname>.<schemaname>.dim_customers`.

You could use the `alias` configuration:

<File name="models/schema.yml">

```yml
- v: 1
defined_in: dim_customers # keep original relation name
config:
alias: dim_customers # keep v1 in its original database location
```

</File>

Or, you could define a separate view that always points to the latest version of the model. We recommend this pattern because it's the most transparent and easiest to follow.

<File name="models/dim_customers_view.yml">

```sql
{{ config(alias = 'dim_customers') }}

select * from {{ ref('dim_customers') }}
```

</File>

:::info
Projects which have historically implemented [custom aliases](/docs/build/custom-aliases) by reimplementing the `generate_alias_name` macro will need to update their custom implementations to account for model versions.
If your project has historically implemented [custom aliases](/docs/build/custom-aliases) by reimplementing the `generate_alias_name` macro, and you'd like to start using model versions, you should update your custom implementation to account for model versions. Specifically, we'd encourage you to add [a condition like this one](https://github.com/dbt-labs/dbt-core/blob/ada8860e48b32ac712d92e8b0977b2c3c9749981/core/dbt/include/global_project/macros/get_custom_name/get_custom_alias.sql#L26-L30).

Otherwise, they'll see something like this as soon as they start using versions:
Your existing implementation of `generate_alias_name` should not encounter any errors upon first upgrading to v1.5. It's only when you create your first versioned model, that you may see an error like:

```sh
dbt.exceptions.AmbiguousAliasError: Compilation Error
Expand All @@ -114,3 +201,32 @@ dbt.exceptions.AmbiguousAliasError: Compilation Error
- model.project_name.model_name.v1 (models/.../model_name.sql)
- model.project_name.model_name.v2 (models/.../model_name_v2.sql)
```

We opted to use `generate_alias_name` for this functionality so that the logic remains accessible to end users, and could be reimplemented with custom logic.

### Optimizing model versions

How you define each model version is completely up to you. While it's easy to start by copy-pasting from one model's SQL definition into another, you should think about _what actually is changing_ from one version to another.

For example, if your new model version is only renaming or removing certain columns, you could define one version as a view on top of the other one:

<File name="models/dim_customers_v2.sql">

```sql
{{ config(materialized = 'view') }}

{% set dim_customers_v1 = ref('dim_customers', v=1)}

select
{{ dbt_utils.star(from=dim_customers_v1, except=["country_name"]) }}
from {{ dim_customers_v1 }}
```

</File>

Of course, if one model version makes meaningful and substantive changes to logic in another, it may not be possible to optimize it in this way. At that point, the cost of human intuition and legibility is more important than the cost of recomputing similar transformations.

We expect to develop more opinionated recommendations as teams start adopting model versions in practice. One recommended pattern we can envision: Prioritize the definition of the `latest_version`, and define other versions (old and prerelease) based on their diffs from the latest. How?
- Define the properties and configuration for the latest version in the top-level model yaml, and the diffs for other versions below (via `include`/`exclude`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👨‍🍳 💋

- Where possible, define other versions as `select` transformations, which take the latest version as their starting point
- When bumping the `latest_version`, migrate the SQL and yaml accordingly. In this case, we would see if it's possible to redefine `v1` with respect to `v2`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case

The sample above where country name is removed? I don't quite follow this bit