Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisions & additions to Model Versions #3232

Merged
merged 7 commits into from
Apr 26, 2023
Merged

Conversation

jtcohen6
Copy link
Collaborator

@jtcohen6 jtcohen6 commented Apr 20, 2023

Preview: Collaborate with others > Model governance > Model versions

What are you changing in this pull request and why?

We've written the minimal viable reference docs for this feature. I want to offer some more opinionated guidance & framing, and gesture in the direction of some best practices:

  • Don't create a new version for every model change
  • Do actually sunset/deprecate your old model versions

This does require a more personal tone, and a sense of future direction, than a lot of other (more-established) documentation. Very open to feedback.

@netlify
Copy link

netlify bot commented Apr 20, 2023

Deploy Preview for docs-getdbt-com ready!

Name Link
🔨 Latest commit 1a00757
🔍 Latest deploy log https://app.netlify.com/sites/docs-getdbt-com/deploys/644900b6a731830008eb5024
😎 Deploy Preview https://deploy-preview-3232--docs-getdbt-com.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@github-actions github-actions bot added content Improvements or additions to content size: medium This change will take up to a week to address labels Apr 20, 2023
value: old
```

Finally, we intend to add support for **deprecating models** in dbt Core v1.6. When you slate a versioned model for deprecation, dbt will be able to provide more helpful warnings to downstream consumers of that model. Rather than just, "This model is going away," it's - "This older version of the model is going away, and there's a new version coming soon."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the best spot for it - but it'd be great to raise the tradeoff between cost/clutter (from maintaining multiple versions of a model in a warehouse) and providing consumers with enough time to gracefully migrate off old versions. It could be easy to let old versions of models pile up but deprecation dates create an explicit bound for how costly a migration should to be for a particular model

Copy link
Contributor

@joellabes joellabes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wonderful. Nitpicks, pedantry and clarifying questions from me throughout, but I think that vibe-wise this is great!

In the meantime, anywhere that model is used downstream, it can be referenced at a specific version.
In the meantime, anywhere that model is used downstream, it can continue to be referenced at a specific version.

In the future, we intend to also add support for **deprecating models**. Taken together, model versions and deprecation offer a pathway for _sunsetting_ and _migrating_. In the short term, avoid breaking everyone's queries. Over the longer term, older & unmaintained versions go away—they do **not** stick around forever.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a lil fuzzy right now.

In the short term, avoid breaking everyone's queries

Who is the subject here?

  • Are we (dbt Labs) avoiding breaking users' queries by shipping versioning first, while in the long term making it possible for old versions to go away?
  • Or is this an imperative to the reader: once deprecation ships, you should use versions to avoid breaking your own queries in the short term, and use the deprecation window to eventually get rid of unmaintained versions?


It's also possible to change the model in more subtle ways — by recalculating a column in a way that doesn't change its name, data type, or enforceable characteristics—but would substantially change the results seen by downstream queriers.

The process of sunsetting and migrating model versions requires real work, and may require significant coordination across teams. If, instead of using model versions, you opt for non-breaking changes wherever possible—that's a completely legitimate approach. Even so, after a while, you'll find yourself with lots of unused or deprecated columns. Many teams will want to consider a predictable cadence (once or twice a year) for bumping the version of their mature models, and taking the opportunity to remove no-longer-used columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a completely legitimate approach

if anything, I reckon that that's underselling it. You should make non-breaking changes as much as possible, and if you have to make breaking changes to a model, you should try to bunch them all together instead of dribbling out bad news over time.

An interesting thing I just found while looking at breaking changes best practices: Our friends at HubSpot have a stated policy for deprecating tables, because they offer Snowflake Data Shares: https://developers.hubspot.com/docs/breaking-change-definition#snowflake-data-share

Copy link
Collaborator Author

@jtcohen6 jtcohen6 Apr 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great find! We're taking exactly the same approach, in terms of what we're considering breaking versus non-breaking, and in recommending a clear migration window:

In this case, HubSpot will add a new table with the same data to the share so that you can begin using the new name. The old table will continue to exist until the end of the 90 day notice period.


You've always been able to create a new model, and name it `dim_customers_v2`. Why should you opt for a "real" versioned model instead?

First, the versioned model preserves its _reference name_. Versioned models are `ref`'d by their _model name_, rather than the name of the file that they're defined in. By default, the `ref` resolves to the latest version (as declared by that model's maintainer), but you can also `ref` a specific version of the model, with a `version` keyword.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"reference name" is a new concept to me here. Is this the same as "model name"? I think it is from context, but if so then I don't think we want to introduce a brand new term as a one-off

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a more worked example with examples and how the different elements combine:

models: 
  - name: dim_customers
    latest_version: 2
    versions:
      - v: 3
        defined_in: dim_customers_NOT_READY_YET.sql
        ...
      - v: 2
        alias: dim_customers
        ...
      - v: 1
        ...      
v ref syntax file name table name
3 ref('dim_customers', v=3) dim_customers_NOT_READY_YET.sql analytics.dim_customers_v3
2 ref('dim_customers') or ref('dim_customers', v=2) dim_customers_v2.sql analytics.dim_customers
1 ref('dim_customers', v=1) dim_customers_v1.sql analytics.dim_customers_v1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does latest version of a model get to look for the _vX-less sql file? I don't like that the grid here forces new sql files for every version, so you don't get a nice git diff

Copy link
Collaborator Author

@jtcohen6 jtcohen6 Apr 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does latest version of a model get to look for the _vX-less sql file?

That's not the case in the current implementation. Should it be? I think we could do this. (Naive attempt: dbt-labs/dbt-core@b13cd2b)

Even if we don't do this, you could do the same thing as with aliases, and keep moving the defined_in property around, so that dim_customers.sql is always your "latest":

models: 
  - name: dim_customers
    latest_version: 2
    versions:
      - v: 3
        defined_in: dim_customers_NOT_READY_YET.sql
        ...
      - v: 2
        # because this is the latest, it should have the canonical file name + alias
        alias: dim_customers
        defined_in: dim_customers
        ...
      - v: 1
        ...      

But if that's our strong recommendation - let's just make it the default behavior

Copy link
Contributor

@joellabes joellabes Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird/WET to have to move both the alias and defined_in around over time to just get the same name as is already defined in the name key up top.

Edit: here's my actual objection: defined_in feels like it should be a "break glass in case of emergency" property, not something that gets rolled out everywhere. If you're using defined_in, you best have a good reason. Encouraging it to be everywhere cheapens that a bit


First, the versioned model preserves its _reference name_. Versioned models are `ref`'d by their _model name_, rather than the name of the file that they're defined in. By default, the `ref` resolves to the latest version (as declared by that model's maintainer), but you can also `ref` a specific version of the model, with a `version` keyword.

<File name="models/schema.yml">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<File name="models/schema.yml">
<File name="models/scratchpad.sql">

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although if you take the table and sample code above, I think this File block is totally redundant

```yml
selectors:
- name: exclude_old_versions
default: "{{ target.name == 'dev' }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need | as_bool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually doesn't. I think the reason is because yaml treats the string "True" as truthy

Comment on lines 225 to 235
Or, you could define a separate view that always points to the latest version of the model. We recommend this pattern because it's the most transparent and easiest to follow.

<File name="models/dim_customers_view.yml">

```sql
{{ config(alias = 'dim_customers') }}

select * from {{ ref('dim_customers') }}
```

</File>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I was building my table of examples up top, I decided that if we're not going to do this magically behind the scenes, I think we should just encourage people to move the alias definition around in their YAML as they progress their models. Telling people to shepherd an entire extra model around by hand feels gross.

Copy link
Collaborator Author

@jtcohen6 jtcohen6 Apr 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joellabes I've had another thought here: It would be possible to implement this as a standard pattern, with a modification to the generate_alias_name macro.

{% macro generate_alias_name(custom_alias_name=none, node=none) -%}
    {%- if custom_alias_name -%}
        {{ return(custom_alias_name | trim) }}
    {%- elif node.version and not node.is_latest_version -%} {# <--- this bit #}
        {{ return(node.name ~ "_v" ~ (node.version | replace(".", "_"))) }}
    {%- else -%} {# latest version has standard behavior #}
        {{ return(node.name) }}
    {%- endif -%}
{%- endmacro %}

This way, whichever version is latest_version, it always lands in the model's "canonical" location.

Should we make that the default behavior? Or should it be something that end users opt into? I'm inclined to be a bit more opinionated, and say this should be the default—really emphasize that the latest version is the thing, and the old/new versions are mechanisms for managing change—but it does add a bit more inconsistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I definitely agree it should be the default. More to come on the Slack thread

Of course, if one model version makes meaningful and substantive changes to logic in another, it may not be possible to optimize it in this way. At that point, the cost of human intuition and legibility is more important than the cost of recomputing similar transformations.

We expect to develop more opinionated recommendations as teams start adopting model versions in practice. One recommended pattern we can envision: Prioritize the definition of the `latest_version`, and define other versions (old and prerelease) based on their diffs from the latest. How?
- Define the properties and configuration for the latest version in the top-level model yaml, and the diffs for other versions below (via `include`/`exclude`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👨‍🍳 💋

We expect to develop more opinionated recommendations as teams start adopting model versions in practice. One recommended pattern we can envision: Prioritize the definition of the `latest_version`, and define other versions (old and prerelease) based on their diffs from the latest. How?
- Define the properties and configuration for the latest version in the top-level model yaml, and the diffs for other versions below (via `include`/`exclude`)
- Where possible, define other versions as `select` transformations, which take the latest version as their starting point
- When bumping the `latest_version`, migrate the SQL and yaml accordingly. In this case, we would see if it's possible to redefine `v1` with respect to `v2`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case

The sample above where country name is removed? I don't quite follow this bit


Many changes to a model are not breaking, and do not require a new version! Examples include adding a new column, or fixing a bug in modeling logic.

By enforcing a model's contract, dbt can help you catch unintended changes to column names and data types that could cause a big headache for downstream queriers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but nothing to do with versioning on its own. Is the implicit recommendation here "when you break your contract, you should bump versions"?


By enforcing a model's contract, dbt can help you catch unintended changes to column names and data types that could cause a big headache for downstream queriers.

It's also possible to change the model in more subtle ways — by recalculating a column in a way that doesn't change its name, data type, or enforceable characteristics—but would substantially change the results seen by downstream queriers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here, it feels like this would benefit from driving the point home: if you are going to surprise your querier, probably bump versions

@jtcohen6
Copy link
Collaborator Author

Thank you for the excellent feedback ❤️

My revision includes two significant assumptions:

  1. That we will implement two of the UX changes proposed in UX improvements to model versions dbt-core#7435 (which I can remove, if implementing proves impossible before Thursday):
    • Latest version can be defined in <model_name>.sql (no suffix)
    • Unpinned ref will log if a newer prerelease version is detected
  2. That, for now, the best / recommended approach to handle aliasing is with a hook that creates a view pointing to the latest version, as in this gist

@jtcohen6 jtcohen6 marked this pull request as ready for review April 24, 2023 02:02
@jtcohen6 jtcohen6 requested a review from a team as a code owner April 24, 2023 02:02
@github-actions github-actions bot added size: large This change will more than a week to address and might require more than one person and removed size: medium This change will take up to a week to address labels Apr 24, 2023
Copy link
Contributor

@joellabes joellabes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Model versions are different. Multiple versions of a model will live in the same code repository at the same time and be deployed into the same data environment simultaneously. This is similar to how web APIs are versioned—multiple versions are live simultaneously; older versions are often eventually sunsetted.
**Versioned models are different.** Defining model `versions` is appropriate when there are people, systems, and processes beyond your team's control, inside or outside of dbt. You can neither simply go migrate them all, nor break their queries on a whim. I need to do my part by offering a migration path, with clear diffs and deprecation dates.

Multiple versions of a model will live in the same code repository at the same time, and be deployed into the same data environment simultaneously. This is similar to how web APIs are versioned—multiple versions are live simultaneously; older versions are often eventually sunsetted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but hopefully not more than 2)


Model versions are different. Multiple versions of a model will live in the same code repository at the same time and be deployed into the same data environment simultaneously. This is similar to how web APIs are versioned—multiple versions are live simultaneously; older versions are often eventually sunsetted.
**Versioned models are different.** Defining model `versions` is appropriate when there are people, systems, and processes beyond your team's control, inside or outside of dbt. You can neither simply go migrate them all, nor break their queries on a whim. I need to do my part by offering a migration path, with clear diffs and deprecation dates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can neither simply go migrate them all, nor break their queries on a whim. I need to do my part by offering a migration path, with clear diffs and deprecation dates.

Both of these are the same actor aren't they? Did you just feel weird about telling people they have to do their part?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) I suspect I'm switching frequently between first & second person - worth another reread just for this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've actually been reasonably consistent:

  • "we" = dbt Labs, developers/maintainers of dbt-core
  • "you" = user of dbt, maintainer of a versioned model

**Where are they defined?**


**Where will they be materialized?** By convention, these will create database relations with aliases `dim_customers_v1` and `dim_customers_v2`. We recommend that you also create a view, named `dim_customers`, pointing to the latest version. Check out guidance on an easy & repeatable way to do that.
Copy link
Contributor

@joellabes joellabes Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out guidance on an easy & repeatable way to do that.

Where is that guidance? Looks like a link is missing here


**Where will they be materialized?** By convention, these will create database relations with aliases `dim_customers_v1` and `dim_customers_v2`. We recommend that you also create a view, named `dim_customers`, pointing to the latest version. Check out guidance on an easy & repeatable way to do that.

By convention, dbt will expect those two models to be defined in files named `dim_customers_v1.sql` and `dim_customers_v2.sql`. It will also accept `dim_customers.sql` (no suffix) as the definition of the latest version. (It is possible to override this by setting `defined_in: any_file_name_you_want`, but we strongly encourage you to follow the convention!)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to override this by setting defined_in: any_file_name_you_want

You have to include the .sql suffix right?

(Related: if it's optional, what happens if you have any_file_you_want.sql and any_file_you_want.py?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to include the file extension (and in fact, shouldn't)

(Related: if it's optional, what happens if you have any_file_you_want.sql and any_file_you_want.py?)

Not allowed - model file names still need to be globally unique, independent of the file extension. I think I have a note about this in the reference docs for defined_in - I think I'll add a link there from here.


<File name="models/dim_customers_view.yml">
<!-- TODO: add the macro from my gist to dbt-core. Better as on-run-end or post-hook? -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say post-hook for all the same reasons we used to encourage doing grants in post-hooks - the changes apply immediately instead of having to wait for the entire 3 hour run to complete

@@ -35,24 +35,29 @@ The standard convention for naming model versions is `<model_name>_v<v>`. This h

The version identifier for a version of a model. This value can be numeric (integer or float), or any string.

The value of the version identifier is used to order versions of a model relative to one another. If a versioned model does _not_ explicitly configure a [`latest_version`](resource-properties/latest-version), the highest version number is used as the latest version to resolve `ref` calls to the model without a `version` argument.
The value of the version identifier is used to order versions of a model relative to one another. If a versioned model does _not_ explicitly configure a [`latest_version`](resource-properties/latest_version), the highest version number is used as the latest version to resolve `ref` calls to the model without a `version` argument.

In general, we recommend that you use a simple "major versioning" scheme for your models: `v1`, `v2`, `v3`, etc, where each version represents a breaking change from previous versions. However, you are welcome to use other versioning schemes.
Copy link
Contributor

@joellabes joellabes Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, you are welcome to use other versioning schemes

as long as they behave correctly when sort()ed. (or however we're actually doing it).

On that note, do we handle people putting vs in their yaml? What would happen if I did this?

models:
  - name: dim_customers
    versions: 
      - v: v1
      ...
      - v: 2

Both from a sorting perspective and an alias-creation perspective - would I wind up with dim_customers_vv1 which outranked the 2 in sort order?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes & yes. I'll add an explicit caution that people should not include v in their version identifier.

@jtcohen6
Copy link
Collaborator Author

jtcohen6 commented Apr 24, 2023

As with dbt-labs/dbt-core#7435 (comment), holding off on merging this until we decide whether to vendor create_latest_version_view (recommended post-hook) within dbt-core directly.

Update: Let's opt for, this will come in v1.6; in the meantime, a macro you can copy-paste-edit-post-hook.

In the meantime, comments from other reviewers still welcome!

@@ -16,7 +16,7 @@ models:
[description](description): <markdown_string>
[docs](/reference/resource-configs/docs):
show: true | false
[latest_version](resource-properties/latest-version): <version_identifier>
[latest_version](resource-properties/latest_version): <version_identifier>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont understand why an underscore was added here, the actual page is https://docs.getdbt.com/reference/resource-properties/latest-version and a latest_version brings the user to a 'page not found'. suggesting it goes back to the latest-version

Suggested change
[latest_version](resource-properties/latest_version): <version_identifier>
[latest_version](resource-properties/latest-version): <version_identifier>

Copy link
Collaborator Author

@jtcohen6 jtcohen6 Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirnawong1 The name of this resource property is latest_version (underscore). I looked at some other similar properties/configs, and they all have underscores in their file name / id:

So I renamed the page, and added a redirect for it

) %}

{% set existing_relation = load_relation(new_relation) %}
{{ drop_relation_if_exists(existing_relation) }}
Copy link
Collaborator Author

@jtcohen6 jtcohen6 Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the whole point is that this view should be live-queryable from a BI tool... I don't think we want to be dropping it outside of a transaction / atomic operation.

I don't think we have a handy cross-db way to do this, outside of the actual view materialization logic. This might take some fudging.

Copy link
Collaborator

@runleonarun runleonarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments now and I will finish reviewing tomorrow!

_redirects Outdated
@@ -278,6 +278,7 @@ docs/dbt-cloud/using-dbt-cloud/cloud-model-timing-tab /docs/deploy/dbt-cloud-job
/docs/artifacts /docs/dbt-cloud/using-dbt-cloud/artifacts 301
/docs/bigquery-configs /reference/resource-configs/bigquery-configs 301
/reference/resource-properties/docs /reference/resource-configs/docs 301
/reference/resource-properties/latest-version /reference/resource-configs/latest_version 301
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be as follows:

Suggested change
/reference/resource-properties/latest-version /reference/resource-configs/latest_version 301
/reference/resource-properties/latest-version /reference/resource-properties/latest_version 301

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@@ -6,7 +6,7 @@ description: "Model contracts define a set of parameters validated during transf
---

:::info New functionality
This functionality is new in v1.5.
This functionality is new in v1.5 — if you have thoughts, weigh into the [GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/6726)!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Weigh in" might be vague.

Suggested change
This functionality is new in v1.5 — if you have thoughts, weigh into the [GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/6726)!
This functionality is new in v1.5 — if you have feedback, then participate in the [GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/6726)!

Copy link
Collaborator Author

@jtcohen6 jtcohen6 Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 replacing with "comment on" "participate in"

@@ -6,39 +6,120 @@ description: "Version models to help with lifecycle management"
---

:::info New functionality
This functionality is new in v1.5.
This functionality is new in v1.5 — if you have thoughts, weigh into the [GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/6736)!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This functionality is new in v1.5 — if you have thoughts, weigh into the [GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/6736)!
This functionality is new in v1.5 — if you have feedback, then participate in the [GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/6736)!

:::

API versioning is a _complex_ problem in software engineering. It's also essential. Our goal is to _overcome obstacles to transform a complex problem into a reality_.
Versioning APIs is a hard problem in software engineering. At the root of the challenge is the fact that the producers and consumers of an API have competing incentives:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Versioning APIs is a hard problem in software engineering. At the root of the challenge is the fact that the producers and consumers of an API have competing incentives:
Versioning APIs is a hard problem in software engineering. The root of the challenge is that the producers and consumers of an API have competing incentives:

runleonarun
runleonarun previously approved these changes Apr 25, 2023
Copy link
Collaborator

@runleonarun runleonarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 This looks good! I have some wording suggestions to clarify the message here. I also have a few questions that might be worth addressing.

Approving so you can release during your day.

:::

API versioning is a _complex_ problem in software engineering. It's also essential. Our goal is to _overcome obstacles to transform a complex problem into a reality_.
Versioning APIs is a hard problem in software engineering. The root of the challenge is that the producers and consumers of an API have competing incentives:
- Producers of an API need the ability to make changes to its logic. There is a real cost associated with maintaining legacy endpoints forever, but losing the trust of downstream users is far costlier.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make this pack more punch by writing out some of the passive language:

Suggested change
- Producers of an API need the ability to make changes to its logic. There is a real cost associated with maintaining legacy endpoints forever, but losing the trust of downstream users is far costlier.
- Producers of an API need the ability to modify its logic. Although maintaining legacy endpoints forever incurs a significant expense, it costs more to lose the trust of downstream users.

API versioning is a _complex_ problem in software engineering. It's also essential. Our goal is to _overcome obstacles to transform a complex problem into a reality_.
Versioning APIs is a hard problem in software engineering. The root of the challenge is that the producers and consumers of an API have competing incentives:
- Producers of an API need the ability to make changes to its logic. There is a real cost associated with maintaining legacy endpoints forever, but losing the trust of downstream users is far costlier.
- Consumers of an API need to trust in its stability—their queries will keep working, and won't break without warning. There is a real cost associated with migrating to a newer API version, but unplanned migration is far costlier.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Consumers of an API need to trust in its stability—their queries will keep working, and won't break without warning. There is a real cost associated with migrating to a newer API version, but unplanned migration is far costlier.
- Consumers of an API need to trust in its stability—their queries will keep working, and won't break without warning. Although migrating to a newer API version incurs an expense, an unplanned migration is far costlier.

- Producers of an API need the ability to make changes to its logic. There is a real cost associated with maintaining legacy endpoints forever, but losing the trust of downstream users is far costlier.
- Consumers of an API need to trust in its stability—their queries will keep working, and won't break without warning. There is a real cost associated with migrating to a newer API version, but unplanned migration is far costlier.

The goal of model versions is not to make the problem go away, nor to pretend it's somehow easier or simpler than it is. Rather, we want dbt to provide tools that make it possible to tackle this problem, thoughtfully and head-on, and to develop standard patterns for solving it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the "tools" are model versioning, right? I'd also suggest flipping the two sentences so the what model versioning does do is not buried.

Suggested change
The goal of model versions is not to make the problem go away, nor to pretend it's somehow easier or simpler than it is. Rather, we want dbt to provide tools that make it possible to tackle this problem, thoughtfully and head-on, and to develop standard patterns for solving it.
The goal of model versions is not to make the problem go away, nor to pretend it's somehow easier or simpler than it is. Rather, model versioning makes it possible to tackle this problem, thoughtfully and head-on, and to develop standard patterns for solving it.


## When should you version a model?

By enforcing a model's contract, dbt can help you catch unintended changes to column names and data types that could cause a big headache for downstream queriers. These changes, when made intentionally, would require a new model version. But many changes are not breaking, and don't require a new version—such as adding a new column, or fixing a bug in an existing column's calculation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By enforcing a model's contract, dbt can help you catch unintended changes to column names and data types that could cause a big headache for downstream queriers. These changes, when made intentionally, would require a new model version. But many changes are not breaking, and don't require a new version—such as adding a new column, or fixing a bug in an existing column's calculation.
By enforcing a model's contract, dbt can help you catch unintended changes to column names and data types that could cause a big headache for downstream queriers. These changes, when made intentionally, would require a new model version. But when making non-breaking changes, you don't need a new version—such as adding a new column, or fixing a bug in an existing column's calculation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 for this sentence, do we want to say that they get the option of creating a new version vs fixing the problem? It feels like that the undertone here, but we might want to be explicit. Using "require" kind of sounds like dbt will require it, which doesn't seem the case later.

These changes, when made intentionally, would require a new model version

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good clarifying question. If you make a breaking contract change, dbt will raise an error during CI — and you'd need to "merge on red" (you always can do it). https://docs.getdbt.com/reference/resource-configs/contract#detecting-breaking-changes

  Consider making an additive (non-breaking) change instead, if possible.
  Otherwise, create a new model version: https://docs.getdbt.com/docs/collaborate/govern/model-versions


When you make updates to a model's source code—its logical definition, in SQL or Python, or related configuration—dbt can [compare your project to previous state](project-state), enabling you to rebuild only models that have changed, and models downstream of a change. In this way, it's possible to develop changes to a model, quickly test in CI, and efficiently deploy into production—all coordinated via your version control system.

**Versioned models are different.** Defining model `versions` is appropriate when there are people, systems, and processes beyond your team's control, inside or outside of dbt. You can neither simply go migrate them all, nor break their queries on a whim. You need to do my part by offering a migration path, with clear diffs and deprecation dates.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if "do my part" was a typo?

Suggested change
**Versioned models are different.** Defining model `versions` is appropriate when there are people, systems, and processes beyond your team's control, inside or outside of dbt. You can neither simply go migrate them all, nor break their queries on a whim. You need to do my part by offering a migration path, with clear diffs and deprecation dates.
**Versioned models are different.** Defining model `versions` is appropriate when people, systems, and processes beyond your team's control, inside or outside of dbt, depend on your models. You can neither simply go migrate them all, nor break their queries on a whim. You need to offer a migration path, with clear diffs and deprecation dates.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*do your part! whoops. I like your suggestion better


You've always been able to copy-paste, create a new model file, and name it `dim_customers_v2.sql`. Why should you opt for a "real" versioned model instead?

As the **producer** of a versioned model:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These benefits are super clear! Love how this section reads!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider bullets instead of steps. Then the reader can focus on the content. Steps usually indicate you need to do something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call!

:::

API versioning is a _complex_ problem in software engineering. It's also essential. Our goal is to _overcome obstacles to transform a complex problem into a reality_.
Versioning APIs is a hard problem in software engineering. The root of the challenge is that the producers and consumers of an API have competing incentives:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never actually say how model versioning relates to API versioning. I wonder if we could call that out here before we start talking about the problems with versioning APIs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these sentences in between, to try and connect the dots:

When sharing a final dbt model with other teams or systems, that model is operating like an API. When the producer of that model needs to make significant changes, how can they avoid breaking the queries of its users downstream?

| 2 | "latest" | `ref('dim_customers', v=2)` **and** `ref('dim_customers')` | `dim_customers_v2.sql` **or** `dim_customers.sql` | `analytics.dim_customers_v2` **and** `analytics.dim_customers` (recommended) |
| 1 | "old" | `ref('dim_customers', v=1)` | `dim_customers_v1.sql` | `analytics.dim_customers_v1` |

As you'll see in the implementation section below, a versioned model can reuse the majority of its yaml properties and configuration. Each version needs to only say how it _differs_ from the shared set of attributes. This gives you, as the producer of a versioned model, the opportunity to highlight the differences across versions—which is otherwise difficult to detect in models with dozens or hundreds of columns—and to clearly track, in one place, all versions of the model which are currently live.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As you'll see in the implementation section below, a versioned model can reuse the majority of its yaml properties and configuration. Each version needs to only say how it _differs_ from the shared set of attributes. This gives you, as the producer of a versioned model, the opportunity to highlight the differences across versions—which is otherwise difficult to detect in models with dozens or hundreds of columns—and to clearly track, in one place, all versions of the model which are currently live.
As you'll see in the implementation section below, a versioned model can reuse the majority of its YAML properties and configuration. Each version needs to only say how it _differs_ from the shared set of attributes. This gives you, as the producer of a versioned model, the opportunity to highlight the differences across versions—which is otherwise difficult to detect in models with dozens or hundreds of columns—and to clearly track, in one place, all versions of the model which are currently live.


Try out v3: {{ ref('my_dbt_project', 'my_model', v='3') }}
Pin to v2: {{ ref('my_dbt_project', 'my_model', v='2') }}
```

## How to create a new version of a model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this week, but eventually, we might want to pull the procedure section out into its own page so people who don't need the "why" and other context can more easily find the "how."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heard! A lot of the "how" details are also captured in the reference documentation: https://docs.getdbt.com/reference/resource-properties/versions

For now, we should expect that the main audience for these docs is people trying to learn about & understand the new feature. As more the concept becomes more established, and people are already convinced they want to use the feature, we can make it even easier to just do the thing

@jtcohen6 jtcohen6 merged commit 22ecf93 into current Apr 26, 2023
@jtcohen6 jtcohen6 deleted the jerco/more-on-model-versions branch April 26, 2023 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Improvements or additions to content size: large This change will more than a week to address and might require more than one person
Projects
None yet
5 participants