From 09545b7b70184e0275f73d3de08f6399aa6d2196 Mon Sep 17 00:00:00 2001 From: graciegoheen Date: Fri, 20 Dec 2024 11:41:57 -0700 Subject: [PATCH] roadmap post december 2024 --- docs/roadmap/2024-12-play-on.md | 207 ++++++++++++++++++++++++++++++++ 1 file changed, 207 insertions(+) create mode 100644 docs/roadmap/2024-12-play-on.md diff --git a/docs/roadmap/2024-12-play-on.md b/docs/roadmap/2024-12-play-on.md new file mode 100644 index 00000000000..e4229e6d9c2 --- /dev/null +++ b/docs/roadmap/2024-12-play-on.md @@ -0,0 +1,207 @@ +# dbt: Play On (December 2024) + +Oh hi there :) + +We’ve had the opportunity to talk a lot about our investments in open source over the past few months: + +- [I [Grace] renewed my vows with the dbt Community in a Vegas wedding ceremony officiated by Elvis](https://youtu.be/DC9sbZBYzpI?si=uK9Aie6Jl-FIHggm) (aka [@jtcohen6](https://github.com/jtcohen6)). +- [We spoke about dbt Labs’ continued commitment to Open Source in the Coalesce Keynote](https://youtu.be/I72yUtrmhbY?si=Iu6s9WXdHnFtyCvi). +- [We wrote about the importance of extensibility for how we build features in dbt Core.](https://www.getdbt.com/blog/dbt-core-v1-9-is-ga#:~:text=Extensibility%20is%20what%20powers%20the%20community) + +All of these points and themes remain incredibly relevant to our team and are fueling our vision as we prepare for the new year: + +- we're committed to defending dbt Core as the open source standard for data transformation, which will remain licensed under Apache 2.0 +- the dbt framework will continue to be shaped by a collaborative effort between you (the community) and us (the maintainers) +- when we add something new to the standard, we are committing to the long term, which means we are intentional about *how* and *when* we expand its breadth - in the meantime, lean into dbt's extensibility (and show us how you're doing it!) + +Now that we’ve gotten into our new rhythm… + +Now that last year’s focus on stability has earned us the right to ship awesome new additions to the dbt framework… + +Now that we’re actually seeing a *much* higher % of projects running the latest & greatest dbt… + +…the next 6-12 months will feel similar to the last. dbt will keep getting better at doing what it does - being a mission-critical piece of your data stack, and a delightful part of your work. To that end, stability always comes first. And, we will be shipping some exciting new features to dbt Core. + +# Oh What A Year It’s Been! + +This year, we released two new minor versions of dbt Core. + +| **Version** | **When** | **Namesake** | **Stuff** | +| --- | --- | --- | --- | +| [v1.8](https://docs.getdbt.com/docs/dbt-versions/core-upgrade/upgrading-to-v1.8) | May | [Julian Abele](https://github.com/dbt-labs/dbt-core/releases/tag/v1.8.0) | Unit testing. Decoupling of dbt-core and adapters. Flags for managing changes to legacy behaviors. | +| [v1.9](https://docs.getdbt.com/docs/dbt-versions/core-upgrade/upgrading-to-v1.9) | December | [Dr. Susan La Flesche Picotte](https://github.com/dbt-labs/dbt-core/releases/tag/v1.9.0) | Microbatch incremental strategy. New configurations and spec for snapshots. Standardizing support for Iceberg. | + +In the [last roadmap post](https://github.com/dbt-labs/dbt-core/blob/main/docs/roadmap/2023-11-dbt-tng.md), we committed to prioritizing all-around interface stability. This included decoupling dbt-core and adapters, introducing [behavior change flags](https://docs.getdbt.com/reference/global-configs/behavior-changes) to give you time to adjust to ~~breaking~~ changes, and improving the stability of metadata artifacts. You can read more about those efforts [here](https://www.notion.so/E-team-offsite-competitive-deep-dive-SQLMesh-SDF-November-2024-129bb38ebda7807ca023ddf198fb1279?pvs=21). + +Stability means you can upgrade with confidence. + +Stability means less disruptions for our adapters and integrations in the dbt ecosystem. + +Stability means we’ve earned the right to **ship some big new features**. + +This past year, we were able to ship some long-awaited additions and enhancements to the dbt Core framework: + +- [**Unit tests**](https://github.com/dbt-labs/dbt-core/discussions/8275) allow you to validate your SQL modeling logic on a small set of static inputs before you materialize your full model in production. +- [**Snapshots**](https://github.com/dbt-labs/dbt-core/discussions/7018) got the glow up they deserved - with new configurations and spec to make capturing your data changes easier to configure, run, and customize. +- [**Microbatch**](https://github.com/dbt-labs/dbt-core/discussions/10672) incremental models enable you to optimize your largest datasets, by transforming your timeseries data in discrete periods with their own SQL queries, rather than all at once. +- [**Iceberg**](https://docs.getdbt.com/blog/icebeg-is-an-implementation-detail) table format (an open standard for storing data and accessing metadata) is standardized across adapters, enabling you to store your data in a way that promises interoperability with multiple compute engines. + +We also were able to close out some highly-upvoted “paper cuts”: + +- `--empty` flag limits the `ref`s and `source`s to zero rows, which you can use for schema-only dry runs that validate your model SQL and run unit tests +- dbt now issues a single (batch) query when calculating source freshness through metadata, instead of executing a query per source +- improvements to `state:modified` help reduce the risk of false positives due to environment-aware logic +- you can now document your data tests with `description`s + +A lot of these features are things that you (the community) have been discussing and experimenting with for years. To everyone who opened an issue, commented on a discussion, joined us for a feedback session, developed a package, or contributed to our code base… however you made your voice heard this year, **thank you** for continuing to care, for continuing to lean in, for continuing to help shape dbt. + +# New Year’s Resolutions + +To start off the new year, we’re focusing on three major areas of development to the dbt framework: + +- **typed macros** - configure Python type annotations for better macro validation +- **catalogs** - first-class support for materializing dbt models into external catalogs, providing a warehouse-agnostic interface for managing data in object storage +- **sample mode** - limit your data to smaller, time-based samples for faster development and CI testing + +## Typed Macros + +In the simplest terms, [macros](https://docs.getdbt.com/docs/build/jinja-macros#macros) are are pieces of code, written in Jinja, that can be reused throughout your project – they are similar to "functions" in other programming languages. + +In practice, you can reference a macro in a model’s SQL (or config block, or hook) to: + +- make your SQL code more DRY by abstracting snippets of SQL into reusable “functions” +- use the results of one query to generate a set of logic +- change the way your project builds based on the current environment + +Or, if you really need to, run arbitrary SQL in your warehouse via `dbt run-operation`. + +Macros can depend on inputs, vars, `env_vars`, or even other macros — *and* there are “special” macros for defining custom generic tests and materializations. + +This immense flexibility is one of the great benefits of macros — you can use them to solve **a lot** of different problems. + +However, on the flip side, this flexibility — where macros can be whatever you want them to be — makes it challenging to validate that your macros are doing what you expect. + +Without a way to define expected types for the inputs and outputs of macros, our adapter maintainers struggle to validate that a built-in macro override will produce the correct output. Furthermore, *any* analytics engineer writing a macro in dbt should be able to validate the expected behavior. + +One of the things we aim to work on in the coming months is an interface for **typed macros.** + +Imagine being able to codify the expected types* for the inputs and outputs for your macros: + +```sql +{% macro cents_to_dollars(column_name: str, scale: int = 2 -> str) %} + ({{ column_name }} / 100)::numeric(16, {{ scale }}) +{% endmacro %} +``` + +**Note: These are a subset of Python types (internal to dbt), not the data types within the data warehouse.* + +Then, we could issue warnings at parse time when usage of a macro violates these types. By adding the ability to configure type expectations, user-created macros become more predicable, and [built-in macros become easier to override](https://github.com/dbt-labs/dbt-core/issues/9164). + +Should this type checking be on by default, or something you opt into? Should we also support dbt-specific types, such as `Relation`? Head over to the [github discussion](https://github.com/dbt-labs/dbt-core/discussions/11158) to participate in the conversation! + +## Catalogs + +In `v1.9`, we shipped a set of standard configs to materialize dbt models in Iceberg table format: + +```sql +{{ + config( + materialized = "table", + table_format = "iceberg", + external_volume = "s3_iceberg_snow" + ) +}} + +... +``` + +Supporting the Iceberg table format was our first step towards empowering users to adopt Iceberg as a standard storage format for their critical datasets. + +In the coming months, we want to add first-class support for “catalogs” in dbt. “Catalogs,” including Glue or Iceberg REST, operate at a level of abstraction above specific Iceberg tables — and they can provide a warehouse-agnostic interface for managing a large number of datasets in object storage. + +Imagine a new top-level `catalogs.yml` that tells dbt about the catalog integrations you want to write to: + +```yaml +catalogs: + - catalog_name: my_first_catalog + write_integrations: + - integration_name: prod_glue_write_integration + external_volume: my_prod_external_volume + table_format: iceberg + catalog_type: glue +``` + +Then, in your model’s configuration, simply specify the `catalog` field: + +```yaml +{{ + config( + catalog = "my_first_catalog" + ) +}} + +... +``` + +These new configurations will enable dbt to template the correct DDL statements for the platform (`CREATE GLUE ICEBERG TABLE` vs. `CREATE ICEBERG TABLE`, etc.). Now, when you run the above model, it will be materialized as an Iceberg table in s3 registered in the AWS Glue catalog. + +We believe the approach of writing datasets to a platform-agnostic storage layer, and registering those datasets with a similarly agnostic *catalog service,* will become important to the technical foundations of the dbt workflow — the [Analytics Development Lifecycle (ADLC)](https://www.getdbt.com/resources/guides/the-analytics-development-lifecycle) — moving forward. Support for materializing Iceberg tables in external catalogs is another step down this path. + +To read more about catalogs and participate in shaping this feature, head over to the [discussion on GitHub](https://github.com/dbt-labs/dbt-core/discussions/11171)! + +## Sample Mode + +We began our [“event_time” journey](https://github.com/dbt-labs/dbt-core/discussions/10672) with microbatch incremental models. Next up is [sample mode](https://github.com/dbt-labs/dbt-core/discussions/10672#:~:text=%F0%9F%8C%80%5BNext%5D%20%E2%80%9CSample%E2%80%9D%20mode%20for%20dev%20%26%20CI%20runs) - we want to support a pattern for speeding up development and CI testing by filtering your dataset to a *time-limited sample.* + +Imagine your model contains a `ref` statements like so: + +```sql +select * from {{ ref('fct_orders') }} +``` + +During standard runs, this compiles to: + +```sql +select * from my_db.my_schema.fct_orders +``` + +But during a *sample* run, dbt could automatically filter down large tables to the last X days of data using the same `event_time` column used for microbatch. The exact syntax here is TBD, but imagine you run something like `dbt run --sample 3`, which then compiles your code to: + +```sql +select * from my_db.my_schema.fct_orders +-- the event_time column in fct_orders is 'order_at' +where order_at > dateadd(-3, day, current_date) +``` + +No more [overriding the `source` or `ref` macro](https://discourse.getdbt.com/t/limiting-dev-runs-with-a-dynamic-date-range/508) to hack together this functionality. + +Built-in support for “sample mode” — filtering to a consistent time-based “slice” across all models — means faster development and faster testing because you’re running on *less* data. This could be configured for a specific dbt invocation, a CI environment, or as a set-it-and-forget-it default for everyone who’s developing on your team’s project. + +We’ll be opening an additional sample-mode-specific GitHub discussion in January, but feel free to queue up your thoughts [here](https://github.com/dbt-labs/dbt-core/discussions/10672) or [here](https://github.com/dbt-labs/dbt-core/issues/8378) in the meantime! + +## and who could forget, paper cuts! + +Those are the big things we plan to tackle in the coming months. As always, we want to tackle some smaller `paper_cut`s as well. Your upvotes and comments help us prioritize these, so please make some noise if there’s something you care a whole lot about. Some that are top of mind for me already: + +- **DRY-er YML**, including: + - [Defining vars configs outside dbt_project.yml](https://github.com/dbt-labs/dbt-core/issues/2955) + - [Add ability to import/include YAML from other files](https://github.com/dbt-labs/dbt-core/issues/9695) +- **Enhancements to model versions and contracts**, including: + - [Automatically create view/clone of latest version](https://github.com/dbt-labs/dbt-core/issues/7442) + - [Support constraints independently from enforcing a full model contract](https://github.com/dbt-labs/dbt-core/issues/10195) +- and [warnings when configs are misspelled](https://github.com/dbt-labs/dbt-core/issues/8942) :) + +# [**Call me**, **beep me** if you wanna reach me](https://www.youtube.com/watch?v=GIgLqN_rAXU) + +If one of *your* New Year’s resolutions is to be more involved in the dbt community, here are some of the many ways you can contribute: + +- open, upvote, and comment on GitHub [issues](https://docs.getdbt.com/community/resources/oss-expectations#issues) +- start a [discussion](https://docs.getdbt.com/community/resources/oss-expectations#discussions) or discourse post when you’ve got a Big Idea +- engage in conversations in the [dbt Community Slack](https://www.getdbt.com/community/join-the-community) +- [contribute code](https://docs.getdbt.com/community/resources/oss-expectations#pull-requests) back to one of our open source repos, for one of our issues tagged `help_wanted` or `good_first_issue`, and our engineering team will work with you to get it over the finish line +- join us on zoom for feedback sessions (which we’ll post about in slack and in the relevant GitHub discussion) +- share your creative solutions in a blog post, at a dbt meetup, or by talking at Coalesce + +Your feedback and thoughts are incredibly valuable to us, so make yourself heard this year. We’re excited to build some awesome new features together. + +[Your loving wife](https://youtu.be/DC9sbZBYzpI?si=xCGWoQDK-w13Fz6U&t=1594), Grace