Skip to content

Commit

Permalink
This branch was auto-updated!
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions[bot] authored Oct 8, 2024
2 parents fafb3ae + cb41939 commit 83ddd13
Show file tree
Hide file tree
Showing 27 changed files with 314 additions and 170 deletions.
82 changes: 82 additions & 0 deletions website/blog/2024-10-04-iceberg-is-an-implementation-detail.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
---
title: "Iceberg Is An Implementation Detail"
description: "This blog will talk about iceberg table support and why it both matters and doesn't"
slug: icebeg-is-an-implementation-detail

authors: [amy_chen]

tags: [table formats, iceberg]
hide_table_of_contents: false

date: 2024-10-04
is_featured: false
---

If you haven’t paid attention to the data industry news cycle, you might have missed the recent excitement centered around an open table format called Apache Iceberg™. It’s one of many open table formats like Delta Lake, Hudi, and Hive. These formats are changing the way data is stored and metadata accessed. They are groundbreaking in many ways.

But I have to be honest: **I don’t care**. But not for the reasons you think.

## What is Iceberg?

To have this conversation, we need to start with the same foundational understanding of Iceberg. Apache Iceberg is a high-performance open table format developed for modern data lakes. It was designed for large-scale datasets, and within the project, there are many ways to interact with it. When people talk about Iceberg, it often means multiple components including but not limited to:

1. Iceberg Table Format - an open-source table format with large-scale data. Tables materialized in iceberg table format are stored on a user’s infrastructure, such as S3 Bucket.
2. Iceberg Data Catalog - an open-source metadata management system that tracks the schema, partition, and versions of Iceberg tables.
3. Iceberg REST Protocol (also called Iceberg REST API) is how engines can support and speak to other Iceberg-compatible catalogs.

If you have been in the industry, you also know that everything I just wrote above about Iceberg could easily be replaced by `Hive,` `Hudi,` or `Delta.` This is because they were all designed to solve essentially the same problem. Ryan Blue (creator of Iceberg) and Michael Armbrust (creator of Delta Lake) recently sat down for this [fantastic chat](https://vimeo.com/1012543474) and said two points that resonated with me:

- “We never intended for people to pay attention to this area. It’s something we wanted to fix, but people should be able to not pay attention and just work with their data. Storage systems should just work.”
- “We solve the same challenges with different approaches.”

At the same time, the industry is converging on Apache Iceberg. [Iceberg has the highest availability of read and write support](https://medium.com/sundeck/2024-lakehouse-format-rundown-7edd75015428).

<Lightbox src="/img/blog/2024-10-04-iceberg-blog/2024-10-03-iceberg-support.png" width="85%" title="Credit to Jacques at Sundeck for creating this fantastic chart of all the Iceberg Support" />


Snowflake launched Iceberg support in 2022. Databricks launched Iceberg support via Uniform last year. Microsoft announced Fabric support for Iceberg in September 2024 at Fabric Con. **Customers are demanding interoperability, and vendors are listening**.

Why does this matter? Standardization of the industry benefits customers. When the industry standardizes - customers have the gift of flexibility. Everyone has a preferred way of working, and with standardization &mdash; they can always bring their preferred tools to their organization’s data.

## Just another implementation detail

I’m not saying open table formats aren't important. The metadata management and performance make them very meaningful and should be paid attention to. Our users are already excited to use it to create data lakes to save on storage costs, create more abstraction from their computing, etc.

But when building data models or focusing on delivering business value through analytics, my primary concern is not *how* the data is stored—it's *how* I can leverage it to generate insights and drive decisions. The analytics development lifecycle is hard enough without having to take into every detail. dbt abstracts the underlying platform and lets me focus on writing SQL and orchestrating my transformations. It’s a feature that I don’t need to think about how tables are stored or optimized—I just need to know that when I reference dim_customers or fct_sales, the correct data is there and ready to use. **It should just work.**

## Sometimes the details do matter

While table formats are an implementation detail for data transformation &mdash; Iceberg can impact dbt developers when the implementation details aren’t seamless. Currently, using Iceberg requires a significant amount of upfront configuration and integration work beyond just creating tables to get started.

One of the biggest hurdles is managing Iceberg’s metadata layer. This metadata often needs to be synced with external catalogs, which requires careful setup and ongoing maintenance to prevent inconsistencies. Permissions and access controls add another layer of complexity—because multiple engines can access Iceberg tables, you have to ensure that all systems have the correct access to both the data files and the metadata catalog. Currently, setting up integrations between these engines is also far from seamless; while some engines natively support Iceberg, others require brittle workarounds to ensure the metadata is synced correctly. This fragmented landscape means you could land with a web of interconnected components.

## Fixing it

**Today, we announced official support for the Iceberg table format in dbt.** By supporting the Iceberg table format, it’s one less thing you have to worry about on your journey to adopting Iceberg.

With support for Iceberg Table Format, it is now easier to convert your dbt models using proprietary table formats to Iceberg by updating your configuration. After you have set up your external storage for Iceberg and connected it to your platforms, you will be able to jump into your dbt model and update the configuration to look something like this:

<Lightbox src="/img/blog/2024-10-04-iceberg-blog/iceberg_materialization.png" width="85%" title="Iceberg Table Format Support on dbt for Snowflake" />

It is available on these adapters:

- Athena
- Databricks
- Snowflake
- Spark
- Starburst/Trino
- Dremio

As with the beauty of any open-source project, Iceberg support grew organically, so the implementations vary. However, this will change in the coming months as we converge onto one dbt standard. This way, no matter which adapter you jump into, the configuration will always be the same.

## dbt the Abstraction Layer

dbt is more than about abstracting away the DDL to create and manage objects. It’s also about ensuring an opinionated approach to managing and optimizing your data. That remains true for our strategy around Iceberg Support.

In our dbt-snowflake implementation, we have already started to [enforce best practices centered around how to manage the base location](https://docs.getdbt.com/reference/resource-configs/snowflake-configs#base-location) to ensure you don’t create technical debt accidentally, ensuring your Iceberg implementation scales over time. And we aren’t done yet.

That said, while we can create the models, there is a *lot* of initial work to get to that stage. dbt developers must still consider the implementation, like how their external volume has been set up or where dbt can access the metadata. We have to make this better.

Given the friction of getting launched on Iceberg, over the coming months, we will enable more capabilities to empower users to adopt Iceberg. It should be easier to read from foreign Iceberg catalogs. It should be easier to mount your volume. It should be easier to manage refreshes. And you should also trust that permissions and governance are consistently enforced.

And this work doesn’t stop at Iceberg. The framework we are building is also compatible with other table formats, ensuring that whatever table format works for you is supported on dbt. This way &mdash; dbt users can also stop caring about table formats. **It’s just another implementation detail.**
2 changes: 1 addition & 1 deletion website/blog/authors.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
amy_chen:
image_url: /img/blog/authors/achen.png
job_title: Product Ecosystem Manager
job_title: Product Manager
links:
- icon: fa-linkedin
url: https://www.linkedin.com/in/yuanamychen/
Expand Down
9 changes: 9 additions & 0 deletions website/docs/docs/build/incremental-microbatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,15 @@ A `sessions` model is aggregating and enriching data that comes from two other m

The `page_view_start` column in `page_views` is configured as that model's `event_time`. The `customers` model does not configure an `event_time`. Therefore, each batch of `sessions` will filter `page_views` to the equivalent time-bounded batch, and it will not filter `customers` (a full scan for every batch).

<File name="models/staging/page_views.yml">

```yaml
models:
- name: page_views
config:
event_time: page_view_start
```
</File>
We run the `sessions` model on October 1, 2024, and then again on October 2. It produces the following queries:

<Tabs>
Expand Down
72 changes: 32 additions & 40 deletions website/docs/docs/build/metricflow-time-spine.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,41 +21,47 @@ To see the generated SQL for the metric and dimension types that use time spine

## Configuring time spine in YAML

- The [`models` key](/reference/model-properties) for the time spine must be in your `models/` directory.
- Each time spine is a normal dbt model with extra configurations that tell dbt and MetricFlow how to use specific columns by defining their properties.
- You likely already have a calendar table in your project which you can use. If you don't, review the [example time-spine tables](#example-time-spine-tables) for sample code.
- You add the configurations under the `time_spine` key for that [model's properties](/reference/model-properties), just as you would add a description or tests.
Time spine models are normal dbt models with extra configurations that tell dbt and MetricFlow how to use specific columns by defining their properties. Add the [`models` key](/reference/model-properties) for the time spine in your `models/` directory. If your project already includes a calendar table or date dimension, you can configure that table as a time spine. Otherwise, review the [example time-spine tables](#example-time-spine-tables) to create one.

Some things to note when configuring time spine models:

- Add the configurations under the `time_spine` key for that [model's properties](/reference/model-properties), just as you would add a description or tests.
- You only need to configure time-spine models that the Semantic Layer should recognize.
- At a minimum, define a time-spine table for a daily grain.
- You can optionally define additional time-spine tables for different granularities, like hourly. Review the [granularity considerations](#granularity-considerations) when deciding which tables to create.
- If you're looking to specify the grain of a time dimension so that MetricFlow can transform the underlying column to the required granularity, refer to the [Time granularity documentation](/docs/build/dimensions?dimension=time_gran)

For example, given the following directory structure, you can create two time spine configurations, `time_spine_hourly` and `time_spine_daily`. MetricFlow supports granularities ranging from milliseconds to years. Refer to the [Dimensions page](/docs/build/dimensions?dimension=time_gran#time) (time_granularity tab) to find the full list of supported granularities.

:::tip
Previously, you had to create a model called `metricflow_time_spine` in your dbt project. Now, if your project already includes a date dimension or time spine table, you can simply configure MetricFlow to use that table by updating the `model` setting in the Semantic Layer.

If you don’t have a date dimension table, you can still create one by using the following code snippet to build your time spine model.
If you previously used a model called `metricflow_time_spine`, you no longer need to create this specific model. You can now configure MetricFlow to use any date dimension or time spine table already in your project by updating the `model` setting in the Semantic Layer.

If you don’t have a date dimension table, you can still create one by using the code snippet in the [next section](#creating-a-time-spine-table) to build your time spine model.
:::

<Lightbox src="/img/time_spines.png" title="Time spine directory structure" />
### Creating a time spine table

MetricFlow supports granularities ranging from milliseconds to years. Refer to the [Dimensions page](/docs/build/dimensions?dimension=time_gran#time) (time_granularity tab) to find the full list of supported granularities.

To create a time spine table from scratch, you can do so by adding the following code to your dbt project.
This example creates a time spine at an hourly grain and a daily grain: `time_spine_hourly` and `time_spine_daily`.

<VersionBlock firstVersion="1.9">
<File name="models/_models.yml">

```yaml
[models:](/reference/model-properties)
- name: time_spine_hourly
description: "my favorite time spine"
[models:](/reference/model-properties)
# Hourly time spine
- name: time_spine_hourly
description: my favorite time spine
time_spine:
standard_granularity_column: date_hour # column for the standard grain of your table, must be date time type."
standard_granularity_column: date_hour # column for the standard grain of your table, must be date time type.
custom_granularities:
- name: fiscal_year
column_name: fiscal_year_column
columns:
- name: date_hour
granularity: hour # set granularity at column-level for standard_granularity_column

# Daily time spine
- name: time_spine_daily
time_spine:
standard_granularity_column: date_day # column for the standard grain of your table
Expand All @@ -66,6 +72,8 @@ If you don’t have a date dimension table, you can still create one by using th
</File>
</VersionBlock>
<Lightbox src="/img/time_spines.png" width="50%" title="Time spine directory structure" />
<VersionBlock lastVersion="1.8">
<File name="models/_models.yml">
Expand All @@ -91,30 +99,14 @@ models:
</File>
</VersionBlock>
For an example project, refer to our [Jaffle shop](https://github.com/dbt-labs/jaffle-sl-template/blob/main/models/marts/_models.yml) example.
<Expandable alt_header="Understanding time spine and granularity">
- The previous configuration demonstrates a time spine model called `time_spine_daily`. It sets the time spine configurations under the `time_spine` key.
- The `standard_granularity_column` is the column that maps to one of our [standard granularities](/docs/build/dimensions?dimension=time_gran). The grain of this column must be finer or equal in size to the granularity of all custom granularity columns in the same model. In this case, it's hourly.
- It needs to reference a column defined under the `columns` key, in this case, `date_hour`.
- MetricFlow will use the `standard_granularity_column` as the join key when joining the time spine table to other source table.
- Here, the granularity of the `standard_granularity_column` is set at the column level, in this case, `hour`.

Additionally, [the `custom_granularities` field](#custom-calendar), (available in dbt v1.9 and higher) lets you specify non-standard time periods like `fiscal_year` or `retail_month` that your organization may use.
- This example configuration shows a time spine model called `time_spine_hourly` and `time_spine_daily`. It sets the time spine configurations under the `time_spine` key.
- The `standard_granularity_column` is the column that maps to one of our [standard granularities](/docs/build/dimensions?dimension=time_gran). This column must be set under the `columns` key and should have a grain that is finer or equal to any custom granularity columns defined in the same model.
- It needs to reference a column defined under the `columns` key, in this case, `date_hour` and `date_day`, respectively.
- It sets the granularity at the column-level using the `granularity` key, in this case, `hour` and `day`, respectively.
- MetricFlow will use the `standard_granularity_column` as the join key when joining the time spine table to another source table.
- [The `custom_granularities` field](#custom-calendar), (available in Versionless and dbt v1.9 and higher) lets you specify non-standard time periods like `fiscal_year` or `retail_month` that your organization may use.

</Expandable>

<Expandable alt_header="Creating a time spine table">

If you need to create a time spine table from scratch, you can do so by adding the following code to your dbt project.
The example creates a time spine at a daily grain and an hourly grain. A few things to note when creating time spine models:
* MetricFlow will use the time spine with the largest compatible granularity for a given query to ensure the most efficient query possible. For example, if you have a time spine at a monthly grain, and query a dimension at a monthly grain, MetricFlow will use the monthly time spine. If you only have a daily time spine, MetricFlow will use the daily time spine and date_trunc to month.
* You can add a time spine for each granularity you intend to use if query efficiency is more important to you than configuration time, or storage constraints. For most engines, the query performance difference should be minimal and transforming your time spine to a coarser grain at query time shouldn't add significant overhead to your queries.
* We recommend having a time spine at the finest grain used in any of your dimensions to avoid unexpected errors. i.e., if you have dimensions at an hourly grain, you should have a time spine at an hourly grain.
</Expandable>

Now, break down the configuration above. It's pointing to a model called `time_spine_daily`, and all the configuration is colocated with the rest of the [model's properties](/reference/model-properties). It sets the time spine configurations under the `time_spine` key. The `standard_granularity_column` is the lowest grain of the table, in this case, it's hourly. It needs to reference a column defined under the columns key, in this case, `date_hour`. Use the `standard_granularity_column` as the join key for the time spine table when joining tables in MetricFlow. Here, the granularity of the `standard_granularity_column` is set at the column level, in this case, `hour`.
For an example project, refer to our [Jaffle shop](https://github.com/dbt-labs/jaffle-sl-template/blob/main/models/marts/_models.yml) example.

### Considerations when choosing which granularities to create{#granularity-considerations}

Expand Down Expand Up @@ -302,9 +294,9 @@ and date_hour < dateadd(day, 30, current_timestamp())

<VersionBlock lastVersion="1.8">

Being able to configure custom calendars, such as a fiscal calendar, is available in [dbt Cloud Versionless](/docs/dbt-versions/upgrade-dbt-version-in-cloud#versionless) or dbt Core [v1.9 and above](/docs/dbt-versions/core).
The ability to configure custom calendars, such as a fiscal calendar, is available in [dbt Cloud Versionless](/docs/dbt-versions/upgrade-dbt-version-in-cloud#versionless) or dbt Core [v1.9 and higher](/docs/dbt-versions/core).

To access this feature, [upgrade to Versionless](/docs/dbt-versions/versionless-cloud) or dbt Core v1.9 and above.
To access this feature, [upgrade to Versionless](/docs/dbt-versions/versionless-cloud) or your dbt Core version to v1.9 or higher.
</VersionBlock>

<VersionBlock firstVersion="1.9">
Expand Down Expand Up @@ -337,6 +329,6 @@ models:
</File>

#### Coming soon
Note that features like calculating offsets and period-over-period will be supported soon.
Note that features like calculating offsets and period-over-period will be supported soon!

</VersionBlock>
Loading

0 comments on commit 83ddd13

Please sign in to comment.