Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] More robust example for incremental runs #5321

Closed
dbeatty10 opened this issue Apr 22, 2024 · 5 comments · Fixed by #5326
Closed

[Core] More robust example for incremental runs #5321

dbeatty10 opened this issue Apr 22, 2024 · 5 comments · Fixed by #5326
Labels
content Improvements or additions to content dbt Core The changes proposed in this issue relate to dbt Core

Comments

@dbeatty10
Copy link
Contributor

dbeatty10 commented Apr 22, 2024

Link to the page(s) on docs.getdbt.com requiring updates

https://docs.getdbt.com/docs/build/incremental-models

Tell us more about this update

The docs currently have the following example:

{% if is_incremental() %}
  where event_time >= (select max(event_time) from {{ this }})
{% endif %}

But this assumes that the the table is not empty.

Improved logic handles the case when the table is empty by using a default date in the distant past:

  where event_time >= (select coalesce(max(event_time), '1900-01-01') from {{ this }})

Reviewers/Stakeholders/SMEs

@b-per

Related GitHub issues

dbt-labs/dbt-core#9997 (comment)

Additional information

Tip

Instead of '1900-01-01', the ideal scenario would be if there were a cross-database way to specify the minimum date / timestamp_tz / timestamp_ntz / timestamp_ltz supported by the database engine. Then we'd just include that in our example instead.

@dbeatty10 dbeatty10 added content Improvements or additions to content dbt Core The changes proposed in this issue relate to dbt Core labels Apr 22, 2024
@BenameurZyad
Copy link

BenameurZyad commented Apr 22, 2024

We used something like this in our case :

where event_time >= ( select coalesce(max(event_time), '' ) from {{ this }} )

@dbeatty10
Copy link
Contributor Author

@BenameurZyad coalesce(max(event_time), '' ) didn't work for me when I tried it out -- which database did you use? And are you certain it worked as intended when the incremental table is empty?

I've got a draft PR to update the docs here: #5326

@BenameurZyad
Copy link

Ouch. Sorry @dbeatty10 !
this is our actual implementation :
coalesce ( select (max(event_time) fom {{ this }} ), '' )

I changed it a bit to align with your logic above without trying, thinking it would be the same. Sorry for the confusion.
This one in this comment works for sure for us in Snowflake.

@dbeatty10
Copy link
Contributor Author

That doesn't look like it would work if I copy-pasted it. i.e., fom vs. from

For sake of completeness and correctness, would you mind copy-pasting your exact logic that you think would work in dbt-snowflake?

@BenameurZyad
Copy link

here it is :


with raw as (
    select
        properties,                 -- this is a json (stored as VARIANT type)
        content,                    -- this is a json (stored as VARIANT type)
        __INGESTED_AT               -- this is a timestamp UTC (stored as TEXT)
    from {{source('********','********')}}

    {% if is_incremental() %}

  -- this filter will only be applied on an incremental run
  -- (uses >= to include records arriving later on the same day as the last run of this model)

    -- this should prevent a bug where the incremental model would not work if the target table is empty
    where __INGESTED_AT >= coalesce((select max(__INGESTED_AT) from {{ this }}), '')

    {% endif %}
),

dbeatty10 added a commit that referenced this issue Apr 24, 2024
[Preview](https://docs-getdbt-com-git-dbeatty10-patch-4-dbt-labs.vercel.app//docs/build/incremental-models#filtering-rows-on-an-incremental-run)

## What are you changing in this pull request and why?

resolves #5321

To ensure that the updated code will work for a broad number of users
without issues, I tested the following example against these data
platforms:
- bigquery
- databricks
- duckdb
- postgres
- redshift
- snowflake

<img width="782" alt="image"
src="https://github.com/dbt-labs/docs.getdbt.com/assets/44704949/0739892e-6f5d-45b8-ac1d-5bbd844cf096">

☝️ Notice the table is empty, like the edge case scenario described in
dbt-labs/dbt-core#9997

<img width="772" alt="image"
src="https://github.com/dbt-labs/docs.getdbt.com/assets/44704949/87ad6438-082f-4d65-9a8b-f97d36497c8e">

☝️ Notice it successfully added new data when it arrived.

<details>
<summary>

### Reprex
</summary>

Create this file:

`models/my_incremental.sql`

```sql
{{ config(materialized="incremental") }}

with

non_empty_cte as (

    select 1 as id, cast('2024-01-01' as date) as event_time

),

empty_cte as (

    select 0 as id, cast('1999-12-31' as date) as event_time
    from non_empty_cte
    where 0=1

)

select *

{% if var("scenario", "empty") == "empty" %}

  from empty_cte

{% else %}

  from non_empty_cte

{% endif %}

{% if is_incremental() %}

  -- this filter will only be applied on an incremental run
  -- (uses >= to include records whose timestamp occurred since the last run of this model)
  where event_time >= (select coalesce(max(event_time), cast('1900-01-01' as date)) from {{ this }})

{% endif %}
```

Assuming a `profiles.yml` with all the relevant profile names, run these
commands:

```shell
dbt run  --profile duckdb -s my_incremental --vars '{scenario: empty}' --full-refresh
dbt show --profile duckdb --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile duckdb -s my_incremental --vars '{scenario: empty}'
dbt show --profile duckdb --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile duckdb -s my_incremental --vars '{scenario: non_empty}'
dbt show --profile duckdb --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile postgres -s my_incremental --vars '{scenario: empty}' --full-refresh
dbt show --profile postgres --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile postgres -s my_incremental --vars '{scenario: empty}'
dbt show --profile postgres --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile postgres -s my_incremental --vars '{scenario: non_empty}'
dbt show --profile postgres --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile redshift -s my_incremental --vars '{scenario: empty}' --full-refresh
dbt show --profile redshift --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile redshift -s my_incremental --vars '{scenario: empty}'
dbt show --profile redshift --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile redshift -s my_incremental --vars '{scenario: non_empty}'
dbt show --profile redshift --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile databricks -s my_incremental --vars '{scenario: empty}' --full-refresh
dbt show --profile databricks --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile databricks -s my_incremental --vars '{scenario: empty}'
dbt show --profile databricks --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile databricks -s my_incremental --vars '{scenario: non_empty}'
dbt show --profile databricks --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile snowflake -s my_incremental --vars '{scenario: empty}' --full-refresh
dbt show --profile snowflake --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile snowflake -s my_incremental --vars '{scenario: empty}'
dbt show --profile snowflake --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile snowflake -s my_incremental --vars '{scenario: non_empty}'
dbt show --profile snowflake --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile bigquery -s my_incremental --vars '{scenario: empty}' --full-refresh
dbt show --profile bigquery --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile bigquery -s my_incremental --vars '{scenario: empty}'
dbt show --profile bigquery --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile bigquery -s my_incremental --vars '{scenario: non_empty}'
dbt show --profile bigquery --inline "select * from {{ ref('my_incremental') }}"
```

</details>

## Checklist
- [x] Review the [Content style
guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md)
so my content adheres to these guidelines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Improvements or additions to content dbt Core The changes proposed in this issue relate to dbt Core
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants