Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] MicrobatchBuilder generates invalid table names with spaces for hourly batches #11165

Open
2 tasks done
pei0804 opened this issue Dec 18, 2024 · 0 comments
Open
2 tasks done
Labels
bug Something isn't working triage

Comments

@pei0804
Copy link

pei0804 commented Dec 18, 2024

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When using incremental_strategy="microbatch" with batch_size="hour", MicrobatchBuilder generates a batch ID that contains spaces and special characters. This leads to invalid temporary table names and SQL syntax errors in database adapters.

Example error from Snowflake adapter:

SQL compilation error: syntax error line 1 at position 150 unexpected '00'.

The error occurs because the generated temporary table name contains spaces and timezone information:

create or replace temporary table [...].model_name__dbt_tmp_20241218 00:00:00+00:00

CleanShot 2024-12-18 at 11 39 54@2x

Expected Behavior

MicrobatchBuilder should generate a valid batch ID without spaces or special characters for hourly batches, similar to how it handles daily batches.

For example:

create or replace temporary table [...].model_name__dbt_tmp_20241218T000000Z

Steps To Reproduce

  1. Create a model with microbatch incremental strategy:
{{
    config(
        materialized="incremental",
        incremental_strategy="microbatch",
        begin="2022-01-01T00:00:00",
        batch_size="hour",
        event_time="_partition_hourly",
        unique_key="_partition_hourly",
        lookback=1,
    )
}}
  1. Run the model with dbt run --select hoge --event-time-start "2024-12-18T00:00:00" --event-time-end "2024-12-18T01:00:00"
  2. Observe the SQL syntax error due to invalid temporary table name

Relevant log output

02:13:16  Using snowflake connection "model.fuga.hoge"
02:13:16  On model.fuga.hoge: create or replace temporary table dbname.schemaname.modelname__dbt_tmp_20241218 00:00:00+00:00
         as
        (

with
    source as (

        select *
        from dbname.schemaname.modelname

    ),

    deduplicated as (
        select *
    from source
    qualify
        row_number() over (
            partition by id
            order by _partition_hourly asc, bid_at asc
        ) = 1
    )

select *
from deduplicated
        )
02:13:17  Snowflake adapter: Snowflake query id: hoge
02:13:17  Snowflake adapter: Snowflake error: 001003 (42000): SQL compilation error:
syntax error line 1 at position 150 unexpected '00'.
02:13:17  Unhandled error while executing
Exception on worker thread. Database Error
  001003 (42000): SQL compilation error:
  syntax error line 1 at position 150 unexpected '00'.
02:13:17  Batch 1 of 1 ERROR creating batch 2024-12-18 00:00:00+00:00 of dbname.schemaname.modelname [ERROR in 0.39s]

Environment

- OS: macOS Sonoma 14.6.1
- Python: 3.12.6
- dbt-core: 1.9.1
- dbt-snowflake: 1.9.0

Which database adapter are you using with dbt?

snowflake

Additional Context

The issue occurs in several parts of the codebase:

  1. First, the batch ID is set in core/dbt/task/run.py:
    https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/task/run.py#L352-L356
self.node.batch = BatchContext(
    id=MicrobatchBuilder.batch_id(batch[0], self.node.config.batch_size),
    event_time_start=batch[0],
    event_time_end=batch[1],
)
  1. This batch ID is then used in the global make_temp_relation macro to create unique temporary table names:
    https://github.com/dbt-labs/dbt-adapters/blob/5407391c5cef22a5c0431daa469d6a8295c026d8/dbt/include/global_project/macros/adapters/relation.sql#L9-L16
{% macro make_temp_relation(base_relation, suffix='__dbt_tmp') %}
  {#-- This ensures microbatch batches get unique temp relations to avoid clobbering --#}
  {% if suffix == '__dbt_tmp' and model.batch %}
    {% set suffix = suffix ~ '_' ~ model.batch.id %}
  {% endif %}
  {{ return(adapter.dispatch('make_temp_relation', 'dbt')(base_relation, suffix)) }}
{% endmacro %}
  1. The problematic batch ID generation occurs in the MicrobatchBuilder class:
    @staticmethod
    def batch_id(start_time: datetime, batch_size: BatchSize) -> str:
    return MicrobatchBuilder.format_batch_start(start_time, batch_size).replace("-", "")
    @staticmethod
    def format_batch_start(batch_start: datetime, batch_size: BatchSize) -> str:
    return str(
    batch_start.date() if (batch_start and batch_size != BatchSize.hour) else batch_start
    )
@staticmethod
def batch_id(start_time: datetime, batch_size: BatchSize) -> str:
    return MicrobatchBuilder.format_batch_start(start_time, batch_size).replace("-", "")

@staticmethod
def format_batch_start(batch_start: datetime, batch_size: BatchSize) -> str:
    return str(
        batch_start.date() if (batch_start and batch_size != BatchSize.hour) else batch_start
    )

For hourly batches (batch_size="hour"), format_batch_start returns str(batch_start) which generates a datetime string like "2024-12-18 00:00:00+00:00". While batch_id removes hyphens, it does not handle spaces and timezone information, resulting in an invalid table name.

For non-hourly batches (day/month/year), it correctly uses batch_start.date() which produces a clean format like "2024-12-18", and after removing hyphens becomes "20241218".

The issue stems from the fact that the batch ID flows from run.py through the Jinja templating system and into SQL table names without proper sanitization for hourly batches. The fix would likely involve modifying the format_batch_start method to ensure hourly timestamps use a database-friendly format (e.g., ISO format "20241218T000000Z") similar to how it handles daily batches.

@pei0804 pei0804 added bug Something isn't working triage labels Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

1 participant