Incorrect `data_type` for BigQuery `REPEATED` fields #190

Thrasi · 2024-11-12T16:25:30Z

Describe the bug

BigQuery supports arrays as REPEATED fields
Running generate_model_yaml for a model with a repeated field of a certain datatype will result in a
schema with data_type: datatype rather than data_type: array<datatype>
Running the model again with contract.enforced=true will show the error:

column_name	definition_type	contract_type	mismatch_reason
repeated_int	ARRAY<INT64>	INT64	data type mismatch

Repeated records should have data_type: array

Steps to reproduce

create a model_with_repeated_field.sql:

select 
    [1,2,3,4] as repeated_int

run it

$ dbt run -s model_with_repeated_field -f

Expected results

version: 2

models:
  - name: model_with_repeated_field
    description: ""
    columns:
      - name: repeated_int
        data_type: array<int64>
        description: ""

Actual results

version: 2

models:
  - name: model_with_repeated_field
    description: ""
    columns:
      - name: repeated_int
        data_type: int64
        description: ""

Screenshots and log output

Running the model with the following yml:

models:
  - name: model_with_repeated_field
    config:
      contract:
        enforced: true
    columns:
      - name: repeated_int
        data_type: int64
        description: ""

System information

packages:
  - package: dbt-labs/codegen
    version: 0.12.1
  - package: dbt-labs/dbt_utils
    version: 1.1.1

Which database are you using dbt with?

The output of dbt --version:

Core:
  - installed: 1.8.8
  - latest:    1.8.8 - Up to date!

Plugins:
  - bigquery: 1.8.3 - Up to date!

The operating system you're using:
MacOS Sequoia Version 15.1

The output of python --version:
Python 3.10.15

Additional context

generate_model_yaml gets the data_type using data_type_format_model

    {% if include_data_types %}
        {% do model_yaml.append('        data_type: ' ~ codegen.data_type_format_model(column)) %}
    {% endif %}

which in turn calls codegen.format_column

{% macro data_type_format_model(column) -%}
  {{ return(adapter.dispatch('data_type_format_model', 'codegen')(column)) }}
{%- endmacro %}

{# format a column data type for a model #}
{% macro default__data_type_format_model(column) %}
    {% set formatted = codegen.format_column(column) %}
    {{ return(formatted['data_type'] | lower) }}
{% endmacro %}

format_column is vendored in macros/vendored/dbt_core/format_column.sql
But it doesn't handle the specific case for repeated fields.

I am tempted to create a default__format_column and a bigquery__format_column to handle BigQuery specifically:

{% macro format_column(column) -%}
  {{ return(adapter.dispatch('format_column', 'codegen')(column)) }}
{%- endmacro %}

{% macro default__format_column(column) -%}
  {% set data_type = column.dtype %}
  {% set formatted = column.column.lower() ~ " " ~ data_type %}
  {{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%}

{% macro bigquery__format_column(column) -%}
  {% if column.mode.lower() == "repeated" %}
    {% set data_type = "array" if column.dtype.lower() == "record" else "array<" ~ column.dtype ~ ">" %}
  {% else %}
    {% set data_type = column.dtype %}
  {% endif %}
  {% set formatted = column.column.lower() ~ " " ~ data_type %}
  {{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%}

and moving it all into helpers.sql while removing the vendored format_column

Alternatively, bigquery__format_column could return a 'mode': column.mode field
and the adapter specific datatype conversion would be taken care of in
bigquery__data_type_format_model and bigquery__data_type_format_source

Are you interested in contributing the fix?

Yes. I would like some input on whether the proposed solution in Additional context is reasonable.

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2024-12-05T23:15:11Z

Thanks for reaching out @Thrasi!

TLDR

Your proposed solution to handle BigQuery REPEATED fields by converting default__format_column and a bigquery__format_column makes sense to me!

Background context

Here's the context of why it was vendored in the first place: #122 (comment)

The aim was to preserve the wide range of require-dbt-version: [">=1.0.0", "<2.0.0"] (rather than needing to raise the lower bound to a dbt-core spec (>= 1.5.0) that includes the format_column macro).

Ultimately, we'd be interested in just ripping these definitions altogether and replacing codegen.format_column(column) with dbt.format_column(column) at some point in the future.

In the meantime

The definition of default__format_column was moved here.

The current definition for bigquery is here.

The only difference in the definition for bigquery is using column.data_type rather than column.dtype.

The good news is that the bigquery of column.data_type looks like it handles the case of REPEATED fields.

Recommended code

So these might do the trick (until we don't care anymore about preserving compatibility in the full [">=1.0.0", "<2.0.0"] range):

{# Vendored from: https://github.com/dbt-labs/dbt-adapters/blob/c7b12aee533184bad391a657d1753539d1dd496a/dbt/include/global_project/macros/relations/column/columns_spec_ddl.sql#L85-L89 #}
{% macro default__format_column(column) -%}
  {% set data_type = column.dtype %}
  {% set formatted = column.column.lower() ~ " " ~ data_type %}
  {{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%}

{# Vendored from: https://github.com/dbt-labs/dbt-bigquery/blob/4d255b2f854d21d5d8871bdaa8d7ab47e7e863a3/dbt/include/bigquery/macros/utils/get_columns_spec_ddl.sql#L1-L5 #}
{% macro bigquery__format_column(column) -%}
  {% set data_type = column.data_type %}
  {% set formatted = column.column.lower() ~ " " ~ data_type %}
  {{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%}

I'd prefer to keep them in the file macros/vendored/dbt_core/format_column.sql to make it clear that logic is just being copy-pasted from elsewhere. (But possibly renaming that file from dbt_core to dbt_adapters and adding comments with hyperlinks to where the code is being copy-pasted from.)

Would welcome a PR if you are still interested in contributing one!

Thrasi · 2024-12-09T22:22:25Z

Great! I will add a test for this too.

Thrasi · 2024-12-11T16:40:42Z

Using the vendored code we get "appropriate" data_type descriptions in the model:
i.e. the test model:

select
  [1, 2] AS repeated_int,
  [
    STRUCT(1 as int_field, [STRUCT("a" as string_field)] as nested_repeated_struct),
    STRUCT(2 AS int_field, [STRUCT("a" as string_field)] as nested_repeated_struct)
  ] as repeated_struct

generates:

version: 2

models:
  - name: repeated_nested
    description: ""
    columns:
      - name: repeated_int
        data_type: array<int64>
        description: ""

      - name: repeated_struct
        data_type: array<struct<`int_field` int64, `nested_repeated_struct` array<struct<`string_field` string>>>>
        description: ""

      - name: repeated_struct.int_field
        data_type: int64
        description: ""

      - name: repeated_struct.nested_repeated_struct
        data_type: array<struct<`string_field` string>>
        description: ""

      - name: repeated_struct.nested_repeated_struct.string_field
        data_type: string
        description: ""

which matches:
the BigQuery schema:

"schema": {
    "fields": [
      {
        "description": "",
        "mode": "REPEATED",
        "name": "repeated_int",
        "type": "INTEGER"
      },
      {
        "description": "",
        "fields": [
          {
            "description": "",
            "name": "int_field",
            "type": "INTEGER"
          },
          {
            "description": "",
            "fields": [
              {
                "description": "",
                "name": "string_field",
                "type": "STRING"
              }
            ],
            "mode": "REPEATED",
            "name": "nested_repeated_struct",
            "type": "RECORD"
          }
        ],
        "mode": "REPEATED",
        "name": "repeated_struct",
        "type": "RECORD"
      }
    ]
  },

So this is positive.
However, running this model with config.contract.enforced: true still fails. The query it attempts to execute is:

SELECT
  *
FROM (
  SELECT
    CAST(NULL AS ARRAY<int64>) AS repeated_int,
    CAST(NULL AS STRUCT<int_field int64, nested_repeated_struct STRUCT<string_field string> string>>> int64, `nested_repeated_struct` ARRAY<STRUCT<`string_field` string>>>>) AS repeated_struct ) AS __dbt_sbq
WHERE
  FALSE
  AND CURRENT_TIMESTAMP() = CURRENT_TIMESTAMP()
LIMIT
  0

The repeated nested column should produce:

CAST(NULL AS ARRAY<STRUCT<int_field int64, nested_repeated_struct ARRAY<STRUCT<string_field string>>>>) AS nested_repeated_struct

to match the schema.

I cannot see the data_type as being incorrect, so this might not be an issue with dbt-codegen specifically but how dbt interacts with the configuration.
This is not an issue when repeated structs have data_type: array in the configuration file. So overriding the explicit data_type of repeated structs as array could help.
Any ideas?

It could be good to add tests enforcing contracts to catch these kind of issues.

dbeatty10 · 2024-12-11T17:12:37Z

Looking good @Thrasi !

What is the YAML that would work when the model contract is enforced?

Per your suggestion about overriding the explicit data_type of repeated structs as array, does this work when contracts are enforced, by any chance?

version: 2

models:
  - name: repeated_nested
    description: ""
    columns:
      - name: repeated_int
        data_type: array<int64>
        description: ""

      - name: repeated_struct
        data_type: array
        description: ""

      - name: repeated_struct.int_field
        data_type: int64
        description: ""

      - name: repeated_struct.nested_repeated_struct
        data_type: array
        description: ""

      - name: repeated_struct.nested_repeated_struct.string_field
        data_type: string
        description: ""

Thrasi · 2024-12-11T17:26:27Z

Yes precisely

dbeatty10 · 2024-12-11T19:19:09Z

@Thrasi Would adding logic similar to what you suggested earlier handle those cases?

See below for a quick shot I took at this, but didn't test out at all.

It uses column.mode to find REPEATED fields and then treats them as a generic array if its dtype is a RECORD.

{# Vendored from: https://github.com/dbt-labs/dbt-bigquery/blob/4d255b2f854d21d5d8871bdaa8d7ab47e7e863a3/dbt/include/bigquery/macros/utils/get_columns_spec_ddl.sql#L1-L5 #}
{# But modified to handle https://github.com/dbt-labs/dbt-codegen/issues/190 #}
{% macro bigquery__format_column(column) -%}
  {% set data_type = column.data_type %}
  {% if column.mode.lower() == "repeated" and column.dtype.lower() == "record" %}
    {% set data_type = "array" %}
  {% endif %}
  {% set formatted = column.column.lower() ~ " " ~ data_type %}
  {{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%}

Thrasi · 2024-12-12T15:16:59Z

Yes this works, it's same logic but much cleaner and uses column.data_type when it can. I've updated the PR.

Thrasi added bug Something isn't working triage labels Nov 12, 2024

dbeatty10 removed the triage label Dec 5, 2024

dbeatty10 changed the title ~~Incorrect data_type for BigQuery REPEATED fields.~~ Incorrect data_type for BigQuery REPEATED fields Dec 5, 2024

Thrasi mentioned this issue Dec 11, 2024

Bigquery repeated data type #236

Merged

6 tasks

dbeatty10 mentioned this issue Dec 11, 2024

Contract data type of nested and repeated data (ARRAY of STRUCTs) in BigQuery dbt-labs/docs.getdbt.com#6645

Open

1 task

dbeatty10 closed this as completed in #236 Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect `data_type` for BigQuery `REPEATED` fields #190

Incorrect `data_type` for BigQuery `REPEATED` fields #190

Thrasi commented Nov 12, 2024 •

edited

Loading

dbeatty10 commented Dec 5, 2024

Thrasi commented Dec 9, 2024

Thrasi commented Dec 11, 2024

dbeatty10 commented Dec 11, 2024

Thrasi commented Dec 11, 2024

dbeatty10 commented Dec 11, 2024

Thrasi commented Dec 12, 2024

Incorrect data_type for BigQuery REPEATED fields #190

Incorrect data_type for BigQuery REPEATED fields #190

Comments

Thrasi commented Nov 12, 2024 • edited Loading

Describe the bug

Steps to reproduce

Expected results

Actual results

Screenshots and log output

System information

Additional context

Are you interested in contributing the fix?

dbeatty10 commented Dec 5, 2024

TLDR

Background context

In the meantime

Recommended code

Thrasi commented Dec 9, 2024

Thrasi commented Dec 11, 2024

dbeatty10 commented Dec 11, 2024

Thrasi commented Dec 11, 2024

dbeatty10 commented Dec 11, 2024

Thrasi commented Dec 12, 2024

Incorrect `data_type` for BigQuery `REPEATED` fields #190

Incorrect `data_type` for BigQuery `REPEATED` fields #190

Thrasi commented Nov 12, 2024 •

edited

Loading