Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect data_type for BigQuery REPEATED fields #190

Closed
1 of 5 tasks
Thrasi opened this issue Nov 12, 2024 · 7 comments · Fixed by #236
Closed
1 of 5 tasks

Incorrect data_type for BigQuery REPEATED fields #190

Thrasi opened this issue Nov 12, 2024 · 7 comments · Fixed by #236
Labels
bug Something isn't working

Comments

@Thrasi
Copy link
Contributor

Thrasi commented Nov 12, 2024

Describe the bug

BigQuery supports arrays as REPEATED fields
Running generate_model_yaml for a model with a repeated field of a certain datatype will result in a
schema with data_type: datatype rather than data_type: array<datatype>
Running the model again with contract.enforced=true will show the error:

column_name definition_type contract_type mismatch_reason
repeated_int ARRAY<INT64> INT64 data type mismatch

Repeated records should have data_type: array

Steps to reproduce

create a model_with_repeated_field.sql:

select 
    [1,2,3,4] as repeated_int

run it

$ dbt run -s model_with_repeated_field -f

Expected results

version: 2

models:
  - name: model_with_repeated_field
    description: ""
    columns:
      - name: repeated_int
        data_type: array<int64>
        description: ""

Actual results

version: 2

models:
  - name: model_with_repeated_field
    description: ""
    columns:
      - name: repeated_int
        data_type: int64
        description: ""

Screenshots and log output

Running the model with the following yml:

models:
  - name: model_with_repeated_field
    config:
      contract:
        enforced: true
    columns:
      - name: repeated_int
        data_type: int64
        description: ""

Screenshot 2024-11-12 at 16 32 13

System information

packages:
  - package: dbt-labs/codegen
    version: 0.12.1
  - package: dbt-labs/dbt_utils
    version: 1.1.1

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

Core:
  - installed: 1.8.8
  - latest:    1.8.8 - Up to date!

Plugins:
  - bigquery: 1.8.3 - Up to date!

The operating system you're using:
MacOS Sequoia Version 15.1

The output of python --version:
Python 3.10.15

Additional context

generate_model_yaml gets the data_type using data_type_format_model

    {% if include_data_types %}
        {% do model_yaml.append('        data_type: ' ~ codegen.data_type_format_model(column)) %}
    {% endif %}

which in turn calls codegen.format_column

{% macro data_type_format_model(column) -%}
  {{ return(adapter.dispatch('data_type_format_model', 'codegen')(column)) }}
{%- endmacro %}

{# format a column data type for a model #}
{% macro default__data_type_format_model(column) %}
    {% set formatted = codegen.format_column(column) %}
    {{ return(formatted['data_type'] | lower) }}
{% endmacro %}

format_column is vendored in macros/vendored/dbt_core/format_column.sql
But it doesn't handle the specific case for repeated fields.

I am tempted to create a default__format_column and a bigquery__format_column to handle BigQuery specifically:

{% macro format_column(column) -%}
  {{ return(adapter.dispatch('format_column', 'codegen')(column)) }}
{%- endmacro %}

{% macro default__format_column(column) -%}
  {% set data_type = column.dtype %}
  {% set formatted = column.column.lower() ~ " " ~ data_type %}
  {{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%}

{% macro bigquery__format_column(column) -%}
  {% if column.mode.lower() == "repeated" %}
    {% set data_type = "array" if column.dtype.lower() == "record" else "array<" ~ column.dtype ~ ">" %}
  {% else %}
    {% set data_type = column.dtype %}
  {% endif %}
  {% set formatted = column.column.lower() ~ " " ~ data_type %}
  {{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%}

and moving it all into helpers.sql while removing the vendored format_column

Alternatively, bigquery__format_column could return a 'mode': column.mode field
and the adapter specific datatype conversion would be taken care of in
bigquery__data_type_format_model and bigquery__data_type_format_source

Are you interested in contributing the fix?

Yes. I would like some input on whether the proposed solution in Additional context is reasonable.

@Thrasi Thrasi added bug Something isn't working triage labels Nov 12, 2024
@dbeatty10
Copy link
Contributor

Thanks for reaching out @Thrasi!

TLDR

Your proposed solution to handle BigQuery REPEATED fields by converting default__format_column and a bigquery__format_column makes sense to me!

Background context

Here's the context of why it was vendored in the first place: #122 (comment)

The aim was to preserve the wide range of require-dbt-version: [">=1.0.0", "<2.0.0"] (rather than needing to raise the lower bound to a dbt-core spec (>= 1.5.0) that includes the format_column macro).

Ultimately, we'd be interested in just ripping these definitions altogether and replacing codegen.format_column(column) with dbt.format_column(column) at some point in the future.

In the meantime

The definition of default__format_column was moved here.

The current definition for bigquery is here.

The only difference in the definition for bigquery is using column.data_type rather than column.dtype.

The good news is that the bigquery of column.data_type looks like it handles the case of REPEATED fields.

Recommended code

So these might do the trick (until we don't care anymore about preserving compatibility in the full [">=1.0.0", "<2.0.0"] range):

{# Vendored from: https://github.com/dbt-labs/dbt-adapters/blob/c7b12aee533184bad391a657d1753539d1dd496a/dbt/include/global_project/macros/relations/column/columns_spec_ddl.sql#L85-L89 #}
{% macro default__format_column(column) -%}
  {% set data_type = column.dtype %}
  {% set formatted = column.column.lower() ~ " " ~ data_type %}
  {{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%}

{# Vendored from: https://github.com/dbt-labs/dbt-bigquery/blob/4d255b2f854d21d5d8871bdaa8d7ab47e7e863a3/dbt/include/bigquery/macros/utils/get_columns_spec_ddl.sql#L1-L5 #}
{% macro bigquery__format_column(column) -%}
  {% set data_type = column.data_type %}
  {% set formatted = column.column.lower() ~ " " ~ data_type %}
  {{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%}

I'd prefer to keep them in the file macros/vendored/dbt_core/format_column.sql to make it clear that logic is just being copy-pasted from elsewhere. (But possibly renaming that file from dbt_core to dbt_adapters and adding comments with hyperlinks to where the code is being copy-pasted from.)

Would welcome a PR if you are still interested in contributing one!

@dbeatty10 dbeatty10 removed the triage label Dec 5, 2024
@dbeatty10 dbeatty10 changed the title Incorrect data_type for BigQuery REPEATED fields. Incorrect data_type for BigQuery REPEATED fields Dec 5, 2024
@Thrasi
Copy link
Contributor Author

Thrasi commented Dec 9, 2024

Great! I will add a test for this too.

@Thrasi Thrasi mentioned this issue Dec 11, 2024
6 tasks
@Thrasi
Copy link
Contributor Author

Thrasi commented Dec 11, 2024

Using the vendored code we get "appropriate" data_type descriptions in the model:
i.e. the test model:

select
  [1, 2] AS repeated_int,
  [
    STRUCT(1 as int_field, [STRUCT("a" as string_field)] as nested_repeated_struct),
    STRUCT(2 AS int_field, [STRUCT("a" as string_field)] as nested_repeated_struct)
  ] as repeated_struct

generates:

version: 2

models:
  - name: repeated_nested
    description: ""
    columns:
      - name: repeated_int
        data_type: array<int64>
        description: ""

      - name: repeated_struct
        data_type: array<struct<`int_field` int64, `nested_repeated_struct` array<struct<`string_field` string>>>>
        description: ""

      - name: repeated_struct.int_field
        data_type: int64
        description: ""

      - name: repeated_struct.nested_repeated_struct
        data_type: array<struct<`string_field` string>>
        description: ""

      - name: repeated_struct.nested_repeated_struct.string_field
        data_type: string
        description: ""

which matches:
the BigQuery schema:

"schema": {
    "fields": [
      {
        "description": "",
        "mode": "REPEATED",
        "name": "repeated_int",
        "type": "INTEGER"
      },
      {
        "description": "",
        "fields": [
          {
            "description": "",
            "name": "int_field",
            "type": "INTEGER"
          },
          {
            "description": "",
            "fields": [
              {
                "description": "",
                "name": "string_field",
                "type": "STRING"
              }
            ],
            "mode": "REPEATED",
            "name": "nested_repeated_struct",
            "type": "RECORD"
          }
        ],
        "mode": "REPEATED",
        "name": "repeated_struct",
        "type": "RECORD"
      }
    ]
  },

So this is positive.
However, running this model with config.contract.enforced: true still fails. The query it attempts to execute is:

SELECT
  *
FROM (
  SELECT
    CAST(NULL AS ARRAY<int64>) AS repeated_int,
    CAST(NULL AS STRUCT<int_field int64, nested_repeated_struct STRUCT<string_field string> string>>> int64, `nested_repeated_struct` ARRAY<STRUCT<`string_field` string>>>>) AS repeated_struct ) AS __dbt_sbq
WHERE
  FALSE
  AND CURRENT_TIMESTAMP() = CURRENT_TIMESTAMP()
LIMIT
  0

The repeated nested column should produce:

CAST(NULL AS ARRAY<STRUCT<int_field int64, nested_repeated_struct ARRAY<STRUCT<string_field string>>>>) AS nested_repeated_struct

to match the schema.

I cannot see the data_type as being incorrect, so this might not be an issue with dbt-codegen specifically but how dbt interacts with the configuration.
This is not an issue when repeated structs have data_type: array in the configuration file. So overriding the explicit data_type of repeated structs as array could help.
Any ideas?

It could be good to add tests enforcing contracts to catch these kind of issues.

@dbeatty10
Copy link
Contributor

Looking good @Thrasi !

What is the YAML that would work when the model contract is enforced?

Per your suggestion about overriding the explicit data_type of repeated structs as array, does this work when contracts are enforced, by any chance?

version: 2

models:
  - name: repeated_nested
    description: ""
    columns:
      - name: repeated_int
        data_type: array<int64>
        description: ""

      - name: repeated_struct
        data_type: array
        description: ""

      - name: repeated_struct.int_field
        data_type: int64
        description: ""

      - name: repeated_struct.nested_repeated_struct
        data_type: array
        description: ""

      - name: repeated_struct.nested_repeated_struct.string_field
        data_type: string
        description: ""

@Thrasi
Copy link
Contributor Author

Thrasi commented Dec 11, 2024

Yes precisely

@dbeatty10
Copy link
Contributor

@Thrasi Would adding logic similar to what you suggested earlier handle those cases?

See below for a quick shot I took at this, but didn't test out at all.

It uses column.mode to find REPEATED fields and then treats them as a generic array if its dtype is a RECORD.

{# Vendored from: https://github.com/dbt-labs/dbt-bigquery/blob/4d255b2f854d21d5d8871bdaa8d7ab47e7e863a3/dbt/include/bigquery/macros/utils/get_columns_spec_ddl.sql#L1-L5 #}
{# But modified to handle https://github.com/dbt-labs/dbt-codegen/issues/190 #}
{% macro bigquery__format_column(column) -%}
  {% set data_type = column.data_type %}
  {% if column.mode.lower() == "repeated" and column.dtype.lower() == "record" %}
    {% set data_type = "array" %}
  {% endif %}
  {% set formatted = column.column.lower() ~ " " ~ data_type %}
  {{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%}

@Thrasi
Copy link
Contributor Author

Thrasi commented Dec 12, 2024

Yes this works, it's same logic but much cleaner and uses column.data_type when it can. I've updated the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants