Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1114] Prevent cache inconsistencies during on_schema_change #447

Closed
jtcohen6 opened this issue Aug 31, 2022 · 0 comments · Fixed by #451
Closed

[CT-1114] Prevent cache inconsistencies during on_schema_change #447

jtcohen6 opened this issue Aug 31, 2022 · 0 comments · Fixed by #451
Labels
bug Something isn't working

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 31, 2022

As a short-term fix to cache inconsistencies (e.g. #431), let's remove entirely the cache lookup in adapter.get_columns_in_relation:

def get_columns_in_relation(self, relation: Relation) -> List[SparkColumn]:
cached_relations = self.cache.get_relations(relation.database, relation.schema)
cached_relation = next(
(
cached_relation
for cached_relation in cached_relations
if str(cached_relation) == str(relation)
),
None,
)
columns = []
if cached_relation and cached_relation.information:
columns = self.parse_columns_from_information(cached_relation)
if not columns:
# in open source delta 'show table extended' query output doesnt
# return relation's schema. if columns are empty from cache,
# use get_columns_in_relation spark macro
# which would execute 'describe extended tablename' query
try:
rows: List[agate.Row] = self.execute_macro(
GET_COLUMNS_IN_RELATION_RAW_MACRO_NAME, kwargs={"relation": relation}
)
columns = self.parse_describe_extended(relation, rows)
except dbt.exceptions.RuntimeException as e:
# spark would throw error when table doesn't exist, where other
# CDW would just return and empty list, normalizing the behavior here
errmsg = getattr(e, "msg", "")
if "Table or view not found" in errmsg or "NoSuchTableException" in errmsg:
pass
else:
raise e

That should become just:

def get_columns_in_relation(self, relation: Relation) -> List[SparkColumn]: 
    try: 
        rows: List[agate.Row] = self.execute_macro( 
            GET_COLUMNS_IN_RELATION_RAW_MACRO_NAME, kwargs={"relation": relation} 
        ) 
        columns = self.parse_describe_extended(relation, rows) 
    except dbt.exceptions.RuntimeException as e: 
        # spark would throw error when table doesn't exist, where other 
        # CDW would just return and empty list, normalizing the behavior here 
        errmsg = getattr(e, "msg", "") 
        if "Table or view not found" in errmsg or "NoSuchTableException" in errmsg: 
            pass 
        else: 
            raise e 

That will be slower in the general case, but it's also guaranteed to return correct results, which is more important.

Reproduction case

{{ config(
  materialized = 'incremental',
  on_schema_change = 'append_new_columns'
) }}

select
  1 as id
  {% if is_incremental() %}
  , 'blue' as color
  {% endif %}
$ dbt run --full-refresh && dbt run

Current behavior: The second dbt run fails. dbt correctly adds the color column to incremental_model while processing schema changes. When it comes time to build the insert statement, it accesses a stale set of columns (just id) from the cache, and the insert fails with:

org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Cannot write to 'spark_catalog.dbt_jcohen.incremental_model', not enough data columns; target table has 2 column(s) but the inserted data has 1 column(s)

Expected behavior: After the second run, incremental_model should contain both columns, including the additional color column.

Incidentally, I still had my branch from #433 checked out locally while working on the reproduction case. So I confirmed that the change there resolves this issue, too—but with more complexity, and a big performance boost. For now, we can resolve just this bug as a standalone :)

@jtcohen6 jtcohen6 added the bug Something isn't working label Aug 31, 2022
@github-actions github-actions bot changed the title Prevent cache inconsistencies during on_schema_change [CT-1114] Prevent cache inconsistencies during on_schema_change Aug 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant