[CT-1114] Prevent cache inconsistencies during on_schema_change #447

jtcohen6 · 2022-08-31T13:32:36Z

As a short-term fix to cache inconsistencies (e.g. #431), let's remove entirely the cache lookup in adapter.get_columns_in_relation:

dbt-spark/dbt/adapters/spark/impl.py

Lines 211 to 241 in 24e796d

    
           def get_columns_in_relation(self, relation: Relation) -> List[SparkColumn]: 
        
               cached_relations = self.cache.get_relations(relation.database, relation.schema) 
        
               cached_relation = next( 
        
                   ( 
        
                       cached_relation 
        
                       for cached_relation in cached_relations 
        
                       if str(cached_relation) == str(relation) 
        
                   ), 
        
                   None, 
        
               ) 
        
               columns = [] 
        
               if cached_relation and cached_relation.information: 
        
                   columns = self.parse_columns_from_information(cached_relation) 
        
               if not columns: 
        
                   # in open source delta 'show table extended' query output doesnt 
        
                   # return relation's schema. if columns are empty from cache, 
        
                   # use get_columns_in_relation spark macro 
        
                   # which would execute 'describe extended tablename' query 
        
                   try: 
        
                       rows: List[agate.Row] = self.execute_macro( 
        
                           GET_COLUMNS_IN_RELATION_RAW_MACRO_NAME, kwargs={"relation": relation} 
        
                       ) 
        
                       columns = self.parse_describe_extended(relation, rows) 
        
                   except dbt.exceptions.RuntimeException as e: 
        
                       # spark would throw error when table doesn't exist, where other 
        
                       # CDW would just return and empty list, normalizing the behavior here 
        
                       errmsg = getattr(e, "msg", "") 
        
                       if "Table or view not found" in errmsg or "NoSuchTableException" in errmsg: 
        
                           pass 
        
                       else: 
        
                           raise e

That should become just:

def get_columns_in_relation(self, relation: Relation) -> List[SparkColumn]: 
    try: 
        rows: List[agate.Row] = self.execute_macro( 
            GET_COLUMNS_IN_RELATION_RAW_MACRO_NAME, kwargs={"relation": relation} 
        ) 
        columns = self.parse_describe_extended(relation, rows) 
    except dbt.exceptions.RuntimeException as e: 
        # spark would throw error when table doesn't exist, where other 
        # CDW would just return and empty list, normalizing the behavior here 
        errmsg = getattr(e, "msg", "") 
        if "Table or view not found" in errmsg or "NoSuchTableException" in errmsg: 
            pass 
        else: 
            raise e

That will be slower in the general case, but it's also guaranteed to return correct results, which is more important.

Reproduction case

{{ config(
  materialized = 'incremental',
  on_schema_change = 'append_new_columns'
) }}

select
  1 as id
  {% if is_incremental() %}
  , 'blue' as color
  {% endif %}

$ dbt run --full-refresh && dbt run

Current behavior: The second dbt run fails. dbt correctly adds the color column to incremental_model while processing schema changes. When it comes time to build the insert statement, it accesses a stale set of columns (just id) from the cache, and the insert fails with:

org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Cannot write to 'spark_catalog.dbt_jcohen.incremental_model', not enough data columns; target table has 2 column(s) but the inserted data has 1 column(s)

Expected behavior: After the second run, incremental_model should contain both columns, including the additional color column.

Incidentally, I still had my branch from #433 checked out locally while working on the reproduction case. So I confirmed that the change there resolves this issue, too—but with more complexity, and a big performance boost. For now, we can resolve just this bug as a standalone :)

The text was updated successfully, but these errors were encountered:

jtcohen6 added the bug Something isn't working label Aug 31, 2022

github-actions bot changed the title ~~Prevent cache inconsistencies during on_schema_change~~ [CT-1114] Prevent cache inconsistencies during on_schema_change Aug 31, 2022

jtcohen6 mentioned this issue Aug 31, 2022

[CT-1051] Not correctly running with schema change #431

Closed

McKnight-42 mentioned this issue Sep 8, 2022

[CT-1114] remove Cache call from get_columns_in_relation #451

Merged

4 tasks

McKnight-42 closed this as completed in #451 Sep 15, 2022

jtcohen6 mentioned this issue Feb 20, 2024

[Feature] Include column-level information in dbt's adapter cache, for faster get_columns_in_relation dbt-labs/dbt-core#9595

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1114] Prevent cache inconsistencies during on_schema_change #447

[CT-1114] Prevent cache inconsistencies during on_schema_change #447

jtcohen6 commented Aug 31, 2022 •

edited

Loading

[CT-1114] Prevent cache inconsistencies during on_schema_change #447

[CT-1114] Prevent cache inconsistencies during on_schema_change #447

Comments

jtcohen6 commented Aug 31, 2022 • edited Loading

Reproduction case

jtcohen6 commented Aug 31, 2022 •

edited

Loading