You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# spark would throw error when table doesn't exist, where other
# CDW would just return and empty list, normalizing the behavior here
errmsg=getattr(e, "msg", "")
if"Table or view not found"inerrmsgor"NoSuchTableException"inerrmsg:
pass
else:
raisee
That should become just:
defget_columns_in_relation(self, relation: Relation) ->List[SparkColumn]:
try:
rows: List[agate.Row] =self.execute_macro(
GET_COLUMNS_IN_RELATION_RAW_MACRO_NAME, kwargs={"relation": relation}
)
columns=self.parse_describe_extended(relation, rows)
exceptdbt.exceptions.RuntimeExceptionase:
# spark would throw error when table doesn't exist, where other # CDW would just return and empty list, normalizing the behavior here errmsg=getattr(e, "msg", "")
if"Table or view not found"inerrmsgor"NoSuchTableException"inerrmsg:
passelse:
raisee
That will be slower in the general case, but it's also guaranteed to return correct results, which is more important.
Reproduction case
{{ config(
materialized = 'incremental',
on_schema_change = 'append_new_columns'
) }}
select
1 as id
{% if is_incremental() %}
, 'blue' as color
{% endif %}
$ dbt run --full-refresh && dbt run
Current behavior: The second dbt run fails. dbt correctly adds the color column to incremental_model while processing schema changes. When it comes time to build the insert statement, it accesses a stale set of columns (just id) from the cache, and the insert fails with:
org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Cannot write to 'spark_catalog.dbt_jcohen.incremental_model', not enough data columns; target table has 2 column(s) but the inserted data has 1 column(s)
Expected behavior: After the second run, incremental_model should contain both columns, including the additional color column.
Incidentally, I still had my branch from #433 checked out locally while working on the reproduction case. So I confirmed that the change there resolves this issue, too—but with more complexity, and a big performance boost. For now, we can resolve just this bug as a standalone :)
The text was updated successfully, but these errors were encountered:
github-actionsbot
changed the title
Prevent cache inconsistencies during on_schema_change
[CT-1114] Prevent cache inconsistencies during on_schema_change
Aug 31, 2022
As a short-term fix to cache inconsistencies (e.g. #431), let's remove entirely the cache lookup in
adapter.get_columns_in_relation
:dbt-spark/dbt/adapters/spark/impl.py
Lines 211 to 241 in 24e796d
That should become just:
That will be slower in the general case, but it's also guaranteed to return correct results, which is more important.
Reproduction case
Current behavior: The second
dbt run
fails. dbt correctly adds thecolor
column toincremental_model
while processing schema changes. When it comes time to build theinsert
statement, it accesses a stale set of columns (justid
) from the cache, and theinsert
fails with:Expected behavior: After the second run,
incremental_model
should contain both columns, including the additionalcolor
column.Incidentally, I still had my branch from #433 checked out locally while working on the reproduction case. So I confirmed that the change there resolves this issue, too—but with more complexity, and a big performance boost. For now, we can resolve just this bug as a standalone :)
The text was updated successfully, but these errors were encountered: