You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi everyone, currently I am going through an issue with DBT (0.20.1) on production that seems related to some "concurrent updates" when I try to run different queries at the same time, not directly related, from different models. As simplified description, I have my folder models divided in subfolders like this:
for a total of 117 models.
Then at some point, let's say a Process-A starts the model_7 (dbt run --models model_7 on a Server-A) (that contains the query table_65.sql). What I can see from the beginning of the logs for this process is something like:
2021-09-14 08:43:56.293473 (MainThread): Acquiring new spark connection "model.my_proj.table_1".
2021-09-14 08:43:56.385776 (MainThread): Acquiring new spark connection "model.my_proj.table_2".
....
2021-09-14 08:43:56.390074 (MainThread): Acquiring new spark connection "model.my_proj.table_117".
Meantime a separate Process-B starts the model_9 (dbt run --models model_9 on a Server-B) that requires to drop the table table_82 (table_65 and table_82 are not related).
Back to Process-A, from the log I can read now (caused by the Process-B likely):
2021-09-14 08:44:35.863983 (ThreadPoolExecutor-1_0): Error while running:
/* {"app": "dbt", "dbt_version": "0.20.1", "profile_name": "my_proj_spark", "target_name": "prod", "connection_name": "list_None_my_models"} */
show table extended in my_models like '*'
2021-09-14 08:44:35.864026 (ThreadPoolExecutor-1_0): Database Error
org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'table_82' not found in database 'my_models'
This last log message gets repeated 3 times in row. Now for some reason this also triggers the process related to the build of the table_65 3 times (as I can see from the log) for:
2021-09-14 08:44:36.189855 (Thread-1): /* {"app": "dbt", "dbt_version": "0.20.1", "profile_name": "my_proj_spark", "target_name": "prod", "node_id": "model.my_proj.table_65"} */
create table model_7.table_65
using parquet
partitioned by (country,month_selected)
location 's3://my_data/....'
as ....
2021-09-14 08:44:36.281855 (Thread-1): /* {"app": "dbt", "dbt_version": "0.20.1", "profile_name": "my_proj_spark", "target_name": "prod", "node_id": "model.my_proj.table_65"} */
create table model_7.table_65
using parquet
partitioned by (country,month_selected)
location 's3://my_data/....'
as ....
2021-09-14 08:44:36.399835 (Thread-1): /* {"app": "dbt", "dbt_version": "0.20.1", "profile_name": "my_proj_spark", "target_name": "prod", "node_id": "model.my_proj.table_65"} */
create table model_7.table_65
using parquet
partitioned by (country,month_selected)
location 's3://my_data/....'
as ....
that understandably will make the process fail with this error:
2021-09-14 08:44:36.944890 (Thread-1): Database Error
org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Table my_proj.table_65 already exists. You need to drop it first.
If I am correctly reading the logs, is there any work around to avoid multiple triggers of the same process if a table disappear not directly related fails to pass the analysis? Or is it a DBT bug? Thank you
Steps To Reproduce
Please look at the bug description. I think the process can be repeated locally following that steps.
Expected behavior
The expected behaviour is avoiding a failure if not related tables get dropped elsewhere.
Screenshots and log output
Please look at the bug description.
System information
Which database are you using dbt with?
postgres
redshift
bigquery
snowflake
other (specify: AWS/EMR 6.3)
The output of dbt --version:
installed version: 0.20.1
latest version: 0.20.2
Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation
Plugins:
- bigquery: 0.19.2
- snowflake: 0.19.2
- redshift: 0.19.2
- postgres: 0.19.2
- spark: 0.20.1
The operating system you're using:
Ubuntu 20.04 (on AWS/EC2)
The output of python --version:
python --version
Python 3.8.14
Additional context
Running this process on AWS/EMR 6.3 with Glue as "Hive" replacement for Spark.
The text was updated successfully, but these errors were encountered:
I found the root of the previous issue (at least I hope). The reason why this is happening is due to the fact that DBT runs via Spark the following command: show table extended in my_glue_schema like '*' . In my case the same schema is used by different other processes that can generate tables with "bad" schemas or without "Location" (S3 path) defined (a view for example). This will make fail the highlighted command and so an eventual "drop table" or "insert overwrite" won't happen, and only the create table process will. Why DBT needs to run show table extended in my_glue_schema like '*' ? Can it be skipped some how?
I have been pointed to this previous issue: #93. That seems to create a major issue for my case, as we cannot run separate parts of a potential unique "tree" as they may overlap creation/drop of tables that are expected in one or another tree.
In environments like Glue/Catalog via EMR may be not required tho, as the "layer" is the Catalog service itself. Not sure if I am missing the point here. In that case that command should be "optional"? For example something like:
show table extended in my_glue_schema in ('table_1', 'table_2', ...etc...)
Sorry for the delay @danfran! I'm going to close this in favor of the discussion and next steps in #228, since I believe a solid resolution there would also manage to avoid this bug
Describe the bug
Hi everyone, currently I am going through an issue with DBT (0.20.1) on production that seems related to some "concurrent updates" when I try to run different queries at the same time, not directly related, from different models. As simplified description, I have my folder
models
divided in subfolders like this:for a total of 117 models.
Then at some point, let's say a Process-A starts the
model_7
(dbt run --models model_7
on a Server-A) (that contains the querytable_65.sql
). What I can see from the beginning of the logs for this process is something like:Meantime a separate Process-B starts the
model_9
(dbt run --models model_9
on a Server-B) that requires to drop the tabletable_82
(table_65
andtable_82
are not related).Back to Process-A, from the log I can read now (caused by the Process-B likely):
This last log message gets repeated 3 times in row. Now for some reason this also triggers the process related to the build of the
table_65
3 times (as I can see from the log) for:that understandably will make the process fail with this error:
If I am correctly reading the logs, is there any work around to avoid multiple triggers of the same process if a table disappear not directly related fails to pass the analysis? Or is it a DBT bug? Thank you
Steps To Reproduce
Please look at the bug description. I think the process can be repeated locally following that steps.
Expected behavior
The expected behaviour is avoiding a failure if not related tables get dropped elsewhere.
Screenshots and log output
Please look at the bug description.
System information
Which database are you using dbt with?
The output of
dbt --version
:The operating system you're using:
Ubuntu 20.04 (on AWS/EC2)
The output of
python --version
:python --version
Python 3.8.14
Additional context
Running this process on AWS/EMR 6.3 with Glue as "Hive" replacement for Spark.
The text was updated successfully, but these errors were encountered: