-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1467] [CT-1466] [Regression] v1.3: views instead of temp tables cause parallel sessions interfere: possible data leak and failure #306
Comments
Thanks for opening @mzayer-coveo. Leaving some quick notes below:
|
Hi @mzayer-coveo, could you please tell me more about your intended use case? I can't figure out why you would want to run the same project multiple times in parallel? |
We have a multi-tenant architecture where each table contains the data of multiple tenants. To avoid accessing/mixing data of more than 1 tenant, we have views that do |
That makes sense. It still goes against the expectation that a single model should not be run multiple times concurrently. But it feels like a valid use case. Anyway, if you don't have hard performance constraints - the reason why we switched methods in 1.3 - have you tried using the |
One quick update here: It sounds like, for some users/tables, the switch to views over temp tables is actually a performance regression (#323). If that's the case, we may wish to make this an opt-in configuration instead. In the meantime, it's possible to switch back to using temp tables in your own project by reimplementing one macro: {% macro dbt_snowflake_get_tmp_relation_type(strategy, unique_key, language) %}
-- always table
{{ return('table') }}
{% endmacro %} |
Thanks for your replies. Both options are possible workarounds that we have considered. But we preferred to stick with V1.2 for now. |
Hi We think this caused a serious failure in our overnight run We had this error on 53 different models in one
We are struggling to make sense of what actually happened. There were no concurrent processes running and the model in the error ran successfully. We are implementing the above macro now to prevent it happening again (we were using 1.3.1) |
Just to add to the parallel execution of models... it could occur when users have say 2 jobs that so happen to include the same incremental model:
So the hourly job will encounter this issue every 6 hours (well depending on which job kicks off first and starts creating the temp view first / drops the temp view first). |
dbt v <= 1.2 dbt v >= 1.3 to make parallel running possible, temporary view should be created instead of normal view Possible solution: to add temporary parameter and code logic into macro: snowflake__create_view_as(...) similarly how it is in snowflake__create_table_as(...) like this: many cases there are multiple incremental merge processes executed in parallel like, and in dbt v >= 1.3 it is not possible anymore: |
Temporary views?? On Snowflake?? Apparently so: https://community.snowflake.com/s/article/How-To-Use-A-Temporary-View-In-Snowflake (As noted in that article, the word |
I wrote a bash script that triggers this behavior ✨ reliably ✨ ! I love an excuse to demo Bash's bizarre built in parallelized process management capabilities 👩🔬
#!/bin/bash -x
# run job in the "background" where all such processes execute in parallel
dbt run -s orders &
# capture the process id for stopping it: this is to force the regression
job_one=$!
# let the job run until it hits the table creation step but before the drop table step, number depends on your machine
sleep 8
kill -STOP $job_one
# don't progress the script _too_ fast, again variable to your system
sleep 4
# let it spin
dbt run -s orders
# resume the original job, it'll fail some percentage of the time
kill -CONT $job_one === And with that, a fix is up! |
@VersusFacit , true, you can not run it like that in parallel. I have multitenant stage setup (DB1, DB2...) and running in parallel commands like this: dbt run -s order -target DB1
DBT creates Temp table (and view) in myDB, and not in myDB1 or myDB2. It is a breaking change if table was created as temporary and now view is not created as temporary. |
Hey @margushein 👋 thanks for the comment. Let's see if I got everything you wanted me to.
Just to be clear, the intent of my provided script is to document a way to trigger the underlying regression reliably since this is a super nasty bug to catch organically. That way, I could be confident adding the Re: "parallel." 🤔 Correct me if I'm wrong: I believe you use that term to describe simultaneous dbt invocations across different source databases that target the same relation. Where my test case differs is in execution: I set two parallel executions of the same model in motion. Everything just happens to live in the same database! As it turns out, my use case, like yours, will force two processes to depend on the same Here's another single-tenant parallelized for i in {1..50}
do
dbt run -s orders & # & means send the command to the "background" where processes execute in parallel
done My hunch is that single/multi-tenant architecture is not the root of the problem. You can trigger the behavior on either a single or multi tenant architecture. It's just extremely unlikely (and ill-advised) someone would run into this in a single tenant setup [I myself had to add So, while I'm not running this "in parallel" database-wise, I am running my test cases "in parallel" process-wise. Right now, I'm confident that's sufficient for triggering this regression. For what it's worth, I did some pretty long-winded testing on before-after adding To summarize:
The regression is caused by two different processes depending on the same 🙏 Hence, I'm pretty sure I've covered your use case in my testing/solution. ❓ Does that make sense to you? Any lingering concerns? |
@VersusFacit , all looks good! Checked the pull request also. Thank you! |
Is this a regression in a recent version of dbt-core?
Current Behavior
We are using Prefect to run several instances of the same dbt-snowflake project with different variables in parallel. Almost randomly, the dbt task fails with this error:
SQL compilation error: Object 'XYZ__DBT_TMP' does not exist or not authorized.
Why?
The way dbt v <= 1.2 works is that it first creates a temporary table based on the dbt model like:
Create or replace temp table XYZ__dbt_tmp
And then merges this temporary table into the target table.
Temporary tables in snowflake are confined to the specific session in which they are created. If session 1 creates a temp table XYZ, session 2 can’t see temp table XYZ, so theoretically, infinite temp tables with the same name can coexist in parallel sessions without interfering with each other. When the session ends, the temporary table gets destroyed.
However, dbt v1.3 has replaced temporary tables with views. So it first creates a view like:
Create or replace view XYZ__dbt_tmp
And then merges the view into the target table, and finally drops the view like:
Drop view if exists XYZ__dbt_tmp
Views ARE NOT confined to sessions, so several parallel sessions trying to create views with the same name interfere with each other. This means that session 1 does
create or replace view XYZ__dbt_tmp
, session 2 doescreate or replace view XYZ__dbt_tmp
, session 1 finishes the merge using the view created by session 2 and does adrop view if exists XYZ__dbt_tmp
, and when session 1 wants to merge the XYZ__dbt_tmp view, it doesn’t exist anymore, hence we get the errorSQL compilation error: Object 'XYZ__DBT_TMP' does not exist or not authorized.
We use session variables to limit sessions to specific datasets. This behavior also causes data leak.
Expected/Previous Behavior
Expected behavior is for the views to show the same behavior as temp tables (being confined to a session). For example, suffixing the view name with the session id to differentiate between the view in different sessions.
Steps To Reproduce
This is not easy to reproduce, but if you trigger several runs of the same dbt project, and the timing is right (drop view in session 1 happens before merge in session 2), you can recreate it.
Relevant log output
No response
Environment
Which database adapter are you using with dbt?
snowflake
Additional Context
No response
The text was updated successfully, but these errors were encountered: