[CT-1122] [CT-1116] [Bug] `adapter.get_relation` for relations in other databases [RA3] #179

jaklan · 2022-09-02T02:52:00Z

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

adapter.get_relation seems to ignore database argument and use the target database anyway.

Expected Behavior

The right database is used.

Steps To Reproduce

I have a model, defined as below, which checks if it already exists in the same schema, but in another table. If it exists - it prints a log message and reads data from there:

{{ config({ "materialized":"table" }) }}

{%- set relation = adapter.get_relation(
    database="integrated",
    schema=this.schema,
    identifier=this.identifier) -%}

{% if relation is not none %}

{{ log(relation.database ~ '.' ~ relation.schema ~ '.' ~ 
       relation.identifier ~ " already exists", info=true) }}

select * from {{relation}}

{% else %}

... (e.g. creates an empty table or do nothing if exists)

{% endif %}

I have 2 databases: dev and integrated.
In dev I have that table already created, in integrated - not.
The database specified in the active target is dev.

Now, when running dbt run -m "<model_identifier>", I would expect it to do nothing in dev database. But, instead of that I get:

02:41:51  1 of 1 START table model some_schema.some_model ................... [RUN]
02:41:52  integrated.some_schema.some_model already exists
02:41:52  1 of 1 ERROR creating table model some_schema.some_model .......... [ERROR in 0.80s]

02:41:55  Completed with 1 error and 0 warnings:
02:41:55  
02:41:55  Database Error in model <model_identifier>
02:41:55    Schema some_schema does not exist in the database.

so adapter.get_relation claims to be able to find the table in the integrated database - although it doesn't exist there. And because it doesn't exist - the run fails when it tries to actually read the data from there.

PS that's only a dummy example to visualise the issue, please don't focus on the logic itself

Relevant log output

No response

Environment

- OS: macOS Monterey 12.5.1
- Python: 3.9.11
- dbt:
  
  Core:
  - installed: 1.2.1
  - latest:    1.2.1 - Up to date!

  Plugins:
  - redshift: 1.2.1 - Up to date!
  - postgres: 1.2.1 - Up to date!

Which database adapter are you using with dbt?

redshift

Additional Context

No response

The text was updated successfully, but these errors were encountered:

jaklan · 2022-09-02T03:28:48Z

Okay, I get some more info. I have added debug statement:

{%- set relation = adapter.get_relation(
    database="integrated",
    schema=this.schema,
    identifier=this.identifier) -%}

{{ debug() }}

{% if relation is not none %}

{{ log(relation.database ~ '.' ~ relation.schema ~ '.' ~ 
       relation.identifier ~ " already exists", info=true) }}

select * from {{relation}}

and then I got:

     20     context.vars['relation'] = l_0_relation
     21     context.exported_vars.add('relation')
---> 22     yield to_string(environment.call(context, (undefined(name='debug') if l_0_debug is missing else l_0_debug)))
     23     yield '\n'
     24     if (not t_1((undefined(name='relation') if l_0_relation is missing else l_0_relation))):

But now, what's interesting - the debug mode was entered twice: at the beginning of the run (so as I assume during compilation) and during execution.

During the first visit, I got:

ipdb> l_0_relation is None
True

which is correct. But - during the second visit (so the one which started just after 03:13:24 1 of 1 START table model some_schema.some_model ................... [RUN]) I got:

ipdb> l_0_relation
<RedshiftRelation "integrated"."some_schema"."some_model">

so the cause is hidden somewhere here.

jaklan · 2022-09-02T03:37:17Z

Looking at docs that behaviour actually makes sense, because it's related to the difference between the parse phase and the execution phase:
https://docs.getdbt.com/reference/dbt-jinja-functions/execute
and it also explains why l_0_relation was initially None - because adapter.get_relation wasn't triggered. But there's still an issue in the execution phase as it should be evaluated to None also then.

jtcohen6 · 2022-09-06T09:41:31Z

@jaklan Thanks for the thorough writeup, and for detailing your follow-up investigation! An important piece of this puzzle: This is a cross-database query on Redshift. I'm assuming that you're using RA3 nodes.

I have a sense of what's going on here. Could you check the logs that are run during the second visit (execution time)? I would expect dbt to be running a query along these lines, since you're asking it to perform a lookup of "integrated"."some_schema"."some_model", and the schema "integrated"."some_schema" is not already cached:

    select
      'integrated' as database,
      tablename as name,
      schemaname as schema,
      'table' as type
    from pg_tables
    where schemaname ilike 'some_schema'
    union all
    select
      'integrated' as database,
      viewname as name,
      schemaname as schema,
      'view' as type
    from pg_views
    where schemaname ilike 'some_schema'

See the issue? It's trying to access information about "integrated"."some_schema", but by querying pg_tables + pg_views in the current database (target.database). Instead, we'd need this to use a cross-database query, and the Redshift docs tell us that pg_catalog is not supported:

To get metadata across databases, use SVV_ALL* and SVV_REDSHIFT* metadata views. You can't use the three-part notation or external schemas to query cross-database metadata tables or views under information_schema and pg_catalog.

Proposed resolution

In cases where we're trying to access metadata on relations outside the current database, we need redshift__list_relations_without_caching to use svv_tables, instead of the postgres__ version
of the macro (pg_tables + pg_views).

I don't have a good sense of whether accessing that view is faster or slower than the pg_* views, or if there are any other gotchas. If it's faster or the same speed, we could switch to using svv_tables always. If it's slower, we should have conditional logic to use it only when needed (if schema_relation.database != target.database), otherwise pg_* views when available.

I'm going to mark this one a help_wanted, and move it over to dbt-redshift. We'd welcome community contribution for it.

Logistical note: We don't currently have RA3 nodes running in CI, since they're still much more expensive at the lowest end than 2nd-gen node types, but we should take another look at finally setting some up to test this sort of functionality.

jaklan · 2022-09-14T22:06:38Z

Hello @jtcohen6, thanks for the research and the detailed explanation! It definitely seems to be the direct reason and yes, we use RA3 nodes.

Such usage of adapter.get_relation is probably not very common in standard flows (but of course could be pretty misleading when happens - maybe we should at least add some warning as for now?), but it becomes problematic when you utilise defer - because then you might want to use database=this.database or database=ref(some_model).database, as you don't know beforehand in which database given model would be found - the target one or the "fallback" one. With that pitfall - unfortunately it's not possible to achieve, so you always refer to the target db, which can lead to very surprising results.

Regarding the implementation dilemmas - is there at least chance you decide on some recommended approach internally, so it would be more clear from the community perspective what MR would be accepted?

jtcohen6 · 2022-09-20T11:58:24Z

A warning is definitely better than the status quo — though I think the actual resolution to this bug may be quick here as well.

My instinct here is that pg_tables is faster than svv_tables when it's available. This is hinted at in Redshift docs, e.g.:

Amazon Redshift table names are stored in both PG_TABLES and STV_TBL_PERM; where possible, use PG_TABLES to return Amazon Redshift table names.

So I think the right move will be to reimplement redshift__list_relations_without_caching to use pg_* where possible, and svv_tables where necessary. Something like:

{% macro redshift__list_relations_without_caching(schema_relation) %}
  {% if schema_relation.database == target.database %}
    {{ return(postgres__list_relations_without_caching(schema_relation) }}
  {% else %}
  {% call statement('list_relations_without_caching', fetch_result=True) -%}
    select
      table_catalog as database,
      table_name as name,
      table_schema as schema,
      table_type as type
    from svv_tables
    where table_catalog ilike '{{ schema_relation.database }}'
      and table_schema ilike '{{ schema_relation.schema }}'
  {% endcall %}
  {{ return(load_result('list_relations_without_caching').table) }}
{% endmacro %}

You can actually give that a try by copy-pasting that macro code into your own project first — dbt will prefer your version over its own built-in one — to see if that resolves the bug.

jaklan · 2022-11-30T16:31:33Z

Hi @jtcohen6,

I've returned to the issue recently as this a key blocker for us to adopt a local development workflow, which is based on creating proper proxy views in dev database when upstream tables exist in the uat database. dev db is the one specified in the active target, so due to the issue - we can't verify the existence of tables in uat database.

There's also another solution for local development which was presented to us today by dbt Core Team:
https://github.com/LewisDavies/upstream-prod
but as far as I can see - it would be affected by the issue as well:
https://github.com/LewisDavies/upstream-prod/blob/43ede2a1913235c3db7ca8c94d89eadaf04fe127/macros/ref.sql#L36-L40

So, I have tested the custom postgres__list_relations_without_caching macro you proposed, but unfortunately it didn't work due to:

Encountered an error while running operation: Runtime Error
  maximum recursion depth exceeded while calling a Python object

I will try to debug that and find some solution in the upcoming days, but I would really appreciate prioritising that issue - especially when it directly affects approaches recommended by dbt consultants.

b-per · 2022-11-30T16:53:11Z

Hi @jaklan, I believe that the error is because of this

{% macro postgres__list_relations_without_caching(schema_relation) %}
  {% if schema_relation.database == target.database %}
    {{ return(postgres__list_relations_without_caching(schema_relation) }}
  {% else %}
...
{% endmacro %}

The function is calling itself recursively forever if the first if is True.

With a bit of copy/pasting from the original macro it could look like:

{% macro postgres__list_relations_without_caching(schema_relation) %}
  {% if schema_relation.database == target.database %}
  {% call statement('list_relations_without_caching', fetch_result=True) -%}
    select
      '{{ schema_relation.database }}' as database,
      tablename as name,
      schemaname as schema,
      'table' as type
    from pg_tables
    where schemaname ilike '{{ schema_relation.schema }}'
    union all
    select
      '{{ schema_relation.database }}' as database,
      viewname as name,
      schemaname as schema,
      'view' as type
    from pg_views
    where schemaname ilike '{{ schema_relation.schema }}'
  {% endcall %}
  {{ return(load_result('list_relations_without_caching').table) }}
  {% else %}
  {% call statement('list_relations_without_caching', fetch_result=True) -%}
    select
      table_catalog as database,
      table_name as name,
      table_schema as schema,
      table_type as type
    from svv_tables
    where table_catalog ilike '{{ schema_relation.database }}'
      and table_schema ilike '{{ schema_relation.schema }}'
  {% endcall %}
  {{ return(load_result('list_relations_without_caching').table) }}
{% endmacro %}

jtcohen6 · 2022-11-30T16:55:15Z

Sorry! I meant to name the macro in my example above redshift__list_relations_without_caching. I've just edited it to reflect that change.

jaklan · 2022-11-30T17:56:04Z

Thanks @jtcohen6 and @b-per, good catch! I tried both variants you pasted and they don't trigger the error, but they return None for tables which for sure exist in the target db - I will have a further look at that tomorrow.

jaklan · 2022-12-01T05:04:29Z

Back to you guys - I was able to make it work by using svv_redshift_tables instead of svv_tables and adjusting the column names:

{% macro redshift__list_relations_without_caching(schema_relation) %}
  {% if schema_relation.database == target.database %}
    {{ return(postgres__list_relations_without_caching(schema_relation)) }}
  {% else %}
    {% call statement('list_relations_without_caching', fetch_result=True) -%}
      select
        database_name as database,
        table_name as name,
        schema_name as schema,
        table_type as type
      from svv_redshift_tables
      where database_name ilike '{{ schema_relation.database }}'
        and schema_name ilike '{{ schema_relation.schema }}'
    {% endcall %}
    {{ return(load_result('list_relations_without_caching').table) }}
  {% endif %} 
{% endmacro %}

svv_tables only returned schemas and tables for target database, that's why I got None previously. But generally there's still an issue with performance, as querying svv_redshift_tables seems to be pretty slow with the current query.

There's also svv_all_tables which is a superset of svv_redshift_tables and works as well, but it's obviously even slower.

b-per · 2022-12-01T12:28:41Z

Could you share a ballpark figure of how long querying svv_redshift_tables takes in your case?

jaklan · 2022-12-05T17:16:13Z

@b-per

select
table_catalog as database,
table_name as name,
table_schema as schema,
table_type as type
from SVV_TABLES
where table_catalog ilike 'dev'
and table_schema ilike 'some_schema'

takes 1 row(s) fetched - 119ms

select
database_name as database,
table_name as name,
schema_name as schema,
table_type as type
from SVV_REDSHIFT_TABLES
where database_name ilike 'dev'
and schema_name ilike 'some_schema'

takes 1 row(s) fetched - 5.517s

For the context:

select distinct
table_catalog as database
from SVV_TABLES

gives 1 row(s) fetched - 2.621s and

select count(*)
from SVV_TABLES

returns 5179

select distinct
database_name as database
from SVV_REDSHIFT_TABLES

gives 38 row(s) fetched - 5.404s and

select count(*)
from SVV_REDSHIFT_TABLES

returns 28263

github-actions · 2023-06-04T02:15:32Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

jaklan · 2023-06-05T12:56:03Z

Bump to un-stale

dataders · 2024-03-25T15:51:04Z

re-opening bc this was not fixed by #428 only addressed the redshift__list_relations_without_caching macro, not redshift_get_columns_in_relation

jaklan added bug Something isn't working triage labels Sep 2, 2022

github-actions bot changed the title ~~[Bug] adapter.get_relation ignores database argument~~ [CT-1116] [Bug] adapter.get_relation ignores database argument Sep 2, 2022

jtcohen6 self-assigned this Sep 6, 2022

jtcohen6 removed the triage label Sep 6, 2022

jtcohen6 changed the title ~~[CT-1116] [Bug] adapter.get_relation ignores database argument~~ [CT-1116] [Bug] adapter.get_relation for relations in other databases [RA3] Sep 6, 2022

jtcohen6 removed their assignment Sep 6, 2022

jtcohen6 transferred this issue from dbt-labs/dbt-core Sep 6, 2022

jtcohen6 added the help_wanted Extra attention is needed label Sep 6, 2022

github-actions bot changed the title ~~[CT-1116] [Bug] adapter.get_relation for relations in other databases [RA3]~~ [CT-1122] [CT-1116] [Bug] adapter.get_relation for relations in other databases [RA3] Sep 6, 2022

jtcohen6 mentioned this issue Nov 14, 2022

[CT-1482] [Feature] Add support for data share objects #217

Closed

3 tasks

jiezhen-chen mentioned this issue Apr 7, 2023

add support for datashared objects and external tables for redshift l… #399

Closed

6 tasks

jiezhen-chen mentioned this issue May 4, 2023

Override list_relations_without_caching #428

Merged

6 tasks

github-actions bot added the Stale label Jun 4, 2023

github-actions bot removed the Stale label Jun 6, 2023

mikealfare closed this as completed in #428 Jul 13, 2023

This was referenced Mar 21, 2024

generate_base_model() doesn't work across databases on Redshift dbt-labs/dbt-codegen#167

Closed

[Bug] adapter.get_columns_in_relation does not work cross database #736

Open

dataders reopened this Mar 25, 2024

dataders mentioned this issue Mar 25, 2024

enable crossdb column metadata with ra3 nodes #738

Closed

4 tasks

dataders added the ra3_node issues relating to ra3 node support label Mar 26, 2024

dataders mentioned this issue Mar 27, 2024

[Feature] Datasharing: cross-database support #742

Open

dataders mentioned this issue Apr 19, 2024

[Feature] use redshift_connector for metadata wherever possible #776

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1122] [CT-1116] [Bug] `adapter.get_relation` for relations in other databases [RA3] #179

[CT-1122] [CT-1116] [Bug] `adapter.get_relation` for relations in other databases [RA3] #179

jaklan commented Sep 2, 2022 •

edited

Loading

jaklan commented Sep 2, 2022 •

edited

Loading

jaklan commented Sep 2, 2022 •

edited

Loading

jtcohen6 commented Sep 6, 2022 •

edited

Loading

jaklan commented Sep 14, 2022

jtcohen6 commented Sep 20, 2022 •

edited

Loading

jaklan commented Nov 30, 2022 •

edited

Loading

b-per commented Nov 30, 2022

jtcohen6 commented Nov 30, 2022

jaklan commented Nov 30, 2022

jaklan commented Dec 1, 2022 •

edited

Loading

b-per commented Dec 1, 2022

jaklan commented Dec 5, 2022 •

edited

Loading

github-actions bot commented Jun 4, 2023

jaklan commented Jun 5, 2023

dataders commented Mar 25, 2024

[CT-1122] [CT-1116] [Bug] adapter.get_relation for relations in other databases [RA3] #179

[CT-1122] [CT-1116] [Bug] adapter.get_relation for relations in other databases [RA3] #179

Comments

jaklan commented Sep 2, 2022 • edited Loading

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

jaklan commented Sep 2, 2022 • edited Loading

jaklan commented Sep 2, 2022 • edited Loading

jtcohen6 commented Sep 6, 2022 • edited Loading

Proposed resolution

jaklan commented Sep 14, 2022

jtcohen6 commented Sep 20, 2022 • edited Loading

jaklan commented Nov 30, 2022 • edited Loading

b-per commented Nov 30, 2022

jtcohen6 commented Nov 30, 2022

jaklan commented Nov 30, 2022

jaklan commented Dec 1, 2022 • edited Loading

b-per commented Dec 1, 2022

jaklan commented Dec 5, 2022 • edited Loading

github-actions bot commented Jun 4, 2023

jaklan commented Jun 5, 2023

dataders commented Mar 25, 2024

[CT-1122] [CT-1116] [Bug] `adapter.get_relation` for relations in other databases [RA3] #179

[CT-1122] [CT-1116] [Bug] `adapter.get_relation` for relations in other databases [RA3] #179

jaklan commented Sep 2, 2022 •

edited

Loading

jaklan commented Sep 2, 2022 •

edited

Loading

jaklan commented Sep 2, 2022 •

edited

Loading

jtcohen6 commented Sep 6, 2022 •

edited

Loading

jtcohen6 commented Sep 20, 2022 •

edited

Loading

jaklan commented Nov 30, 2022 •

edited

Loading

jaklan commented Dec 1, 2022 •

edited

Loading

jaklan commented Dec 5, 2022 •

edited

Loading