Add Snowflake-specific implementation of `min_or_max` #32

wylbee · 2023-04-07T14:54:12Z

Per #26, the generic implementation of the _min_or_max macro does not work correctly on Snowflake, with the root cause being dbt's safe_cast using a function in Snowflake with limitations that make it unsuitable for this purpose.

Given that this macro is intended to create the equivalent of Snowflake's min_by/max_by functions, the work of this PR is to:

Create a version of the macro that uses Snowflake's functions and achieves the same end.
Add conditional logic to use the Snowflake-specific macro if on Snowflake, but otherwise default to the current production version.

Test 1- Integration Suite

Test 2- Prod project

dbt_project.yml vars

  dbt_activity_schema:
      included_columns:
        - activity_id
        - customer
        - ts
        - activity
        - anonymous_customer_id
        - feature_json 
        - link
        - revenue_impact
        - activity_occurrence
        - activity_repeated_at

query

with

    stream_query as (
        {{
            dbt_activity_schema.dataset(
                ref("account_stream"),
                dbt_activity_schema.activity(
                    dbt_activity_schema.all_ever(), "originates_loan"
                ),
                [
                    dbt_activity_schema.activity(
                        dbt_activity_schema.last_before(), "updates_autopay"
                    )
                ],
            )
        }}

    ),

    final as (

        select * from stream_query

    )

select *
from final

compiled query


  
    

        create or replace transient table analytics.dev_wbrown.ee_loan_at_origination  as
        (with

    stream_query as (
        












with

filter_activity_stream_using_primary_activity as (
    select
        
        stream.activity_id,
        
        stream.customer,
        
        stream.ts,
        
        stream.activity,
        
        stream.anonymous_customer_id,
        
        stream.feature_json,
        
        stream.link,
        
        stream.revenue_impact,
        
        stream.activity_occurrence,
        
        stream.activity_repeated_at
        

    from analytics.dev_wbrown.account_stream as stream

    where stream.activity = 'originates_loan'
        and 
(true)

),




append_and_aggregate__1__last_before
 as (
    select

        -- Primary Activity Columns
        
        stream.activity_id,
        
        stream.customer,
        
        stream.ts,
        
        stream.activity,
        
        stream.anonymous_customer_id,
        
        stream.feature_json,
        
        stream.link,
        
        stream.revenue_impact,
        
        stream.activity_occurrence,
        
        stream.activity_repeated_at,
        

        
            
        max_by(
            appended.activity_id
            , appended.ts)
 as 
last_before_updates_autopay_activity_id
            ,
        
            
        max_by(
            appended.customer
            , appended.ts)
 as 
last_before_updates_autopay_customer
            ,
        
            
        max_by(
            appended.ts
            , appended.ts)
 as 
last_before_updates_autopay_ts
            ,
        
            
        max_by(
            appended.activity
            , appended.ts)
 as 
last_before_updates_autopay_activity
            ,
        
            
        max_by(
            appended.anonymous_customer_id
            , appended.ts)
 as 
last_before_updates_autopay_anonymous_customer_id
            ,
        
            
        max_by(
            appended.feature_json
            , appended.ts)
 as 
last_before_updates_autopay_feature_json
            ,
        
            
        max_by(
            appended.link
            , appended.ts)
 as 
last_before_updates_autopay_link
            ,
        
            
        max_by(
            appended.revenue_impact
            , appended.ts)
 as 
last_before_updates_autopay_revenue_impact
            ,
        
            
        max_by(
            appended.activity_occurrence
            , appended.ts)
 as 
last_before_updates_autopay_activity_occurrence
            ,
        
            
        max_by(
            appended.activity_repeated_at
            , appended.ts)
 as 
last_before_updates_autopay_activity_repeated_at
            
        

    from filter_activity_stream_using_primary_activity as stream

    left join analytics.dev_wbrown.account_stream as appended
        on (
            -- Join on Customer UUID Column
            appended.customer = stream.customer

            -- Join the Correct Activity
            and appended.activity = 'updates_autopay'

            -- Relationship Specific Join Conditions
            and (
            
            
            





(
    appended.ts <= coalesce(stream.ts, '1900-01-01'::timestamp)
)

            
            )
            -- Additional Join Condition
            and ( true )
        )

    group by
        
        stream.activity_id,
        
        stream.customer,
        
        stream.ts,
        
        stream.activity,
        
        stream.anonymous_customer_id,
        
        stream.feature_json,
        
        stream.link,
        
        stream.revenue_impact,
        
        stream.activity_occurrence,
        
        stream.activity_repeated_at
        
),



rejoin_aggregated_activities as (
    select

        
        stream.activity_id,
        
        stream.customer,
        
        stream.ts,
        
        stream.activity,
        
        stream.anonymous_customer_id,
        
        stream.feature_json,
        
        stream.link,
        
        stream.revenue_impact,
        
        stream.activity_occurrence,
        
        stream.activity_repeated_at,
        

        
            
        
append_and_aggregate__1__last_before
.
last_before_updates_autopay_activity_id,
            
        
append_and_aggregate__1__last_before
.
last_before_updates_autopay_customer,
            
        
append_and_aggregate__1__last_before
.
last_before_updates_autopay_ts,
            
        
append_and_aggregate__1__last_before
.
last_before_updates_autopay_activity,
            
        
append_and_aggregate__1__last_before
.
last_before_updates_autopay_anonymous_customer_id,
            
        
append_and_aggregate__1__last_before
.
last_before_updates_autopay_feature_json,
            
        
append_and_aggregate__1__last_before
.
last_before_updates_autopay_link,
            
        
append_and_aggregate__1__last_before
.
last_before_updates_autopay_revenue_impact,
            
        
append_and_aggregate__1__last_before
.
last_before_updates_autopay_activity_occurrence,
            
        
append_and_aggregate__1__last_before
.
last_before_updates_autopay_activity_repeated_at
            
        

    from filter_activity_stream_using_primary_activity as stream

    

    left join 
append_and_aggregate__1__last_before

        on 
append_and_aggregate__1__last_before
.activity_id = stream.activity_id

    
)

select * from rejoin_aggregated_activities



    ),

    final as (

        select * from stream_query

    )

select *
from final
        );

bcodell · 2023-04-07T21:04:25Z

Hey @brown5628 - thanks for opening this! For sake of auditing, would you mind installing your fork in the Snowflake-based project you have and running a query that uses the dataset macro to validate that the query compiles and runs successfully, then pasting a screenshot of the dbt CLI logs or pasting the compiled query here for reference? Sorry for the tedious ask, but since the project's CI pipeline only runs on duckdb and this PR is a Snowflake-specific fix, it'd be nice to have evidence of the code change working as expected.

wylbee · 2023-04-14T19:59:18Z

@bcodell Should be ready for review. Thanks for your patience on this one with the slow turnaround time. Quick callouts:

I hacked together getting the integration test suite running in Snowflake- pasted the screenshot above showing full passes.
I pasted the artifacts that you requested from my current implementation showing input --> successful output & the compiled SQL.

Two questions from me:

Does the project have a style guide? belated realizing that my IDE reformatted everything using sqlfmt. Happy to turn that off and restate a cleaner diff if that is preferred, just let me know.
Is there utility in maintaining a parallel set of integration tests for non-DuckDB connectors, even if they are not being run automatically? If so I can open a new PR with a cleaned-up version of what I did to get the tests running on Snowflake that can sit in a separate folder from the integration tests. Changes needed were:
- In the models, change every instance of json_extract({{ dbt_activity_schema.primary() }}.feature_json, 'type') = json_extract({{ dbt_activity_schema.appended() }}.feature_json, 'type') to parse_json({{ dbt_activity_schema.primary() }}.feature_json):"type"= parse_json({{ dbt_activity_schema.appended() }}.feature_json):"type"
- Point to the Snowflake profile
- Remove the model configs in the dbt_project.yml

tnightengale

Thank you for tackling this! I think we just need to make a small tweak to leverage dbt dispatch functionality. See the comment in the code.

Otherwise, if you wouldn't mind turning off sqlfrmt so the diff is cleaner!

100% we should implement a style guide in the near future. Until then, let's refrain from style changes outside the PR scope.

Awesome work though!

tnightengale · 2023-04-18T23:28:39Z

macros/utils/aggregations/_min_or_max.sql

@@ -1,56 +1,65 @@
 {% macro _min_or_max(min_or_max, qualified_col) %}

+{% if target.type == "snowflake" %}


Great work! But I think we ought to change this to use dbt dispatch.

I think we could just abstract this by doing {% set aggregation = get_db_aggregation() %}. And have that get_db_aggregation() macro contain default__ and snowflake__ dispatches.

Sound reasonable?

Yep, absolutely! Wasn't aware of that functionality but seems like a perfect fit here. Will make the changes & tee this up!

Hi @tnightengale:

Restated the file without sqlfmt so the diff should be cleaner.

Struggling a bit with how to implement the dbt dispatch logic. I see how it can be used to replace the if/then logic I was using and have done so, but I'm not following applying this at the {% set aggregation = get_db_aggregation() %} level since there are other changes to the Snowflake specific logic. Would you be able to spell out a bit further what you are looking for there? Happy to take another pass but recognize that I'm spinning my wheels.

Validation

No row differences between dev and prod pre change

No row differences between dev and prod post change

tnightengale

So, the _min_or_max macro actually uses a trick to prepend the ts, then aggregate, then trim off the prepended ts column, so that the aggregation abstraction can work the same across sum and count.

So I think that logic needs to be replicated for snowflake as well, as far as I understand.

I don't think this will work as is?

tnightengale · 2023-04-26T12:47:27Z

macros/utils/aggregations/_min_or_max.sql

+    {{ return(adapter.dispatch('_min_or_max','dbt_activity_schema')(min_or_max, qualified_col)) }}
+{%- endmacro -%}
+
+{% macro default___min_or_max(min_or_max, qualified_col) -%}

 {% set aggregation = "min" if min_or_max == "min" else "max" %}


I think this assignment should be via a dispatched macro. And then we can call that aggregation with {{ }} in the relevant place.

@tnightengale Apologies for the back and forth. I think I'm missing something here that would help me address this feedback. Three questions for you:

I follow how the default _min_or_max macro works- per the discussion in TRY_CAST SQL compilation error #26, this code uses built-in Snowflake functionality to achieve the same end rather than replicate the prepend approach. In my dbt project and the Snowflake version of the integration tests, I see that the min_by/max_by version produces the expected results. Could you explain further what you mean by I don't think this will work as is??

For So I think that logic needs to be replicated for Snowflake as well, as far as I understand., are you saying that you would prefer a Snowflake version of the prepend logic rather than the native Snowflake approach used here or something else?

I'm not following the comment on the assignment via dispatch macro. The version here in this PR is essentially a carbon copy of the example provided in the dbt docs for this functionality. Could you explain more about what you are looking for or point me to an example on which I can base the refactoring?

Appreciate the feedback to give me what I need to get this all squared away!

Will Brown added 2 commits April 5, 2023 11:11

tmp snowflake min or max

ed4483c

add snowflake min or max

f4f6c10

wylbee marked this pull request as ready for review April 14, 2023 19:55

tnightengale requested changes Apr 18, 2023

View reviewed changes

Will Brown added 2 commits April 21, 2023 15:34

update if then to dispatch

982270e

formatting

fa808f2

tnightengale requested changes Apr 26, 2023

View reviewed changes

InbarShirizly mentioned this pull request Oct 2, 2023

TRY_CAST SQL compilation error #26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Snowflake-specific implementation of `min_or_max` #32

Add Snowflake-specific implementation of `min_or_max` #32

wylbee commented Apr 7, 2023 •

edited

Loading

bcodell commented Apr 7, 2023

wylbee commented Apr 14, 2023 •

edited

Loading

tnightengale left a comment

tnightengale Apr 18, 2023

wylbee Apr 19, 2023

wylbee Apr 21, 2023 •

edited

Loading

tnightengale left a comment

tnightengale Apr 26, 2023

wylbee Apr 26, 2023 •

edited

Loading

		@@ -1,56 +1,65 @@
		{% macro _min_or_max(min_or_max, qualified_col) %}

		{% if target.type == "snowflake" %}

Add Snowflake-specific implementation of min_or_max #32

Are you sure you want to change the base?

Add Snowflake-specific implementation of min_or_max #32

Conversation

wylbee commented Apr 7, 2023 • edited Loading

Test 1- Integration Suite

Test 2- Prod project

bcodell commented Apr 7, 2023

wylbee commented Apr 14, 2023 • edited Loading

tnightengale left a comment

Choose a reason for hiding this comment

tnightengale Apr 18, 2023

Choose a reason for hiding this comment

wylbee Apr 19, 2023

Choose a reason for hiding this comment

wylbee Apr 21, 2023 • edited Loading

Choose a reason for hiding this comment

Validation

tnightengale left a comment

Choose a reason for hiding this comment

tnightengale Apr 26, 2023

Choose a reason for hiding this comment

wylbee Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

Add Snowflake-specific implementation of `min_or_max` #32

Add Snowflake-specific implementation of `min_or_max` #32

wylbee commented Apr 7, 2023 •

edited

Loading

wylbee commented Apr 14, 2023 •

edited

Loading

wylbee Apr 21, 2023 •

edited

Loading

wylbee Apr 26, 2023 •

edited

Loading