From 425642766d39827b8cce73a2034834f027970d55 Mon Sep 17 00:00:00 2001 From: Joel Labes Date: Wed, 20 Mar 2024 20:23:49 +1300 Subject: [PATCH] Cortex additional sample code (#5125) ## What are you changing in this pull request and why? Adding sample code for devblog --- website/blog/2024-02-29-cortex-slack.md | 144 +++++++++++++++++++++++- 1 file changed, 142 insertions(+), 2 deletions(-) diff --git a/website/blog/2024-02-29-cortex-slack.md b/website/blog/2024-02-29-cortex-slack.md index 90f954f22d5..188d26628c9 100644 --- a/website/blog/2024-02-29-cortex-slack.md +++ b/website/blog/2024-02-29-cortex-slack.md @@ -68,14 +68,154 @@ You probably don't have the exact same use case as I do, but you can imagine a w - A mobile app developer might pull in app store reviews for sentiment analysis - By calculating the vector embeddings for text, deduplicating similar but nonidentical text becomes more tractable. +Here's an extract of some of the code, using the [cortex.complete() function](https://docs.snowflake.com/sql-reference/functions/complete-snowflake-cortex) - notice that the whole thing feels just like normal SQL, because it is! + +```sql + select trim( + snowflake.cortex.complete( + 'llama2-70b-chat', + concat( + 'Write a short, two sentence summary of this Slack thread. Focus on issues raised. Be brief. ', + text_to_summarize, + '. The users involved are: ', + participant_metadata.participant_users::text, + '' + ) + ) + ) as thread_summary, +``` + ## Tips for building LLM-powered dbt models - **Always build incrementally.** Anyone who's interacted with any LLM-powered tool knows that it can take some time to get results back from a request, and that the results can vary from one invocation to another. For speed, cost and consistency reasons, I implemented both models incrementally even though in terms of row count the tables are tiny. I also added the [full_refresh: false](https://docs.getdbt.com/reference/resource-configs/full_refresh) config to protect against other full refreshes we run to capture late-arriving facts. - **Beware of token limits.** Requests that contain [too many tokens](https://docs.snowflake.com/LIMITEDACCESS/cortex-functions#model-restrictions) are truncated, which can lead to unexpected results if the cutoff point is halfway through a message. In future I would first try to use the llama-70b model (~4k token limit), and for unsuccessful rows make a second pass using the mistral-7b model (~32k token limit). Like many aspects of LLM powered workflows, we expect token length constraints to increase substantially in the near term. -- **Orchestrate defensively, for now**. Because of the above considerations, I've got these steps running in their own dbt Cloud job, triggered by the successful completion of our main project job. I don't want the data team to be freaked out by a failing production run due to my experiments. We use YAML selectors to define what gets run in our default job; I added another selector for these models and then added that selector to the default job's exclusion list. Once this becomes more stable, I'll fold it into our normal job. +- **Orchestrate defensively, for now**. Because of the above considerations, I've got these steps running in their own dbt Cloud job, triggered by the successful completion of our main project job. I don't want the data team to be freaked out by a failing production run due to my experiments. We use [YAML selectors](/reference/node-selection/yaml-selectors) to define what gets run in our default job; I created a new selector for these models and then added that selector to the default job's exclusion list. Once this becomes more stable, I'll fold it into our normal job. - **Iterate on your prompt.** In the same way as you gradually iterate on a SQL query, you have to tweak your prompt frequently in development to ensure you're getting the expected results. In general, I started with the shortest command I thought could work and tweaked it based on the results I was seeing. One slightly disappointing part of prompt engineering: I can spend an afternoon working on a problem, and at the end of it only have a single line of code to check into a commit. -- **Remember that your results are non-deterministic.** For someone who loves to talk about idempotency , having a model whose results vary based on the vibes of some rocks we tricked into dreaming is a bit weird, and requires a bit more defensive coding than you may be used to. For example, one of the prompts I use is classification-focused (identifying the discussion's product area), and normally the result is just the name of that product. But sometimes it will return a little spiel explaining its thinking, so I need to explicitly extract that value from the response instead of unthinkingly accepting whatever I get back. Defining the valid options in a Jinja variable has helped keep them in sync: I can pass them into the prompt and then reuse the same list when extracting the correct answer. +- **Remember that your results are non-deterministic.** For someone who loves to talk about idempotency, having a model whose results vary based on the vibes of some rocks we tricked into dreaming is a bit weird, and requires a bit more defensive coding than you may be used to. For example, one of the prompts I use is classification-focused (identifying the discussion's product area), and normally the result is just the name of that product. But sometimes it will return a little spiel explaining its thinking, so I need to explicitly extract that value from the response instead of unthinkingly accepting whatever I get back. Defining the valid options in a Jinja variable has helped keep them in sync: I can pass them into the prompt and then reuse the same list when extracting the correct answer. + +```sql +-- a cut down list of segments for the sake of readability +{% set segments = ['Warehouse configuration', 'dbt Cloud IDE', 'dbt Core', 'SQL', 'dbt Orchestration', 'dbt Explorer', 'Unknown'] %} + + select trim( + snowflake.cortex.complete( + 'llama2-70b-chat', + concat( + 'Identify the dbt product segment that this message relates to, out of [{{ segments | join ("|") }}]. Your response should be only the segment with no explanation. ', + text, + '' + ) + ) + ) as product_segment_raw, + + -- reusing the segments Jinja variable here + coalesce(regexp_substr(product_segment_raw, '{{ segments | join ("|") }}'), 'Unknown') as product_segment +``` ## Share your experiences If you're doing anything like this in your work or side project, I'd love to hear about it in the comment section on Discourse or in machine-learning-general in Slack. + +## Appendix: An example complete model + +Here's the full model that I'm running to create the overall rollup messages that get posted to Slack, built on top of the row-by-row summary in an earlier model: + +```sql +{{ + config( + materialized='incremental', + unique_key='unique_key', + full_refresh=false + ) +-}} + + +{# + This partition_by dict is to dry up the columns that are used in different parts of the query. + The SQL is used in the partition by components of the window function aggregates, and the column + names are used (in conjunction with the SQL) to select the relevant columns out in the final model. + They could be written out manually, but it creates a lot of places to update when changing from + day to week truncation for example. + + Side note: I am still not thrilled with this approach, and would be happy to hear about alternatives! +#} +{%- set partition_by = [ + {'column': 'summary_period', 'sql': 'date_trunc(day, sent_at)'}, + {'column': 'product_segment', 'sql': 'lower(product_segment)'}, + {'column': 'is_further_attention_needed', 'sql': 'is_further_attention_needed'}, +] -%} + +{% set partition_by_sqls = [] -%} +{% set partition_by_columns = [] -%} + +{% for p in partition_by -%} + {% do partition_by_sqls.append(p.sql) -%} + {% do partition_by_columns.append(p.column) -%} +{% endfor -%} + + +with + +summaries as ( + + select * from {{ ref('fct_slack_thread_llm_summaries') }} + where not has_townie_participant + +), + +aggregated as ( + select distinct + {# Using the columns defined above #} + {% for p in partition_by -%} + {{ p.sql }} as {{ p.column }}, + {% endfor -%} + + -- This creates a JSON array, where each element is one thread + its permalink. + -- Each array is broken down by the partition_by columns defined above, so there's + -- one summary per time period and product etc. + array_agg( + object_construct( + 'permalink', thread_permalink, + 'thread', thread_summary + ) + ) over (partition by {{ partition_by_sqls | join(', ') }}) as agg_threads, + count(*) over (partition by {{ partition_by_sqls | join(', ') }}) as num_records, + + -- The partition columns are the grain of the table, and can be used to create + -- a unique key for incremental purposes + {{ dbt_utils.generate_surrogate_key(partition_by_columns) }} as unique_key + from summaries + {% if is_incremental() %} + where unique_key not in (select this.unique_key from {{ this }} as this) + {% endif %} + +), + +summarised as ( + + select + *, + trim(snowflake.cortex.complete( + 'llama2-70b-chat', + concat( + 'In a few bullets, describe the key takeaways from these threads. For each object in the array, summarise the `thread` field, then provide the Slack permalink URL from the `permalink` field for that element in markdown format at the end of each summary. Do not repeat my request back to me in your response.', + agg_threads::text + ) + )) as overall_summary + from aggregated + +), + +final as ( + select + * exclude overall_summary, + -- The LLM loves to say something like "Sure, here's your summary:" despite my best efforts. So this strips that line out + regexp_replace( + overall_summary, '(^Sure.+:\n*)', '' + ) as overall_summary + + from summarised +) + +select * from final +```