Skip to content

Commit

Permalink
Cortex additional sample code (#5125)
Browse files Browse the repository at this point in the history
## What are you changing in this pull request and why?
Adding sample code for devblog
  • Loading branch information
joellabes authored Mar 20, 2024
1 parent 976a908 commit 4256427
Showing 1 changed file with 142 additions and 2 deletions.
144 changes: 142 additions & 2 deletions website/blog/2024-02-29-cortex-slack.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,14 +68,154 @@ You probably don't have the exact same use case as I do, but you can imagine a w
- A mobile app developer might pull in app store reviews for sentiment analysis
- By calculating the vector embeddings for text, deduplicating similar but nonidentical text becomes more tractable.

Here's an extract of some of the code, using the [cortex.complete() function](https://docs.snowflake.com/sql-reference/functions/complete-snowflake-cortex) - notice that the whole thing feels just like normal SQL, because it is!

```sql
select trim(
snowflake.cortex.complete(
'llama2-70b-chat',
concat(
'Write a short, two sentence summary of this Slack thread. Focus on issues raised. Be brief. <thread>',
text_to_summarize,
'</thread>. The users involved are: <users>',
participant_metadata.participant_users::text,
'</users>'
)
)
) as thread_summary,
```

## Tips for building LLM-powered dbt models

- **Always build incrementally.** Anyone who's interacted with any LLM-powered tool knows that it can take some time to get results back from a request, and that the results can vary from one invocation to another. For speed, cost and consistency reasons, I implemented both models incrementally even though in terms of row count the tables are tiny. I also added the [full_refresh: false](https://docs.getdbt.com/reference/resource-configs/full_refresh) config to protect against other full refreshes we run to capture late-arriving facts.
- **Beware of token limits.** Requests that contain [too many tokens](https://docs.snowflake.com/LIMITEDACCESS/cortex-functions#model-restrictions) are truncated, which can lead to unexpected results if the cutoff point is halfway through a message. In future I would first try to use the llama-70b model (~4k token limit), and for unsuccessful rows make a second pass using the mistral-7b model (~32k token limit). Like many aspects of LLM powered workflows, we expect token length constraints to increase substantially in the near term.
- **Orchestrate defensively, for now**. Because of the above considerations, I've got these steps running in their own dbt Cloud job, triggered by the successful completion of our main project job. I don't want the data team to be freaked out by a failing production run due to my experiments. We use YAML selectors to define what gets run in our default job; I added another selector for these models and then added that selector to the default job's exclusion list. Once this becomes more stable, I'll fold it into our normal job.
- **Orchestrate defensively, for now**. Because of the above considerations, I've got these steps running in their own dbt Cloud job, triggered by the successful completion of our main project job. I don't want the data team to be freaked out by a failing production run due to my experiments. We use [YAML selectors](/reference/node-selection/yaml-selectors) to define what gets run in our default job; I created a new selector for these models and then added that selector to the default job's exclusion list. Once this becomes more stable, I'll fold it into our normal job.
- **Iterate on your prompt.** In the same way as you gradually iterate on a SQL query, you have to tweak your prompt frequently in development to ensure you're getting the expected results. In general, I started with the shortest command I thought could work and tweaked it based on the results I was seeing. One slightly disappointing part of prompt engineering: I can spend an afternoon working on a problem, and at the end of it only have a single line of code to check into a commit.
- **Remember that your results are non-deterministic.** For someone who loves to talk about <Term id="idempotent">idempotency</Term> , having a model whose results vary based on the vibes of some rocks we tricked into dreaming is a bit weird, and requires a bit more defensive coding than you may be used to. For example, one of the prompts I use is classification-focused (identifying the discussion's product area), and normally the result is just the name of that product. But sometimes it will return a little spiel explaining its thinking, so I need to explicitly extract that value from the response instead of unthinkingly accepting whatever I get back. Defining the valid options in a Jinja variable has helped keep them in sync: I can pass them into the prompt and then reuse the same list when extracting the correct answer.
- **Remember that your results are non-deterministic.** For someone who loves to talk about <Term id="idempotent">idempotency</Term>, having a model whose results vary based on the vibes of some rocks we tricked into dreaming is a bit weird, and requires a bit more defensive coding than you may be used to. For example, one of the prompts I use is classification-focused (identifying the discussion's product area), and normally the result is just the name of that product. But sometimes it will return a little spiel explaining its thinking, so I need to explicitly extract that value from the response instead of unthinkingly accepting whatever I get back. Defining the valid options in a Jinja variable has helped keep them in sync: I can pass them into the prompt and then reuse the same list when extracting the correct answer.

```sql
-- a cut down list of segments for the sake of readability
{% set segments = ['Warehouse configuration', 'dbt Cloud IDE', 'dbt Core', 'SQL', 'dbt Orchestration', 'dbt Explorer', 'Unknown'] %}

select trim(
snowflake.cortex.complete(
'llama2-70b-chat',
concat(
'Identify the dbt product segment that this message relates to, out of [{{ segments | join ("|") }}]. Your response should be only the segment with no explanation. <message>',
text,
'</message>'
)
)
) as product_segment_raw,

-- reusing the segments Jinja variable here
coalesce(regexp_substr(product_segment_raw, '{{ segments | join ("|") }}'), 'Unknown') as product_segment
```

## Share your experiences

If you're doing anything like this in your work or side project, I'd love to hear about it in the comment section on Discourse or in machine-learning-general in Slack.

## Appendix: An example complete model

Here's the full model that I'm running to create the overall rollup messages that get posted to Slack, built on top of the row-by-row summary in an earlier model:

```sql
{{
config(
materialized='incremental',
unique_key='unique_key',
full_refresh=false
)
-}}


{#
This partition_by dict is to dry up the columns that are used in different parts of the query.
The SQL is used in the partition by components of the window function aggregates, and the column
names are used (in conjunction with the SQL) to select the relevant columns out in the final model.
They could be written out manually, but it creates a lot of places to update when changing from
day to week truncation for example.

Side note: I am still not thrilled with this approach, and would be happy to hear about alternatives!
#}
{%- set partition_by = [
{'column': 'summary_period', 'sql': 'date_trunc(day, sent_at)'},
{'column': 'product_segment', 'sql': 'lower(product_segment)'},
{'column': 'is_further_attention_needed', 'sql': 'is_further_attention_needed'},
] -%}

{% set partition_by_sqls = [] -%}
{% set partition_by_columns = [] -%}

{% for p in partition_by -%}
{% do partition_by_sqls.append(p.sql) -%}
{% do partition_by_columns.append(p.column) -%}
{% endfor -%}


with

summaries as (

select * from {{ ref('fct_slack_thread_llm_summaries') }}
where not has_townie_participant

),

aggregated as (
select distinct
{# Using the columns defined above #}
{% for p in partition_by -%}
{{ p.sql }} as {{ p.column }},
{% endfor -%}

-- This creates a JSON array, where each element is one thread + its permalink.
-- Each array is broken down by the partition_by columns defined above, so there's
-- one summary per time period and product etc.
array_agg(
object_construct(
'permalink', thread_permalink,
'thread', thread_summary
)
) over (partition by {{ partition_by_sqls | join(', ') }}) as agg_threads,
count(*) over (partition by {{ partition_by_sqls | join(', ') }}) as num_records,

-- The partition columns are the grain of the table, and can be used to create
-- a unique key for incremental purposes
{{ dbt_utils.generate_surrogate_key(partition_by_columns) }} as unique_key
from summaries
{% if is_incremental() %}
where unique_key not in (select this.unique_key from {{ this }} as this)
{% endif %}

),

summarised as (

select
*,
trim(snowflake.cortex.complete(
'llama2-70b-chat',
concat(
'In a few bullets, describe the key takeaways from these threads. For each object in the array, summarise the `thread` field, then provide the Slack permalink URL from the `permalink` field for that element in markdown format at the end of each summary. Do not repeat my request back to me in your response.',
agg_threads::text
)
)) as overall_summary
from aggregated

),

final as (
select
* exclude overall_summary,
-- The LLM loves to say something like "Sure, here's your summary:" despite my best efforts. So this strips that line out
regexp_replace(
overall_summary, '(^Sure.+:\n*)', ''
) as overall_summary

from summarised
)

select * from final
```

0 comments on commit 4256427

Please sign in to comment.