Skip to content

Commit

Permalink
Add knitr caching
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley committed Nov 21, 2024
1 parent eb4bd4b commit bf36878
Show file tree
Hide file tree
Showing 85 changed files with 44 additions and 3 deletions.
22 changes: 21 additions & 1 deletion vignettes/prompt-design.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ vignette: >
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
eval = elmer:::openai_key_exists() && elmer:::anthropic_key_exists()
eval = elmer:::openai_key_exists() && elmer:::anthropic_key_exists(),
cache = TRUE
)
options(elmer_seed = 1337)
```
Expand All @@ -27,6 +28,7 @@ As well as the general advice in this vignette, it's also a good idea to read th
If you have a claude account, you can use its <https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prompt-generator>. This prompt generator has been specifically tailored for Claude, but I suspect it will help many other LLMs, or at least give you some ideas as to what else you might want to include in your prompt.

```{r setup}
#| cache: false
library(elmer)
```

Expand Down Expand Up @@ -56,6 +58,7 @@ However, for the purposes of this vignette, we'll keep the prompts fairly short
Let's explore prompt design for a simple code generation task:

```{r}
#| cache: false
question <- "
How can I compute the mean and median of variables a, b, c, and so on,
all the way up to z, grouped by age and sex.
Expand All @@ -69,13 +72,15 @@ I'll use `chat_claude()` for this problem because in our experience it does the
When I don't provide a system prompt, I sometimes get answers in a different languages:

```{r}
#| label: code-basic
chat <- chat_claude()
chat$chat(question)
```

I can ensure that I always get R code by providing a system prompt:

```{r}
#| label: code-r
chat <- chat_claude(system_prompt = "
You are an expert R programmer.
")
Expand All @@ -87,6 +92,7 @@ Note that I'm using both a system prompt (which defines the general behaviour) a
Since I'm mostly interested in the code, I ask it to drop the explanation and the sample data:

```{r}
#| label: code-r-minimal
chat <- chat_claude(system_prompt = "
You are an expert R programmer.
Just give me the code. I don't want any explanation or sample data.
Expand All @@ -97,6 +103,7 @@ chat$chat(question)
With this prompt, I seem to mostly get tidyverse code. So if you want a different style of R code, you should ask for it:

```{r}
#| label: code-styles
chat <- chat_claude(system_prompt = "
You are an expert R programmer who prefers data.table.
Just give me the code. I don't want any explanation or sample data.
Expand All @@ -117,6 +124,7 @@ chat$chat(question)
If there's something about the output that you don't like, you can try being more explicit about it. For example, the code isn't styled quite how I like, so I provide more details about what I do want:

```{r}
#| label: code-explicit
chat <- chat_claude(system_prompt = "
You are an expert R programmer who prefers the tidyverse.
Just give me the code. I don't want any explanation or sample data.
Expand All @@ -136,6 +144,7 @@ This still doesn't yield exactly the code that I'd write, but it's prety close.
You could provide a different prompt if you were looking for more explanation of the code:

```{r}
#| label: code-teacher
chat <- chat_claude(system_prompt = "
You are an expert R teacher.
I am a new R user who wants to improve my programming skills.
Expand All @@ -151,6 +160,7 @@ chat$chat(question)
You can imagine LLMs as being a sort of an average of the internet at a given point in time. That means they will provide popular answers, which will tend to reflect older coding styles (either because the new features aren't in their index, or the older features are so much more popular). So if you want your code to use specific features that are relatively recent, you might need to provide the examples yourself:

```{r}
#| label: code-new-feature
chat <- chat_claude(system_prompt = "
You are an expert R programmer.
Just give me the code; no explanation in text.
Expand Down Expand Up @@ -181,6 +191,7 @@ Providing a rich set of examples is a great way to encourage the output to produ
My overall goal is to turn a list of ingredients, like the following, into a nicely structured JSON that I can then analyse in R (e.g. to compute the total weight, scale the recipe up or down, or to convert the units from volumes to weights).

```{r}
#| cache: false
ingredients <- "
¾ cup (150g) dark brown sugar
2 large eggs
Expand All @@ -199,6 +210,7 @@ ingredients <- "
If you don't have strong feelings about what the data structure should look like, you can start with a very loose prompt and see what you get back. I find this a useful pattern for underspecified problems where a big part of the problem is just defining precisely what problem you want to solve. Seeing the LLMs attempt at coming up with a data structure gives me something to immediately react to, rather than having to start from a blank page.

```{r}
#| label: data-loose
instruct_json <- "
You're an expert baker who also loves JSON. I am going to give you a list of
ingredients and your job is to return nicely structured JSON. Just return the
Expand All @@ -216,6 +228,7 @@ chat$chat(ingredients)
This isn't a bad start, but I prefer to cook with weight, so I only want to see volumes if weight isn't available. So I provide a couple of examples of what I'm looking for. I was pleasantly suprised that I can provide the input and output examples in such a loose format.

```{r}
#| label: data-examples
instruct_weight <- r"(
Here are some examples of the sort of output I'm looking for:
Expand All @@ -236,6 +249,7 @@ chat$chat(ingredients)
Just providing the examples seems to work remarkably well. But I found it useful to also include description of what the examples are trying to accomplish. I'm not sure if this helps the LLM or not, but it certainly makes it easier for me to understand the organisation and check that I've covered the key pieces that I'm interested in.

```{r}
#| cacahe: false
instruct_weight <- r"(
* If an ingredient has both weight and volume, extract only the weight:
Expand All @@ -260,6 +274,7 @@ This structure also allows me to give the LLMs a hint about how I want multiple
I then just iterated on this task, looking at the results from different recipes to get a sense of what the LLM was getting wrong. Much of this felt like I waws iterating on my understanding of the problem as I didn't start by knowing exactly how I wanted the data. For example, when I started out I didn't really think about all the various ways that ingredients are specified. For later analysis, I always want quantities to be number, even if they were originally fractions, or the if the units aren't precise (like a pinch). It also forced me to realise that some ingredients are unitless.

```{r}
#| cache: false
instruct_unit <- r"(
* If the unit uses a fraction, convert it to a decimal.
Expand Down Expand Up @@ -296,6 +311,7 @@ You might want to take a look at the [full prompt](https://gist.github.com/hadle
Now that I've iterated to get a data structure that I like, it seems useful to formalise it and tell the LLM exactly what I'm looking for using structured data. This guarantees that the LLM will only return JSON, the JSON will have the fields that you expect, and then elmer will automatically convert it into an R data structure for you.

```{r}
#| label: data-structured
type_ingredient <- type_object(
name = type_string("Ingredient name"),
quantity = type_number(),
Expand All @@ -314,6 +330,7 @@ do.call(rbind, lapply(data$ingredients, as.data.frame))
One thing that I'd do next time would also be to include the raw ingredient name in the output. This doesn't make much difference here, in this simple example, but it makes it much easier to align the input and the output and start to develop automated measures of how well my prompt is doing.

```{r}
#| cache: false
instruct_weight_input <- r"(
* If an ingredient has both weight and volume, extract only the weight:
Expand All @@ -336,6 +353,7 @@ instruct_weight_input <- r"(
I think this is particularly important if you're working with even less structured text. For example, imagine you had this text:

```{r}
#| cache: false
recipe <- r"(
In a large bowl, cream together one cup of softened unsalted butter and a
quarter cup of white sugar until smooth. Beat in an egg and 1 teaspoon of
Expand All @@ -351,6 +369,7 @@ recipe <- r"(
Including the input text in the output makes it easier to see if it's doing a good job:

```{r}
#| label: data-unstructured-input
chat <- chat_openai(c(instruct_json, instruct_weight_input))
chat$chat(recipe)
```
Expand All @@ -360,6 +379,7 @@ When I ran it while writing this vignette, it seems to be working out the weight
## Token usage

```{r}
#| label: usage
#| type: asis
#| echo: false
knitr::kable(token_usage())
Expand Down
1 change: 1 addition & 0 deletions vignettes/prompt-design_cache/html/__packages
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
elmer
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Binary file not shown.
23 changes: 21 additions & 2 deletions vignettes/structured-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,24 @@ vignette: >
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
eval = elmer:::openai_key_exists()
eval = elmer:::openai_key_exists(),
cache = TRUE
)
```

When using an LLM to extract data from text or images, you can ask the chatbot to nicely format it, in JSON or any other format that you like. This will generally work well most of the time, but there's no gaurantee that you'll actually get the exact format that you want. In particular, if you're trying to get JSON, find that it's typically surrounded in ```` ```json ````, and you'll occassionally get text that isn't actually valid JSON. To avoid these challenges you can use a recent LLM feature: **structured data** (aka structured output). With structured data, you supply a type specification that exactly defines the object structure that you want and the LLM will guarantee that's what you get back.

```{r setup}
library(elmer)
#| cache: false
library(elmer)#|
```

## Structured data basics

To extract structured data you call the `$extract_data()` method instead of the `$chat()` method. You'll also need to define a type specification that describes the structure of the data that you want (more on that shortly). Here's a simple example that extracts two specific values from a string:

```{r}
#| label: basics-text
chat <- chat_openai()
chat$extract_data(
"My name is Susan and I'm 13 years old",
Expand All @@ -39,6 +42,7 @@ chat$extract_data(
The same basic idea works with images too:

```{r}
#| label: basics-image
chat$extract_data(
content_image_url("https://www.r-project.org/Rlogo.png"),
spec = type_object(
Expand All @@ -58,6 +62,7 @@ To define your desired type specification (also known as a **schema**), you use
* **Arrays** represent any number of values of the same type and are created with `type_array()`. You must always supply the `item` argument which specifies the type of each individual element. Arrays of scalars are very similar to R's atomic vectors:

```{r}
#| cache: false
type_logical_vector <- type_array(items = type_boolean())
type_integer_vector <- type_array(items = type_integer())
type_double_vector <- type_array(items = type_number())
Expand All @@ -67,12 +72,14 @@ To define your desired type specification (also known as a **schema**), you use
You can also have arrays of arrays and arrays of objects, which more closely resemble lists with well defined structures:

```{r}
#| cache: false
list_of_integers <- type_array(items = type_integer_vector)
```

* **Objects** represent a collection of named values and are created with `type_object()`. Objects can contain any number of scalars, arrays, and other objects. They are similar to named lists in R.

```{r}
#| cache: false
type_person <- type_object(
name = type_string(),
age = type_integer(),
Expand All @@ -83,6 +90,7 @@ To define your desired type specification (also known as a **schema**), you use
As well as the definition of the types, you need to provide the LLM with some information about what you actually want. This is the purpose of the first argument, `description`, which is a string that describes the data that you want. This is a good place to ask nicely for other attributes you'll like the value to possess (e.g. minimum or maximum values, date formats, ...). You aren't guaranteed that these requests will be honoured, but the LLM will usually make a best effort to do so.

```{r}
#| cache: false
type_person <- type_object(
"A person",
name = type_string("Name"),
Expand All @@ -100,6 +108,7 @@ The following examples are [closely inspired by the Claude documentation](https:
### Example 1: Article summarisation

```{r}
#| label: examples-summarisation
text <- readLines(system.file("examples/third-party-testing.txt", package = "elmer"))
# url <- "https://www.anthropic.com/news/third-party-testing"
# html <- rvest::read_html(url)
Expand Down Expand Up @@ -127,6 +136,7 @@ str(data)
### Example 2: Named entity recognition

```{r}
#| label: examples-named-entity
text <- "
John works at Google in New York. He met with Sarah, the CEO of
Acme Inc., last week in San Francisco.
Expand All @@ -151,6 +161,7 @@ str(chat$extract_data(text, spec = named_entities))
### Example 3: Sentiment analysis

```{r}
#| label: examples-sentiment
text <- "
The product was okay, but the customer service was terrible. I probably
won't buy from them again.
Expand All @@ -172,6 +183,7 @@ Note that we've asked nicely for the scores to sum 1, and they do in this exampl
### Example 4: Text classification

```{r}
#| label: examples-classification
text <- "The new quantum computing breakthrough could revolutionize the tech industry."
classification <- type_array(
Expand Down Expand Up @@ -213,6 +225,7 @@ do.call(rbind, lapply(data, as.data.frame))
### Example 5: Working with unknown keys

```{r, eval = elmer:::anthropic_key_exists()}
#| label: examples-unknown-keys
characteristics <- type_object(
"All characteristics",
.additional_properties = TRUE
Expand Down Expand Up @@ -242,6 +255,7 @@ This example comes from [Dan Nguyen](https://gist.github.com/dannguyen/faaa56ceb
Even without any descriptions, ChatGPT does pretty well:

```{r}
#| label: examples-image
asset <- type_object(
assert_name = type_string(),
owner = type_string(),
Expand Down Expand Up @@ -274,6 +288,7 @@ By default, all components of an object are required. If you want to make some o
For example, here the LLM hallucinates a date even though there isn't one in the text:

```{r}
#| label: type-required
article_spec <- type_object(
"Information about an article written in markdown",
title = type_string("Article title"),
Expand Down Expand Up @@ -302,6 +317,7 @@ Note that I've used more of an explict prompt here. For this example, I found th
If let the LLM know that the fields are all optional, it'll instead return `NULL` for the missing fields:

```{r}
#| label: type-optional
article_spec <- type_object(
"Information about an article written in markdown",
title = type_string("Article title", required = FALSE),
Expand All @@ -316,6 +332,7 @@ chat$extract_data(prompt, spec = article_spec)
If you want to define a data frame like object, you might be tempted to create a definition similar to what R uses: an object (i.e. a named list) containing multiple vectors (i.e. arrays):

```{r}
#| cache: false
my_df_type <- type_object(
name = type_array(items = type_string()),
age = type_array(items = type_integer()),
Expand All @@ -327,6 +344,7 @@ my_df_type <- type_object(
This however, is not quite right becuase there's no way to specify that each array should have the same length. Instead you need to turn the data structure "inside out", and instead create an array of objects:

```{r}
#| cache: false
my_df_type <- type_array(
items = type_object(
name = type_string(),
Expand All @@ -344,6 +362,7 @@ If you're working with OpenAI, you'll need to wrap this in a dummy object becaus
## Token usage

```{r}
#| label: usage
#| type: asis
#| echo: false
knitr::kable(token_usage())
Expand Down
1 change: 1 addition & 0 deletions vignettes/structured-data_cache/html/__packages
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
elmer
Binary file not shown.
Empty file.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Binary file not shown.

0 comments on commit bf36878

Please sign in to comment.