Convex Coding Evals

LLMs don't have perfect knowledge of Convex, so they require some prompting to help them along. This repo contains a set of prompts for coding a Convex backend, a set of human-curated solutions, and a script for evaluating the LLM's output.

Running the evaluations

pip install pdm
pdm install

npm install -g bun
bun install

echo "ANTHROPIC_API_KEY=<your ANTHROPIC_API_KEY>" > .env
echo "OPENAI_API_KEY=<your OPENAI_API_KEY>" >> .env

pdm run python runner/main.py --model=claude-3-5-sonnet-latest --generate-concurrency=1

You can also specify a test filter regex:

pdm run python runner/main.py --model=claude-3-5-sonnet-latest --generate-concurrency=1 --test-filter='.*data_modeling.*'

If you'd like to grade the evaluations again without regenerating them, run:

pdm run python runner/main.py --skip-generation

Grading writes out a JSON report in the output directory.

You can also pretty print the report:

pdm run python runner/main.py --output-dir=output
pdm run python print_report.py output/report.json

There's also a Next app for viewing the report:

cd viewer
bun install
bun dev

Creating a new evaluation

pdm run python create_eval.py <name> <category>

For example, adding a new fundmentals eval for using HTTP actions and storage would be:

pdm run python create_eval.py http_actions_file_storage 000-fundamentals

Note that test or category names cannot contain dashes.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
evals		evals
examples/chat-app		examples/chat-app
grader		grader
runner		runner
viewer		viewer
.gitignore		.gitignore
.pdm-python		.pdm-python
.prettierignore		.prettierignore
EVAL_WORKFLOW.md		EVAL_WORKFLOW.md
HALLUCINATIONS.md		HALLUCINATIONS.md
Justfile		Justfile
README.md		README.md
bun.lockb		bun.lockb
create_eval.py		create_eval.py
eslint.config.mjs		eslint.config.mjs
package.json		package.json
pdm.lock		pdm.lock
print_report.py		print_report.py
pyproject.toml		pyproject.toml
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Convex Coding Evals

Running the evaluations

Creating a new evaluation

About

Releases

Packages

Languages

get-convex/convex-evals

Folders and files

Latest commit

History

Repository files navigation

Convex Coding Evals

Running the evaluations

Creating a new evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages