Stencila Evaluations and Benchmarking
👋 Intro • 🚴 Roadmap • 🛠️ Develop 🙏 Acknowledgements • 💖 Supporters
Welcome to the repository for Stencila's LLM evaluations and benchmarking. This is in early development and consolidates related code we have had in other repos.
We plan the following three main methodologies to evaluating LLMs for science-focussed prompts and tasks. To avoid discontinuities, we are likely to use a weighting approach, in which we gradually increase the weight of the more advanced methodologies as they are developed.
Collate external benchmarks and map prompts to each. For example, combine scores from LiveBench's coding benchmark and Aider's code editing benchmark into a single code-quality
score and use for stencila/create/code-chunk
, stencila/create/figure-code
and other code-related prompts.
Establish a pipeline for evaluating prompts themselves, and which LLMs are best suited to each prompt, using LLM-as-a-jury and other methods for machine-based evaluation.
Use data from user's acceptance and refinement of AI suggestions within documents as the basis for human-based evaluations.
For development, you’ll need to install the following dependencies:
Then, the following will get you started with a development environment:
just init
Once uv
is installed, you can use it to install some additional tools:
uv tool install ruff
uv tool install pyright
The justfile
has some common development-related commands that you might want to run.
For example, the check
command runs all linting and tests:
just check
To run anything within the virtual environment, you need to use uv run <command>
.
Alternatively, you can install direnv, and have the virtual environment activated automatically.
See here for more details about using direnv and uv together.
Overview of the current design of the code:
- Code is fetched from the sources defined under the
src/evals/benchmarks
and save the raw data downloaded. - We then use pydantic classes to validate the incoming data and then save it to parquet data frames using
polars
. - The
tables
folder contains two tables (as CVS). A set of models with anid
, and their mapping to the model names in the benchmarks we download. Ause
column lets us pick which models we use. The second table is a list of prompts, with an associatedcategory
. - We combine the downloaded parquet data frames with the models and prompts tables to generate lists of scores (validated by pydantic) and then save the results to another scoring data frame.
Each of these stages can be run from the command line.
To see the commands, look in the pyproject.toml
under the section [project.script]
. For example, to download the benchmarks, run:
These commands are also invoked from the justfile (just all
)
- By default, all the data just gets saved under a
data
folder in the root of the project. - The scores are currently normalized to 0..1 (rather than 0-100)
- There is no output to any sqlite database yet, thought there is a schema sketched in
src/evals/orm.py
.
Thank you to the following projects whose code and/or data we rely on:
We are grateful for the support of the Astera Institute for this work.