llm-eval

This repository contains a reproducible workflow setup using DVC backed by a JASMIN object store. Before working with the repository please contact Matt Coole to request access to the Jasmin object store llm-eval-o. Then follow the instructions below.

Requirements

Ollama (llama3.1 and mistral-nemo models)
Python 3.9+

Getting started

Setup

First create a new virtual environment and install the required dependencies:

python -m venv .venv
source .venv/bin/activate
pip install .

Configuration

Next setup your local DVC configuration with your Jasmin object store access key:

dvc remote modify --local jasmin access_key_id '<ACCES_KEY_ID>'
dvc remote modify --local jasmin secret_access_key '<KEY_SECRET>'

Getting the data

Pull the data from the object store using DVC:

dvc pull

Working with the pipeline

You should now be ready to re-run the pipeline:

dvc repro

This pipeline is defined in dvc.yaml and can be viewed with the command:

dvc dag

or it can be output to mermaid format to display in markdown:

dvc dag -md

flowchart TD
	node1["chunk-data"]
	node2["create-embeddings"]
	node3["evaluate"]
	node4["extract-metadata"]
	node5["fetch-metadata"]
	node6["fetch-supporting-docs"]
	node7["generate-testset"]
	node8["run-rag-pipeline"]
	node9["upload-to-docstore"]
	node1-->node2
	node2-->node9
	node4-->node1
	node5-->node4
	node5-->node6
	node6-->node1
	node7-->node8
	node8-->node3
	node9-->node8
	node10["data/evaluation-sets.dvc"]
	node11["data/synthetic-datasets.dvc"]

Loading

Note: To re-run the fetch-supporting-docs stage of the pipeline you will need to request access to the Legilo service from the EDS dev team and provide your username and password in a .env file.

Running Experiments

The pipeline by default will run using the parameters defind in params.yaml. To experiment with varying these paramaters you can change them directly, or use DVC experiments.

To run an experiment varying a particual parameter:

dvc exp run -S hp.chunk-size=1000

This will re-run the pipeline but override the value of the hp.chunk-size parameter in params.yaml and set it to 1000. Only the necessary stages of the pipeline should be re-run and the result should appear in your workspace.

You can compare the results of your experiment to the results of the baseline run of the pipeline using:

dvc exp diff

Path               Metric              HEAD      workspace    Change
data/metrics.json  answer_correctness  0.049482  0.043685     -0.0057974
data/metrics.json  answer_similarity   0.19793   0.17474      -0.02319
data/metrics.json  context_recall      0.125     0            -0.125
data/metrics.json  faithfulness        0.75      0.69375      -0.05625

Path         Param          HEAD    workspace    Change
params.yaml  hp.chunk-size  300     1000         700

It is also possible to compare the results of all experiments:

dvc exp show --only-changed

Experiments can be remove using (-A flag removes all experiment, but individually experiment can be removed using their name or ID):

dvc exp remove -A

Experiment Runner

The repository includes a simple shell script that can be used as an experiment runner to test various different models:

./run-experiments.sh

This will run the dvc pipeline with various different llm model (check the shell scripts for details) and save the results as experiments.

An experiment for each model defined will be queued and run in the background. To check the status of the experiments:

dvc queue status

To check the output for an experiment currently running use:

dvc queue log $EXPERIMENT_NAME

Other Notes

DVC and CML

Notes on the use of Data Version Control and Continuous Machine Learning:

DVC
CML

vLLM

Notes on running models with vLLM:

vLLM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

llm-eval

Requirements

Getting started

Setup

Configuration

Getting the data

Working with the pipeline

Running Experiments

Experiment Runner

Other Notes

DVC and CML

vLLM

Files

README.md

Latest commit

History

README.md

File metadata and controls

llm-eval

Requirements

Getting started

Setup

Configuration

Getting the data

Working with the pipeline

Running Experiments

Experiment Runner

Other Notes

DVC and CML

vLLM