The purpose of this repo is to provide an open-source framework for running evaluations and benchmarks of LLM and RAG systems.
This repo provides several tools to enable the running of itterative tests on your framework:
- Database server for storing test results
- Abstraction layers for integrating your code into the test framework
- Functions for loading data from datasource, creating Q/A set, and running tests against your platform
- Running customizable evaluations of test results
- Viewing evaluation results
- sqlite server
- Jupyter workbooks for running experiments and viewing results
- postgres server with pgvector
- database abstraction
- examples with local models
- better batching with progress bars
- api server for database ops
- front-end for viewing results
- library of benchmark and evaluation functions
- profit
python3 -m venv venv
make install
This project provides workbooks that a developer can run locally. They are designed to be run in steps and support stopping and resuming your progress.
step_1a_files.ipynb
step_1b_data.ipynb
step_2_test.ipynb
step_3_eval.ipynb
step_4_results.ipynb
To test your solution, you must implement the class AbstractGenerator
from packages/scripts/src/eval_scripts/generator.py
, then modify the file my_generator.py
to import your implementation. This generator will then be ran as part of the process in step_2_test.ipynb
## - Response Generation.