⚡Quick Start | 🔠Benchmarks | 🤖LLM Generated Code | 📝Citation | 🙏Acknowledgement
EvoEval1 is a holistic benchmark suite created by evolving HumanEval problems:
- 🔥 Contains 828 new problems across 5 🌠 semantic-altering and 2 ⭐ semantic-preserving benchmarks
- 🔮 Allows evaluation/comparison across different dimensions and problem types (i.e., Difficult, Creative or Tool Use problems). See our visualization tool for ready-to-use comparison
- 🏆 Complete with leaderboard, groundtruth solutions, robust testcases and evaluation scripts to easily fit into your evaluation pipeline
- 🤖 Generated LLM code samples from >50 different models to save you time in running experiments
1 coincidentally similar pronunciation with 😈 EvilEval
Checkout our 📃 paper and webpage for more detail!
Directly install the package:
pip install evoeval --upgrade
⏬ Nightly Version
pip install "git+https://github.com/evo-eval/evoeval.git" --upgrade
⏬ Local Repository
git clone https://github.com/evo-eval/evoeval.git
cd evoeval
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
Now you are ready to download EvoEval benchmarks and perform evaluation!
To download our benchmarks, simply use the following code snippet:
from evoeval.data import get_evo_eval
evoeval_benchmark = "EvoEval_difficult" # you can pick from 7 different benchmarks!
problems = get_evo_eval(evoeval_benchmark)
For code generation and evaluation, we adopt the same style as HumanEval+ and HumanEval.
Implement the GEN_SOLUTION
function by calling the LLM to produce the complete solution (include the function header + code) and save the samples to {benchmark}_samples.jsonl
:
from evoeval.data import get_evo_eval, write_jsonl
evoeval_benchmark = "EvoEval_difficult"
samples = [
dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
for task_id, problem in get_evo_eval(evoeval_benchmark).items()
]
write_jsonl(f"{evoeval_benchmark}_samples.jsonl", samples)
Tip
EvoEval samples.jsonl
expects the solution field to contain the complete code implementation, this is
slightly different from the original HumanEval where the solution field only contains the function body.
If you want to follow exactly like HumanEval setup, checkout our 🤗 Huggingface datasets, which can be directly ran with HumanEval evaluation script
You can use our provided docker image:
docker run --rm -v $(pwd):/app evoeval/evoeval:latest --dataset EvoEval_difficult --samples EvoEval_difficult_samples.jsonl
Or run it locally:
evoeval.evaluate --dataset EvoEval_difficult --samples EvoEval_difficult_samples.jsonl
Or if you are using it as a local repository:
export PYTHONPATH=$PYTHONPATH:$(pwd)
python evoeval/evaluate.py --dataset EvoEval_difficult --samples EvoEval_difficult_samples.jsonl
You should expect to see the following output (when evaluated on GPT-4):
Computing expected output...
Expected outputs computed in 11.24s
Reading samples...
100it [00:00, 164.16it/s]
100%|████████████████████████████████████████████████████████████████| 100/100 [00:07<00:00, 12.77it/s]
EvoEval_difficult
pass@1: 0.520 # for reference GPT-4 solves more than 80% of problems in HumanEval
This shows the pass@1 score for the EvoEval_difficult benchmark. You can use --i-just-wanna-run
to recompute the evaluation result
Note
You can also evaluate the LLM solutions in a folder format with each subfolder contains the LLM solution for each problem in the benchmark
For example, you can grab the GPT-4 solutions in our v0.1.0 release. After unzipping, you can run the following command:
evoeval.evaluate --dataset EvoEval_difficult --samples gpt-4_temp_0.0/EvoEval_difficult
to obtain the same result as above using .jsonl
EvoEval contains 7 different benchmarks, each with a unique set of problems evolved from the original HumanEval problems. 🌠 denotes semantic-altering benchmarks, while ⭐ denotes semantic-preserving benchmarks.:
🌠EvoEval_difficult:
Introduce complexity by adding additional constraints and requirements, replace commonly used requirements to less common ones, or add additional reasoning steps to the original problem.
🌠EvoEval_creative:
Generate a more creative problem compared to the original through the use of stories or uncommon narratives.
🌠EvoEval_subtle:
Make a subtle and minor change to the original problem such as inverting or replacing a requirement.
🌠EvoEval_combine:
Combine two different problems by integrating the concepts from both problems. In order to select problems that make sense to combine, we apply a simple heuristic to combine only problems of the same type together categorized based on the type of input arguments in the original problem.
🌠EvoEval_tool_use:
Produce a new problem containing a main problem and one or more helpers functions which can be used to solve it. Each helper function is fully implemented and provides hints or useful functionality for solving the main problem. The main problem does not explicitly reference individual helper functions, and we do not require the model to use the provided helpers.
⭐EvoEval_verbose:
Reword the original docstring to be more verbose. These verbose docstrings can use more descriptive language to illustrate the problem, include detailed explanation of the example output, and provide additional hints.
⭐EvoEval_concise:
Reword the original docstring to be more concise by removing unnecessary details and using concise language. Furthermore, simple examples that are not required to demonstrate edge cases may be removed.
For each problem in each EvoEval benchmark, we include the complete groundtruth as well as test cases for functional evaluation.
Note
Problem Structure
{
"task_id": "identifier string for the task",
"entry_point": "name of the function",
"prompt": "function signature with docstring",
"canonical_solution": "groundtruth implementation",
"inputs": "test inputs for each problem",
"parent": "original HumanEval problem it evolved from",
"main": "special field of EvoEval_tool_use to show just the main problem description",
"helpers": "special field of EvoEval_tool_use to show the helper functions"
}
To view the performance of >50 LLMs on the EvoEval benchmarks, we provide a complete leaderboard as well as a visualization tool to compare the performance of different models.
Further, we also provide all code samples from LLMs on the EvoEval benchmarks:
- See the attachment of our v0.1.0 release.
Each LLM generation is packaged in a zip file named like {model_name}_temp_0.0.zip
. You can unzip the folder and obtain the
LLM generation for each of our 7 benchmarks + the original HumanEval problems. Note that we only evaluate the greedy output for each LLM.
@article{evoeval,
author = {Xia, Chunqiu Steven and Deng, Yinlin and Zhang, Lingming},
title = {Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM},
year = {2024},
journal = {arXiv preprint},
}
Note
The first two authors contributed equally to this work, with author order determined via Nigiri