Version | HF repo ID | Splits | Subsets | Metric | Codebase | License |
---|---|---|---|---|---|---|
v0.1.3 | bigcode/bigcodebench |
full/hard | complete/instruct | pass@k | GitHub | Apache 2.0 |
Table of Contents
- Evaluate with a Docker container
- Evaluate on your machine (not recommended)
- Citation
- Aknowledgements
To safely execute the code, you can isolate the execution inside a Docker container. Here you can find a ready-to-use docker image. The Dockerfile that was used for building the image can be found here.
Then, specify your command. For example, you can use vLLM backend for generation:
CMD="python -m eval.eval \
--model vllm \
--tasks BigCodeBench \
--model_args pretrained=Qwen/Qwen2.5-7B-Instruct \
--batch_size auto"
Then you can run the evaluation inside the container:
docker run --gpus \
-v $(pwd):/app -t marianna13/evalchemy:latest \
$CMD
🚨 Warning: proceed with caution.
Install all dependencies:
pip install -r requirements/requirements.txt
pip install -r requirements/requirements-eval.txt
Then run the evaluation:
python -m eval.eval \
--model vllm \
--tasks BigCodeBench \
--model_args pretrained=Qwen/Qwen2.5-7B-Instruct \
--batch_size auto
@article{zhuo2024bigcodebench,
title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
author={Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others},
journal={arXiv preprint arXiv:2406.15877},
year={2024}
}
Thanks to the wonderful team of BigCode for making their benchmark and code publically available! 🙏