Skip to content

mlfoundations/evalchemy

Repository files navigation

πŸ§ͺ Evalchemy

A framework for gold standard language model evaluations

alt text

Evalchemy is a unified and easy-to-use toolkit for evaluating language models, focussing on post-trained models. Evalchemy is developed by the DataComp community and Bespoke Labs and builds on the LM-Eval-Harness to provide a unified, easy-to-use platform for language model evaluation. Evalchemy integrates multiple existing benchmarks, such as RepoBench, AlpacaEval, and ZeroEval. We've streamlined the process by:

Key Features

  • Unified Installation: One-step setup for all benchmarks, eliminating dependency conflicts
  • Parallel Evaluation:
    • Data-Parallel: Distribute evaluations across multiple GPUs for faster results
    • Model-Parallel: Handle large models that don't fit on a single GPU
  • Simplified Usage: Run any benchmark with a consistent command-line interface
  • Results Management:
    • Local results tracking with standardized output format
    • Optional database integration for systematic tracking
    • Leaderboard submission capability (requires database setup)

⚑ Quick Start

Installation

We suggest using conda (installation instructions).

# Create and activate conda environment
conda create --name evalchemy python=3.10
conda activate evalchemy

# Clone the repo
git clone [email protected]:mlfoundations/evalchemy.git   
cd evalchemy

# Install dependencies
pip install -e ".[eval]"
pip install -e eval/chat_benchmarks/alpaca_eval

# Log into HuggingFace for datasets and models.
huggingface-cli login

πŸ“š Available Tasks

Built-in Benchmarks

We have recorded reproduced results against published numbers for these benchmarks in reproduced_benchmarks.md.

Basic Usage

Make sure your OPENAI_API_KEY is set in your environment before running evaluations.

python -m eval.eval \
    --model hf \
    --tasks HumanEval,mmlu \
    --model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \
    --batch_size 2 \
    --output_path logs

The results will be written out in output_path. If you have jq installed, you can view the results easily after evaluation. Example: jq '.results' logs/Qwen__Qwen2.5-7B-Instruct/results_2024-11-17T17-12-28.668908.json

Args:

  • --model: Which model type or provider is evaluated (example: hf)
  • --tasks: Comma-separated list of tasks to be evaluated.
  • --model_args: Model path and parameters. Comma-separated list of parameters passed to the model constructor. Accepts a string of the format "arg1=val1,arg2=val2,...". You can find the list supported arguments here.
  • --batch_size: Batch size for inference
  • --output_path: Directory to save evaluation results

Example running multiple benchmarks:

python -m eval.eval \
    --model hf \
    --tasks MTBench,WildBench,alpaca_eval \
    --model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \
    --batch_size 2 \
    --output_path logs

Config shortcuts:

To be able to reuse commonly used settings without having to manually supply full arguments every time, we support reading eval configs from YAML files. These configs replace the --batch_size, --tasks, and --annoator_model arguments. Some example config files can be found in ./configs. To use these configs, you can use the --config flag as shown below:

python -m eval.eval \
    --model hf \
    --model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \
    --output_path logs \
    --config configs/light_gpt4omini0718.yaml

We add several more command examples in eval/examples to help you start using Evalchemy.

πŸ”§ Advanced Usage

Support for different models

Through LM-Eval-Harness, we support all HuggingFace models and are currently adding support for all LM-Eval-Harness models, such as OpenAI and VLLM. For more information on such models, please check out the models page.

To choose a model, simply set 'pretrained=' where the model name can either be a HuggingFace model name or a path to a local model.

Coming Soon

  • Support for vLLM models
  • Support for OpenAI
  • Few-shot prompting for instruction benchmarks.

Multi-GPU Evaluation

For faster evaluation using data parallelism (recommended):

accelerate launch --num-processes <num-gpus> --num-machines <num-nodes> \
    --multi-gpu -m eval.eval \
    --model hf \
    --tasks MTBench,alpaca_eval \
    --model_args 'pretrained=mistralai/Mistral-7B-Instruct-v0.3' \
    --batch_size 2 \
    --output_path logs

Large Model Evaluation

For models that don't fit on a single GPU, use model parallelism:

python -m eval.eval \
    --model hf \
    --tasks MTBench,alpaca_eval \
    --model_args 'pretrained=mistralai/Mistral-7B-Instruct-v0.3,parallelize=True' \
    --batch_size 2 \
    --output_path logs

πŸ’‘ Note: While "auto" batch size is supported, we recommend manually tuning the batch size for optimal performance. The optimal batch size depends on the model size, GPU memory, and the specific benchmark. We used a maximum of 32 and a minimum of 4 (for RepoBench) to evaluate Llama-3-8B-Instruct on 8xH100 GPUs.

Output Log Structure

Our generated logs include critical information about each evaluation to help inform your experiments. We highlight important items in our generated logs.

  • Model Configuration

    • model: Model framework used
    • model_args: Model arguments for the model framework
    • batch_size: Size of processing batches
    • device: Computing device specification
    • annotator_model: Model used for annotation ("gpt-4o-mini-2024-07-18")
  • Seed Configuration

    • random_seed: General random seed
    • numpy_seed: NumPy-specific seed
    • torch_seed: PyTorch-specific seed
    • fewshot_seed: Seed for few-shot examples
  • Model Details

    • model_num_parameters: Number of model parameters
    • model_dtype: Model data type
    • model_revision: Model version
    • model_sha: Model commit hash
  • Version Control

    • git_hash: Repository commit hash
    • date: Unix timestamp of evaluation
    • transformers_version: Hugging Face Transformers version
  • Tokenizer Configuration

    • tokenizer_pad_token: Padding token details
    • tokenizer_eos_token: End of sequence token
    • tokenizer_bos_token: Beginning of sequence token
    • eot_token_id: End of text token ID
    • max_length: Maximum sequence length
  • Model Settings

    • model_source: Model source platform
    • model_name: Full model identifier
    • model_name_sanitized: Sanitized model name for file system usage
    • chat_template: Conversation template
    • chat_template_sha: Template hash
  • Timing Information

    • start_time: Evaluation start timestamp
    • end_time: Evaluation end timestamp
    • total_evaluation_time_seconds: Total duration
  • Hardware Environment

    • PyTorch version and build configuration
    • Operating system details
    • GPU configuration
    • CPU specifications
    • CUDA and driver versions
    • Relevant library versions

Customizing Evaluation

πŸ€– Change Annotator Model

As part of Evalchemy, we want to make swapping in different Language Model Judges for standard benchmarks easy. Currently, we support two judge settings. The first is the default setting, where we use a benchmark's default judge. To activate this, you can either do nothing or pass in

--annotator_model auto

In addition to the default assignments, we support using gpt-4o-mini-2024-07-18 as a judge:

--annotator_model gpt-4o-mini-2024-07-18

We are planning on adding support for different judges in the future!

⏱️ Runtime and Cost Analysis

Evalchemy makes running common benchmarks simple, fast, and versatile! We list the speeds and costs for each benchmark we achieve with Evalchemy for Meta-Llama-3-8B-Instruct on 8xH100 GPUs.

Benchmark Runtime (8xH100) Batch Size Total Tokens Default Judge Cost ($) GPT-4o-mini Judge Cost ($) Notes
MTBench 14:00 32 ~196K 6.40 0.05
WildBench 38:00 32 ~2.2M 30.00 0.43 Using GPT-4-mini judge
RepoBench 46:00 4 - - - Lower batch size due to memory
MixEval 13:00 32 ~4-6M 3.36 0.76 Varies by judge model
AlpacaEval 16:00 32 ~936K 9.40 0.14
HumanEval 4:00 32 - - - No API costs
IFEval 1:30 32 - - - No API costs
ZeroEval 1:44:00 32 - - - Longest runtime
MBPP 6:00 32 - - - No API costs
MMLU 7:00 32 - - - No API costs
ARC 4:00 32 - - - No API costs
DROP 20:00 32 - - - No API costs

Notes:

  • Runtimes measured using 8x H100 GPUs with Meta-Llama-3-8B-Instruct model
  • Batch sizes optimized for memory and speed
  • API costs vary based on judge model choice

Cost-Saving Tips:

  • Use gpt-4o-mini-2024-07-18 judge when possible for significant cost savings
  • Adjust batch size based on available memory
  • Consider using data-parallel evaluation for faster results

πŸ” Special Access Requirements

ZeroEval Access

To run ZeroEval benchmarks, you need to:

  1. Request access to the ZebraLogicBench-private dataset on Hugging Face
  2. Accept the terms and conditions
  3. Log in to your Hugging Face account when running evaluations

πŸ› οΈ Implementing Custom Evaluations

To add a new evaluation system:

  1. Create a new directory under eval/chat_benchmarks/
  2. Implement eval_instruct.py with two required functions:
    • eval_instruct(model): Takes an LM Eval Model, returns results dict
    • evaluate(results): Takes results dictionary, returns evaluation metrics

Adding External Evaluation Repositories

Use git subtree to manage external evaluation code:

# Add external repository
git subtree add --prefix=eval/chat_benchmarks/new_eval https://github.com/original/repo.git main --squash

# Pull updates
git subtree pull --prefix=eval/chat_benchmarks/new_eval https://github.com/original/repo.git main --squash

# Push contributions back
git subtree push --prefix=eval/chat_benchmarks/new_eval https://github.com/original/repo.git contribution-branch

πŸ” Debug Mode

To run evaluations in debug mode, add the --debug flag:

python -m eval.eval \
    --model hf \
    --tasks MTBench \
    --model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \
    --batch_size 2 \
    --output_path logs \
    --debug

This is particularly useful when testing new evaluation implementations, debugging model configurations, verifying dataset access, and testing database connectivity.

πŸš€ Performance Tips

  1. Utilize batch processing for faster evaluation:
all_instances.append(
    Instance(
        "generate_until",
        example,
        (
            inputs,
            {
                "max_new_tokens": 1024,
                "do_sample": False,
            },
        ),
        idx,
    )
)

outputs = self.compute(model, all_instances)
  1. Use the LM-eval logger for consistent logging across evaluations

πŸ”§ Troubleshooting

Evalchemy has been tested on CUDA 12.4. If you run into issues like this: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12, try updating your CUDA version:

wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4

πŸ† Leaderboard Integration

To track experiments and evaluations, we support logging results to a PostgreSQL database. Details on the entry schemas and database setup can be found in database/.

Contributing

Thank you to all the contributors for making this project possible! Please follow these instructions on how to contribute.

Citation

If you find Evalchemy useful, please consider citing us!

@software{Evalchemy,
  author = {Guha, Etash and Raoof, Negin and Mercat, Jean and Frankel, Eric and Keh, Sedrick and Grover, Sachin and Smyrnis, George and Vu, Trung and Marten, Ryan and Saad-Falcon, Jon and Choi, Caroline and Arora, Kushal and Merrill, Mike and Deng, Yichuan and Suvarna, Ashima and Bansal, Hritik and Nezhurina, Marianna and Choi, Yejin and Heckel, Reinhard and Oh, Seewong and Hashimoto, Tatsunori and Jitsev, Jenia and Shankar, Vaishaal and Dimakis, Alex and Sathiamoorthy, Mahesh and Schmidt, Ludwig},
  month = nov,
  title = {{Evalchemy}},
  year = {2024}
}