Skip to content

Official implementation for "MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models"

License

Notifications You must be signed in to change notification settings

guijinSON/MM-Eval

Repository files navigation

Multilingual Meta-EVALuation benchmark (MM-Eval)

🤗 MM-Eval | 📄Paper | 🤗 MMQA

MM-Eval is a multilingual meta-evaluation benchmark consisting of five core subsets—Chat, Reasoning, Safety, Language Hallucination, and Linguistics—spanning 18 languages and a Language Resource subset spanning 122 languages for a broader analysis of language effects.

Design Choice
In this work, we minimize the inclusion of translated samples, as mere translation may alter existing preferences due to translation errors. Instead, we increase the proportion of linguistically and culturally related instances. Consequently, translated samples are only included in the Safety subset. Additionally, we enrich the dataset with a Linguistics subset designed to evaluate the judge model's ability to comprehend the linguistic characteristics of various languages accurately. Furthermore, we incorporate hand-crafted culturally related prompts in the Language Hallucination subset. If you are interested, please look into MMQA (Multilingual, Multicultural Question Answering).

This is a fork of the RewardBench codebase.

If you use our code in your work, please consider citing both our work and RewardBench. Many thanks to the original authors of RewardBench.

How to Use

You can replicate our experiments by following the process outlined below.

Installation

git clone https://github.com/guijinSON/MM-Eval
cd MM-Eval

pip install -e .

⚠️ pip install reward-bench will not work

Evaluating Reward Models

Run this for evaluation on MM-Eval

python scripts/run_rm.py --model=prometheus-eval/MM-Mistral-7B-RM --custom_dataset_path prometheus-eval/MM-Eval

Ensure your model fits on the GPU. If not, reduce the batch size:

--batch_size 1

For some models, you may also need to add the following flag:

--trust_remote_code

Evaluating Proprietary Models

First, add your OpenAI API key:

export OPENAI_API_KEY="{your-api-key}"

Then, run this for evaluation on MM-Eval

python scripts/run_generative.py --model=gpt-4o-mini-2024-07-18 --custom_dataset_path prometheus-eval/MM-Eval

Evaluating Prometheus2.0 and Self-Taught Evaluator

For Prometheus-2 and Self-Taught Evaluator we use their original implementations instead of the Reward-Bench codebase. Tutorials to replicate the experiments will be added shortly.

Analysis

Notebooks for replicating the experiment and plot in Section 6. Analysis is in the analysis folder.

How to Cite

@article{son2024mm,
  title={MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models},
  author={Son, Guijin and Yoon, Dongkeun and Suk, Juyoung and Aula-Blasco, Javier and Aslan, Mano and Kim, Vu Trong and Islam, Shayekh Bin and Prats-Cristi{\`a}, Jaume and Tormo-Ba{\~n}uelos, Luc{\'\i}a and Kim, Seungone},
  journal={arXiv preprint arXiv:2410.17578},
  year={2024}
}

About

Official implementation for "MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published