MM-Eval is a multilingual meta-evaluation benchmark consisting of five core subsets—Chat, Reasoning, Safety, Language Hallucination, and Linguistics—spanning 18 languages and a Language Resource subset spanning 122 languages for a broader analysis of language effects.
Design Choice
In this work, we minimize the inclusion of translated samples, as mere translation may alter existing preferences due to translation errors. Instead, we increase the proportion of linguistically and culturally related instances. Consequently, translated samples are only included in the Safety subset. Additionally, we enrich the dataset with a Linguistics subset designed to evaluate the judge model's ability to comprehend the linguistic characteristics of various languages accurately. Furthermore, we incorporate hand-crafted culturally related prompts in the Language Hallucination subset. If you are interested, please look into MMQA (Multilingual, Multicultural Question Answering).
This is a fork of the RewardBench codebase.
If you use our code in your work, please consider citing both our work and RewardBench. Many thanks to the original authors of RewardBench.
You can replicate our experiments by following the process outlined below.
git clone https://github.com/guijinSON/MM-Eval
cd MM-Eval
pip install -e .
pip install reward-bench
will not work
Run this for evaluation on MM-Eval
python scripts/run_rm.py --model=prometheus-eval/MM-Mistral-7B-RM --custom_dataset_path prometheus-eval/MM-Eval
Ensure your model fits on the GPU. If not, reduce the batch size:
--batch_size 1
For some models, you may also need to add the following flag:
--trust_remote_code
First, add your OpenAI API key:
export OPENAI_API_KEY="{your-api-key}"
Then, run this for evaluation on MM-Eval
python scripts/run_generative.py --model=gpt-4o-mini-2024-07-18 --custom_dataset_path prometheus-eval/MM-Eval
For Prometheus-2 and Self-Taught Evaluator we use their original implementations instead of the Reward-Bench codebase. Tutorials to replicate the experiments will be added shortly.
Notebooks for replicating the experiment and plot in Section 6. Analysis is in the analysis
folder.
@article{son2024mm,
title={MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models},
author={Son, Guijin and Yoon, Dongkeun and Suk, Juyoung and Aula-Blasco, Javier and Aslan, Mano and Kim, Vu Trong and Islam, Shayekh Bin and Prats-Cristi{\`a}, Jaume and Tormo-Ba{\~n}uelos, Luc{\'\i}a and Kim, Seungone},
journal={arXiv preprint arXiv:2410.17578},
year={2024}
}