Multilingual Meta-EVALuation benchmark (MM-Eval)

MM-Eval is a multilingual meta-evaluation benchmark consisting of five core subsets—Chat, Reasoning, Safety, Language Hallucination, and Linguistics—spanning 18 languages and a Language Resource subset spanning 122 languages for a broader analysis of language effects.

Design Choice
In this work, we minimize the inclusion of translated samples, as mere translation may alter existing preferences due to translation errors. Instead, we increase the proportion of linguistically and culturally related instances. Consequently, translated samples are only included in the Safety subset. Additionally, we enrich the dataset with a Linguistics subset designed to evaluate the judge model's ability to comprehend the linguistic characteristics of various languages accurately. Furthermore, we incorporate hand-crafted culturally related prompts in the Language Hallucination subset. If you are interested, please look into MMQA (Multilingual, Multicultural Question Answering).

This is a fork of the RewardBench codebase.

If you use our code in your work, please consider citing both our work and RewardBench. Many thanks to the original authors of RewardBench.

How to Use

You can replicate our experiments by following the process outlined below.

Installation

git clone https://github.com/guijinSON/MM-Eval
cd MM-Eval

pip install -e .

⚠️ pip install reward-bench will not work

Evaluating Reward Models

Run this for evaluation on MM-Eval

python scripts/run_rm.py --model=prometheus-eval/MM-Mistral-7B-RM --custom_dataset_path prometheus-eval/MM-Eval

Ensure your model fits on the GPU. If not, reduce the batch size:

--batch_size 1

For some models, you may also need to add the following flag:

--trust_remote_code

Evaluating Proprietary Models

First, add your OpenAI API key:

export OPENAI_API_KEY="{your-api-key}"

Then, run this for evaluation on MM-Eval

python scripts/run_generative.py --model=gpt-4o-mini-2024-07-18 --custom_dataset_path prometheus-eval/MM-Eval

Evaluating Prometheus2.0 and Self-Taught Evaluator

For Prometheus-2 and Self-Taught Evaluator we use their original implementations instead of the Reward-Bench codebase. Tutorials to replicate the experiments will be added shortly.

Analysis

Notebooks for replicating the experiment and plot in Section 6. Analysis is in the analysis folder.

How to Cite

@article{son2024mm,
  title={MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models},
  author={Son, Guijin and Yoon, Dongkeun and Suk, Juyoung and Aula-Blasco, Javier and Aslan, Mano and Kim, Vu Trong and Islam, Shayekh Bin and Prats-Cristi{\`a}, Jaume and Tormo-Ba{\~n}uelos, Luc{\'\i}a and Kim, Seungone},
  journal={arXiv preprint arXiv:2410.17578},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
analysis		analysis
rewardbench		rewardbench
scripts		scripts
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README_rewardbench.md		README_rewardbench.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Meta-EVALuation benchmark (MM-Eval)

This is a fork of the RewardBench codebase.

How to Use

Installation

Evaluating Reward Models

Evaluating Proprietary Models

Evaluating Prometheus2.0 and Self-Taught Evaluator

Analysis

How to Cite

About

Releases

Packages

Contributors 2

Languages

License

guijinSON/MM-Eval

Folders and files

Latest commit

History

Repository files navigation

Multilingual Meta-EVALuation benchmark (MM-Eval)

This is a fork of the RewardBench codebase.

How to Use

Installation

Evaluating Reward Models

Evaluating Proprietary Models

Evaluating Prometheus2.0 and Self-Taught Evaluator

Analysis

How to Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages