Inquiry Regarding Your Paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" #3590

daichi-nakatsuji · 2024-10-16T06:51:09Z

Hello, let me some questions.

Is the evaluation in your paper based on a comparison between the human evaluation dataset provided by Hugging Face (https://huggingface.co/datasets/lmsys/mt_bench_human_judgments) and the outputs of LLMs? If not, would you kindly share the dataset used for the human evaluation in your paper?
If the answer to the first question is yes, I have an additional question. I noticed that the human evaluation was performed only on a subset of the questions, and that the number of evaluators varies across these questions. For example, "question_82_turn1 model1: claude-v1 model2: vicuna-13b-v1.2 judgment: No man", and "question_82_turn1 model1: vicuna-13b-v1.2 model2: gpt-3.5-turbo judgment: author_4, expert_2, expert_20, expert_24". Did you use the dataset as it is, or did you process it before calculating the evaluation scores?
The Hugging Face dataset lists "vicuna-13b-v1.2" as one of the evaluated models, but the models retrieved from the FastChat repository, include "vicuna-13b-v1.3" instead of "vicuna-13b-v1.2". Which model version was used in your evaluation? If both are correct, could you explain the reason for the version difference?

Provide feedback