Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry Regarding Your Paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" #3590

Open
daichi-nakatsuji opened this issue Oct 16, 2024 · 0 comments

Comments

@daichi-nakatsuji
Copy link

Hello, let me some questions.

  1. Is the evaluation in your paper based on a comparison between the human evaluation dataset provided by Hugging Face (https://huggingface.co/datasets/lmsys/mt_bench_human_judgments) and the outputs of LLMs? If not, would you kindly share the dataset used for the human evaluation in your paper?

  2. If the answer to the first question is yes, I have an additional question. I noticed that the human evaluation was performed only on a subset of the questions, and that the number of evaluators varies across these questions. For example, "question_82_turn1 model1: claude-v1 model2: vicuna-13b-v1.2 judgment: No man", and "question_82_turn1 model1: vicuna-13b-v1.2 model2: gpt-3.5-turbo judgment: author_4, expert_2, expert_20, expert_24". Did you use the dataset as it is, or did you process it before calculating the evaluation scores?

  3. The Hugging Face dataset lists "vicuna-13b-v1.2" as one of the evaluated models, but the models retrieved from the FastChat repository, include "vicuna-13b-v1.3" instead of "vicuna-13b-v1.2". Which model version was used in your evaluation? If both are correct, could you explain the reason for the version difference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant