You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is the evaluation in your paper based on a comparison between the human evaluation dataset provided by Hugging Face (https://huggingface.co/datasets/lmsys/mt_bench_human_judgments) and the outputs of LLMs? If not, would you kindly share the dataset used for the human evaluation in your paper?
If the answer to the first question is yes, I have an additional question. I noticed that the human evaluation was performed only on a subset of the questions, and that the number of evaluators varies across these questions. For example, "question_82_turn1 model1: claude-v1 model2: vicuna-13b-v1.2 judgment: No man", and "question_82_turn1 model1: vicuna-13b-v1.2 model2: gpt-3.5-turbo judgment: author_4, expert_2, expert_20, expert_24". Did you use the dataset as it is, or did you process it before calculating the evaluation scores?
The Hugging Face dataset lists "vicuna-13b-v1.2" as one of the evaluated models, but the models retrieved from the FastChat repository, include "vicuna-13b-v1.3" instead of "vicuna-13b-v1.2". Which model version was used in your evaluation? If both are correct, could you explain the reason for the version difference?
The text was updated successfully, but these errors were encountered:
Hello, let me some questions.
Is the evaluation in your paper based on a comparison between the human evaluation dataset provided by Hugging Face (https://huggingface.co/datasets/lmsys/mt_bench_human_judgments) and the outputs of LLMs? If not, would you kindly share the dataset used for the human evaluation in your paper?
If the answer to the first question is yes, I have an additional question. I noticed that the human evaluation was performed only on a subset of the questions, and that the number of evaluators varies across these questions. For example, "question_82_turn1 model1: claude-v1 model2: vicuna-13b-v1.2 judgment: No man", and "question_82_turn1 model1: vicuna-13b-v1.2 model2: gpt-3.5-turbo judgment: author_4, expert_2, expert_20, expert_24". Did you use the dataset as it is, or did you process it before calculating the evaluation scores?
The Hugging Face dataset lists "vicuna-13b-v1.2" as one of the evaluated models, but the models retrieved from the FastChat repository, include "vicuna-13b-v1.3" instead of "vicuna-13b-v1.2". Which model version was used in your evaluation? If both are correct, could you explain the reason for the version difference?
The text was updated successfully, but these errors were encountered: