Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about evaluating Simpo-v0.2 by arena-hard #68

Open
jimmy19991222 opened this issue Sep 21, 2024 · 3 comments
Open

about evaluating Simpo-v0.2 by arena-hard #68

jimmy19991222 opened this issue Sep 21, 2024 · 3 comments

Comments

@jimmy19991222
Copy link

Hi, I tried to eval the Llama-3-Instruct-8B-SimPO-v0.2 checkpoint by arena-hard-auto, and I only got

Llama-3-Instruct-8B-SimPO-v0.2 | score: 35.4 | 95% CI: (-3.2, 2.0) | average #tokens: 530

while your paper reported 36.5

So I am wondering if my vllm api server setting is right:

python3 -m vllm.entrypoints.openai.api_server \
        --model path-to-SimPO-v0.2 \
        --host 0.0.0.0 --port 5001 --served-model-name SimPO-v0.2 \
        --chat-template templates/llama3.jinja
@jimmy19991222
Copy link
Author

I have checked that there is no '<|eot_id|>' in the end of generated answers

@jimmy19991222
Copy link
Author

I found there is an update on questions.jsonl from arena-hard 5 months ago, don't know if it is the reason: lmarena/arena-hard-auto@d989e6f#diff-9a6dd9530bef3f149817dfb224c99c9d6432597c11a9ce88ffe220ad61c201fb

@yumeng5
Copy link
Collaborator

yumeng5 commented Oct 13, 2024

Hi @jimmy19991222

I think your result is reasonably close to our reported one (a ~1 point difference probably can be attributed to randomness).

Best,
Yu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants