about evaluating Simpo-v0.2 by arena-hard #68

jimmy19991222 · 2024-09-21T14:56:07Z

Hi, I tried to eval the Llama-3-Instruct-8B-SimPO-v0.2 checkpoint by arena-hard-auto, and I only got

Llama-3-Instruct-8B-SimPO-v0.2 | score: 35.4 | 95% CI: (-3.2, 2.0) | average #tokens: 530

while your paper reported 36.5

So I am wondering if my vllm api server setting is right:

python3 -m vllm.entrypoints.openai.api_server \
        --model path-to-SimPO-v0.2 \
        --host 0.0.0.0 --port 5001 --served-model-name SimPO-v0.2 \
        --chat-template templates/llama3.jinja

The text was updated successfully, but these errors were encountered:

jimmy19991222 · 2024-09-21T15:11:19Z

I have checked that there is no '<|eot_id|>' in the end of generated answers

jimmy19991222 · 2024-09-25T06:39:03Z

I found there is an update on questions.jsonl from arena-hard 5 months ago, don't know if it is the reason: lmarena/arena-hard-auto@d989e6f#diff-9a6dd9530bef3f149817dfb224c99c9d6432597c11a9ce88ffe220ad61c201fb

yumeng5 · 2024-10-13T00:46:36Z

Hi @jimmy19991222

I think your result is reasonably close to our reported one (a ~1 point difference probably can be attributed to randomness).

Best,
Yu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about evaluating Simpo-v0.2 by arena-hard #68

about evaluating Simpo-v0.2 by arena-hard #68

jimmy19991222 commented Sep 21, 2024

jimmy19991222 commented Sep 21, 2024

jimmy19991222 commented Sep 25, 2024

yumeng5 commented Oct 13, 2024

about evaluating Simpo-v0.2 by arena-hard #68

about evaluating Simpo-v0.2 by arena-hard #68

Comments

jimmy19991222 commented Sep 21, 2024

jimmy19991222 commented Sep 21, 2024

jimmy19991222 commented Sep 25, 2024

yumeng5 commented Oct 13, 2024