Unable to reproduce performance #12

guozhiyao · 2024-02-22T05:54:40Z

base model: alignment-handbook/zephyr-7b-sft-full
train data: UCLA-AGI/SPIN_iter0
I use the default hyper-parameter to train the model, and test the model with HuggingFaceH4/open_llm_leaderboard locally. The result on allenai/ai2_arc as below:
base model: 0.5819112627986348
epoch1 (step 778): 0.5989761092150171
epoch2 (step 1556): 0.5964163822525598
epoch3 (step 2334): 0.590443686006826
which can not match the performance in paper (63.40).

The text was updated successfully, but these errors were encountered:

angelahzyuan · 2024-02-22T10:08:33Z

Try setting --num_train_epochs=6. We've uncommented it in the revised finetune.sh script (previously specified in comments). You can still stop at the first few epochs.

guozhiyao · 2024-02-22T10:59:14Z

Try setting --num_train_epochs=6. We've uncommented it in the revised finetune.sh script (previously specified in comments). You can still stop at the first few epochs.

@angelahzyuan Thanks for your reply, but the performance of the baseline model is not consistent with the results in the paper.

guozhiyao · 2024-02-22T12:31:22Z

Try setting --num_train_epochs=6. We've uncommented it in the revised finetune.sh script (previously specified in comments). You can still stop at the first few epochs.

@angelahzyuan I have change the epoch to 6. I test the model of epoch 1 on allenai/ai2_arc, which still can not match the result.

yihedeng9 · 2024-02-22T20:50:56Z

Please check our readme for the following,

We noticed that Alignment Handbook has updated their configuration and SFT checkpoint since our experiments. The configuration and SFT model from the Alignment Handbook that we used in our experiments for data generation and fine-tuning are the older version (Config, Model). If you wish to use the newest SFT model, you need to generate your own data instead of using the datasets we provided on huggingface.
For our evaluation on the Open LLM Leaderboard, please use the lm-evaluation-harness repository at v0.4.0. Also, note that we set the number of few shot examples to be the same as instructed on the Leaderboard, as in their 'About'. The leaderboard uses a different version of lm-eval-harness. Different evaluation versions results in different scores, but the trend for improvement will remain the same. Lastly, we have also uploaded our models to the leaderboard, for which you can directly search for the results (https://huggingface.co/datasets/open-llm-leaderboard/details_UCLA-AGI__zephyr-7b-sft-full-SPIN-iter0).

angelahzyuan · 2024-02-22T21:00:54Z

Try setting --num_train_epochs=6. We've uncommented it in the revised finetune.sh script (previously specified in comments). You can still stop at the first few epochs.

@angelahzyuan I have change the epoch to 6. I test the model of epoch 1 on allenai/ai2_arc, which still can not match the result.

@guozhiyao Furthermore, we've retrained the Zephyr-7b-sft-full model with the latest updates from the Alignment Handbook. Our evaluation on ARC indicates an improvement from 57.51 (sft-full) to 60.75 (SPIN-iter0). While different base models and evaluation methods may produce varying results, the overall performance trend remains consistent.

lewtun · 2024-02-23T07:56:00Z

Hello, @guozhiyao I'm one of the authors of the alignment-handbook 👋

Indeed, we updated some of the configs used to train the SFT model in this PR (huggingface/alignment-handbook#88) because there were some bugs fixed in TRL that changes the learning rate scheduler.

If you want to use the original checkpoint we released with the handbook you can load the model with revision=ac6e600eefcce74f5e8bae1035d4f66019e93190

lewtun · 2024-02-23T09:17:28Z

One related question for @angelahzyuan - do you happen to know why the GSM8k values on the leaderboard are so different from those shown in your paper?

It seems like iter0 is actually the best on the leaderboard, while your paper shows iter2 and iter3 are considerably better.

yihedeng9 · 2024-02-23T21:26:05Z

Hi, the difference in scores between leaderboard and our papers is mainly due to difference in lm-evaluation-harness version. We used v.0.4.0, which has a different evaluation result as compared to the older version used by leaderboard, especially in gsm8k as we observed.

lewtun · 2024-02-23T23:25:21Z

Hi, the difference in scores between leaderboard and our papers is mainly due to difference in lm-evaluation-harness version. We used v.0.4.0, which has a different evaluation result as compared to the older version used by leaderboard, especially in gsm8k as we observed.

Great, thanks for the clarification!

acherstyx · 2024-03-31T07:53:27Z

Hello @yihedeng9 , I noticed that the task name and evaluation metric in version 4.0 of lm-evaluation-harness are different from those described on the Open LLM Leaderboard (About -> REPRODUCIBILITY). Could you please provide your evaluation scripts for running evaluation locally? Thanks!

junkangwu mentioned this issue Mar 13, 2024

Significant Performance Drop in GSM8k Evaluation with Updated SFT ckpt #24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce performance #12

Unable to reproduce performance #12

guozhiyao commented Feb 22, 2024 •

edited

Loading

angelahzyuan commented Feb 22, 2024 •

edited

Loading

guozhiyao commented Feb 22, 2024

guozhiyao commented Feb 22, 2024

yihedeng9 commented Feb 22, 2024

angelahzyuan commented Feb 22, 2024 •

edited

Loading

lewtun commented Feb 23, 2024

lewtun commented Feb 23, 2024

yihedeng9 commented Feb 23, 2024

lewtun commented Feb 23, 2024

acherstyx commented Mar 31, 2024

Unable to reproduce performance #12

Unable to reproduce performance #12

Comments

guozhiyao commented Feb 22, 2024 • edited Loading

angelahzyuan commented Feb 22, 2024 • edited Loading

guozhiyao commented Feb 22, 2024

guozhiyao commented Feb 22, 2024

yihedeng9 commented Feb 22, 2024

angelahzyuan commented Feb 22, 2024 • edited Loading

lewtun commented Feb 23, 2024

lewtun commented Feb 23, 2024

yihedeng9 commented Feb 23, 2024

lewtun commented Feb 23, 2024

acherstyx commented Mar 31, 2024

guozhiyao commented Feb 22, 2024 •

edited

Loading

angelahzyuan commented Feb 22, 2024 •

edited

Loading

angelahzyuan commented Feb 22, 2024 •

edited

Loading