-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to reproduce performance #12
Comments
Try setting --num_train_epochs=6. We've uncommented it in the revised finetune.sh script (previously specified in comments). You can still stop at the first few epochs. |
@angelahzyuan Thanks for your reply, but the performance of the baseline model is not consistent with the results in the paper. |
@angelahzyuan I have change the epoch to 6. I test the model of epoch 1 on |
Please check our readme for the following,
|
@guozhiyao Furthermore, we've retrained the Zephyr-7b-sft-full model with the latest updates from the Alignment Handbook. Our evaluation on ARC indicates an improvement from 57.51 (sft-full) to 60.75 (SPIN-iter0). While different base models and evaluation methods may produce varying results, the overall performance trend remains consistent. |
Hello, @guozhiyao I'm one of the authors of the Indeed, we updated some of the configs used to train the SFT model in this PR (huggingface/alignment-handbook#88) because there were some bugs fixed in TRL that changes the learning rate scheduler. If you want to use the original checkpoint we released with the handbook you can load the model with |
One related question for @angelahzyuan - do you happen to know why the GSM8k values on the leaderboard are so different from those shown in your paper? It seems like |
Hi, the difference in scores between leaderboard and our papers is mainly due to difference in lm-evaluation-harness version. We used v.0.4.0, which has a different evaluation result as compared to the older version used by leaderboard, especially in gsm8k as we observed. |
Great, thanks for the clarification! |
Hello @yihedeng9 , I noticed that the task name and evaluation metric in version 4.0 of lm-evaluation-harness are different from those described on the Open LLM Leaderboard (About -> REPRODUCIBILITY). Could you please provide your evaluation scripts for running evaluation locally? Thanks! |
alignment-handbook/zephyr-7b-sft-full
UCLA-AGI/SPIN_iter0
I use the default hyper-parameter to train the model, and test the model with
HuggingFaceH4/open_llm_leaderboard
locally. The result onallenai/ai2_arc
as below:which can not match the performance in paper (63.40).
The text was updated successfully, but these errors were encountered: