You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I noticed that this method seems to optimize preferences directly on the base model rather than the SFT model, as far as I know the base does not have the intruct-following ability, so preference optimization methods such as DPO need to undergo sft training first.
My question is: the experimental setting in this article is reasonable?
The text was updated successfully, but these errors were encountered:
We also tried the setting of conducting SFT first, and then doing DPO. Please refer to Table 3 in the Appendix for detailed results regarding this setting.
Hi! I noticed that this method seems to optimize preferences directly on the base model rather than the SFT model, as far as I know the base does not have the intruct-following ability, so preference optimization methods such as DPO need to undergo sft training first.
My question is: the experimental setting in this article is reasonable?
The text was updated successfully, but these errors were encountered: