Replies: 2 comments
-
We ran the longest pre-training for SFT-8 but the unfortunately more did not mean better in this case - eval results were pretty bad, see https://tju01.github.io/ilm-eval/ The idea to add red_pajama was to continue training basic language modelling. It was also used in the 2nd stage with the hope to reduce overfitting, but the effect was not clear and the end-result was "mediocre". Just for reference, the old SFT-8 stage-1 dataset config was (bad, don't use):
|
Beta Was this translation helpful? Give feedback.
-
Thanks for your reply. Moreover, is there a good fraction of pretrain data in the sft training stage? Now i used 15% pretrain data and 85% instruction tuning data in the sft training. Is there any suggesiton about the fraction between these two part of data? |
Beta Was this translation helpful? Give feedback.
-
Hi, i noticed that in the config of sft-8-datasets, 5% red_pajama are added in sft training.
So there are 3 question i was confused:
Beta Was this translation helpful? Give feedback.
All reactions