-
-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LORA training broken on Mistral Nemo. Massive loss values immediately. #2039
Comments
I guess it is related issue to #2004 but somehow Qwen behaves perfectly fine. Also just tested Llama 3.1 8b it works perfectly fine. |
I believe this will be resolved with transformers 4.46.2, but that version doesn't work with gradient accumulation and FSDP |
@winglian we can wait till 4.47 comes out right? Hopefully by then they will have incorporated your fix |
Is this broken functionally or does it still train normally? |
I've seen reports that even when the loss values are scaled, that it doesn't learn properly iirc. |
Will just wait for the fix then. :( Thanks. |
I've also had massive loss values when training old mistral v0.2s with high grad accum recently, it looks like it's not actually my fault but instead an issue with transformers? Do I understand this correctly? So the solution is maybe to rollback versions? |
there's an upstream transformers fix waiting to be merged, but in the meantime, it seems that people have reported that 4.46.2 resolves the issue. I'll merge #2064 in the meantime as a workaround. |
Thanks! |
Please check that this issue hasn't been reported before.
Expected Behavior
Expected loss value should be in the 1.x even at the beginning. This worked fine in older commits of axolotl, but sadly can't pinpoint up to which commit. Just that the recent one it is broken.
Current behaviour
The loss value immediately starts at 23-24. Just before training Nemo, I was training Qwen2.5 32B Instruct which works perfectly fine on the other hand.
tran/loss is really high:
eval/loss somehow looks alright?
grad/norm also looks really high:
Steps to reproduce
Train Mistral Nemo 12B Instruct using LORA+. Loss value immediately is not right. Originally I tried using RSLoRA and Liger kernels enabled, which worked for Qwen but the loss was broken. So I disabled them so that the config matches what I used that worked for Mistral back a few weeks ago except for using chat_templates. This is the example config I am showing here and it is still broken. So something is fundamentally broken with Mistral Nemo training now.
In order to get mistral to work with the new chat_templates, I also had to modify the chat template in the model tokenizer_config.json so that it doesn't throw errors when the order of the conversation isn't exactly as mistral expects. Previously using sharegpt fastchat method this never was a problem. The chat template in the tokenizer config is changed to this:
Then just run preprocess with --debug option to tokenize dataset first, and then run the training.
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
d356740
Acknowledgements
The text was updated successfully, but these errors were encountered: