Mistral Nemo 12B training CUDA Out of memory only when enabling EVAL. On 2x3090Ti FSDP. #1813
Open
6 of 8 tasks
Labels
bug
Something isn't working
Please check that this issue hasn't been reported before.
Expected Behavior
Eval or no eval shouldn't cause a difference in memory use and cause the training to fail when eval is enabled.
Current behaviour
I can start and train Mistral Nemo 12B just fine but it crashes when returning to training after an eval. If I disable eval entirely then the training works just fine.
This is the result of training Mistral Nemo 12B Instruct when enabling eval. This error comes up after going back into training after finishing the first eval,
Steps to reproduce
Train mistral 12B instruct using FSDP and LORA with 8192 context. Then enable evals and it will fail after the first eval.
I am training Mistral Nemo 12B Instruct with the tokenizersreplaced with the tokenizer from https://huggingface.co/axolotl-ai-co/Mistral-Nemo-Base-2407-chatml so that there are chatml tokens for training.
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
78b42a3
Acknowledgements
The text was updated successfully, but these errors were encountered: