You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please check that this issue hasn't been reported before.
I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Typically, a run would start at a certain loss and gradually go down.
Current behaviour
Training BTLM 3B 8k base model has several issues, including:
Does not work with gradient_checkpointing set to true
Train loss stays at 0.0
Eval loss is nan
As a result of the first issue, memory usage at micro_batch_size=1 and a seqlen of 2048 is ~70GB/GPU on 8x NVIDIA H100s. As a result of the second and third issues, training is useless.
P.S. Flash Attention works. I've also tested both with and without it.
sample packing is not supported with BTLM yet. Flash attention support was added for BTLM with #566 which should make supporting packing easier down the line
Please check that this issue hasn't been reported before.
Expected Behavior
Typically, a run would start at a certain loss and gradually go down.
Current behaviour
Training BTLM 3B 8k base model has several issues, including:
gradient_checkpointing
set to truenan
As a result of the first issue, memory usage at
micro_batch_size=1
and a seqlen of 2048 is ~70GB/GPU on 8x NVIDIA H100s. As a result of the second and third issues, training is useless.P.S. Flash Attention works. I've also tested both with and without it.
Steps to reproduce
Possible solution
No idea why this happens, opening this issue so it's brought to attention.
Which Operating Systems are you using?
Python Version
3.10.12
axolotl branch-commit
main/c1921c9acb66c2a8b6542584f62bb02bc543acbf
Acknowledgements
The text was updated successfully, but these errors were encountered: