Invalid loss with BTLM 3B training #546

AlpinDale · 2023-09-10T13:58:59Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Typically, a run would start at a certain loss and gradually go down.

Current behaviour

Training BTLM 3B 8k base model has several issues, including:

Does not work with gradient_checkpointing set to true
Train loss stays at 0.0
Eval loss is nan

As a result of the first issue, memory usage at micro_batch_size=1 and a seqlen of 2048 is ~70GB/GPU on 8x NVIDIA H100s. As a result of the second and third issues, training is useless.

P.S. Flash Attention works. I've also tested both with and without it.

Steps to reproduce

Use this config file.

Possible solution

No idea why this happens, opening this issue so it's brought to attention.

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.10.12

axolotl branch-commit

main/c1921c9acb66c2a8b6542584f62bb02bc543acbf

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

winglian · 2023-09-21T15:56:58Z

sample packing is not supported with BTLM yet. Flash attention support was added for BTLM with #566 which should make supporting packing easier down the line

AlpinDale added the bug Something isn't working label Sep 10, 2023

winglian closed this as completed Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid loss with BTLM 3B training #546

Invalid loss with BTLM 3B training #546

AlpinDale commented Sep 10, 2023

winglian commented Sep 21, 2023

Invalid loss with BTLM 3B training #546

Invalid loss with BTLM 3B training #546

Comments

AlpinDale commented Sep 10, 2023

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

winglian commented Sep 21, 2023