Poential memory leak for axolotl v0.5.2 pretrain streaming datasets with liger kernel #2108
Open
6 of 8 tasks
Labels
bug
Something isn't working
Please check that this issue hasn't been reported before.
Expected Behavior
I am using axolotl v0.5.2 and liger for llama 3.2 1B continued pretraining or llama 3.1 7b continued pretraining . Previous axolotl 0.4 without liger works perfectly with the same parameters , datasets and GPUs.
Current behaviour
With axolotl v0.5.2 and liger with same parameters , datasets and GPUs , the CPU memory keep increasing until full and the training will be killed .
Steps to reproduce
Just simply run the yaml file for training ,wait for 2-5 hours , the training will be stopped . I tested both llama 3.2 1B continued pretraining(2 A40 48Gb) or llama 3.1 7b continued pretraining (8 H100 90gb) , results are the same .
runing on RUNPOD template : runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
Config yaml
Possible solution
change the dataset streaming to dataset download will keep the cpu memory from increasing . change pretraining_dataset: to datasets:
Which Operating Systems are you using?
Python Version
4.10
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: