You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But I am unable to train retnet_3b without running into memory issues. For now I just want to make it run, but even with very small batch-size and max-tokens I run into issues.
It seems like the backward pass always introduces OOM issues since the call to optimizer.step() in fairseq_task.py, line 498 exits with:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.96 GiB. GPU 0 has a total capacty of 79.15 GiB of which 7.20 GiB is free. Process 75705 has 71.94 GiB memory in use. Of the allocated memory 65.77 GiB is allocated by PyTorch, and 3.41 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
What would you recommend for training this size of model? Is there a way to train it on one or more A100 GPUs with 80GiB of memory?
I understand that I might want to partition the model into multiple GPUs, but I am very unfamiliar with this and any help would be appreciated.
The text was updated successfully, but these errors were encountered:
Hello,
I followed the blog post https://zenn.dev/selllous/articles/retnet_tutorial shared in #52 in order to train RetNet, and it seems to work well for small models (< 3B).
But I am unable to train
retnet_3b
without running into memory issues. For now I just want to make it run, but even with very small batch-size and max-tokens I run into issues.It seems like the backward pass always introduces OOM issues since the call to
optimizer.step()
infairseq_task.py, line 498
exits with:What would you recommend for training this size of model? Is there a way to train it on one or more A100 GPUs with 80GiB of memory?
I understand that I might want to partition the model into multiple GPUs, but I am very unfamiliar with this and any help would be appreciated.
The text was updated successfully, but these errors were encountered: