Skip to content
This repository has been archived by the owner on Oct 16, 2023. It is now read-only.

Gemini badcase #40

Closed
feifeibear opened this issue Apr 20, 2022 · 5 comments
Closed

Gemini badcase #40

feifeibear opened this issue Apr 20, 2022 · 5 comments

Comments

@feifeibear
Copy link
Contributor

feifeibear commented Apr 20, 2022

See MR #41
The launching script

env OMP_NUM_THREADS=12 torchrun --standalone --nproc_per_node=4 train.py --from_torch --config=./configs/palm_8b_zero_gemini_badcase.py

When using 'auto', the error log. failed in the 1st iteration's backward

image

However, using 'cpu' will pass.
That indicates our gemini is not robust enough.

@feifeibear feifeibear pinned this issue Apr 20, 2022
@feifeibear
Copy link
Contributor Author

I suggest the program failed on hybrid adam. ColossalAl wrongly estimated the margin space for OS when using Gemini.
@ver217

@feifeibear
Copy link
Contributor Author

At the 1st ADAM computation, we found 3.411 GB margin memory space on CUDA.
Therefore we move 3.411 GB of data from CPU to CUDA.

image

@ver217
Copy link
Member

ver217 commented Apr 20, 2022

I solved this issue by adding and environment varieble PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:1024.

@ver217
Copy link
Member

ver217 commented Apr 21, 2022

Adding PYTORCH_NO_CUDA_MEMORY_CACHING=1 can also work.

@ver217
Copy link
Member

ver217 commented Apr 21, 2022

I think fragmentation leads to this issue, rather than gemini itself.

@ver217 ver217 closed this as completed Apr 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants