Gemini badcase #40

feifeibear · 2022-04-20T02:25:58Z

See MR #41
The launching script

env OMP_NUM_THREADS=12 torchrun --standalone --nproc_per_node=4 train.py --from_torch --config=./configs/palm_8b_zero_gemini_badcase.py

When using 'auto', the error log. failed in the 1st iteration's backward

However, using 'cpu' will pass.
That indicates our gemini is not robust enough.

The text was updated successfully, but these errors were encountered:

feifeibear · 2022-04-20T02:48:57Z

I suggest the program failed on hybrid adam. ColossalAl wrongly estimated the margin space for OS when using Gemini.
@ver217

feifeibear · 2022-04-20T07:53:56Z

At the 1st ADAM computation, we found 3.411 GB margin memory space on CUDA.
Therefore we move 3.411 GB of data from CPU to CUDA.

ver217 · 2022-04-20T09:27:16Z

I solved this issue by adding and environment varieble PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:1024.

ver217 · 2022-04-21T09:03:06Z

Adding PYTORCH_NO_CUDA_MEMORY_CACHING=1 can also work.

ver217 · 2022-04-21T09:04:48Z

I think fragmentation leads to this issue, rather than gemini itself.

feifeibear pinned this issue Apr 20, 2022

ver217 closed this as completed Apr 21, 2022

Provide feedback