You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thank you for this great project. I am finetuning yi-34b, and when loading the model, it occurs cuda oom error. So i just change the zero3_init_flag to true to avoid oom when loading the model. But when training, there are some other errors, I paste the errors here, could you please help me? Thank you!
/opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [257,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.
[rank6]:[E410 09:24:27.054138428 ProcessGroupNCCL.cpp:1430] [PG 0 Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fd1a42a5375 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fd1a43b0612 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fd182ac63de in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fd182aca678 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x8ad (0x7fd182ad2fbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x128 (0x7fd182ad3c08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fd1a42a5375 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fd1a43b0612 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fd182ac63de in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fd182aca678 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x8ad (0x7fd182ad2fbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x128 (0x7fd182ad3c08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /opt/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1434 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xfded22 (0x7fd182afbd22 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd342da (0x7fd1828512da in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)
Hi, thank you for this great project. I am finetuning yi-34b, and when loading the model, it occurs cuda oom error. So i just change the zero3_init_flag to true to avoid oom when loading the model. But when training, there are some other errors, I paste the errors here, could you please help me? Thank you!
/opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [257,0,0], thread: [31,0,0] Assertion
srcIndex < srcSelectDimSize
failed.[rank6]:[E410 09:24:27.054138428 ProcessGroupNCCL.cpp:1430] [PG 0 Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at /opt/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fd1a42a5375 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fd1a43b0612 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fd182ac63de in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fd182aca678 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x8ad (0x7fd182ad2fbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x128 (0x7fd182ad3c08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at /opt/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fd1a42a5375 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fd1a43b0612 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fd182ac63de in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fd182aca678 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x8ad (0x7fd182ad2fbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x128 (0x7fd182ad3c08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /opt/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1434 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xfded22 (0x7fd182afbd22 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd342da (0x7fd1828512da in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)
/opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [64,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [65,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [66,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [67,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [68,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [69,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [70,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [71,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [72,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [73,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [74,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [75,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [76,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [77,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [78,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [79,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [80,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [81,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [82,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [83,0,0] Assertion
srcIndex < srcSelectDimSize
failed./opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [84,0,0] Assertion
srcIndex < srcSelectDimSize
failed.The text was updated successfully, but these errors were encountered: