You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 1, 2025. It is now read-only.
During training in epoch 1, I observed the following error:
[06/30 01:54:32][INFO] train_net.py: 446: Start epoch: 1
[06/30 01:54:46][INFO] distributed.py: 995: Reducer buckets have been rebuilt in this iteration.
[06/30 01:54:59][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.46034, "dt_data": 0.00346, "dt_net": 1.45688, "epoch": "1/15", "eta": "7:33:55", "gpu_mem": "7.68G", "iter": "10/1244", "loss": 6.05343, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000}
[06/30 01:55:14][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.50371, "dt_data": 0.00334, "dt_net": 1.50036, "epoch": "1/15", "eta": "7:47:09", "gpu_mem": "7.68G", "iter": "20/1244", "loss": 6.16927, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000}
../aten/src/ATen/native/cuda/Loss.cu:271: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Follow by a lengthy exceptions being raised:
Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:166 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f10084d2612 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xea8e4a (0x7f1009892e4a in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x33a968 (0x7f1051d51968 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
.....
In my instance I have four Tesla T4 GPU with Driver Version: 510.47.03 CUDA Version: 11.6
What does the error I see above means, and how do I fix it?
The text was updated successfully, but these errors were encountered:
kct22aws
changed the title
Training at epoch 1 with CUDA error and see Assertion t >= 0 && t < n_classes failed
Training at epoch 1 end up with CUDA error and see Assertion t >= 0 && t < n_classes failed
Jun 30, 2022
kct22aws
changed the title
Training at epoch 1 end up with CUDA error and see Assertion t >= 0 && t < n_classes failed
Training at epoch 1 end up with CUDA error and Assertion t >= 0 && t < n_classes failed
Jun 30, 2022
I ran the following command:
python tools/run_net.py
--cfg configs/Kinetics/TimeSformer_divST_8x32_224_4gpus.yaml
DATA.PATH_TO_DATA_DIR /home/ubuntu/vit/kinetics-dataset/k400/videos_resized
NUM_GPUS 4
TRAIN.BATCH_SIZE 16
\
During training in epoch 1, I observed the following error:
[06/30 01:54:32][INFO] train_net.py: 446: Start epoch: 1
[06/30 01:54:46][INFO] distributed.py: 995: Reducer buckets have been rebuilt in this iteration.
[06/30 01:54:59][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.46034, "dt_data": 0.00346, "dt_net": 1.45688, "epoch": "1/15", "eta": "7:33:55", "gpu_mem": "7.68G", "iter": "10/1244", "loss": 6.05343, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000}
[06/30 01:55:14][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.50371, "dt_data": 0.00334, "dt_net": 1.50036, "epoch": "1/15", "eta": "7:47:09", "gpu_mem": "7.68G", "iter": "20/1244", "loss": 6.16927, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000}
../aten/src/ATen/native/cuda/Loss.cu:271: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion
t >= 0 && t < n_classes
failed.terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Follow by a lengthy exceptions being raised:
Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:166 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f10084d2612 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xea8e4a (0x7f1009892e4a in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x33a968 (0x7f1051d51968 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
.....
In my instance I have four Tesla T4 GPU with Driver Version: 510.47.03 CUDA Version: 11.6
What does the error I see above means, and how do I fix it?
The text was updated successfully, but these errors were encountered: