Training at epoch 1 end up with CUDA error and Assertion `t >= 0 && t < n_classes` failed #115

kct22aws · 2022-06-30T02:01:38Z

I ran the following command:

python tools/run_net.py
--cfg configs/Kinetics/TimeSformer_divST_8x32_224_4gpus.yaml
DATA.PATH_TO_DATA_DIR /home/ubuntu/vit/kinetics-dataset/k400/videos_resized
NUM_GPUS 4
TRAIN.BATCH_SIZE 16
\

During training in epoch 1, I observed the following error:

[06/30 01:54:32][INFO] train_net.py: 446: Start epoch: 1
[06/30 01:54:46][INFO] distributed.py: 995: Reducer buckets have been rebuilt in this iteration.
[06/30 01:54:59][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.46034, "dt_data": 0.00346, "dt_net": 1.45688, "epoch": "1/15", "eta": "7:33:55", "gpu_mem": "7.68G", "iter": "10/1244", "loss": 6.05343, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000}
[06/30 01:55:14][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.50371, "dt_data": 0.00334, "dt_net": 1.50036, "epoch": "1/15", "eta": "7:47:09", "gpu_mem": "7.68G", "iter": "20/1244", "loss": 6.16927, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000}
../aten/src/ATen/native/cuda/Loss.cu:271: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered

Follow by a lengthy exceptions being raised:
Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:166 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f10084d2612 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xea8e4a (0x7f1009892e4a in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x33a968 (0x7f1051d51968 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

.....

In my instance I have four Tesla T4 GPU with Driver Version: 510.47.03 CUDA Version: 11.6

What does the error I see above means, and how do I fix it?

The text was updated successfully, but these errors were encountered:

xksteven · 2022-07-20T16:23:14Z

Does it work with smaller batch size?

SJTUPanda · 2024-07-22T07:25:21Z

Data label in trian.csv should be started from 0.
:)

kct22aws changed the title ~~Training at epoch 1 with CUDA error and see Assertion t >= 0 && t < n_classes failed~~ Training at epoch 1 end up with CUDA error and see Assertion t >= 0 && t < n_classes failed Jun 30, 2022

kct22aws changed the title ~~Training at epoch 1 end up with CUDA error and see Assertion t >= 0 && t < n_classes failed~~ Training at epoch 1 end up with CUDA error and Assertion t >= 0 && t < n_classes failed Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training at epoch 1 end up with CUDA error and Assertion `t >= 0 && t < n_classes` failed #115

Training at epoch 1 end up with CUDA error and Assertion `t >= 0 && t < n_classes` failed #115

kct22aws commented Jun 30, 2022 •

edited

Loading

xksteven commented Jul 20, 2022

SJTUPanda commented Jul 22, 2024

Training at epoch 1 end up with CUDA error and Assertion t >= 0 && t < n_classes failed #115

Training at epoch 1 end up with CUDA error and Assertion t >= 0 && t < n_classes failed #115

Comments

kct22aws commented Jun 30, 2022 • edited Loading

xksteven commented Jul 20, 2022

SJTUPanda commented Jul 22, 2024

Training at epoch 1 end up with CUDA error and Assertion `t >= 0 && t < n_classes` failed #115

Training at epoch 1 end up with CUDA error and Assertion `t >= 0 && t < n_classes` failed #115

kct22aws commented Jun 30, 2022 •

edited

Loading