Skip to content
This repository has been archived by the owner on Jan 1, 2025. It is now read-only.

Training at epoch 1 end up with CUDA error and Assertion t >= 0 && t < n_classes failed #115

Open
kct22aws opened this issue Jun 30, 2022 · 2 comments

Comments

@kct22aws
Copy link

kct22aws commented Jun 30, 2022

I ran the following command:

python tools/run_net.py
--cfg configs/Kinetics/TimeSformer_divST_8x32_224_4gpus.yaml
DATA.PATH_TO_DATA_DIR /home/ubuntu/vit/kinetics-dataset/k400/videos_resized
NUM_GPUS 4
TRAIN.BATCH_SIZE 16
\

During training in epoch 1, I observed the following error:

[06/30 01:54:32][INFO] train_net.py: 446: Start epoch: 1
[06/30 01:54:46][INFO] distributed.py: 995: Reducer buckets have been rebuilt in this iteration.
[06/30 01:54:59][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.46034, "dt_data": 0.00346, "dt_net": 1.45688, "epoch": "1/15", "eta": "7:33:55", "gpu_mem": "7.68G", "iter": "10/1244", "loss": 6.05343, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000}
[06/30 01:55:14][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.50371, "dt_data": 0.00334, "dt_net": 1.50036, "epoch": "1/15", "eta": "7:47:09", "gpu_mem": "7.68G", "iter": "20/1244", "loss": 6.16927, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000}
../aten/src/ATen/native/cuda/Loss.cu:271: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered

Follow by a lengthy exceptions being raised:
Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:166 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f10084d2612 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xea8e4a (0x7f1009892e4a in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x33a968 (0x7f1051d51968 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

.....

In my instance I have four Tesla T4 GPU with Driver Version: 510.47.03 CUDA Version: 11.6

What does the error I see above means, and how do I fix it?

@kct22aws kct22aws changed the title Training at epoch 1 with CUDA error and see Assertion t >= 0 && t < n_classes failed Training at epoch 1 end up with CUDA error and see Assertion t >= 0 && t < n_classes failed Jun 30, 2022
@kct22aws kct22aws changed the title Training at epoch 1 end up with CUDA error and see Assertion t >= 0 && t < n_classes failed Training at epoch 1 end up with CUDA error and Assertion t >= 0 && t < n_classes failed Jun 30, 2022
@xksteven
Copy link

Does it work with smaller batch size?

@SJTUPanda
Copy link

Data label in trian.csv should be started from 0.
:)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants