can not train #2

PengboLi1998 · 2022-05-27T06:30:39Z

Hi,excuse me,could you please help me?
Epoch: 0 - Iteration: 8300 - Total loss: 2.49283218 - Segmentation loss: 1.3356148 - Cross-entropy loss: 0.334046245 - Soft IoU loss: -0.371718287 - L2 loss: 0.28116408 - Score loss: 0.435414135 - Mask loss: 0.478311181
Traceback (most recent call last):
File "bin/bonet2-train.py", line 107, in
sys.exit(main())
File "bin/bonet2-train.py", line 103, in main
trainer.train(args.epochs, lr=args.lr)
File "/home/vipuser/Downloads/3DBoNet2-maindata/3DBoNet2-main/bin/bonet2/trainer.py", line 222, in train
self.eval((epoch + 1) * n_batches)
File "/home/vipuser/Downloads/3DBoNet2-maindata/3DBoNet2-main/bin/bonet2/trainer.py", line 340, in eval
results = self.instance_pr.result()
File "/home/vipuser/miniconda3/envs/bonet/lib/python3.6/site-packages/tensorflow/python/keras/utils/metrics_utils.py", line 122, in decorated
result_t = array_ops.identity(result_fn(*args))
File "/home/vipuser/miniconda3/envs/bonet/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/home/vipuser/miniconda3/envs/bonet/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 286, in identity
input = ops.convert_to_tensor(input)
File "/home/vipuser/miniconda3/envs/bonet/lib/python3.6/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
return func(*args, **kwargs)
File "/home/vipuser/miniconda3/envs/bonet/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1540, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/vipuser/miniconda3/envs/bonet/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 339, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/home/vipuser/miniconda3/envs/bonet/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 265, in constant
allow_broadcast=True)
File "/home/vipuser/miniconda3/envs/bonet/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 276, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "/home/vipuser/miniconda3/envs/bonet/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 301, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/home/vipuser/miniconda3/envs/bonet/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 98, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
ValueError: Attempt to convert a value ({'precisions': <tf.Tensor: shape=(13,), dtype=float32, numpy=
array([0.98245615, 0.8852459 , 0.45416668, 0. , 0. ,
0.5555556 , 0.15789473, 0.07692308, 0.21830986, 0. ,
0.20454545, 0. , 0.17624521], dtype=float32)>, 'recalls': <tf.Tensor: shape=(13,), dtype=float32, numpy=
array([0.7368421 , 0.7941176 , 0.31778425, 0. , 0. ,
0.3846154 , 0.11811024, 0.04545455, 0.12015504, 0. ,
0.04147466, 0. , 0.04994571], dtype=float32)>, 'average_precision': <tf.Tensor: shape=(), dtype=float32, numpy=0.2854879>, 'average_recall': <tf.Tensor: shape=(), dtype=float32, numpy=0.2006538>}) with an unsupported type (<class 'dict'>) to a Tensor.

PengboLi1998 · 2022-05-27T06:31:39Z

I dont know why the train epoch cant increase but keep 0 epoch.

lucagrementieri · 2022-05-27T11:20:13Z

Hi,

the error refers to the evaluation phase that is called at the end of every epoch, so that why you cannot go beyond the first epoch.

I cannot reproduce the error, so maybe it's a problem with the version of Tensorflow.
I have now updated the code to work with the latest version of Tensorflow 2.9, you can retry and maybe I will work.

PengboLi1998 · 2022-05-28T01:36:35Z

And why there are so many iterations?Are you the same?By the way,could you please tell me your CUDA version of your new code and original code?

lucagrementieri · 2022-05-30T06:00:32Z

The number of iterations is determined by the number of samples in the dataset and the batch size. Probably you are using ScanNet that is a large dataset, so the number of iterations is large.

The new code uses the latest version of CUDA 11.7, while the old code used 11.4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can not train #2

can not train #2

PengboLi1998 commented May 27, 2022

PengboLi1998 commented May 27, 2022

lucagrementieri commented May 27, 2022

PengboLi1998 commented May 28, 2022

lucagrementieri commented May 30, 2022

can not train #2

can not train #2

Comments

PengboLi1998 commented May 27, 2022

PengboLi1998 commented May 27, 2022

lucagrementieri commented May 27, 2022

PengboLi1998 commented May 28, 2022

lucagrementieri commented May 30, 2022