多卡训练，精度大幅度下降 #47

coldlarry · 2021-10-21T04:51:41Z

作者您好，我使用多卡训练（4张），并且增大了你设置的默认单卡bathsize。

训练时，第一次测试mAP接近0.（单卡训练时，第一次一般是0.6）。

想请教一下，这是什么原因呀？是学习率的问题吗？

coldlarry · 2021-10-22T02:21:18Z

单卡训练，调大batchsize，精度也会显著下降。这是为啥呀..................

yuantn · 2021-10-22T08:27:24Z

你在调大 batch size 的同时调节学习率了吗？

Did you adjust the learning rate while increasing the batch size?

coldlarry · 2021-10-22T08:29:57Z

您好，确实没有调节学习率（没想到学习率影响这么大）。
另外，还想问下您：现在的代码支持多卡训练吗？

coldlarry · 2021-10-22T08:31:44Z

我多卡训练，训练到中间就不print任何信息了，也没报错，我最后kill掉进程了。
看到之前issue有问多卡训练的，所以想问下，是否支持多卡呢现在。

yuantn · 2021-10-25T01:46:45Z

你可以提供一下输出日志吗？一个可能的原因是各个进程在多个GPU上没有对齐，导致 GPU 利用率达到 100% 卡住。

目前的代码最好是用单 GPU 运行。如果想使用多 GPU 的话，你可以参考问题集锦中训练和测试部分的第 3 个问题和问题 #11

Can you provide the output log? One possible reason is that each process is not aligned on multiple GPUs, causing the GPU utilization to reach 100% and stuck.

The current code is best run on a single GPU. If you want to use multiple GPUs, you can refer to the third question of Training and Test in FAQ and Issue #11

yuantn added the good first issue Good for newcomers label Oct 27, 2021

yuantn closed this as completed Nov 22, 2021

yuantn added a commit that referenced this issue Apr 21, 2023

Add #47

73fbd58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多卡训练，精度大幅度下降 #47

多卡训练，精度大幅度下降 #47

coldlarry commented Oct 21, 2021

coldlarry commented Oct 22, 2021

yuantn commented Oct 22, 2021

coldlarry commented Oct 22, 2021

coldlarry commented Oct 22, 2021

yuantn commented Oct 25, 2021

多卡训练，精度大幅度下降 #47

多卡训练，精度大幅度下降 #47

Comments

coldlarry commented Oct 21, 2021

coldlarry commented Oct 22, 2021

yuantn commented Oct 22, 2021

coldlarry commented Oct 22, 2021

coldlarry commented Oct 22, 2021

yuantn commented Oct 25, 2021