Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use it with Multi GPU #1

Open
Hesene opened this issue Aug 10, 2019 · 12 comments
Open

How to use it with Multi GPU #1

Hesene opened this issue Aug 10, 2019 · 12 comments
Labels
help wanted Extra attention is needed

Comments

@Hesene
Copy link

Hesene commented Aug 10, 2019

Thank you for your sharing!!! when I run with single GPU,it runs well, but when I run with multi GPU, it occur error
RuntimeError: Function CatBackward returned an invalid gradient at index 1 - expected device cuda:1 but got cuda:0
could you give some advice on this error?

@zhoudaxia233 zhoudaxia233 added the help wanted Extra attention is needed label Aug 13, 2019
@zhoudaxia233
Copy link
Owner

@Hesene Hello Hesene, in my lab I only have one single 2080Ti, therefore I cannot replicate this issue. I'm sorry about it!

@Hesene
Copy link
Author

Hesene commented Aug 13, 2019

@Hesene Hello Hesene, in my lab I only have one single 2080Ti, therefore I cannot replicate this issue. I'm sorry about it!

Ok, thank you for your code, it help me a lot

@AtsunoriFujita
Copy link

I face the same problem.
Which part is the cause?

@goodgoodstudy92
Copy link

did you use torch.nn.DataParallel()?

@zhoudaxia233
Copy link
Owner

did you use torch.nn.DataParallel()?

no I didn't, but I think it may work

@zhoudaxia233
Copy link
Owner

I face the same problem.
Which part is the cause?

I'm not sure, but I think you can try to integrate nn.DataParallel() into the source code

@goodgoodstudy92
Copy link

I face the same problem.
Which part is the cause?

I'm not sure, but I think you can try to integrate nn.DataParallel() into the source code

I use efficientnet as backbone to trian a object detection model, and the nn.DataParallel() works fine, the only issue is the speed of multi gpu is quit slow

@ryanstout
Copy link

I'm seeing a similar issue when running with nn.DataParallel:

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/efficientunet/efficientunet.py", line 106, in forward
    x = torch.cat([x, blocks.popitem()[1]], dim=1)
RuntimeError: All input tensors must be on the same device. Received cuda:0 and cuda:1

Any ideas?

Thanks!

@Vipermdl
Copy link

I'm seeing a similar issue when running with nn.DataParallel:

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/efficientunet/efficientunet.py", line 106, in forward
    x = torch.cat([x, blocks.popitem()[1]], dim=1)
RuntimeError: All input tensors must be on the same device. Received cuda:0 and cuda:1

Any ideas?

Thanks!

Hi, bro.
Are you solved the problem?

@If-only1
Copy link

If-only1 commented Nov 8, 2020

I suspect that this problem is due to the sharing of a certain module in Efficientunet, which results in this module being only on one GPU, perhaps the encoder……

@TianyiFranklinWang
Copy link

I suspect that this problem is due to the sharing of a certain module in Efficientunet, which results in this module being only on one GPU, perhaps the encoder……

I agree, I'm now facing the same problem.

@zhoudaxia233
Copy link
Owner

@NPU-Franklin Franklin created a PR (#11 ) to support multi GPUs. I do not have multi cards therefore I cannot test it. But maybe you can give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants