Pytorch not working with CUDA 11.2 and CUDA 11.7 #32

TC-MCZ · 2023-10-12T07:50:01Z

          Hi ,I have some problems when running cricket in pytorch. I have pulled the latest code，and build pytorch locally with modify change the doces mentioned.

my CUDA is 11.2 and cudnn is 8.9.2 in ths Tesla P4,but get this problem:

server:
+08:01:00.423212 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.445168 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.445403 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.447247 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.448076 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:07.164339 ERROR: cuda_device_prop_result size mismatch in cpu-server-runtime.c:367 +08:02:22.370950 INFO: RPC deinit requested. +08:08:54.324012 INFO: have a nice day!
client:
`+08:00:36.417392 WARNING: could not find .nv.info section. This means this binary does not contain any kernels. in cpu-elf2.c:922
+08:00:36.418684 WARNING: could not find .nv.info section. This means this binary does not contain any kernels. in cpu-elf2.c:922
+08:00:36.420058 WARNING: could not find .nv.info section. This means this binary does not contain any kernels. in cpu-elf2.c:922
call failed: RPC: Timed out
call failed: RPC: Timed out
call failed: RPC: Timed out
+08:02:01.851255 ERROR: something went wrong in cpu-client-runtime.c:444
Traceback (most recent call last):
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 242, in _lazy_init
queued_call()
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 125, in _check_capability
capability = get_device_capability(d)
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 357, in get_device_capability
prop = get_device_properties(device)
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 375, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/lwh/cricket/tests/test_apps/pytorch_minimal.py", line 39, in
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 246, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error:

CUDA call was originally invoked at:

[' File "/home/lwh/cricket/tests/test_apps/pytorch_minimal.py", line 31, in \n import torch\n', ' File "", line 991, in _find_and_load\n', ' File "", line 975, in _find_and_load_unlocked\n', ' File "", line 671, in _load_unlocked\n', ' File "", line 843, in exec_module\n', ' File "", line 219, in _call_with_frames_removed\n', ' File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/init.py", line 798, in \n _C._initExtension(manager_path())\n', ' File "", line 991, in _find_and_load\n', ' File "", line 975, in _find_and_load_unlocked\n', ' File "", line 671, in _load_unlocked\n', ' File "", line 843, in exec_module\n', ' File "", line 219, in _call_with_frames_removed\n', ' File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 179, in \n _lazy_call(_check_capability)\n', ' File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 177, in _lazy_call\n _queued_calls.append((callable, traceback.format_stack()))\n']
+08:02:27.007890 ERROR: call failed. in cpu-client.c:213
+08:02:27.012036 INFO: api-call-cnt: 14
+08:02:27.012051 INFO: memcpy-cnt: 0`

Is my CUDA version wrong? or other reasons?

Originally posted by @Tlhaoge in #6 (comment)

The text was updated successfully, but these errors were encountered:

leonardosul · 2024-01-23T00:58:06Z

Encountering the same issue. Using CUDA 11.7 and CUDNN 8.7.0. Running on an AWS EC2 instance.

It would be really nice to have a Github workflow that builds and runs this the RPC server and docker container together to ensure that it works as described in the docs. Although this would require a GPU enabled runner... probably not as easy as I imagined 🤔

n-eiling · 2024-01-24T11:12:15Z

There is a CI testing Cricket with a GPU enabled runner. There is no test for pytorch, yet, and yes, we should add one. However, I'm not surprised there are issues with pytorch support. Pytorch is really complex and uses a lot of CUDA features in unusual ways that make testing pretty difficult.

leonardosul · 2024-01-27T08:09:28Z

@n-eiling Thanks for the reply! I can see that you use Gitlab CI. I can have a look and see if I can write a workflow that can test pytorch with cricket.

Outside of that how would you recommend I go about trying to map the unusual ways that pytorch uses cuda? That might be a good place to start I guess.

leeyiding · 2024-05-20T06:50:04Z

Hello, I encountered the same problem when running pytorch_minimal.py on cuda11.8 and cndnn8.9. Does anyone have a solution now?

TC-MCZ closed this as completed Oct 12, 2023

TC-MCZ reopened this Oct 12, 2023

TC-MCZ changed the title ~~HI, I have the same problem with cuda 11.7, how do I fix it?~~ @n-eiling, HI, I have the same problem with cuda 11.7, how do I fix it? Oct 12, 2023

TC-MCZ changed the title ~~@n-eiling, HI, I have the same problem with cuda 11.7, how do I fix it?~~ HI, I have the same problem with cuda 11.7, how do I fix it? Oct 12, 2023

n-eiling changed the title ~~HI, I have the same problem with cuda 11.7, how do I fix it?~~ Pytorch not working with CUDA 11.2 and CUDA 11.7 Dec 30, 2023

n-eiling added the pytorch Issues related to pytorch label Dec 30, 2023

RWTH-ACS deleted a comment from Cattacker Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch not working with CUDA 11.2 and CUDA 11.7 #32

Pytorch not working with CUDA 11.2 and CUDA 11.7 #32

TC-MCZ commented Oct 12, 2023

leonardosul commented Jan 23, 2024 •

edited

Loading

n-eiling commented Jan 24, 2024

leonardosul commented Jan 27, 2024

leeyiding commented May 20, 2024

Pytorch not working with CUDA 11.2 and CUDA 11.7 #32

Pytorch not working with CUDA 11.2 and CUDA 11.7 #32

Comments

TC-MCZ commented Oct 12, 2023

leonardosul commented Jan 23, 2024 • edited Loading

n-eiling commented Jan 24, 2024

leonardosul commented Jan 27, 2024

leeyiding commented May 20, 2024

leonardosul commented Jan 23, 2024 •

edited

Loading