[Ray Complied Graph] NCCL Internal Error #49827

anonymousmaharaj · 2025-01-14T16:11:00Z

What happened + What you expected to happen

I have installed the latest Ray, Cuda 12.6, latest NCCL and am trying to run Complied Graph as per the instructions from anyscale but am getting this error.
I have 2 servers with 3090 and 3080TI in a Ray cluster.
I have reinstalled the environment, Cuda Toolkit, Cuda, NCCL several times and it doesn't help.

python app-gpu-dag.py 
2025-01-14 16:03:22,335	INFO worker.py:1636 -- Connecting to existing Ray cluster at address: 192.168.1.166:6379...
2025-01-14 16:03:22,348	INFO worker.py:1812 -- Connected to Ray cluster. View the dashboard at 192.168.1.166:8265 
2025-01-14 16:03:29,761	INFO torch_tensor_nccl_channel.py:672 -- Creating NCCL group 961c6ef4-9cbc-403d-8b0b-f86455a19f34 on actors: [Actor(GPUReceiver, fc227afd5b6a56339c72e06201000000), Actor(GPUSender, a160d92fea529c468f5d2ee501000000)]
2025-01-14 16:03:33,551	INFO torch_tensor_nccl_channel.py:697 -- NCCL group initialized.
(GPUSender pid=9988) Sender using GPU: 0
Traceback (most recent call last):
  File "/root/ray-complied-graph/app-gpu-dag.py", line 54, in <module>
    assert ray.get(compiled_graph.execute((10, ))) == (10, )
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/_private/worker.py", line 2727, in get
    return object_refs.get(timeout=timeout)
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/experimental/compiled_dag_ref.py", line 103, in get
    return_vals = self._dag._execute_until(
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 2115, in _execute_until
    self._dag_output_fetcher.read(timeout),
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/experimental/channel/common.py", line 296, in read
    outputs = self._read_list(timeout)
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/experimental/channel/common.py", line 321, in _read_list
    results.append(c.read(timeout))
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/experimental/channel/shared_memory_channel.py", line 751, in read
    return self._channel_dict[actor_id].read(timeout)
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/experimental/channel/shared_memory_channel.py", line 602, in read
    output = self._buffers[self._next_read_index].read(timeout)
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/experimental/channel/shared_memory_channel.py", line 485, in read
    ret = self._worker.get_objects(
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/_private/worker.py", line 882, in get_objects
    ] = self.core_worker.get_objects(
  File "python/ray/_raylet.pyx", line 3501, in ray._raylet.CoreWorker.get_objects
  File "python/ray/includes/common.pxi", line 102, in ray._raylet.check_status
ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read. ObjectID: 0078c9b589a9c677a810922346adee9a94c059e90100000002e1f505
2025-01-14 16:03:43,803	INFO compiled_dag_node.py:1935 -- Tearing down compiled DAG
2025-01-14 16:03:43,803	INFO compiled_dag_node.py:1940 -- Cancelling compiled worker on actor: Actor(GPUReceiver, fc227afd5b6a56339c72e06201000000)
2025-01-14 16:03:43,803	INFO compiled_dag_node.py:1940 -- Cancelling compiled worker on actor: Actor(GPUSender, a160d92fea529c468f5d2ee501000000)
(GPUReceiver pid=6305, ip=192.168.1.159) Destructing NCCL group on actor: Actor(GPUReceiver, fc227afd5b6a56339c72e06201000000)
(GPUReceiver pid=6305, ip=192.168.1.159) ERROR:root:Compiled DAG task exited with exception
(GPUReceiver pid=6305, ip=192.168.1.159) Traceback (most recent call last):
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 161, in do_exec_tasks
(GPUReceiver pid=6305, ip=192.168.1.159)     done = tasks[operation.exec_task_idx].exec_operation(
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 644, in exec_operation
(GPUReceiver pid=6305, ip=192.168.1.159)     return self._read(overlap_gpu_communication)
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 537, in _read
(GPUReceiver pid=6305, ip=192.168.1.159)     input_data = self.input_reader.read()
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/common.py", line 296, in read
(GPUReceiver pid=6305, ip=192.168.1.159)     outputs = self._read_list(timeout)
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/common.py", line 321, in _read_list
(GPUReceiver pid=6305, ip=192.168.1.159)     results.append(c.read(timeout))
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/torch_tensor_nccl_channel.py", line 243, in read
(GPUReceiver pid=6305, ip=192.168.1.159)     tensors = self._gpu_data_channel.read(timeout)
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/torch_tensor_nccl_channel.py", line 527, in read
(GPUReceiver pid=6305, ip=192.168.1.159)     buf = self._nccl_group.recv(
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/nccl_group.py", line 238, in recv
(GPUReceiver pid=6305, ip=192.168.1.159)     self._comm.recv(
(GPUReceiver pid=6305, ip=192.168.1.159)   File "cupy_backends/cuda/libs/nccl.pyx", line 480, in cupy_backends.cuda.libs.nccl.NcclCommunicator.recv
(GPUReceiver pid=6305, ip=192.168.1.159)   File "cupy_backends/cuda/libs/nccl.pyx", line 128, in cupy_backends.cuda.libs.nccl.check_status
(GPUReceiver pid=6305, ip=192.168.1.159) cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_INTERNAL_ERROR: internal error - please report this issue to the NCCL developers
(GPUSender pid=9988) Destructing NCCL group on actor: Actor(GPUSender, a160d92fea529c468f5d2ee501000000)
(GPUSender pid=9988)     return self._write()
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 616, in _write
(GPUSender pid=9988)     self.output_writer.write(output_val)
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/common.py", line 500, in write
(GPUSender pid=9988)     channel.write(val_i, timeout)
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/torch_tensor_nccl_channel.py", line 198, in write
(GPUSender pid=9988)     self._send_cpu_and_gpu_data(value, timeout)
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/torch_tensor_nccl_channel.py", line 142, in _send_cpu_and_gpu_data
(GPUSender pid=9988)     self._gpu_data_channel.write(gpu_tensors)
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/torch_tensor_nccl_channel.py", line 484, in write
(GPUSender pid=9988)     self._nccl_group.send(tensor, rank)
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/nccl_group.py", line 191, in send
(GPUSender pid=9988)     self._comm.send(
(GPUSender pid=9988)   File "cupy_backends/cuda/libs/nccl.pyx", line 471, in cupy_backends.cuda.libs.nccl.NcclCommunicator.send
2025-01-14 16:03:44,338	INFO compiled_dag_node.py:1960 -- Waiting for worker tasks to exit
2025-01-14 16:03:44,339	INFO compiled_dag_node.py:1962 -- Teardown complete
(GPUSender pid=9988) ERROR:root:Compiled DAG task exited with exception
(GPUSender pid=9988) Traceback (most recent call last):
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 161, in do_exec_tasks
(GPUSender pid=9988)     done = tasks[operation.exec_task_idx].exec_operation(
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 649, in exec_operation
(GPUSender pid=9988)   File "cupy_backends/cuda/libs/nccl.pyx", line 128, in cupy_backends.cuda.libs.nccl.check_status
(GPUSender pid=9988) cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_INTERNAL_ERROR: internal error - please report this issue to the NCCL developers

Versions / Dependencies

ray 2.40.0
cupy-cuda12x 13.3.0

python -c "import torch; print(f'PyTorch version: {torch.__version__}. Cuda Version {torch.version.cuda}')"
PyTorch version: 2.5.1+cu124. Cuda Version 12.4

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0


dpkg -l | grep nccl
hi  libnccl-dev                                2.24.3-1+cuda12.6                       amd64        NVIDIA Collective Communication Library (NCCL) Development Files
hi  libnccl2                                   2.24.3-1+cuda12.6                       amd64        NVIDIA Collective Communication Library (NCCL) Runtime
ii  nccl-local-repo-ubuntu2204-2.24.3-cuda12.4 1.0-1                                   amd64        nccl-local repository configuration files

Reproduction script

Run Ray with
ray start --head --node-ip-address=192.168.1.166 --port=6379 --dashboard-port=8265 --dashboard-host=0.0.0.0 --num-gpus=1

Connect to Ray
ray start --address='192.168.1.166:6379' --num-gpus=1

import ray
import ray.dag
import torch
import os

os.environ["NCCL_DEBUG"] = "INFO"
os.environ["NCCL_SOCKET_IFNAME"] = "eth0"
os.environ["NCCL_IB_DISABLE"] = "1"
os.environ["NCCL_P2P_DISABLE"] = "1"

ray.init(address="auto")

@ray.remote(num_gpus=1)
class GPUSender:
    def send(self, shape):
        current_device = torch.cuda.current_device()
        print(f"Sender using GPU: {current_device}")
        try:
            tensor = torch.zeros(shape, device=f"cuda:{current_device}")
            torch.cuda.synchronize()
            return tensor
        except Exception as e:
            print(f"Error in sender: {e}")
            raise

@ray.remote(num_gpus=1)
class GPUReceiver:
    def recv(self, tensor: torch.Tensor):
        current_device = torch.cuda.current_device()
        print(f"Receiver using GPU: {current_device}")
        try:
            assert tensor.device.type == "cuda"
            torch.cuda.synchronize()  # Синхронизация перед возвратом
            return tensor.shape
        except Exception as e:
            print(f"Error in receiver: {e}")
            raise

sender = GPUSender.remote()
receiver = GPUReceiver.remote()

from ray.experimental.channel.torch_tensor_type import TorchTensorType

with ray.dag.InputNode() as inp:
  dag = sender.send.bind(inp)
  dag = dag.with_type_hint(TorchTensorType(transport="nccl"))
  dag = receiver.recv.bind(dag)

compiled_graph = dag.experimental_compile()

assert ray.get(compiled_graph.execute((10, ))) == (10, )

Issue Severity

None

The text was updated successfully, but these errors were encountered:

ruisearch42 · 2025-01-16T02:22:35Z

@anonymousmaharaj What happens if you remove the following:

os.environ["NCCL_SOCKET_IFNAME"] = "eth0"
os.environ["NCCL_IB_DISABLE"] = "1"
os.environ["NCCL_P2P_DISABLE"] = "1"

After removing them I was able to run locally without an issue on a single node.

anonymousmaharaj · 2025-01-16T13:17:28Z

@anonymousmaharaj What happens if you remove the following:
os.environ["NCCL_SOCKET_IFNAME"] = "eth0"
os.environ["NCCL_IB_DISABLE"] = "1"
os.environ["NCCL_P2P_DISABLE"] = "1"
After removing them I was able to run locally without an issue on a single node.

@ruisearch42 same error(
I can provide any information about my nodes you will ask.

anonymousmaharaj added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 14, 2025

jcotant1 added core Issues that should be addressed in Ray Core compiled-graphs labels Jan 14, 2025

ruisearch42 self-assigned this Jan 16, 2025

jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray Complied Graph] NCCL Internal Error #49827

[Ray Complied Graph] NCCL Internal Error #49827

anonymousmaharaj commented Jan 14, 2025 •

edited

Loading

ruisearch42 commented Jan 16, 2025

anonymousmaharaj commented Jan 16, 2025

[Ray Complied Graph] NCCL Internal Error #49827

[Ray Complied Graph] NCCL Internal Error #49827

Comments

anonymousmaharaj commented Jan 14, 2025 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

ruisearch42 commented Jan 16, 2025

anonymousmaharaj commented Jan 16, 2025

anonymousmaharaj commented Jan 14, 2025 •

edited

Loading