Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Complied Graph] NCCL Internal Error #49827

Open
anonymousmaharaj opened this issue Jan 14, 2025 · 2 comments
Open

[Ray Complied Graph] NCCL Internal Error #49827

anonymousmaharaj opened this issue Jan 14, 2025 · 2 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't compiled-graphs core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks

Comments

@anonymousmaharaj
Copy link

anonymousmaharaj commented Jan 14, 2025

What happened + What you expected to happen

I have installed the latest Ray, Cuda 12.6, latest NCCL and am trying to run Complied Graph as per the instructions from anyscale but am getting this error.
I have 2 servers with 3090 and 3080TI in a Ray cluster.
I have reinstalled the environment, Cuda Toolkit, Cuda, NCCL several times and it doesn't help.

python app-gpu-dag.py 
2025-01-14 16:03:22,335	INFO worker.py:1636 -- Connecting to existing Ray cluster at address: 192.168.1.166:6379...
2025-01-14 16:03:22,348	INFO worker.py:1812 -- Connected to Ray cluster. View the dashboard at 192.168.1.166:8265 
2025-01-14 16:03:29,761	INFO torch_tensor_nccl_channel.py:672 -- Creating NCCL group 961c6ef4-9cbc-403d-8b0b-f86455a19f34 on actors: [Actor(GPUReceiver, fc227afd5b6a56339c72e06201000000), Actor(GPUSender, a160d92fea529c468f5d2ee501000000)]
2025-01-14 16:03:33,551	INFO torch_tensor_nccl_channel.py:697 -- NCCL group initialized.
(GPUSender pid=9988) Sender using GPU: 0
Traceback (most recent call last):
  File "/root/ray-complied-graph/app-gpu-dag.py", line 54, in <module>
    assert ray.get(compiled_graph.execute((10, ))) == (10, )
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/_private/worker.py", line 2727, in get
    return object_refs.get(timeout=timeout)
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/experimental/compiled_dag_ref.py", line 103, in get
    return_vals = self._dag._execute_until(
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 2115, in _execute_until
    self._dag_output_fetcher.read(timeout),
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/experimental/channel/common.py", line 296, in read
    outputs = self._read_list(timeout)
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/experimental/channel/common.py", line 321, in _read_list
    results.append(c.read(timeout))
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/experimental/channel/shared_memory_channel.py", line 751, in read
    return self._channel_dict[actor_id].read(timeout)
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/experimental/channel/shared_memory_channel.py", line 602, in read
    output = self._buffers[self._next_read_index].read(timeout)
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/experimental/channel/shared_memory_channel.py", line 485, in read
    ret = self._worker.get_objects(
  File "/root/.pyenv/versions/ray/lib/python3.10/site-packages/ray/_private/worker.py", line 882, in get_objects
    ] = self.core_worker.get_objects(
  File "python/ray/_raylet.pyx", line 3501, in ray._raylet.CoreWorker.get_objects
  File "python/ray/includes/common.pxi", line 102, in ray._raylet.check_status
ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read. ObjectID: 0078c9b589a9c677a810922346adee9a94c059e90100000002e1f505
2025-01-14 16:03:43,803	INFO compiled_dag_node.py:1935 -- Tearing down compiled DAG
2025-01-14 16:03:43,803	INFO compiled_dag_node.py:1940 -- Cancelling compiled worker on actor: Actor(GPUReceiver, fc227afd5b6a56339c72e06201000000)
2025-01-14 16:03:43,803	INFO compiled_dag_node.py:1940 -- Cancelling compiled worker on actor: Actor(GPUSender, a160d92fea529c468f5d2ee501000000)
(GPUReceiver pid=6305, ip=192.168.1.159) Destructing NCCL group on actor: Actor(GPUReceiver, fc227afd5b6a56339c72e06201000000)
(GPUReceiver pid=6305, ip=192.168.1.159) ERROR:root:Compiled DAG task exited with exception
(GPUReceiver pid=6305, ip=192.168.1.159) Traceback (most recent call last):
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 161, in do_exec_tasks
(GPUReceiver pid=6305, ip=192.168.1.159)     done = tasks[operation.exec_task_idx].exec_operation(
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 644, in exec_operation
(GPUReceiver pid=6305, ip=192.168.1.159)     return self._read(overlap_gpu_communication)
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 537, in _read
(GPUReceiver pid=6305, ip=192.168.1.159)     input_data = self.input_reader.read()
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/common.py", line 296, in read
(GPUReceiver pid=6305, ip=192.168.1.159)     outputs = self._read_list(timeout)
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/common.py", line 321, in _read_list
(GPUReceiver pid=6305, ip=192.168.1.159)     results.append(c.read(timeout))
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/torch_tensor_nccl_channel.py", line 243, in read
(GPUReceiver pid=6305, ip=192.168.1.159)     tensors = self._gpu_data_channel.read(timeout)
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/torch_tensor_nccl_channel.py", line 527, in read
(GPUReceiver pid=6305, ip=192.168.1.159)     buf = self._nccl_group.recv(
(GPUReceiver pid=6305, ip=192.168.1.159)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/nccl_group.py", line 238, in recv
(GPUReceiver pid=6305, ip=192.168.1.159)     self._comm.recv(
(GPUReceiver pid=6305, ip=192.168.1.159)   File "cupy_backends/cuda/libs/nccl.pyx", line 480, in cupy_backends.cuda.libs.nccl.NcclCommunicator.recv
(GPUReceiver pid=6305, ip=192.168.1.159)   File "cupy_backends/cuda/libs/nccl.pyx", line 128, in cupy_backends.cuda.libs.nccl.check_status
(GPUReceiver pid=6305, ip=192.168.1.159) cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_INTERNAL_ERROR: internal error - please report this issue to the NCCL developers
(GPUSender pid=9988) Destructing NCCL group on actor: Actor(GPUSender, a160d92fea529c468f5d2ee501000000)
(GPUSender pid=9988)     return self._write()
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 616, in _write
(GPUSender pid=9988)     self.output_writer.write(output_val)
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/common.py", line 500, in write
(GPUSender pid=9988)     channel.write(val_i, timeout)
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/torch_tensor_nccl_channel.py", line 198, in write
(GPUSender pid=9988)     self._send_cpu_and_gpu_data(value, timeout)
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/torch_tensor_nccl_channel.py", line 142, in _send_cpu_and_gpu_data
(GPUSender pid=9988)     self._gpu_data_channel.write(gpu_tensors)
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/torch_tensor_nccl_channel.py", line 484, in write
(GPUSender pid=9988)     self._nccl_group.send(tensor, rank)
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/experimental/channel/nccl_group.py", line 191, in send
(GPUSender pid=9988)     self._comm.send(
(GPUSender pid=9988)   File "cupy_backends/cuda/libs/nccl.pyx", line 471, in cupy_backends.cuda.libs.nccl.NcclCommunicator.send
2025-01-14 16:03:44,338	INFO compiled_dag_node.py:1960 -- Waiting for worker tasks to exit
2025-01-14 16:03:44,339	INFO compiled_dag_node.py:1962 -- Teardown complete
(GPUSender pid=9988) ERROR:root:Compiled DAG task exited with exception
(GPUSender pid=9988) Traceback (most recent call last):
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 161, in do_exec_tasks
(GPUSender pid=9988)     done = tasks[operation.exec_task_idx].exec_operation(
(GPUSender pid=9988)   File "/root/.pyenv/versions/3.10.16/envs/ray/lib/python3.10/site-packages/ray/dag/compiled_dag_node.py", line 649, in exec_operation
(GPUSender pid=9988)   File "cupy_backends/cuda/libs/nccl.pyx", line 128, in cupy_backends.cuda.libs.nccl.check_status
(GPUSender pid=9988) cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_INTERNAL_ERROR: internal error - please report this issue to the NCCL developers

Versions / Dependencies

ray 2.40.0
cupy-cuda12x 13.3.0

python -c "import torch; print(f'PyTorch version: {torch.__version__}. Cuda Version {torch.version.cuda}')"
PyTorch version: 2.5.1+cu124. Cuda Version 12.4
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

dpkg -l | grep nccl
hi  libnccl-dev                                2.24.3-1+cuda12.6                       amd64        NVIDIA Collective Communication Library (NCCL) Development Files
hi  libnccl2                                   2.24.3-1+cuda12.6                       amd64        NVIDIA Collective Communication Library (NCCL) Runtime
ii  nccl-local-repo-ubuntu2204-2.24.3-cuda12.4 1.0-1                                   amd64        nccl-local repository configuration files

Reproduction script

Run Ray with
ray start --head --node-ip-address=192.168.1.166 --port=6379 --dashboard-port=8265 --dashboard-host=0.0.0.0 --num-gpus=1

Connect to Ray
ray start --address='192.168.1.166:6379' --num-gpus=1

import ray
import ray.dag
import torch
import os

os.environ["NCCL_DEBUG"] = "INFO"
os.environ["NCCL_SOCKET_IFNAME"] = "eth0"
os.environ["NCCL_IB_DISABLE"] = "1"
os.environ["NCCL_P2P_DISABLE"] = "1"

ray.init(address="auto")

@ray.remote(num_gpus=1)
class GPUSender:
    def send(self, shape):
        current_device = torch.cuda.current_device()
        print(f"Sender using GPU: {current_device}")
        try:
            tensor = torch.zeros(shape, device=f"cuda:{current_device}")
            torch.cuda.synchronize()
            return tensor
        except Exception as e:
            print(f"Error in sender: {e}")
            raise

@ray.remote(num_gpus=1)
class GPUReceiver:
    def recv(self, tensor: torch.Tensor):
        current_device = torch.cuda.current_device()
        print(f"Receiver using GPU: {current_device}")
        try:
            assert tensor.device.type == "cuda"
            torch.cuda.synchronize()  # Синхронизация перед возвратом
            return tensor.shape
        except Exception as e:
            print(f"Error in receiver: {e}")
            raise

sender = GPUSender.remote()
receiver = GPUReceiver.remote()

from ray.experimental.channel.torch_tensor_type import TorchTensorType

with ray.dag.InputNode() as inp:
  dag = sender.send.bind(inp)
  dag = dag.with_type_hint(TorchTensorType(transport="nccl"))
  dag = receiver.recv.bind(dag)

compiled_graph = dag.experimental_compile()

assert ray.get(compiled_graph.execute((10, ))) == (10, )

Issue Severity

None

@anonymousmaharaj anonymousmaharaj added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 14, 2025
@jcotant1 jcotant1 added core Issues that should be addressed in Ray Core compiled-graphs labels Jan 14, 2025
@ruisearch42 ruisearch42 self-assigned this Jan 16, 2025
@ruisearch42
Copy link
Contributor

@anonymousmaharaj What happens if you remove the following:

os.environ["NCCL_SOCKET_IFNAME"] = "eth0"
os.environ["NCCL_IB_DISABLE"] = "1"
os.environ["NCCL_P2P_DISABLE"] = "1"

After removing them I was able to run locally without an issue on a single node.

@anonymousmaharaj
Copy link
Author

@anonymousmaharaj What happens if you remove the following:

os.environ["NCCL_SOCKET_IFNAME"] = "eth0"
os.environ["NCCL_IB_DISABLE"] = "1"
os.environ["NCCL_P2P_DISABLE"] = "1"

After removing them I was able to run locally without an issue on a single node.

@ruisearch42 same error(
I can provide any information about my nodes you will ask.

@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't compiled-graphs core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

4 participants