Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: NCCL failure when starting a dask client #3889

Closed
2 tasks done
jnke2016 opened this issue Sep 27, 2023 · 2 comments
Closed
2 tasks done

[BUG]: NCCL failure when starting a dask client #3889

jnke2016 opened this issue Sep 27, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@jnke2016
Copy link
Contributor

Version

23.08

Which installation method(s) does this occur on?

Docker, Conda, Pip, Source

Describe the bug.

NCCL reports a failure starting a dask client with non consecutive device IDs starting from 0.

Minimum reproducible example

In [1]: from cugraph.testing.mg_utils import start_dask_client
In [2]: client, cluster = start_dask_client(dask_worker_devices=[1], jit_unspill=False)


### Relevant log output

```shell
Dask client/cluster created using LocalCUDACluster
2023-09-27 04:28:47,439 - distributed.worker - WARNING - Run Failed
Function: _func_init_all
args:     (b'\x1c9P\xba\x10\xdcGG\xaf\xe8\xe6\x91\xce\x93\xe2\xd2', b'\xe4/`\xf6\x94\x08\xbe\xb7\x02\x00\xb4-\n!\xe3\xa8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00p\xd4_\xccv\x7f\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x80\x0b\x9dU\x00V\x00\x00\x00\x00\x00\x00\x8e\xfa\x8b\xd9`\x8e T\x00V\x00\x00\x80i^\xccv\x7f\x00\x00\x00Ea>\xe9\xa8\x856\xd0\xcdj\xccv\x7f\x00\x00p\xd4_\xccv\x7f\x00\x00 \xd6\x9d\xbbv\x7f\x00\x00\xe0\xb5 T\x00V\x00', True, {'tcp://127.0.0.1:43613': {'rank': 3, 'port': 33031}, 'tcp://127.0.0.1:43837': {'rank': 0, 'port': 60137}}, False, 0)
kwargs:   {'dask_worker': <Worker 'tcp://127.0.0.1:43613', name: 3, status: running, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>}
Traceback (most recent call last):
  File "/home/nfs/jnke/miniconda3/envs/ppr/lib/python3.10/site-packages/distributed/worker.py", line 3174, in run
    result = await function(*args, **kwargs)
  File "/home/nfs/jnke/miniconda3/envs/ppr/lib/python3.10/site-packages/raft_dask/common/comms.py", line 446, in _func_init_all
    _func_init_nccl(sessionId, uniqueId, dask_worker=dask_worker)
  File "/home/nfs/jnke/miniconda3/envs/ppr/lib/python3.10/site-packages/raft_dask/common/comms.py", line 511, in _func_init_nccl
    n.init(nWorkers, uniqueId, wid)
  File "nccl.pyx", line 151, in raft_dask.common.nccl.nccl.init
RuntimeError: NCCL_ERROR: b'invalid argument (run with NCCL_DEBUG=WARN for details)'

Environment details

No response

Other/Misc.

No response

Code of Conduct

  • I agree to follow cuGraph's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@jnke2016 jnke2016 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 27, 2023
@jnke2016 jnke2016 changed the title [BUG]: NCCL failure when starting a dask cli [BUG]: NCCL failure when starting a dask client Sep 27, 2023
@BradReesWork BradReesWork removed the ? - Needs Triage Need team to review and classify label Sep 29, 2023
@tingyu66
Copy link
Member

Do we have any update on the bug? I just ran into this issue earlier this week with 23.12.

@VibhuJawa
Copy link
Member

Can confirm that with rapidsai/raft#1926 this seems to be working.

rapids-bot bot pushed a commit to rapidsai/raft that referenced this issue Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants