We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
23.08
Docker, Conda, Pip, Source
NCCL reports a failure starting a dask client with non consecutive device IDs starting from 0.
In [1]: from cugraph.testing.mg_utils import start_dask_client In [2]: client, cluster = start_dask_client(dask_worker_devices=[1], jit_unspill=False)
### Relevant log output ```shell Dask client/cluster created using LocalCUDACluster 2023-09-27 04:28:47,439 - distributed.worker - WARNING - Run Failed Function: _func_init_all args: (b'\x1c9P\xba\x10\xdcGG\xaf\xe8\xe6\x91\xce\x93\xe2\xd2', b'\xe4/`\xf6\x94\x08\xbe\xb7\x02\x00\xb4-\n!\xe3\xa8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00p\xd4_\xccv\x7f\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x80\x0b\x9dU\x00V\x00\x00\x00\x00\x00\x00\x8e\xfa\x8b\xd9`\x8e T\x00V\x00\x00\x80i^\xccv\x7f\x00\x00\x00Ea>\xe9\xa8\x856\xd0\xcdj\xccv\x7f\x00\x00p\xd4_\xccv\x7f\x00\x00 \xd6\x9d\xbbv\x7f\x00\x00\xe0\xb5 T\x00V\x00', True, {'tcp://127.0.0.1:43613': {'rank': 3, 'port': 33031}, 'tcp://127.0.0.1:43837': {'rank': 0, 'port': 60137}}, False, 0) kwargs: {'dask_worker': <Worker 'tcp://127.0.0.1:43613', name: 3, status: running, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>} Traceback (most recent call last): File "/home/nfs/jnke/miniconda3/envs/ppr/lib/python3.10/site-packages/distributed/worker.py", line 3174, in run result = await function(*args, **kwargs) File "/home/nfs/jnke/miniconda3/envs/ppr/lib/python3.10/site-packages/raft_dask/common/comms.py", line 446, in _func_init_all _func_init_nccl(sessionId, uniqueId, dask_worker=dask_worker) File "/home/nfs/jnke/miniconda3/envs/ppr/lib/python3.10/site-packages/raft_dask/common/comms.py", line 511, in _func_init_nccl n.init(nWorkers, uniqueId, wid) File "nccl.pyx", line 151, in raft_dask.common.nccl.nccl.init RuntimeError: NCCL_ERROR: b'invalid argument (run with NCCL_DEBUG=WARN for details)'
No response
The text was updated successfully, but these errors were encountered:
Do we have any update on the bug? I just ran into this issue earlier this week with 23.12.
Sorry, something went wrong.
Can confirm that with rapidsai/raft#1926 this seems to be working.
Try using contiguous rank to fix cuda_visible_devices (#1926)
3b87796
This PR attempts to solve rapidsai/cugraph#3889 Authors: - Vibhu Jawa (https://github.com/VibhuJawa) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1926
VibhuJawa
jnke2016
No branches or pull requests
Version
23.08
Which installation method(s) does this occur on?
Docker, Conda, Pip, Source
Describe the bug.
NCCL reports a failure starting a dask client with non consecutive device IDs starting from 0.
Minimum reproducible example
Environment details
No response
Other/Misc.
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: