Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try using contiguous rank to fix cuda_visible_devices #1926

Merged
merged 3 commits into from
Oct 25, 2023

Conversation

VibhuJawa
Copy link
Member

This PR attempts to solve rapidsai/cugraph#3889

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 24, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cjnolet cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 24, 2023
@cjnolet
Copy link
Member

cjnolet commented Oct 24, 2023

/ok to test

@VibhuJawa
Copy link
Member Author

VibhuJawa commented Oct 25, 2023

Can confirm that with this PR below issue (rapidsai/cugraph#3889) is resolved .

In [1]: from cugraph.testing.mg_utils import start_dask_client
In [2]: client, cluster = start_dask_client(dask_worker_devices=[1], jit_unspill=False)

@jnke2016
Copy link

Thanks @VibhuJawa.
This PR looks good to me. I will need to test this PR on Draco to make sure it doesn't bring back the nccl transient issue that was resolved few months ago

@VibhuJawa
Copy link
Member Author

In [1]: from raft_dask.common import Comms
   ...: from dask_cuda import LocalCUDACluster
   ...: from dask.distributed import Client
   ...: 
   ...: cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="1")
   ...: client = Client(cluster)
   ...: cb = Comms(verbose=True, comms_p2p=True)
   ...: cb.init()

With brach-23.12:

Initializing comms!
2023-10-24 18:44:17,633 - distributed.worker - WARNING - Run Failed
Function: _func_init_all
args:     (b'b\xbbF\xa8\x1d\xa6O0\xba\xd9\xcb\x16\xbd\xfe\xbd\xd4', b'\x9fW\x1bd\xfb\x81Rs\x02\x00\xcd\xab\n!\xe4F\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xe0\x11\x06VI\x7f\x00\x00 \xb6E\xa0E\x7f\x00\x00p\x86\xb4VI\x7f\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x02\x93\x86\xb1U\x00\x00\x00\x00\x00\x00\xce\x90\xd6*`^\xfa\x84\xb1U\x00\x00\xe0\x11\x06VI\x7f\x00\x00\x00\x03\xf6\x9a\x9c\xaa\xa3w \xb9U\xacE\x7f\x00\x00p\x86\xb4VI\x7f\x00', True, {'tcp://127.0.0.1:38629': {'rank': 1, 'port': 47810}}, True, 0)
kwargs:   {'dask_worker': <Worker 'tcp://127.0.0.1:38629', name: 1, status: running, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>}
Traceback (most recent call last):
  File "/datasets/vjawa/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/distributed/worker.py", line 3174, in run
    result = await function(*args, **kwargs)
  File "/datasets/vjawa/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/raft_dask/common/comms.py", line 446, in _func_init_all
    _func_init_nccl(sessionId, uniqueId, dask_worker=dask_worker)
  File "/datasets/vjawa/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/raft_dask/common/comms.py", line 511, in _func_init_nccl
    n.init(nWorkers, uniqueId, wid)
  File "nccl.pyx", line 151, in raft_dask.common.nccl.nccl.init
RuntimeError: NCCL_ERROR: b'invalid argument (run with NCCL_DEBUG=WARN for details)'

With PR:

Initializing comms!
Initialization complete.

@cjnolet
Copy link
Member

cjnolet commented Oct 25, 2023

/merge

@cjnolet cjnolet marked this pull request as ready for review October 25, 2023 01:51
@cjnolet cjnolet requested a review from a team as a code owner October 25, 2023 01:51
@cjnolet
Copy link
Member

cjnolet commented Oct 25, 2023

/merge

@rapids-bot rapids-bot bot merged commit 3b87796 into rapidsai:branch-23.12 Oct 25, 2023
60 checks passed
seberg added a commit to seberg/raft that referenced this pull request Oct 25, 2023
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original [PR]() mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Oct 25, 2023
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Oct 25, 2023
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Oct 25, 2023
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Nov 3, 2023
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Nov 14, 2023
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Feb 13, 2024
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Feb 13, 2024
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Feb 13, 2024
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
VibhuJawa pushed a commit to VibhuJawa/raft that referenced this pull request Mar 14, 2024
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
rapids-bot bot pushed a commit that referenced this pull request Mar 15, 2024
This PR is based on @seberg work in  #1928 .  

From the PR:


This is a follow up on #1926, since the rank sorting seemed a bit hard to understand.

It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal? 

If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs.

OTOH, if the NVML device order doesn't matter, we could just sort the workers directly. 

The original #1587 mentions:

NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node
which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of:

Ensure we can guarantee this, but this requires initializing workers that are not involved in the operation.
At least raise an error, because if NCCL will end up raising the error it will be very confusing.

Authors:
  - Vibhu Jawa (https://github.com/VibhuJawa)
  - Sebastian Berg (https://github.com/seberg)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #2228
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python
Projects
Development

Successfully merging this pull request may close these issues.

3 participants