You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A failure occurs when creating a graph passed scale 27 which corresponds to 2.1+ billion edges and this is irrespective of the cluster size. In fact, clusters of size 16 up to 256 GPUs attempted to create the graph that was generated with RMAT but none succeeded.
terminate called after throwing an instance of 'raft::logic_error'what(): NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=303:
NOTE: One worker failed with the error below and it seems to be at the graph creation
[16500106 rows x 2 columns]], <pylibcugraph.graph_properties.GraphProperties object at 0x14d82c5da1d0>, 'src', 'dst', False, dtype('int32'), None, None, None)
kwargs: {}
Exception: "RuntimeError('non-success value returned from cugraph_mg_graph_create(): CUGRAPH_UNKNOWN_ERROR NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=500: ')"
Note: Unsual high host memory usage
2023-12-15 05:53:33,945 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 44.19 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:53:44,012 - distributed.utils_perf - INFO - full garbage collection released 2.13 GiB from 371 reference cycles (threshold: 9.54 MiB)
2023-12-15 05:53:50,484 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 44.15 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:00,618 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 48.13 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:10,773 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 48.32 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:13,125 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 50.65 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:19,048 - distributed.worker.memory - WARNING - Worker is at 78% memory usage. Resuming worker. Process memory: 49.69 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:20,843 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 49.44 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:23,891 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 50.67 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:27,048 - distributed.worker.memory - WARNING - Worker is at 79% memory usage. Resuming worker. Process memory: 50.32 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:28,736 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 50.53 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:29,005 - distributed.worker.memory - WARNING - Worker is at 79% memory usage. Resuming worker. Process memory: 50.32 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:29,208 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 50.56 GiB -- Worker memory limit: 62.97 GiB
Environment details
23.12 MNMG Nightly squash file
Other/Misc.
Several attempts were made to identify or isolate the issue without success such as:
Running the client on a separate process
Moving the edgelist created from the RMAT generator to host memory with the goal of freeing device memory
Tested raft PR 1928 which ensures that nccl identifies the correct rank
NOTE: Only one node fails, causing the other nodes to wait for its completion thereby, creating a hang. The node with rank 0 is always the one failing.
Code of Conduct
I agree to follow cuGraph's Code of Conduct
I have searched the open bugs and have found no duplicates for this bug report
The text was updated successfully, but these errors were encountered:
Version
23.12
Which installation method(s) does this occur on?
Docker, Conda, Pip, Source
Describe the bug.
A failure occurs when creating a graph passed scale 27 which corresponds to 2.1+ billion edges and this is irrespective of the cluster size. In fact, clusters of size 16 up to 256 GPUs attempted to create the graph that was generated with RMAT but none succeeded.
Minimum reproducible example
Relevant log output
Environment details
Other/Misc.
Several attempts were made to identify or isolate the issue without success such as:
nccl
identifies the correct rankNOTE: Only one node fails, causing the other nodes to wait for its completion thereby, creating a hang. The node with rank 0 is always the one failing.
Code of Conduct
The text was updated successfully, but these errors were encountered: