[BUG]: Graph Creation Failure at Scale #4076

jnke2016 · 2024-01-03T15:28:53Z

Version

23.12

Which installation method(s) does this occur on?

Docker, Conda, Pip, Source

Describe the bug.

A failure occurs when creating a graph passed scale 27 which corresponds to 2.1+ billion edges and this is irrespective of the cluster size. In fact, clusters of size 16 up to 256 GPUs attempted to create the graph that was generated with RMAT but none succeeded.

Minimum reproducible example

def trim_memory() -> int:
    libc = ctypes.CDLL("libc.so.6")
    return libc.malloc_trim(0)

if __name__ == "__main__":
    print("setting up the cluster", flush=True)
    setup_objs = start_dask_client()
    client = setup_objs[0]
    print("Done setting up the cluster", flush=True)

    scale = 27
    edgefactor = 16
    seed = 2

    dask_df = generate_edgelist_rmat(
        scale=scale, edgefactor=edgefactor, seed=seed, unweighted=True, mg=True
    )

    dask_df = dask_df.to_backend("pandas").persist()
    dask_df = dask_df.to_backend("cudf") # delayed

    dask_df = dask_df.astype('int64')

    directed = False

    G = cugraph.Graph(directed=directed)

    #client.run(trim_memory)
    
    G.from_dask_cudf_edgelist(
        dask_df, source="src", destination="dst"
    )

    print("the number of nodes = ", G.number_of_nodes(), flush=True)
    
    print("dask_df = \n", dask_df.head())

    result_louvain, mod_score = dask_cugraph.louvain(G)


    print("result louvain = \n", result_louvain.head())
    print("mod score = ", mod_score)


    stop_dask_client(*setup_objs)

Relevant log output

terminate called after throwing an instance of 'raft::logic_error'
  what():  NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=303: 


NOTE: One worker failed with the error below and it seems to be at the graph creation
[16500106 rows x 2 columns]], <pylibcugraph.graph_properties.GraphProperties object at 0x14d82c5da1d0>, 'src', 'dst', False, dtype('int32'), None, None, None)
kwargs:    {}
Exception: "RuntimeError('non-success value returned from cugraph_mg_graph_create(): CUGRAPH_UNKNOWN_ERROR NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=500: ')"

Note: Unsual high host memory usage
2023-12-15 05:53:33,945 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 44.19 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:53:44,012 - distributed.utils_perf - INFO - full garbage collection released 2.13 GiB from 371 reference cycles (threshold: 9.54 MiB)
2023-12-15 05:53:50,484 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 44.15 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:00,618 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 48.13 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:10,773 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 48.32 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:13,125 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 50.65 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:19,048 - distributed.worker.memory - WARNING - Worker is at 78% memory usage. Resuming worker. Process memory: 49.69 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:20,843 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 49.44 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:23,891 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 50.67 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:27,048 - distributed.worker.memory - WARNING - Worker is at 79% memory usage. Resuming worker. Process memory: 50.32 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:28,736 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 50.53 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:29,005 - distributed.worker.memory - WARNING - Worker is at 79% memory usage. Resuming worker. Process memory: 50.32 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:29,208 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 50.56 GiB -- Worker memory limit: 62.97 GiB

Environment details

23.12 MNMG Nightly squash file

Other/Misc.

Several attempts were made to identify or isolate the issue without success such as:

Running the client on a separate process
Moving the edgelist created from the RMAT generator to host memory with the goal of freeing device memory
Checked for integer overflow
Trimming memory to reduce host memory usage (In fact, a significant high host memory usage was observed on the failing node prior to the failure: https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os
Tested raft PR 1928 which ensures that nccl identifies the correct rank

NOTE: Only one node fails, causing the other nodes to wait for its completion thereby, creating a hang. The node with rank 0 is always the one failing.

Code of Conduct

I agree to follow cuGraph's Code of Conduct
I have searched the open bugs and have found no duplicates for this bug report

The text was updated successfully, but these errors were encountered:

rlratzel · 2024-02-05T17:54:57Z

Closing now that these PRs are merged:

jnke2016 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 3, 2024

rlratzel assigned jnke2016 Jan 3, 2024

ChuckHastings removed the ? - Needs Triage Need team to review and classify label Jan 31, 2024

rlratzel closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Graph Creation Failure at Scale #4076

[BUG]: Graph Creation Failure at Scale #4076

jnke2016 commented Jan 3, 2024 •

edited

Loading

rlratzel commented Feb 5, 2024

[BUG]: Graph Creation Failure at Scale #4076

[BUG]: Graph Creation Failure at Scale #4076

Comments

jnke2016 commented Jan 3, 2024 • edited Loading

Version

Which installation method(s) does this occur on?

Describe the bug.

Minimum reproducible example

Relevant log output

Environment details

Other/Misc.

Code of Conduct

rlratzel commented Feb 5, 2024

jnke2016 commented Jan 3, 2024 •

edited

Loading