[ENH] `from_cudf_edgelist` should have a `num_nodes` argument #1206

flandolfi · 2020-10-12T16:23:29Z

Hi,

I noticed that from_cudf_edgelist does not allow the user to specify the number of nodes in the input graph. This could lead to problems while converting empty graphs or graphs with isolated nodes (with identifier greater than the maximum value in the DataFrame). Renumbering is of no help.

How can we specify such value?

I tried changing the value of G.node_count or passing a list of identifiers to G.add_nodes_from() (both before and after from_cudf_edgelist), but executing louvain/leiden/ecg produces an error. For example:

>>> import cudf
>>> import cugraph as cx
>>> df = cudf.DataFrame([[0, 1, 1], [1, 0, 1]], columns=['source', 'destination', 'weight'])
>>> df
   source  destination  weight
0       0            1       1
1       1            0       1
>>> G = cx.Graph()  # Here G is a graph of 10 nodes, 8 of which are isolated
>>> G.add_nodes_from(list(range(10)))
>>> G.from_cudf_edgelist(df, edge_attr='weight')
>>> G.number_of_nodes()
2
>>> G.node_count = 10
>>> G.number_of_nodes()
10
>>> cx.community.louvain(G)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "[...]/python3.7/site-packages/cugraph/community/louvain.py", line 78, in louvain
    input_graph, max_iter, resolution
  File "cugraph/community/louvain_wrapper.pyx", line 72, in cugraph.community.louvain_wrapper.louvain
RuntimeError: __copy::trivial_device_copy H->D: failed: cudaErrorInvalidValue: invalid argument

The only workaround at the moment is to add self-loops to every (missing) node.

Kind regards,
Francesco

The text was updated successfully, but these errors were encountered:

afender · 2020-10-12T20:30:52Z

Hi Francesco,

Thanks for your message. from_cudf_edgelist has renumbering turned set to True by default. Renumbering will discard isolated vertices by nature. At this time adding a singleton (or resizing the graph) after loading the data this way is not supported/tested. There is no major technical blocker for that other than the time for enabling and testing this path.

The good news is that you can bypass this limitation and keep singletons if you set renumbering to False. Just make sure that the smallest vertex label is 0 and the largest vertex label is the total number of vertices. In your example, if you want 10 vertices it would look like this:

>>> df = cudf.DataFrame([[0, 9, 1], [9, 0, 1]], columns=['source', 'destination', 'weight'])
>>> G.from_cudf_edgelist(df, edge_attr='weight', renumber=False)
>>> G.number_of_nodes()
10

In this case vertices 1 to 8 would be isolated vertices
There may be other ways to hack this but I'd need to better understand what you are trying to do by adding these isolated vertices.

If you need both renumbering and resizing/growing the graph, the fastest way is probably to try adding it, the cugraph team will be happy to review your PR and assist. We'll also keep the issue on the radar and prioritize for future releases.

Hope that helps, and thanks again.

afender · 2020-10-12T20:35:53Z

Ps. disconnected graphs support is not well tested in cuGraph at the moment. It is quite possible that some analytics still fail when such inputs are passed. This is something we have been tracking in #305 and plan on continuing to improve in the next release.

flandolfi · 2020-10-13T09:42:40Z

Thank you for your quick reply!

Renumbering a vertex with the largest index is indeed a hack, but still needs some user action to manage it.
For example, I'm converting graphs from PyTorch-Geometric to cuGraph to make clustering on device, and then bring the clustering index back to PyTorch. Since I'm dealing with entire datasets, changing the indices on the fly can be cumbersome.
This is just an example, there can be other scenarios in which the size of the graph has to be specified (e.g., importing a fixed sized graph from CSV).

Anyway, thanks for your suggestion!

BradReesWork · 2020-10-13T20:26:51Z

@flandolfi thanks for the great question. We are just getting ready to start improving a lot of basic function around graph creation. You are correct that an isolated node will not appear in an edge list and therefore not be created.

BradReesWork · 2020-10-20T14:26:35Z

@flandolfi The issue that I'm running into with specifying the number of nodes that should be in a graph when using from_cudf_edgelist is that there is no consistent way to determine what the missing node names should be. If your data has contiguous integer values for nodes, then it is an easy process to assign one-up numbers to the isolated nodes. But if your data is not integers, say IP addresses, then what names are given to nodes?

Now if your data is integers, just not contiguous, then just setting Renumbering to False will fill in all missed values with isolated nodes. Those nodes just need to be within the range and not at the end.

The number of nodes created in the graph is equal to the max node ID + 1. Renumbering simply packs values to be contiguous and to be integers.

flandolfi · 2020-10-20T15:03:00Z

Many Graph/Network file formats (e.g., Graph Markup Language) first define all the nodes by their ID, attributes, labels, etc., then specify all the edges of the graph only by the IDs of their end-nodes. This avoids the repetition of known information (similarly to the Bakus-Naur form in databases).

So, in your example, I would find all the IPs in the "header", along with their unique node-IDs, then the list of edges (that can be empty).

Again, renumbering is a trick and it works in most cases. I just believe there should be a way to specify the nodes in advance. Notice that this does not mean that you have to change the signature of from_cudf_edgelist. For example, a totally fine (and maybe better) solution could be preventing the re-initialization of the graph when calling from_cudf_edgelist, e.g.,

import cudf
import cugraph as cx

ips = [
    '192.168.1.1',
    '192.168.1.2',
    ...
    '192.168.1.254'
]

...  # Load the edge list to `df`

G = cx.Graph() 
G.add_nodes_from(ips)
G.from_cudf_edgelist(df, renumber=False)
print(G.number_of_nodes())  # Prints `len(ips)`

This approach will separate the two tasks (adding nodes and adding edges), thus allowing an incremental definition of the graphs (also, one could "update" the current graph by multiple calls of from_cudf_edgelist, maybe?)

BradReesWork · 2021-02-01T20:51:00Z

issue is being addressed with enhancement issue #1372

flandolfi added the ? - Needs Triage Need team to review and classify label Oct 12, 2020

flandolfi mentioned this issue Oct 12, 2020

[FEA] pytorch geometric interface #1200

Closed

afender mentioned this issue Oct 13, 2020

Test with isolated vertices and self loops #1214

Closed

BradReesWork added this to the 0.18 milestone Nov 30, 2020

BradReesWork added improvement Improvement / enhancement to an existing function and removed ? - Needs Triage Need team to review and classify labels Nov 30, 2020

BradReesWork mentioned this issue Feb 1, 2021

[ENH] from_cudf_edgelist should support NULL mask for isolated vertices #1372

Closed

BradReesWork closed this as completed Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] `from_cudf_edgelist` should have a `num_nodes` argument #1206

[ENH] `from_cudf_edgelist` should have a `num_nodes` argument #1206

flandolfi commented Oct 12, 2020 •

edited

Loading

afender commented Oct 12, 2020

afender commented Oct 12, 2020

flandolfi commented Oct 13, 2020

BradReesWork commented Oct 13, 2020

BradReesWork commented Oct 20, 2020

flandolfi commented Oct 20, 2020

BradReesWork commented Feb 1, 2021

[ENH] from_cudf_edgelist should have a num_nodes argument #1206

[ENH] from_cudf_edgelist should have a num_nodes argument #1206

Comments

flandolfi commented Oct 12, 2020 • edited Loading

afender commented Oct 12, 2020

afender commented Oct 12, 2020

flandolfi commented Oct 13, 2020

BradReesWork commented Oct 13, 2020

BradReesWork commented Oct 20, 2020

flandolfi commented Oct 20, 2020

BradReesWork commented Feb 1, 2021

[ENH] `from_cudf_edgelist` should have a `num_nodes` argument #1206

[ENH] `from_cudf_edgelist` should have a `num_nodes` argument #1206

flandolfi commented Oct 12, 2020 •

edited

Loading