Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] from_cudf_edgelist should have a num_nodes argument #1206

Closed
flandolfi opened this issue Oct 12, 2020 · 7 comments
Closed

[ENH] from_cudf_edgelist should have a num_nodes argument #1206

flandolfi opened this issue Oct 12, 2020 · 7 comments
Labels
improvement Improvement / enhancement to an existing function
Milestone

Comments

@flandolfi
Copy link

flandolfi commented Oct 12, 2020

Hi,

I noticed that from_cudf_edgelist does not allow the user to specify the number of nodes in the input graph. This could lead to problems while converting empty graphs or graphs with isolated nodes (with identifier greater than the maximum value in the DataFrame). Renumbering is of no help.

How can we specify such value?

I tried changing the value of G.node_count or passing a list of identifiers to G.add_nodes_from() (both before and after from_cudf_edgelist), but executing louvain/leiden/ecg produces an error. For example:

>>> import cudf
>>> import cugraph as cx
>>> df = cudf.DataFrame([[0, 1, 1], [1, 0, 1]], columns=['source', 'destination', 'weight'])
>>> df
   source  destination  weight
0       0            1       1
1       1            0       1
>>> G = cx.Graph()  # Here G is a graph of 10 nodes, 8 of which are isolated
>>> G.add_nodes_from(list(range(10)))
>>> G.from_cudf_edgelist(df, edge_attr='weight')
>>> G.number_of_nodes()
2
>>> G.node_count = 10
>>> G.number_of_nodes()
10
>>> cx.community.louvain(G)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "[...]/python3.7/site-packages/cugraph/community/louvain.py", line 78, in louvain
    input_graph, max_iter, resolution
  File "cugraph/community/louvain_wrapper.pyx", line 72, in cugraph.community.louvain_wrapper.louvain
RuntimeError: __copy::trivial_device_copy H->D: failed: cudaErrorInvalidValue: invalid argument

The only workaround at the moment is to add self-loops to every (missing) node.

Kind regards,
Francesco

@flandolfi flandolfi added the ? - Needs Triage Need team to review and classify label Oct 12, 2020
@afender
Copy link
Member

afender commented Oct 12, 2020

Hi Francesco,

Thanks for your message. from_cudf_edgelist has renumbering turned set to True by default. Renumbering will discard isolated vertices by nature. At this time adding a singleton (or resizing the graph) after loading the data this way is not supported/tested. There is no major technical blocker for that other than the time for enabling and testing this path.

The good news is that you can bypass this limitation and keep singletons if you set renumbering to False. Just make sure that the smallest vertex label is 0 and the largest vertex label is the total number of vertices. In your example, if you want 10 vertices it would look like this:

>>> df = cudf.DataFrame([[0, 9, 1], [9, 0, 1]], columns=['source', 'destination', 'weight'])
>>> G.from_cudf_edgelist(df, edge_attr='weight', renumber=False)
>>> G.number_of_nodes()
10

In this case vertices 1 to 8 would be isolated vertices
There may be other ways to hack this but I'd need to better understand what you are trying to do by adding these isolated vertices.

If you need both renumbering and resizing/growing the graph, the fastest way is probably to try adding it, the cugraph team will be happy to review your PR and assist. We'll also keep the issue on the radar and prioritize for future releases.

Hope that helps, and thanks again.

@afender
Copy link
Member

afender commented Oct 12, 2020

Ps. disconnected graphs support is not well tested in cuGraph at the moment. It is quite possible that some analytics still fail when such inputs are passed. This is something we have been tracking in #305 and plan on continuing to improve in the next release.

@flandolfi
Copy link
Author

Thank you for your quick reply!

Renumbering a vertex with the largest index is indeed a hack, but still needs some user action to manage it.
For example, I'm converting graphs from PyTorch-Geometric to cuGraph to make clustering on device, and then bring the clustering index back to PyTorch. Since I'm dealing with entire datasets, changing the indices on the fly can be cumbersome.
This is just an example, there can be other scenarios in which the size of the graph has to be specified (e.g., importing a fixed sized graph from CSV).

Anyway, thanks for your suggestion!

@BradReesWork
Copy link
Member

@flandolfi thanks for the great question. We are just getting ready to start improving a lot of basic function around graph creation. You are correct that an isolated node will not appear in an edge list and therefore not be created.

@BradReesWork
Copy link
Member

@flandolfi The issue that I'm running into with specifying the number of nodes that should be in a graph when using from_cudf_edgelist is that there is no consistent way to determine what the missing node names should be. If your data has contiguous integer values for nodes, then it is an easy process to assign one-up numbers to the isolated nodes. But if your data is not integers, say IP addresses, then what names are given to nodes?

Now if your data is integers, just not contiguous, then just setting Renumbering to False will fill in all missed values with isolated nodes. Those nodes just need to be within the range and not at the end.

The number of nodes created in the graph is equal to the max node ID + 1. Renumbering simply packs values to be contiguous and to be integers.

@flandolfi
Copy link
Author

Many Graph/Network file formats (e.g., Graph Markup Language) first define all the nodes by their ID, attributes, labels, etc., then specify all the edges of the graph only by the IDs of their end-nodes. This avoids the repetition of known information (similarly to the Bakus-Naur form in databases).

So, in your example, I would find all the IPs in the "header", along with their unique node-IDs, then the list of edges (that can be empty).

Again, renumbering is a trick and it works in most cases. I just believe there should be a way to specify the nodes in advance. Notice that this does not mean that you have to change the signature of from_cudf_edgelist. For example, a totally fine (and maybe better) solution could be preventing the re-initialization of the graph when calling from_cudf_edgelist, e.g.,

import cudf
import cugraph as cx

ips = [
    '192.168.1.1',
    '192.168.1.2',
    ...
    '192.168.1.254'
]

...  # Load the edge list to `df`

G = cx.Graph() 
G.add_nodes_from(ips)
G.from_cudf_edgelist(df, renumber=False)
print(G.number_of_nodes())  # Prints `len(ips)`

This approach will separate the two tasks (adding nodes and adding edges), thus allowing an incremental definition of the graphs (also, one could "update" the current graph by multiple calls of from_cudf_edgelist, maybe?)

@BradReesWork BradReesWork added this to the 0.18 milestone Nov 30, 2020
@BradReesWork BradReesWork added improvement Improvement / enhancement to an existing function and removed ? - Needs Triage Need team to review and classify labels Nov 30, 2020
@BradReesWork
Copy link
Member

issue is being addressed with enhancement issue #1372

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function
Projects
None yet
Development

No branches or pull requests

3 participants