Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nx-cugraph: Updates nxcg.Graph classes for API-compatibility with NetworkX Graph classes, needed for zero code change graph generators #4629

Merged
merged 12 commits into from
Sep 24, 2024

Conversation

eriknw
Copy link
Contributor

@eriknw eriknw commented Aug 22, 2024

This is an alternative approach to #4558 for enabling GPU-accelerated NetworkX to "just work". It has similarities to #4558. I opted to make separate classes such as ZeroGraph, which I think makes for cleaner separation and gives us and users more control.

There are a few lingering TODOs and code comments to tidy up, but I don't think there are any show-stoppers. I have not updated methods (such as number_of_nodes) to optimistically try to use GPU if possible, b/c this is not strictly necessary, but we should update these soon.

I have run NetworkX tests with these classes using networkx/networkx#7585 and networkx/networkx#7600. We need the behavior in 7585, and 7600 is only useful for testing.

There are only 3 new failing tests, and 3 tests that hang (I'll run them overnight to see if they finish). Here's a test summary:

5548 passed, 24 skipped, 16 xfailed, 25 xpassed

Note that 25 tests that were failing now pass. I have not investigated test failures, xfails, or xpasses yet. I would like to add tests too.

We rely heavily on the networkx cache. I think this is preferred.

It is late for me. I will describe and show how and why this works later.

I opted for zero= and ZeroGraph, because I find them delightful! Renaming is trivial if other terms are preferred.

CC @quasiben

@eriknw eriknw added improvement Improvement / enhancement to an existing function python nx-cugraph labels Aug 22, 2024
@eriknw eriknw requested a review from a team as a code owner August 22, 2024 21:37
@eriknw eriknw requested a review from rlratzel August 22, 2024 21:39
Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Erik, thanks for the PR, I do have a few feedback items:

  • I don't think we should be introducing a new class for this, I think this should just be the expected behavior for nx-cugraph. If we don't feel confident in this being a stable solution and want to give users an escape hatch, then I'd rather rename the original graph classes (perhaps call those GPUGraph?)
    • edit: I should clarify that I think the separation the new class provides to the implementation is good. I was referring to exposing a new class to end users, and how I think sticking with one family of Graph classes for users to interact with would be better IMO.
  • If there's a technical reason we need to keep the new class, I'd prefer to change the name since "ZeroGraph" feels too close to something like a Null graph IMO. I don't think users will make the connection between "zero" and "zero-code-change" here. I've been using the acronym "ZCC" in place of the marketing term "zero-code-change", so maybe "ZCCGraph"? But again, I'd rather just make this the "Graph" class and rename the legacy one if necessary.
    • edit: Similar to the above edit, I was referring to if we needed to expose the new class to end users for some reason that we should change the name.
  • Am I correct to assume that if the user disables caching in NetworkX that this will break? If so, this worries me since NetworkX outputs a warning that one could reasonably interpret as "use caching at your own risk".
  • When I use this PR with networkx/networkx#7585, I'm still not able to run NX functions that mutate graph inputs. Am I missing a config var setting for this? I've updated the description in #4558 to include the test script I'm using. The test script passes when used with networkx/networkx#7578 and #4558
    NotImplementedError: Backend 'cugraph' does not implement `pad_graph'. This call will mutate an input, so automatic conversions between backends (if configured) will not be attempted, because this may change behavior. You may specify a backend to use by passing e.g. `backend='networkx'` keyword, but this may also change behavior by not mutating inputs.
    

@eriknw eriknw requested a review from a team as a code owner August 23, 2024 22:09
@eriknw eriknw requested a review from KyleFromNVIDIA August 23, 2024 22:09
@github-actions github-actions bot added the conda label Aug 23, 2024
@eriknw eriknw requested a review from a team as a code owner August 25, 2024 20:43
@github-actions github-actions bot added the ci label Aug 25, 2024
@eriknw
Copy link
Contributor Author

eriknw commented Aug 25, 2024

Some examples when using networkx/networkx#7585

In [1]: import networkx as nx
In [2]: import nx_cugraph as nxcg
In [3]: import pandas as pd
In [4]: import logging

In [5]: nxl = logging.getLogger("networkx")
In [6]: nxl.addHandler(logging.StreamHandler())
In [7]: nxl.setLevel(logging.DEBUG)

In [8]: # Configure graph generators to use cugraph.
In [9]: # Can also be set via NETWORKX_BACKEND_PRIORITY_GENERATORS environment variable
In [10]: nx.config.backend_priority.generators = ["cugraph"]

In [11]: # DataFrame with int and string dtypes for edge data
In [12]: df = pd.DataFrame({"source": ['a', 'b'], "target": ['b', 'c'], "x": [1, 2], "y": ["a", "b"]})

In [13]: # Only use int edge data; is on GPU
In [14]: G = nx.from_pandas_edgelist(df, edge_attr=["x"])
Call to `from_pandas_edgelist' has inputs from no backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `from_pandas_edgelist' with arguments: (df=  source target  x  y
0      a      b  1  a
1      b      c  2  b, source='source', target='target', edge_attr=['x'], create_using=None, edge_key=None)

In [15]: G._is_on_gpu
Out[15]: True

In [16]: G._is_on_cpu
Out[16]: False

In [17]: # Use both int and str edge data; is on CPU
In [18]: G = nx.from_pandas_edgelist(df, edge_attr=["x", "y"])
Call to `from_pandas_edgelist' has inputs from no backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `from_pandas_edgelist' with arguments: (df=  source target  x  y
0      a      b  1  a
1      b      c  2  b, source='source', target='target', edge_attr=['x', 'y'], create_using=None, edge_key=None)
Backend 'cugraph' raised NotImplementedError when calling `from_pandas_edgelist'
Trying next backend: 'networkx'
Call to `empty_graph' has inputs from no backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `empty_graph' with arguments: (n=0, create_using=None, default=<class 'networkx.classes.graph.Graph'>)

In [19]: G._is_on_gpu
Out[19]: False

In [20]: G._is_on_cpu
Out[20]: True

In [21]: # This algorithm only uses int edge data; use GPU
In [22]: %time nx.pagerank(G, weight="x")
Call to `pagerank' has inputs from {'cugraph'} backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `pagerank' with arguments: (G=<nx_cugraph.classes.zero.ZeroGraph object at 0x7f471355b650>, alpha=0.85, personalization=None, max_iter=100, tol=1e-06, nstart=None, weight='x', dangling=None)
CPU times: user 55.3 ms, sys: 36.2 ms, total: 91.5 ms
Wall time: 121 ms
Out[22]: {'a': 0.18783806264400482, 'b': 0.48648586869239807, 'c': 0.3256761133670807}

In [23]: # The previous call cached to the GPU, so subsequent calls using this data are faster
In [24]: %time nx.pagerank(G, weight="x")
Call to `pagerank' has inputs from {'cugraph'} backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `pagerank' with arguments: (G=<nx_cugraph.classes.zero.ZeroGraph object at 0x7f471355b650>, alpha=0.85, personalization=None, max_iter=100, tol=1e-06, nstart=None, weight='x', dangling=None)
CPU times: user 10.4 ms, sys: 0 ns, total: 10.4 ms
Wall time: 10.3 ms
Out[24]: {'a': 0.18783806264400482, 'b': 0.48648586869239807, 'c': 0.3256761133670807}

In [25]: # Creating an empty graph also returns an nx-cugraph graph
In [26]: G = nx.empty_graph()
Call to `empty_graph' has inputs from no backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `empty_graph' with arguments: (n=0, create_using=None, default=<class 'networkx.classes.graph.Graph'>)

In [27]: # Mutation always forces us to the CPU (for now)
In [28]: G.add_edge(0, 1, x=1)

In [29]: G._is_on_gpu
Out[29]: False

In [30]: G._is_on_cpu
Out[30]: True

In [31]: # But we can get the GPU graph
In [32]: G._cugraph  # Suggestions for a better name?
Out[32]: <nx_cugraph.classes.graph.Graph at 0x7f471462c790>

In [33]: G._is_on_gpu
Out[33]: True

In [34]: G._is_on_cpu
Out[34]: True

# This graph can be used (with no conversion) by normal networkx algorithms that nx-cugraph does not implement
In [35]: nx.harmonic_centrality(G)
Call to `harmonic_centrality' has inputs from {'cugraph'} backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Backend 'cugraph' does not implement `harmonic_centrality'
Trying next backend: 'networkx'
Converting input graphs from 'cugraph' backend to 'networkx' backend for call to `harmonic_centrality'
Caching converted graph (from 'cugraph' to 'networkx' backend) in call to `harmonic_centrality' for 'G' argument
Call to `shortest_path_length' has inputs from {'cugraph'} backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `shortest_path_length' with arguments: (G=<nx_cugraph.classes.zero.ZeroGraph object at 0x7f4713ebc890>, source=0, target=None, weight=None, method='dijkstra')
Call to `shortest_path_length' has inputs from {'cugraph'} backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `shortest_path_length' with arguments: (G=<nx_cugraph.classes.zero.ZeroGraph object at 0x7f4713ebc890>, source=1, target=None, weight=None, method='dijkstra')
Out[35]: {0: 1.0, 1: 1.0}

In [36]: # We can also use string or any Python object as edge data
In [37]: G.add_edge(0, 2, y="hi")
In [38]: G.add_edge(1, 2, z=["x", object()])

# Let's compare performance of using `G.add_edge` in a tight loop
In [39]: %%timeit
    ...: G = nx.empty_graph(backend="networkx")
    ...: for i in range(1_000_000):
    ...:     G.add_edge(i, i+1)
    ...:
1.13 s ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [40]: %%timeit
    ...: G = nx.empty_graph(backend="cugraph")
    ...: for i in range(1_000_000):
    ...:     G.add_edge(i, i+1)
    ...:
1.67 s ± 6.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Some replies:

  • We can change names and choose whatever UX we think is best; the approach in this PR is flexible. zero was helpful to implement and will be easy to do find and replace
  • This will not break if the user disables cache, but we will still use the cache
  • Functions that mutate the graph are not yet generally supported. Expect a solution for this release, but not by Tuesday

In addition to refining the final UX, this has a couple rough edges to smooth out and we should do more thorough testing (and optimizing), but as the examples above show it is extremely capable today.

I have gone through a couple iterations to try to get CI to pass.

@rlratzel rlratzel added the non-breaking Non-breaking change label Aug 29, 2024
In addition to renaming and moving graph classes around, this smooths out
*many* rough edges for zero-code change use cases. There is more to be
done, but this is excellent progress and should be good enough to use.
Now to see about getting CI to pass!
Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'm very excited to get this functionality! Comments/suggestions/questions below.

Comment on lines +9 to +10
NX_CUGRAPH_USE_COMPAT_GRAPHS=False pytest --capture=no --cache-clear --benchmark-disable "$@" tests
NX_CUGRAPH_USE_COMPAT_GRAPHS=True pytest --capture=no --cache-clear --benchmark-disable "$@" tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should minimize the public API surface area and expose only the "compat" graph. We're not getting backwards compatibility by exposing both classes anyway, so we can expose GPUGraph (which I'm assuming is the non-compat graph) to users later if needed.
Or...do we get additional coverage by testing both compat and non-compat separately? I'd still prefer to start small and expand the API/our support burden as needed, but I can appreciate if it's used to get additional test coverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is used to get addition test coverage. We can change the environment variable to something like _NX_CUGRAPH_USE_COMPAT_GRAPHS and use it as an internal implementation detail.

@@ -108,7 +108,7 @@ echo "nx-cugraph coverage from networkx tests: $_coverage"
echo $_coverage | awk '{ if ($NF == "0.0%") exit 1 }'
# Ensure all algorithms were called by comparing covered lines to function lines.
# Run our tests again (they're fast enough) to add their coverage, then create coverage.json
pytest \
NX_CUGRAPH_USE_COMPAT_GRAPHS=False pytest \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is False? It reads as if we're not getting test coverage from the NX suite for the "compat" Graph, which seems bad, but maybe it's necessary for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work for both True and False, and it would be great to be able to test both options. False (not "compat" graph) matches how we were testing before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great to be able to test both options

Why can't we test both? If we have to pick one, shouldn't we pick compat since (I'm assuming) it will be used much more?

python/nx-cugraph/nx_cugraph/classes/graph.py Outdated Show resolved Hide resolved
python/nx-cugraph/nx_cugraph/classes/graph.py Show resolved Hide resolved
def _reify_networkx(self) -> None:
"""Copy graph to host (CPU) if necessary."""
if self.__dict__["_node"] is None:
# After we make this into an nx graph, we rely on the cache being correct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think explaining this a bit more would help me understand the implementation better. Assuming you're referring to __networkx_cache__, I'm not seeing how that comes into play. Is it because self._cudagraph is actually a graph instance in the cache, or is there some other dependency on the cache here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have graphs in the cache and reified networkx data structured (e.g., _adj and _node), then we want all the graphs to be consistent. Hence, the graph can be on host and device at the same time--it doesn't move back and forth keeping only one. Consistency requires the cache being "correct" such as being cleared when the networkx data structures are mutated. Notably, the inner dicts of the data structures are just dicts. The cache is cleared when using methods such as G.add_edge, but not when doing e.g. G[u][v][key] = val. If we wanted to be super-careful and strict, we could make our own mutable mapping class that clears the cache upon mutation, but this would come at a (possibly significant) performance penalty. This is what the cache warning message discusses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I understand that, but I was asking more because the comment about the cache just seemed out of place. After looking at the code again, I remembered self._cudagraph is actually a property that uses the cache, which I'm assuming is what the comment is referring to. Assuming this is correct, I think I just needed a reminder that self._cudagraph could end up using a cached graph.

python/nx-cugraph/nx_cugraph/classes/graph.py Show resolved Hide resolved
python/nx-cugraph/nx_cugraph/classes/graph.py Show resolved Hide resolved
python/nx-cugraph/nx_cugraph/convert.py Show resolved Hide resolved
python/nx-cugraph/nx_cugraph/convert.py Show resolved Hide resolved
python/nx-cugraph/nx_cugraph/interface.py Show resolved Hide resolved
@rlratzel
Copy link
Contributor

Can you also update the description and title of this PR since the implementation is quite different now?

@rlratzel rlratzel changed the title nx-cugraph: add ZeroGraph for nx-compatibility (zero-code change) nx-cugraph: Updates nxcg.Graph classes for API-compatibility with NetworkX Graph classes, needed for zero code change graph generators Sep 21, 2024
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packaging changes look fine, once @rlratzel’s review questions are resolved.

@rlratzel
Copy link
Contributor

/merge

@jakirkham jakirkham removed the request for review from KyleFromNVIDIA September 24, 2024 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci conda improvement Improvement / enhancement to an existing function non-breaking Non-breaking change nx-cugraph python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants