-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nx-cugraph: Updates nxcg.Graph classes for API-compatibility with NetworkX Graph classes, needed for zero code change graph generators #4629
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Erik, thanks for the PR, I do have a few feedback items:
- I don't think we should be introducing a new class for this, I think this should just be the expected behavior for nx-cugraph. If we don't feel confident in this being a stable solution and want to give users an escape hatch, then I'd rather rename the original graph classes (perhaps call those
GPUGraph
?)- edit: I should clarify that I think the separation the new class provides to the implementation is good. I was referring to exposing a new class to end users, and how I think sticking with one family of Graph classes for users to interact with would be better IMO.
- If there's a technical reason we need to keep the new class, I'd prefer to change the name since "ZeroGraph" feels too close to something like a Null graph IMO. I don't think users will make the connection between "zero" and "zero-code-change" here. I've been using the acronym "ZCC" in place of the marketing term "zero-code-change", so maybe "ZCCGraph"? But again, I'd rather just make this the "Graph" class and rename the legacy one if necessary.
- edit: Similar to the above edit, I was referring to if we needed to expose the new class to end users for some reason that we should change the name.
- Am I correct to assume that if the user disables caching in NetworkX that this will break? If so, this worries me since NetworkX outputs a warning that one could reasonably interpret as "use caching at your own risk".
- When I use this PR with networkx/networkx#7585, I'm still not able to run NX functions that mutate graph inputs. Am I missing a config var setting for this? I've updated the description in #4558 to include the test script I'm using. The test script passes when used with networkx/networkx#7578 and #4558
NotImplementedError: Backend 'cugraph' does not implement `pad_graph'. This call will mutate an input, so automatic conversions between backends (if configured) will not be attempted, because this may change behavior. You may specify a backend to use by passing e.g. `backend='networkx'` keyword, but this may also change behavior by not mutating inputs.
Some examples when using networkx/networkx#7585 In [1]: import networkx as nx
In [2]: import nx_cugraph as nxcg
In [3]: import pandas as pd
In [4]: import logging
In [5]: nxl = logging.getLogger("networkx")
In [6]: nxl.addHandler(logging.StreamHandler())
In [7]: nxl.setLevel(logging.DEBUG)
In [8]: # Configure graph generators to use cugraph.
In [9]: # Can also be set via NETWORKX_BACKEND_PRIORITY_GENERATORS environment variable
In [10]: nx.config.backend_priority.generators = ["cugraph"]
In [11]: # DataFrame with int and string dtypes for edge data
In [12]: df = pd.DataFrame({"source": ['a', 'b'], "target": ['b', 'c'], "x": [1, 2], "y": ["a", "b"]})
In [13]: # Only use int edge data; is on GPU
In [14]: G = nx.from_pandas_edgelist(df, edge_attr=["x"])
Call to `from_pandas_edgelist' has inputs from no backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `from_pandas_edgelist' with arguments: (df= source target x y
0 a b 1 a
1 b c 2 b, source='source', target='target', edge_attr=['x'], create_using=None, edge_key=None)
In [15]: G._is_on_gpu
Out[15]: True
In [16]: G._is_on_cpu
Out[16]: False
In [17]: # Use both int and str edge data; is on CPU
In [18]: G = nx.from_pandas_edgelist(df, edge_attr=["x", "y"])
Call to `from_pandas_edgelist' has inputs from no backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `from_pandas_edgelist' with arguments: (df= source target x y
0 a b 1 a
1 b c 2 b, source='source', target='target', edge_attr=['x', 'y'], create_using=None, edge_key=None)
Backend 'cugraph' raised NotImplementedError when calling `from_pandas_edgelist'
Trying next backend: 'networkx'
Call to `empty_graph' has inputs from no backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `empty_graph' with arguments: (n=0, create_using=None, default=<class 'networkx.classes.graph.Graph'>)
In [19]: G._is_on_gpu
Out[19]: False
In [20]: G._is_on_cpu
Out[20]: True
In [21]: # This algorithm only uses int edge data; use GPU
In [22]: %time nx.pagerank(G, weight="x")
Call to `pagerank' has inputs from {'cugraph'} backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `pagerank' with arguments: (G=<nx_cugraph.classes.zero.ZeroGraph object at 0x7f471355b650>, alpha=0.85, personalization=None, max_iter=100, tol=1e-06, nstart=None, weight='x', dangling=None)
CPU times: user 55.3 ms, sys: 36.2 ms, total: 91.5 ms
Wall time: 121 ms
Out[22]: {'a': 0.18783806264400482, 'b': 0.48648586869239807, 'c': 0.3256761133670807}
In [23]: # The previous call cached to the GPU, so subsequent calls using this data are faster
In [24]: %time nx.pagerank(G, weight="x")
Call to `pagerank' has inputs from {'cugraph'} backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `pagerank' with arguments: (G=<nx_cugraph.classes.zero.ZeroGraph object at 0x7f471355b650>, alpha=0.85, personalization=None, max_iter=100, tol=1e-06, nstart=None, weight='x', dangling=None)
CPU times: user 10.4 ms, sys: 0 ns, total: 10.4 ms
Wall time: 10.3 ms
Out[24]: {'a': 0.18783806264400482, 'b': 0.48648586869239807, 'c': 0.3256761133670807}
In [25]: # Creating an empty graph also returns an nx-cugraph graph
In [26]: G = nx.empty_graph()
Call to `empty_graph' has inputs from no backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `empty_graph' with arguments: (n=0, create_using=None, default=<class 'networkx.classes.graph.Graph'>)
In [27]: # Mutation always forces us to the CPU (for now)
In [28]: G.add_edge(0, 1, x=1)
In [29]: G._is_on_gpu
Out[29]: False
In [30]: G._is_on_cpu
Out[30]: True
In [31]: # But we can get the GPU graph
In [32]: G._cugraph # Suggestions for a better name?
Out[32]: <nx_cugraph.classes.graph.Graph at 0x7f471462c790>
In [33]: G._is_on_gpu
Out[33]: True
In [34]: G._is_on_cpu
Out[34]: True
# This graph can be used (with no conversion) by normal networkx algorithms that nx-cugraph does not implement
In [35]: nx.harmonic_centrality(G)
Call to `harmonic_centrality' has inputs from {'cugraph'} backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Backend 'cugraph' does not implement `harmonic_centrality'
Trying next backend: 'networkx'
Converting input graphs from 'cugraph' backend to 'networkx' backend for call to `harmonic_centrality'
Caching converted graph (from 'cugraph' to 'networkx' backend) in call to `harmonic_centrality' for 'G' argument
Call to `shortest_path_length' has inputs from {'cugraph'} backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `shortest_path_length' with arguments: (G=<nx_cugraph.classes.zero.ZeroGraph object at 0x7f4713ebc890>, source=0, target=None, weight=None, method='dijkstra')
Call to `shortest_path_length' has inputs from {'cugraph'} backends, and will try to use backends in the following order: ['cugraph', 'networkx']
Using backend 'cugraph' for call to `shortest_path_length' with arguments: (G=<nx_cugraph.classes.zero.ZeroGraph object at 0x7f4713ebc890>, source=1, target=None, weight=None, method='dijkstra')
Out[35]: {0: 1.0, 1: 1.0}
In [36]: # We can also use string or any Python object as edge data
In [37]: G.add_edge(0, 2, y="hi")
In [38]: G.add_edge(1, 2, z=["x", object()])
# Let's compare performance of using `G.add_edge` in a tight loop
In [39]: %%timeit
...: G = nx.empty_graph(backend="networkx")
...: for i in range(1_000_000):
...: G.add_edge(i, i+1)
...:
1.13 s ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [40]: %%timeit
...: G = nx.empty_graph(backend="cugraph")
...: for i in range(1_000_000):
...: G.add_edge(i, i+1)
...:
1.67 s ± 6.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Some replies:
In addition to refining the final UX, this has a couple rough edges to smooth out and we should do more thorough testing (and optimizing), but as the examples above show it is extremely capable today. I have gone through a couple iterations to try to get CI to pass. |
In addition to renaming and moving graph classes around, this smooths out *many* rough edges for zero-code change use cases. There is more to be done, but this is excellent progress and should be good enough to use. Now to see about getting CI to pass!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'm very excited to get this functionality! Comments/suggestions/questions below.
NX_CUGRAPH_USE_COMPAT_GRAPHS=False pytest --capture=no --cache-clear --benchmark-disable "$@" tests | ||
NX_CUGRAPH_USE_COMPAT_GRAPHS=True pytest --capture=no --cache-clear --benchmark-disable "$@" tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should minimize the public API surface area and expose only the "compat" graph. We're not getting backwards compatibility by exposing both classes anyway, so we can expose GPUGraph
(which I'm assuming is the non-compat graph) to users later if needed.
Or...do we get additional coverage by testing both compat and non-compat separately? I'd still prefer to start small and expand the API/our support burden as needed, but I can appreciate if it's used to get additional test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is used to get addition test coverage. We can change the environment variable to something like _NX_CUGRAPH_USE_COMPAT_GRAPHS
and use it as an internal implementation detail.
@@ -108,7 +108,7 @@ echo "nx-cugraph coverage from networkx tests: $_coverage" | |||
echo $_coverage | awk '{ if ($NF == "0.0%") exit 1 }' | |||
# Ensure all algorithms were called by comparing covered lines to function lines. | |||
# Run our tests again (they're fast enough) to add their coverage, then create coverage.json | |||
pytest \ | |||
NX_CUGRAPH_USE_COMPAT_GRAPHS=False pytest \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this is False? It reads as if we're not getting test coverage from the NX suite for the "compat" Graph, which seems bad, but maybe it's necessary for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work for both True
and False
, and it would be great to be able to test both options. False
(not "compat" graph) matches how we were testing before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great to be able to test both options
Why can't we test both? If we have to pick one, shouldn't we pick compat since (I'm assuming) it will be used much more?
def _reify_networkx(self) -> None: | ||
"""Copy graph to host (CPU) if necessary.""" | ||
if self.__dict__["_node"] is None: | ||
# After we make this into an nx graph, we rely on the cache being correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think explaining this a bit more would help me understand the implementation better. Assuming you're referring to __networkx_cache__
, I'm not seeing how that comes into play. Is it because self._cudagraph
is actually a graph instance in the cache, or is there some other dependency on the cache here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have graphs in the cache and reified networkx data structured (e.g., _adj
and _node
), then we want all the graphs to be consistent. Hence, the graph can be on host and device at the same time--it doesn't move back and forth keeping only one. Consistency requires the cache being "correct" such as being cleared when the networkx data structures are mutated. Notably, the inner dicts of the data structures are just dicts. The cache is cleared when using methods such as G.add_edge
, but not when doing e.g. G[u][v][key] = val
. If we wanted to be super-careful and strict, we could make our own mutable mapping class that clears the cache upon mutation, but this would come at a (possibly significant) performance penalty. This is what the cache warning message discusses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I understand that, but I was asking more because the comment about the cache just seemed out of place. After looking at the code again, I remembered self._cudagraph
is actually a property that uses the cache, which I'm assuming is what the comment is referring to. Assuming this is correct, I think I just needed a reminder that self._cudagraph
could end up using a cached graph.
Can you also update the description and title of this PR since the implementation is quite different now? |
ZeroGraph
for nx-compatibility (zero-code change)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Packaging changes look fine, once @rlratzel’s review questions are resolved.
/merge |
This is an alternative approach to #4558 for enabling GPU-accelerated NetworkX to "just work". It has similarities to #4558. I opted to make separate classes such as
ZeroGraph
, which I think makes for cleaner separation and gives us and users more control.There are a few lingering TODOs and code comments to tidy up, but I don't think there are any show-stoppers. I have not updated methods (such as
number_of_nodes
) to optimistically try to use GPU if possible, b/c this is not strictly necessary, but we should update these soon.I have run NetworkX tests with these classes using networkx/networkx#7585 and networkx/networkx#7600. We need the behavior in 7585, and 7600 is only useful for testing.
There are only 3 new failing tests, and 3 tests that hang (I'll run them overnight to see if they finish). Here's a test summary:
Note that 25 tests that were failing now pass. I have not investigated test failures, xfails, or xpasses yet. I would like to add tests too.
We rely heavily on the networkx cache. I think this is preferred.
It is late for me. I will describe and show how and why this works later.
I opted for
zero=
andZeroGraph
, because I find them delightful! Renaming is trivial if other terms are preferred.CC @quasiben