Enable expression-based Dask Dataframe support #4325

rjzamora · 2024-04-09T14:33:22Z

[WIP] I'm using this PR to debug/add support for DASK_DATAFRAME__QUERY_PLANNING=True.

NOTES:

Depends on Fix bug in Series reductions dask/dask-expr#1041 [Merged]
Depends on Fix default name conversion in ToFrame dask/dask-expr#1044

python/cugraph/cugraph/structure/graph_implementation/simpleDistributedGraph.py

rjzamora · 2024-04-26T19:10:52Z

python/cugraph/cugraph/structure/symmetrize.py

-                output_df[value_col.columns],
+                output_df[list(value_col.columns)],


TODO: This may be a dask-expr bug? Column projection using anything other than a list seems fragile.

I'm not observing this bug locally anymore, but I'd still like to keep this precaution in place.

Should we leave a code comment? Is it worth raising a tracking issue on cuGraph for follow up?

I'll be honest: I don't actually think this "fix" is required, because removing it doesn't seem to cause test failures for me locally (was probably specific to an earlier combination of dask/dask-expr/dask-cudf). However, I left it for now because it will take a long time for "real CI" to tell me that it actually is a problem.

With that said, I'll be happy to give it a try now that I'm realizing it's only cudf/dask-cudf that is about to freeze (and not cugraph).

Should we try dropping the list then?

…expr

…bug-dask-expr

…long axis=1

rjzamora · 2024-05-20T13:45:59Z

benchmarks/cugraph/standalone/bulk_sampling/cugraph_bulk_sampling.py

-    dask_label_df = dask_cudf.from_dask_dataframe(dask_label_df)
+    dask_label_df = dask_label_df.to_backend("cudf")


from_dask_dataframe is now deprecated.

Does this mean that dask-expr has some dispatching/plugin support for different DataFrame implementations?

Correct - Dask documentation is here: https://docs.dask.org/en/latest/how-to/selecting-the-collection-backend.html, and Dask-cudf: https://docs.rapids.ai/api/dask-cudf/stable/#dataframe-creation-from-in-memory-formats

NOTE: Given the complexity of dask's various dispatching mechanisms, I'm not expecting anything other than "pandas" and "cudf" the ever be implemented - Though it's technically possible.

rjzamora · 2024-05-20T13:47:50Z

python/cugraph/cugraph/dask/__init__.py

+# Avoid "p2p" shuffling in dask for now
+config.set({"dataframe.shuffle.method": "tasks"})


"p2p" should work fine, but it will rarely provide a performance benefit. It seems best to minimize "optional" changes until the query-planning migration is finished.

rjzamora · 2024-05-20T13:51:25Z

python/cugraph/cugraph/dask/common/input_utils.py

-from dask_cudf.core import DataFrame as dcDataFrame
-from dask_cudf.core import Series as daskSeries
+from dask_cudf import DataFrame as dcDataFrame
+from dask_cudf import Series as daskSeries


NOTE: All imports from dask_cudf.core should be avoided, because these imports are always using "legacy" dask-cudf. Importing from the top-level dask_cudf module are automatically routed to the proper API. There is no way to protect against dask_cudf.core imports yet, because some query-planning logic still needs to find/use specific legacy code.

rjzamora · 2024-05-20T14:22:52Z

python/cugraph/cugraph/tests/structure/test_graph_mg.py

-        .to_frame()
-        .sort_values(0)
+        .to_frame(name="0")
+        .sort_values("0")


Another "precaution" (using numerical column names still seems "fragile" in dask)

rlratzel

LGTM

jakirkham

Highlighting the OPS relevant changes. Namely dropping old Dask workarounds (environment variables that are no longer needed) as the underlying issue was resolved.

jakirkham · 2024-05-22T23:17:23Z

ci/test_python.sh

-# TODO: Enable dask query planning (by default) once some bugs are fixed.
-# xref: https://github.com/rapidsai/cudf/issues/15027
-export DASK_DATAFRAME__QUERY_PLANNING=False
-


AIUI this is one of the OPS relevant changes. Basically removing a workaround that is no longer needed

jakirkham · 2024-05-22T23:17:40Z

ci/test_wheel.sh

-# TODO: Enable dask query planning (by default) once some bugs are fixed.
-# xref: https://github.com/rapidsai/cudf/issues/15027
-export DASK_DATAFRAME__QUERY_PLANNING=False
-


This is the other one. So same change as before just in another place

jakirkham

Thanks Rick! 🙏

Based on your comment above, do we want to drop this workaround?

jakirkham · 2024-05-23T18:57:59Z

python/cugraph/cugraph/structure/symmetrize.py

-                output_df[value_col.columns],
+                output_df[list(value_col.columns)],


Should we try dropping the list then?

python/cugraph/cugraph/structure/symmetrize.py

Co-authored-by: jakirkham <[email protected]>

BradReesWork · 2024-05-28T13:58:46Z

/merge

allow dask-expr for debugging

b3e0fba

github-actions bot added the ci label Apr 9, 2024

rjzamora mentioned this pull request Apr 9, 2024

[FEA] Support "dataframe.query-planning" config in dask.dataframe rapidsai/cudf#15027

Open

28 tasks

Merge branch 'branch-24.06' into debug-dask-expr

39f5101

alexbarghi-nv added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 24, 2024

avoid importing from core and p2p shuffling

4ac1828

github-actions bot added the python label Apr 25, 2024

rjzamora added 3 commits April 25, 2024 09:59

avoid deprecated API

41d42f6

adjust test_nodes_functionality to work after dask_expr#1041

9074cab

add a few workarounds for now

0f8e221

github-actions bot added the benchmarks label Apr 26, 2024

rjzamora commented Apr 26, 2024

View reviewed changes

python/cugraph/cugraph/structure/graph_implementation/simpleDistributedGraph.py Outdated Show resolved Hide resolved

rjzamora commented Apr 26, 2024

View reviewed changes

rjzamora added 9 commits May 2, 2024 08:31

Merge branch 'branch-24.06' into debug-dask-expr

436a080

Merge branch 'branch-24.06' into debug-dask-expr

da82bcd

Merge branch 'branch-24.06' into debug-dask-expr

a9ae6f2

Merge remote-tracking branch 'upstream/branch-24.06' into debug-dask-…

f2d6a25

…expr

test hacky workaround

43963ad

Merge branch 'branch-24.06' into debug-dask-expr

f9bcbe6

Merge branch 'branch-24.06' into debug-dask-expr

bf9dd10

Merge remote-tracking branch 'upstream/branch-24.06' into debug-dask-…

c104292

…expr

Merge branch 'branch-24.06' into debug-dask-expr

03bbee5

rjzamora changed the title ~~[DNM][WIP] Debug expression-based Dask Dataframe support~~ [WIP] Debug expression-based Dask Dataframe support May 15, 2024

rjzamora added 6 commits May 16, 2024 10:21

Merge branch 'branch-24.06' into debug-dask-expr

952b224

clean up test_mg_symmetrize

9496171

Merge remote-tracking branch 'upstream/branch-24.06' into debug-dask-…

d807de0

…expr

clean up test_mg_symmetrize

94c8d79

Merge branch 'debug-dask-expr' of github.com:rjzamora/cugraph into de…

942a396

…bug-dask-expr

Merge branch 'branch-24.06' into debug-dask-expr

36df87d

use 'canonical' dask.dataframe approach for concatnating dataframes a…

27ecbaf

…long axis=1

rjzamora commented May 20, 2024

View reviewed changes

rjzamora changed the title ~~[WIP] Debug expression-based Dask Dataframe support~~ Enable expression-based Dask Dataframe support May 20, 2024

rjzamora marked this pull request as ready for review May 20, 2024 14:23

rjzamora requested review from a team as code owners May 20, 2024 14:23

rjzamora added 2 commits May 20, 2024 18:47

Merge branch 'branch-24.06' into debug-dask-expr

68b798b

Merge branch 'branch-24.06' into debug-dask-expr

ec0fcd4

rlratzel approved these changes May 21, 2024

View reviewed changes

rjzamora added 2 commits May 21, 2024 12:29

Merge branch 'branch-24.06' into debug-dask-expr

acbc219

Merge branch 'branch-24.06' into debug-dask-expr

541c8f2

jakirkham reviewed May 22, 2024

View reviewed changes

raydouglass approved these changes May 23, 2024

View reviewed changes

jakirkham reviewed May 23, 2024

View reviewed changes

Update python/cugraph/cugraph/structure/symmetrize.py

8dc5804

Co-authored-by: jakirkham <[email protected]>

BradReesWork added this to the 24.06 milestone May 28, 2024

rapids-bot bot merged commit 3156569 into rapidsai:branch-24.06 May 28, 2024
136 checks passed

rjzamora deleted the debug-dask-expr branch May 28, 2024 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable expression-based Dask Dataframe support #4325

Enable expression-based Dask Dataframe support #4325

rjzamora commented Apr 9, 2024 •

edited

Loading

rjzamora Apr 26, 2024

rjzamora May 20, 2024

jakirkham May 22, 2024 •

edited

Loading

rjzamora May 23, 2024

jakirkham May 23, 2024

rjzamora May 20, 2024

jakirkham May 22, 2024

rjzamora May 23, 2024

rjzamora May 23, 2024

rjzamora May 20, 2024

rjzamora May 20, 2024

rjzamora May 20, 2024

rlratzel left a comment

jakirkham left a comment

jakirkham May 22, 2024

jakirkham May 22, 2024

jakirkham left a comment

jakirkham May 23, 2024

BradReesWork commented May 28, 2024

		output_df[value_col.columns],
		output_df[list(value_col.columns)],

		dask_label_df = dask_cudf.from_dask_dataframe(dask_label_df)
		dask_label_df = dask_label_df.to_backend("cudf")

		# Avoid "p2p" shuffling in dask for now
		config.set({"dataframe.shuffle.method": "tasks"})

Enable expression-based Dask Dataframe support #4325

Enable expression-based Dask Dataframe support #4325

Conversation

rjzamora commented Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakirkham May 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlratzel left a comment

Choose a reason for hiding this comment

jakirkham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakirkham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BradReesWork commented May 28, 2024

rjzamora commented Apr 9, 2024 •

edited

Loading

jakirkham May 22, 2024 •

edited

Loading