[FEA] Offer more control over CPU fallback in cudf.pandas #14975

bdice · 2024-02-06T17:03:52Z

Is your feature request related to a problem? Please describe.
The default execution model for cudf.pandas is to try to execute an operation on the GPU, then fall back to the CPU if it fails for any reason. This approach is desirable for end-users to maximize the number of cases where cudf.pandas "just works", but it makes it difficult to analyze when failures are occurring and why. The former can be addressed by running under the profiler, but that is more cumbersome than we would like in many cases where we would rather get a quick signal in the form of failure (e.g. when running a workflow or a test suite to analyze unsupported cases). Furthermore, there is no easy way to determine whether cudf and pandas return the same results for a given operation, which is a different failure mode that is currently not possible to capture.

Describe the solution you'd like
We should generalize _fast_slow_function_call to support a wider range of fallback options. These options could be configurable by an environment variable, or by some global configuration option (the former is probably fine to start with). The different behaviors we would want to support are:

Error on fallback. We could then run the pandas test suite with this turned on and get a sense of how many tests cudf passes on its own.
Error on specific types of fallback. This would allow us to analyze the types of fallback that are occurring. Some of the most obvious error modes I can foresee (there are certainly others) are:
- Out of memory errors, for the sake of planning No OOM related work
- AttributeErrors for missing functionality
- TypeErrors for differing function signatures
Error when cudf and pandas produce different outputs. This would be an extra branch within the fast path where the slow path is run even if the fast path succeeds, and then the fast and slow paths are compared for equivalence.

We may want to support warning instead of raising errors in some cases, but I don't think that's critical to start.

Describe alternatives you've considered
This could be configured by the cudf.pandas profiler, or a similar context manager?

Additional context
Feedback from @ianozsvald and @lmeyerov would be welcome!

The text was updated successfully, but these errors were encountered:

lmeyerov · 2024-02-07T05:55:55Z

A python Warning object so we can do managed handling would make sense

Note we are not cudf.pandas users but cudf, so our interest would be seeing the same thing there

bdice · 2024-02-07T18:02:15Z

@lmeyerov cudf doesn't fall back to CPU so you'd never see this with normal cudf usage. Only cudf.pandas has CPU fallback behavior. Can you clarify what you mean?

lmeyerov · 2024-02-07T21:16:19Z

Re:cudf, Some reason I thought a few cudf methods will fall back to CPU, like in parsing or others, rather than throwing NotImpl or a warning

Seperately / more broadly, there are some perf gotchas in cudf like where it makes copies or sorts that good code would avoid. A perf tips flag/mode that warns in these cases would be helpful for us, not just for the CPU fallback case. But that is a bigger story.

bdice · 2024-02-07T21:48:20Z

Good feedback! There are a few cases in I/O where cudf does not offer a GPU-accelerated reader/writer for every format. That's the only exception I can think of right now where cudf executes CPU-only code (it copies to device and returns a GPU dataframe at the end). Those are documented in the notes on this page: https://docs.rapids.ai/api/cudf/stable/user_guide/io/io/

I can think of a few algorithms where cudf has cut down on extraneous copies/sorting over the last few releases (like drop_duplicates). If any specific cases come to mind, please file issues for those! We're aiming to reduce intermediate memory usage in cudf and these would likely align with that goal (in addition to improving performance).

lmeyerov · 2024-02-07T22:42:36Z

Yes, my meta is perf warnings mode, like when defaults are slow for conformance reasons and a special calling pattern would make faster, would be very helpful :)

Matt711 · 2024-05-22T18:04:47Z

Error when cudf and pandas produce different outputs. This would be an extra branch within the fast path where the slow path is run even if the fast path succeeds, and then the fast and slow paths are compared for equivalence.

If it's okay with you @mroeschke, can I still work on this component since it covers the issue I opened?

mroeschke · 2024-05-22T18:07:27Z

If it's okay with you @mroeschke, can I still work on this component since it covers the #15817 I opened?

Yes go for it @Matt711!

Matt711 · 2024-05-29T17:45:56Z

We could have two debugging mode options (note: we can use different names):

mode.pandas_debugging
mode.fallback_debugging

(1.) is for when fallback does not occur. It checks that the results from cudf and pandas agree and returns a warning if they do not. I'm working on that option in this PR #15837 .

(2.) is for when fallback does occur. It could return errors on the specific types of fallback mentioned:

Out of memory errors, for the sake of planning No OOM related work

AttributeErrors for missing functionality

TypeErrors for differing function signatures

What do we think about these two options?

cc. @bdice @vyasr @wence-

vyasr · 2024-05-30T01:25:30Z

Making these modes independently configurable is definitely what we want, yes. As I commented on this in #15837, though, I don't think options are the right way to expose this. options are user-facing, whereas what we're trying to accomplish here is something for developers. Some environment variables documented in the developer guide are probably closer to what I would envision, especially for the first one (pandas_debugging). I don't see a reason for a user to ever need that one. I could envision exposing some internal APIs to control the second case (fallback_debugging) because in that scenario it could be useful to have the profiler hook into these so that users could collect information on why fallback occurred.

Matt711 · 2024-05-30T13:41:59Z

Using an environment variable instead of an option is fine with me. I am curious if you have a more specific place in mind in the Developer Guide for documenting the environment variable?

wence- · 2024-05-30T14:30:39Z

Maybe we can add a new section on the fast-slow-proxy wrapping scheme. It can be mostly stubbed out and we can add info.

Matt711 · 2024-05-30T14:31:41Z

Maybe we can add a new section on the fast-slow-proxy wrapping scheme. It can be mostly stubbed out and we can add info.

Yes, and I could add that in a new cudf.pandas section in the Developer Guide?

This PR provides documentation for cudf.pandas in the Developer Guide. It will describe the fast-slow proxy wrapping scheme as well as document the `CUDF_PANDAS_DEBUGGING` environment variable created in PR #15837 for issue #14975. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Lawrence Mitchell (https://github.com/wence-) URL: #15889

#15837) Part of #14975 This PR adds a pandas debugging option to `_fast_slow_function_call` that runs the slow path after the fast and returns a warning if the results differ. Authors: - Matthew Murray (https://github.com/Matt711) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15837

#16562) This PR makes more on #14975 by adding an environment variable that fails when fallback occurs in cudf.pandas. It also adds some tests that do __not__ fallback. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #16562

vyasr · 2024-11-05T20:53:29Z

@Matt711 what's the status of this issue after #16562? Next steps would be to work on enabling the various different fallback modes suggested in the issue I think (which in turn would help us do more systematic analysis of fallback).

Matt711 · 2024-11-05T21:19:47Z

@Matt711 what's the status of this issue after #16562? Next steps would be to work on enabling the various different fallback modes suggested in the issue I think (which in turn would help us do more systematic analysis of fallback).

Thanks for the reminder! I'll create a PR that raises on specific kinds of fallback, which I think should close this issue.

bdice added the feature request New feature or request label Feb 6, 2024

bdice assigned galipremsagar Feb 6, 2024

bdice changed the title ~~[FEA] cudf.pandas should be able to warn on fallback~~ [FEA] cudf.pandas should be able to warn on CPU fallback Feb 6, 2024

vyasr changed the title ~~[FEA] cudf.pandas should be able to warn on CPU fallback~~ [FEA] Offer more control over CPU fallback in cudf.pandas May 15, 2024

vyasr mentioned this issue May 22, 2024

[FEA] Disable fallback in cudf.pandas on request #15724

Closed

mroeschke self-assigned this May 22, 2024

vyasr assigned mroeschke and unassigned mroeschke and galipremsagar May 22, 2024

Matt711 self-assigned this May 22, 2024

Matt711 mentioned this issue May 22, 2024

[FEA] Add an option to enable pandas debugging mode in cudf.pandas fast path #15817

Closed

Matt711 mentioned this issue May 23, 2024

Add an Environment Variable for debugging the fast path in cudf.pandas #15837

Merged

3 tasks

This was referenced May 30, 2024

DOC: Add documentation for cudf.pandas in the Developer Guide #15889

Merged

Add an environment variable for handling fallback in cudf.pandas #15910

Closed

Matt711 mentioned this issue Aug 14, 2024

[FEA] Add an environment variable to fail on fallback in cudf.pandas #16562

Merged

3 tasks

vyasr added the Python Affects Python cuDF API. label Nov 5, 2024

github-project-automation bot added this to cuDF Python Nov 5, 2024

github-project-automation bot moved this to Todo in cuDF Python Nov 5, 2024

Matt711 mentioned this issue Nov 7, 2024

Raise errors on specific types of fallback in cudf.pandas #17268

Merged

3 tasks

GPUtester moved this from Todo to In Progress in cuDF Python Nov 7, 2024

rapids-bot bot closed this as completed in #17268 Nov 13, 2024

rapids-bot bot closed this as completed in 76a5e32 Nov 13, 2024

github-project-automation bot moved this from In Progress to Done in cuDF Python Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Offer more control over CPU fallback in cudf.pandas #14975

[FEA] Offer more control over CPU fallback in cudf.pandas #14975

bdice commented Feb 6, 2024 •

edited by vyasr

Loading

lmeyerov commented Feb 7, 2024

bdice commented Feb 7, 2024

lmeyerov commented Feb 7, 2024

bdice commented Feb 7, 2024 •

edited

Loading

lmeyerov commented Feb 7, 2024

Matt711 commented May 22, 2024

mroeschke commented May 22, 2024

Matt711 commented May 29, 2024

vyasr commented May 30, 2024

Matt711 commented May 30, 2024

wence- commented May 30, 2024

Matt711 commented May 30, 2024

vyasr commented Nov 5, 2024

Matt711 commented Nov 5, 2024

[FEA] Offer more control over CPU fallback in cudf.pandas #14975

[FEA] Offer more control over CPU fallback in cudf.pandas #14975

Comments

bdice commented Feb 6, 2024 • edited by vyasr Loading

lmeyerov commented Feb 7, 2024

bdice commented Feb 7, 2024

lmeyerov commented Feb 7, 2024

bdice commented Feb 7, 2024 • edited Loading

lmeyerov commented Feb 7, 2024

Matt711 commented May 22, 2024

mroeschke commented May 22, 2024

Matt711 commented May 29, 2024

vyasr commented May 30, 2024

Matt711 commented May 30, 2024

wence- commented May 30, 2024

Matt711 commented May 30, 2024

vyasr commented Nov 5, 2024

Matt711 commented Nov 5, 2024

bdice commented Feb 6, 2024 •

edited by vyasr

Loading

bdice commented Feb 7, 2024 •

edited

Loading