Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Offer more control over CPU fallback in cudf.pandas #14975

Closed
bdice opened this issue Feb 6, 2024 · 14 comments · Fixed by #17268
Closed

[FEA] Offer more control over CPU fallback in cudf.pandas #14975

bdice opened this issue Feb 6, 2024 · 14 comments · Fixed by #17268
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@bdice
Copy link
Contributor

bdice commented Feb 6, 2024

Is your feature request related to a problem? Please describe.
The default execution model for cudf.pandas is to try to execute an operation on the GPU, then fall back to the CPU if it fails for any reason. This approach is desirable for end-users to maximize the number of cases where cudf.pandas "just works", but it makes it difficult to analyze when failures are occurring and why. The former can be addressed by running under the profiler, but that is more cumbersome than we would like in many cases where we would rather get a quick signal in the form of failure (e.g. when running a workflow or a test suite to analyze unsupported cases). Furthermore, there is no easy way to determine whether cudf and pandas return the same results for a given operation, which is a different failure mode that is currently not possible to capture.

Describe the solution you'd like
We should generalize _fast_slow_function_call to support a wider range of fallback options. These options could be configurable by an environment variable, or by some global configuration option (the former is probably fine to start with). The different behaviors we would want to support are:

  • Error on fallback. We could then run the pandas test suite with this turned on and get a sense of how many tests cudf passes on its own.
  • Error on specific types of fallback. This would allow us to analyze the types of fallback that are occurring. Some of the most obvious error modes I can foresee (there are certainly others) are:
    • Out of memory errors, for the sake of planning No OOM related work
    • AttributeErrors for missing functionality
    • TypeErrors for differing function signatures
  • Error when cudf and pandas produce different outputs. This would be an extra branch within the fast path where the slow path is run even if the fast path succeeds, and then the fast and slow paths are compared for equivalence.

We may want to support warning instead of raising errors in some cases, but I don't think that's critical to start.

Describe alternatives you've considered
This could be configured by the cudf.pandas profiler, or a similar context manager?

Additional context
Feedback from @ianozsvald and @lmeyerov would be welcome!

@bdice bdice added the feature request New feature or request label Feb 6, 2024
@bdice bdice changed the title [FEA] cudf.pandas should be able to warn on fallback [FEA] cudf.pandas should be able to warn on CPU fallback Feb 6, 2024
@lmeyerov
Copy link

lmeyerov commented Feb 7, 2024

A python Warning object so we can do managed handling would make sense

Note we are not cudf.pandas users but cudf, so our interest would be seeing the same thing there

@bdice
Copy link
Contributor Author

bdice commented Feb 7, 2024

@lmeyerov cudf doesn't fall back to CPU so you'd never see this with normal cudf usage. Only cudf.pandas has CPU fallback behavior. Can you clarify what you mean?

@lmeyerov
Copy link

lmeyerov commented Feb 7, 2024

Re:cudf, Some reason I thought a few cudf methods will fall back to CPU, like in parsing or others, rather than throwing NotImpl or a warning

Seperately / more broadly, there are some perf gotchas in cudf like where it makes copies or sorts that good code would avoid. A perf tips flag/mode that warns in these cases would be helpful for us, not just for the CPU fallback case. But that is a bigger story.

@bdice
Copy link
Contributor Author

bdice commented Feb 7, 2024

Good feedback! There are a few cases in I/O where cudf does not offer a GPU-accelerated reader/writer for every format. That's the only exception I can think of right now where cudf executes CPU-only code (it copies to device and returns a GPU dataframe at the end). Those are documented in the notes on this page: https://docs.rapids.ai/api/cudf/stable/user_guide/io/io/

I can think of a few algorithms where cudf has cut down on extraneous copies/sorting over the last few releases (like drop_duplicates). If any specific cases come to mind, please file issues for those! We're aiming to reduce intermediate memory usage in cudf and these would likely align with that goal (in addition to improving performance).

@lmeyerov
Copy link

lmeyerov commented Feb 7, 2024

Yes, my meta is perf warnings mode, like when defaults are slow for conformance reasons and a special calling pattern would make faster, would be very helpful :)

@vyasr vyasr changed the title [FEA] cudf.pandas should be able to warn on CPU fallback [FEA] Offer more control over CPU fallback in cudf.pandas May 15, 2024
@mroeschke mroeschke self-assigned this May 22, 2024
@vyasr vyasr assigned mroeschke and unassigned mroeschke and galipremsagar May 22, 2024
@Matt711
Copy link
Contributor

Matt711 commented May 22, 2024

  • Error when cudf and pandas produce different outputs. This would be an extra branch within the fast path where the slow path is run even if the fast path succeeds, and then the fast and slow paths are compared for equivalence.

If it's okay with you @mroeschke, can I still work on this component since it covers the issue I opened?

@mroeschke
Copy link
Contributor

If it's okay with you @mroeschke, can I still work on this component since it covers the #15817 I opened?

Yes go for it @Matt711!

@Matt711
Copy link
Contributor

Matt711 commented May 29, 2024

We could have two debugging mode options (note: we can use different names):

  1. mode.pandas_debugging
  2. mode.fallback_debugging

(1.) is for when fallback does not occur. It checks that the results from cudf and pandas agree and returns a warning if they do not. I'm working on that option in this PR #15837 .

(2.) is for when fallback does occur. It could return errors on the specific types of fallback mentioned:

  • Out of memory errors, for the sake of planning No OOM related work
  • AttributeErrors for missing functionality
  • TypeErrors for differing function signatures

What do we think about these two options?

cc. @bdice @vyasr @wence-

@vyasr
Copy link
Contributor

vyasr commented May 30, 2024

Making these modes independently configurable is definitely what we want, yes. As I commented on this in #15837, though, I don't think options are the right way to expose this. options are user-facing, whereas what we're trying to accomplish here is something for developers. Some environment variables documented in the developer guide are probably closer to what I would envision, especially for the first one (pandas_debugging). I don't see a reason for a user to ever need that one. I could envision exposing some internal APIs to control the second case (fallback_debugging) because in that scenario it could be useful to have the profiler hook into these so that users could collect information on why fallback occurred.

@Matt711
Copy link
Contributor

Matt711 commented May 30, 2024

Using an environment variable instead of an option is fine with me. I am curious if you have a more specific place in mind in the Developer Guide for documenting the environment variable?

@wence-
Copy link
Contributor

wence- commented May 30, 2024

Maybe we can add a new section on the fast-slow-proxy wrapping scheme. It can be mostly stubbed out and we can add info.

@Matt711
Copy link
Contributor

Matt711 commented May 30, 2024

Maybe we can add a new section on the fast-slow-proxy wrapping scheme. It can be mostly stubbed out and we can add info.

Yes, and I could add that in a new cudf.pandas section in the Developer Guide?

rapids-bot bot pushed a commit that referenced this issue Jun 5, 2024
This PR provides documentation for cudf.pandas in the Developer Guide. It will describe the fast-slow proxy wrapping scheme as well as document the `CUDF_PANDAS_DEBUGGING` environment variable created in PR #15837 for issue #14975.

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #15889
rapids-bot bot pushed a commit that referenced this issue Jun 9, 2024
#15837)

Part of #14975 This PR adds a pandas debugging option to `_fast_slow_function_call` that runs the slow path after the fast and returns a warning if the results differ.

Authors:
  - Matthew Murray (https://github.com/Matt711)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #15837
rapids-bot bot pushed a commit that referenced this issue Sep 25, 2024
#16562)

This PR makes more on #14975 by adding an environment variable that fails when fallback occurs in cudf.pandas. It also adds some tests that do __not__ fallback.

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #16562
@vyasr
Copy link
Contributor

vyasr commented Nov 5, 2024

@Matt711 what's the status of this issue after #16562? Next steps would be to work on enabling the various different fallback modes suggested in the issue I think (which in turn would help us do more systematic analysis of fallback).

@vyasr vyasr added the Python Affects Python cuDF API. label Nov 5, 2024
@Matt711
Copy link
Contributor

Matt711 commented Nov 5, 2024

@Matt711 what's the status of this issue after #16562? Next steps would be to work on enabling the various different fallback modes suggested in the issue I think (which in turn would help us do more systematic analysis of fallback).

Thanks for the reminder! I'll create a PR that raises on specific kinds of fallback, which I think should close this issue.

@GPUtester GPUtester moved this from Todo to In Progress in cuDF Python Nov 7, 2024
@rapids-bot rapids-bot bot closed this as completed in 76a5e32 Nov 13, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants