From 3a388dddff7bd865c257611c486e71db7464827e Mon Sep 17 00:00:00 2001 From: Matthew Murray Date: Thu, 30 May 2024 08:39:16 -0700 Subject: [PATCH 1/4] DOC: Add documentation for cudf.pandas in the Developer Guide --- docs/cudf/source/developer_guide/cudf_pandas.md | 1 + docs/cudf/source/developer_guide/index.md | 1 + 2 files changed, 2 insertions(+) create mode 100644 docs/cudf/source/developer_guide/cudf_pandas.md diff --git a/docs/cudf/source/developer_guide/cudf_pandas.md b/docs/cudf/source/developer_guide/cudf_pandas.md new file mode 100644 index 00000000000..e2182c1a9dd --- /dev/null +++ b/docs/cudf/source/developer_guide/cudf_pandas.md @@ -0,0 +1 @@ +# cudf.pandas diff --git a/docs/cudf/source/developer_guide/index.md b/docs/cudf/source/developer_guide/index.md index 5cafa8f784c..5e099631fc5 100644 --- a/docs/cudf/source/developer_guide/index.md +++ b/docs/cudf/source/developer_guide/index.md @@ -27,4 +27,5 @@ testing benchmarking options pylibcudf +cudf_pandas ``` From 3e44e328d96c9d9834839974c9339bd4f9adb21c Mon Sep 17 00:00:00 2001 From: Matthew Murray Date: Mon, 3 Jun 2024 12:38:25 -0700 Subject: [PATCH 2/4] Add cudf.pandas developer documentation --- .../source/developer_guide/cudf_pandas.md | 96 +++++++++++++++++++ 1 file changed, 96 insertions(+) diff --git a/docs/cudf/source/developer_guide/cudf_pandas.md b/docs/cudf/source/developer_guide/cudf_pandas.md index e2182c1a9dd..31aab569b40 100644 --- a/docs/cudf/source/developer_guide/cudf_pandas.md +++ b/docs/cudf/source/developer_guide/cudf_pandas.md @@ -1 +1,97 @@ # cudf.pandas +The use of the cuDF pandas accelerator mode (`cudf.pandas`) is explained [here](https://docs.rapids.ai/api/cudf/stable/cudf_pandas/). The purpose of this document is to explain how the fast-slow proxy mechanism works and document the internal environment variables used to debug `cudf.pandas`. + +## fast-slow proxy mechanism +`cudf.pandas` works by wrapping each Pandas type (and its corresponding cuDF type) in new proxy types (aka fast-slow proxy types). Because the proxy types wrap both the fast and slow implementations of the original type, we can ensure that computations are first done on the fast version of the proxy type, and if that fails, the slow version of the proxy type. + +### Types +#### Wrapped Types and Proxy Types +The "wrapped" types/classes are the Pandas and cuDF specific types that have been wrapped into proxy types. Wrapped objects and proxy objects are instances of wrapped types and proxy types, respectively. In the snippet below `s1` and `s2` are wrapped objects and `s3` is a fast-slow proxy object. Also note that the module `xpd` is a wrapped module and contains cuDF and Pandas modules as attributes. + ```python + import cudf.pandas + cudf.pandas.install() + import pandas as xpd + + cudf = xpd._fsproxy_fast + pd = xpd._fsproxy_slow + + s1 = cudf.Series([1,2]) + s2 = pd.Series([1,2]) + s3 = xpd.Series([1,2]) + ``` + +#### The Different Kinds of Proxy Types +In `cudf.pandas`, proxy types come in multiple kinds: fast-slow proxy types, callable types, and fast-slow attribute types. + +Fast-slow proxy types come in two flavors: final types and intermediate types. Final types are types for which known operations exist for converting an object of "fast" type to "slow" and vice-versa. For example, `cudf.DataFrame` can be converted to Pandas using the method `to_pandas` and `pd.DataFrame` can be converted to cuDF using the function `cudf.from_pandas`. + +Intermediate types are the types of the results of operations invoked on final types. For example, `DataFrameGroupBy` is a type that will be created during a groupby operation. + +Callable types are ... + +Fast-Slow attribute types ... + +#### Creating New Proxy Types +`_FinalProxy` and `_IntermediateProxy` types are created using the functions `make_final_proxy_type` and `make_intermediate_proxy` type, respectively. Creating a new final type looks like this. + +```python +DataFrame = make_final_proxy_type( + "DataFrame", + cudf.DataFrame, + pd.DataFrame, + fast_to_slow=lambda fast: fast.to_pandas(), + slow_to_fast=cudf.from_pandas, +) +``` + +### The Fallback Mechanism +The `_fast_slow_function_call` is the all-important function that we use to call operations the fast way (using cuDF) and if that fails, the slow way (using Pandas). The is also known as the fallback mechanism. The function looks like this: +```python +def _fast_slow_function_call(func: Callable, *args, **kwargs): + try: + ... + fast_args, fast_kwargs = _fast_arg(args), _fast_arg(kwargs) + result = func(*fast_args, **fast_kwargs) + ... + except Exception: + ... + slow_args, slow_kwargs = _slow_arg(args), _slow_arg(kwargs) + result = func(*slow_args, **slow_kwargs) + ... +``` +As you can see the function attempts to call `func` the fast way using cuDF and if an any `Exception` occurs, it calls the function using Pandas. In essence, this `try-except` is what allows `cudf.pandas` to support 100% of the Pandas API. + +### Using Metaclasses +`cudf.pandas` uses a [metalass](https://docs.python.org/3/glossary.html#term-metaclass) called (`_FastSlowProxyMeta`) to dynamically find class attributes and classmethods of fast-slow proxy types. For example, in the snippet below, the `xpd.Series` type is an instance `_FastSlowProxyMeta`. Therefore we can access the property `_fsproxy_fast` defined in the metaclass. +```python +import cudf.pandas +cudf.pandas.install() +import pandas as xpd + +print(xpd.Series._fsproxy_fast) # output is cudf.core.series.Series +``` + +### Caching + +### Pickling and Unpickling + +## debugging `cudf.pandas` +Several environment variables are available for debugging purposes. + +Setting the environment variable `CUDF_PANDAS_DEBUGGING` produces a warning when the results from cuDF and Pandas differ from one another. For example, the snippet below produces the warning below. +```python +import cudf.pandas +cudf.pandas.install() +import pandas as pd +import numpy as np + +setattr(pd.Series.mean, "_fsproxy_slow", lambda self, *args, **kwargs: np.float64(1)) +s = pd.Series([1,2,3]) +s.mean() +``` +``` +UserWarning: The results from cudf and pandas were different. The exception was +Arrays are not almost equal to 7 decimals + ACTUAL: 1.0 + DESIRED: 2.0. +``` From ec0853ca1e966d0475aab61c5af1c794c6c66284 Mon Sep 17 00:00:00 2001 From: Matthew Murray Date: Tue, 4 Jun 2024 06:31:35 -0700 Subject: [PATCH 3/4] Address comments --- .../source/developer_guide/cudf_pandas.md | 70 ++++++++++++------- 1 file changed, 46 insertions(+), 24 deletions(-) diff --git a/docs/cudf/source/developer_guide/cudf_pandas.md b/docs/cudf/source/developer_guide/cudf_pandas.md index 31aab569b40..6ef84cd7030 100644 --- a/docs/cudf/source/developer_guide/cudf_pandas.md +++ b/docs/cudf/source/developer_guide/cudf_pandas.md @@ -1,12 +1,17 @@ # cudf.pandas -The use of the cuDF pandas accelerator mode (`cudf.pandas`) is explained [here](https://docs.rapids.ai/api/cudf/stable/cudf_pandas/). The purpose of this document is to explain how the fast-slow proxy mechanism works and document the internal environment variables used to debug `cudf.pandas`. - -## fast-slow proxy mechanism -`cudf.pandas` works by wrapping each Pandas type (and its corresponding cuDF type) in new proxy types (aka fast-slow proxy types). Because the proxy types wrap both the fast and slow implementations of the original type, we can ensure that computations are first done on the fast version of the proxy type, and if that fails, the slow version of the proxy type. - -### Types -#### Wrapped Types and Proxy Types -The "wrapped" types/classes are the Pandas and cuDF specific types that have been wrapped into proxy types. Wrapped objects and proxy objects are instances of wrapped types and proxy types, respectively. In the snippet below `s1` and `s2` are wrapped objects and `s3` is a fast-slow proxy object. Also note that the module `xpd` is a wrapped module and contains cuDF and Pandas modules as attributes. +The use of the cuDF pandas accelerator mode (`cudf.pandas`) is explained [here](../cudf_pandas/index.rst). +The purpose of this document is to explain how the fast-slow proxy mechanism works and document the internal environment variables used to debug `cudf.pandas`. + +## fast-slow proxy mechanism: +`cudf.pandas` works by wrapping each Pandas type and its corresponding cuDF type in new proxy types (aka fast-slow proxy types). +The purpose of the proxy types is to attempt computations on the fast (cuDF) object first, and then fall back to running on the slow (Pandas) object if the fast version fails. + +### Types: +#### Wrapped Types and Proxy Types: +The "wrapped" types/classes are the Pandas and cuDF specific types that have been wrapped into proxy types. +Wrapped objects and proxy objects are instances of wrapped types and proxy types, respectively. +In the snippet below `s1` and `s2` are wrapped objects and `s3` is a fast-slow proxy object. +Also note that the module `xpd` is a wrapped module and contains cuDF and Pandas modules as attributes. ```python import cudf.pandas cudf.pandas.install() @@ -20,18 +25,24 @@ The "wrapped" types/classes are the Pandas and cuDF specific types that have bee s3 = xpd.Series([1,2]) ``` -#### The Different Kinds of Proxy Types -In `cudf.pandas`, proxy types come in multiple kinds: fast-slow proxy types, callable types, and fast-slow attribute types. +Note that users should never have to interact with the wrapped objects directly in this way. +This code is purely for demonstrative purposes. -Fast-slow proxy types come in two flavors: final types and intermediate types. Final types are types for which known operations exist for converting an object of "fast" type to "slow" and vice-versa. For example, `cudf.DataFrame` can be converted to Pandas using the method `to_pandas` and `pd.DataFrame` can be converted to cuDF using the function `cudf.from_pandas`. +#### The Different Kinds of Proxy Types: +In `cudf.pandas`, there are two main kinds of proxy types: final types and intermediate types. -Intermediate types are the types of the results of operations invoked on final types. For example, `DataFrameGroupBy` is a type that will be created during a groupby operation. +##### Final and Intermediate Proxy Types: +Final types are types for which known operations exist for converting an object of a "fast" type to a "slow" type and vice versa. +For example, `cudf.DataFrame` can be converted to Pandas using the method `to_pandas`, and `pd.DataFrame` can be converted to cuDF using the function `cudf.from_pandas`. +Intermediate types are the types of the results of operations invoked on final types. +For example, `xpd.DataFrameGroupBy` is an intermediate type that will be created during a groupby operation on the final type `xpd.DataFrame`. -Callable types are ... +##### Attributes and Callable Proxy Types: +Final proxy types are typically classes or modules, both of which have attributes. +Classes also have methods. +These attributes and methods must be wrapped as well to support the fast-slow proxy scheme. -Fast-Slow attribute types ... - -#### Creating New Proxy Types +#### Creating New Proxy Types: `_FinalProxy` and `_IntermediateProxy` types are created using the functions `make_final_proxy_type` and `make_intermediate_proxy` type, respectively. Creating a new final type looks like this. ```python @@ -45,7 +56,9 @@ DataFrame = make_final_proxy_type( ``` ### The Fallback Mechanism -The `_fast_slow_function_call` is the all-important function that we use to call operations the fast way (using cuDF) and if that fails, the slow way (using Pandas). The is also known as the fallback mechanism. The function looks like this: +The `_fast_slow_function_call` is the all-important function that we use to call operations the fast way (using cuDF) and if that fails, the slow way (using Pandas). +The is also known as the fallback mechanism. +The function looks like this: ```python def _fast_slow_function_call(func: Callable, *args, **kwargs): try: @@ -58,11 +71,23 @@ def _fast_slow_function_call(func: Callable, *args, **kwargs): slow_args, slow_kwargs = _slow_arg(args), _slow_arg(kwargs) result = func(*slow_args, **slow_kwargs) ... + return _maybe_wrap_result(result, func, *args, **kwargs), fast ``` -As you can see the function attempts to call `func` the fast way using cuDF and if an any `Exception` occurs, it calls the function using Pandas. In essence, this `try-except` is what allows `cudf.pandas` to support 100% of the Pandas API. +As you can see the function attempts to call `func` the fast way using cuDF and if any `Exception` occurs, it calls the function using Pandas. +In essence, this `try-except` is what allows `cudf.pandas` to support 100% of the Pandas API. + +At the end, the function wraps the result from either path in a fast-slow proxy object, if necessary. + +#### Converting Proxy Objects +Note that before the `func` is called, the proxy object and its attributes need to be converted to either their cuDF or Pandas implementations. +This conversion is handled in the function `_transform_arg` which both `_fast_arg` and `_slow_arg` call. + +`_transform_arg` is a recuirsive function that will call itself depending on the type or argument passed to it (eg. `_transform_arg` is called for each element in a list of arguments). ### Using Metaclasses -`cudf.pandas` uses a [metalass](https://docs.python.org/3/glossary.html#term-metaclass) called (`_FastSlowProxyMeta`) to dynamically find class attributes and classmethods of fast-slow proxy types. For example, in the snippet below, the `xpd.Series` type is an instance `_FastSlowProxyMeta`. Therefore we can access the property `_fsproxy_fast` defined in the metaclass. +`cudf.pandas` uses a [metaclass](https://docs.python.org/3/glossary.html#term-metaclass) called (`_FastSlowProxyMeta`) to dynamically find class attributes and classmethods of fast-slow proxy types. +For example, in the snippet below, the `xpd.Series` type is an instance of `_FastSlowProxyMeta`. +Therefore we can access the property `_fsproxy_fast` defined in the metaclass. ```python import cudf.pandas cudf.pandas.install() @@ -71,14 +96,11 @@ import pandas as xpd print(xpd.Series._fsproxy_fast) # output is cudf.core.series.Series ``` -### Caching - -### Pickling and Unpickling - ## debugging `cudf.pandas` Several environment variables are available for debugging purposes. -Setting the environment variable `CUDF_PANDAS_DEBUGGING` produces a warning when the results from cuDF and Pandas differ from one another. For example, the snippet below produces the warning below. +Setting the environment variable `CUDF_PANDAS_DEBUGGING` produces a warning when the results from cuDF and Pandas differ from one another. +For example, the snippet below produces the warning below. ```python import cudf.pandas cudf.pandas.install() From 2aa159ceb622517e9e44e75ebdeef5a89616d558 Mon Sep 17 00:00:00 2001 From: Matthew Murray Date: Wed, 5 Jun 2024 10:04:28 -0700 Subject: [PATCH 4/4] Address comments --- .../source/developer_guide/cudf_pandas.md | 36 ++++++++++--------- 1 file changed, 19 insertions(+), 17 deletions(-) diff --git a/docs/cudf/source/developer_guide/cudf_pandas.md b/docs/cudf/source/developer_guide/cudf_pandas.md index 6ef84cd7030..aeb43f66b2d 100644 --- a/docs/cudf/source/developer_guide/cudf_pandas.md +++ b/docs/cudf/source/developer_guide/cudf_pandas.md @@ -1,13 +1,13 @@ # cudf.pandas -The use of the cuDF pandas accelerator mode (`cudf.pandas`) is explained [here](../cudf_pandas/index.rst). -The purpose of this document is to explain how the fast-slow proxy mechanism works and document the internal environment variables used to debug `cudf.pandas`. +The use of the cuDF pandas accelerator mode (`cudf.pandas`) is explained [in the user guide](../cudf_pandas/index.rst). +The purpose of this document is to explain how the fast-slow proxy mechanism works and document internal environment variables that can be used to debug `cudf.pandas` itself. -## fast-slow proxy mechanism: -`cudf.pandas` works by wrapping each Pandas type and its corresponding cuDF type in new proxy types (aka fast-slow proxy types). -The purpose of the proxy types is to attempt computations on the fast (cuDF) object first, and then fall back to running on the slow (Pandas) object if the fast version fails. +## fast-slow proxy mechanism +`cudf.pandas` works by wrapping each Pandas type and its corresponding cuDF type in a new proxy type also known as a fast-slow proxy type. +The purpose of proxy types is to attempt computations on the fast (cuDF) object first, and then fall back to running on the slow (Pandas) object if the fast version fails. ### Types: -#### Wrapped Types and Proxy Types: +#### Wrapped Types and Proxy Types The "wrapped" types/classes are the Pandas and cuDF specific types that have been wrapped into proxy types. Wrapped objects and proxy objects are instances of wrapped types and proxy types, respectively. In the snippet below `s1` and `s2` are wrapped objects and `s3` is a fast-slow proxy object. @@ -25,25 +25,28 @@ Also note that the module `xpd` is a wrapped module and contains cuDF and Pandas s3 = xpd.Series([1,2]) ``` +```{note} Note that users should never have to interact with the wrapped objects directly in this way. This code is purely for demonstrative purposes. +``` -#### The Different Kinds of Proxy Types: +#### The Different Kinds of Proxy Types In `cudf.pandas`, there are two main kinds of proxy types: final types and intermediate types. -##### Final and Intermediate Proxy Types: +##### Final and Intermediate Proxy Types Final types are types for which known operations exist for converting an object of a "fast" type to a "slow" type and vice versa. For example, `cudf.DataFrame` can be converted to Pandas using the method `to_pandas`, and `pd.DataFrame` can be converted to cuDF using the function `cudf.from_pandas`. Intermediate types are the types of the results of operations invoked on final types. For example, `xpd.DataFrameGroupBy` is an intermediate type that will be created during a groupby operation on the final type `xpd.DataFrame`. -##### Attributes and Callable Proxy Types: +##### Attributes and Callable Proxy Types Final proxy types are typically classes or modules, both of which have attributes. Classes also have methods. These attributes and methods must be wrapped as well to support the fast-slow proxy scheme. -#### Creating New Proxy Types: -`_FinalProxy` and `_IntermediateProxy` types are created using the functions `make_final_proxy_type` and `make_intermediate_proxy` type, respectively. Creating a new final type looks like this. +#### Creating New Proxy Types +`_FinalProxy` and `_IntermediateProxy` types are created using the functions `make_final_proxy_type` and `make_intermediate_proxy` type, respectively. +Creating a new final type looks like this. ```python DataFrame = make_final_proxy_type( @@ -56,8 +59,7 @@ DataFrame = make_final_proxy_type( ``` ### The Fallback Mechanism -The `_fast_slow_function_call` is the all-important function that we use to call operations the fast way (using cuDF) and if that fails, the slow way (using Pandas). -The is also known as the fallback mechanism. +Proxied calls are implemented with fallback via [`_fast_slow_function_call`](https://github.com/rapidsai/cudf/blob/57aeeb78d85e169ac18b82f51d2b1cbd01b0608d/python/cudf/cudf/pandas/fast_slow_proxy.py#L869). This implements the mechanism by which we attempt operations the fast way (using cuDF) and then fall back to the slow way (using Pandas) on failure. The function looks like this: ```python def _fast_slow_function_call(func: Callable, *args, **kwargs): @@ -73,8 +75,8 @@ def _fast_slow_function_call(func: Callable, *args, **kwargs): ... return _maybe_wrap_result(result, func, *args, **kwargs), fast ``` -As you can see the function attempts to call `func` the fast way using cuDF and if any `Exception` occurs, it calls the function using Pandas. -In essence, this `try-except` is what allows `cudf.pandas` to support 100% of the Pandas API. +As we can see the function attempts to call `func` the fast way using cuDF and if any `Exception` occurs, it calls the function using Pandas. +In essence, this `try-except` is what allows `cudf.pandas` to support the bulk of the Pandas API. At the end, the function wraps the result from either path in a fast-slow proxy object, if necessary. @@ -82,10 +84,10 @@ At the end, the function wraps the result from either path in a fast-slow proxy Note that before the `func` is called, the proxy object and its attributes need to be converted to either their cuDF or Pandas implementations. This conversion is handled in the function `_transform_arg` which both `_fast_arg` and `_slow_arg` call. -`_transform_arg` is a recuirsive function that will call itself depending on the type or argument passed to it (eg. `_transform_arg` is called for each element in a list of arguments). +`_transform_arg` is a recursive function that will call itself depending on the type or argument passed to it (eg. `_transform_arg` is called for each element in a list of arguments). ### Using Metaclasses -`cudf.pandas` uses a [metaclass](https://docs.python.org/3/glossary.html#term-metaclass) called (`_FastSlowProxyMeta`) to dynamically find class attributes and classmethods of fast-slow proxy types. +`cudf.pandas` uses a [metaclass](https://docs.python.org/3/glossary.html#term-metaclass) called (`_FastSlowProxyMeta`) to find class attributes and classmethods of fast-slow proxy types. For example, in the snippet below, the `xpd.Series` type is an instance of `_FastSlowProxyMeta`. Therefore we can access the property `_fsproxy_fast` defined in the metaclass. ```python