diff --git a/docs/cudf/source/user_guide/index.md b/docs/cudf/source/user_guide/index.md index 127097631e4..486368c3b8b 100644 --- a/docs/cudf/source/user_guide/index.md +++ b/docs/cudf/source/user_guide/index.md @@ -16,4 +16,5 @@ options performance-comparisons/index PandasCompat copy-on-write +pandas-2.0-breaking-changes ``` diff --git a/docs/cudf/source/user_guide/pandas-2.0-breaking-changes.md b/docs/cudf/source/user_guide/pandas-2.0-breaking-changes.md new file mode 100644 index 00000000000..e322fe4aebc --- /dev/null +++ b/docs/cudf/source/user_guide/pandas-2.0-breaking-changes.md @@ -0,0 +1,562 @@ +# Breaking changes for pandas 2 in cuDF 24.04+ + +In release 24.04 and later, cuDF requires pandas 2, following the announcement in [RAPIDS Support Notice 36](https://docs.rapids.ai/notices/rsn0036/). +Migrating to pandas 2 comes with a number of API and behavior changes, documented below. +The changes to support pandas 2 affect both `cudf` and `cudf.pandas` (cuDF pandas accelerator mode). +For more details, refer to the [pandas 2.0 changelog](https://pandas.pydata.org/docs/whatsnew/index.html#version-2-0). + +## Removed `DataFrame.append` & `Series.append`, use `cudf.concat` instead. + +`DataFrame.append` & `Series.append` deprecations are enforced by removing these two APIs. Instead, please use `cudf.concat`. + +Old behavior: +```python + +In [37]: s = cudf.Series([1, 2, 3]) + +In [38]: p = cudf.Series([10, 20, 30]) + +In [39]: s.append(p) +Out[39]: +0 1 +1 2 +2 3 +0 10 +1 20 +2 30 +dtype: int64 +``` + +New behavior: +```python +In [40]: cudf.concat([s, p]) +Out[40]: +0 1 +1 2 +2 3 +0 10 +1 20 +2 30 +dtype: int64 +``` + + +## Removed various numeric `Index` sub-classes, use `cudf.Index` + +`Float32Index`, `Float64Index`, `GenericIndex`, `Int8Index`, `Int16Index`, `Int32Index`, `Int64Index`, `StringIndex`, `UInt8Index`, `UInt16Index`, `UInt32Index`, `UInt64Index` have all been removed, use `cudf.Index` directly with a `dtype` to construct the index instead. + +Old behavior: +```python +In [35]: cudf.Int8Index([0, 1, 2]) +Out[35]: Int8Index([0, 1, 2], dtype='int8') +``` + +New behavior: +```python +In [36]: cudf.Index([0, 1, 2], dtype='int8') +Out[36]: Index([0, 1, 2], dtype='int8') +``` + + +## Change in bitwise operation results + + +Bitwise operations between two objects with different indexes will now not result in boolean results. + + +Old behavior: + +```python +In [1]: import cudf + +In [2]: import numpy as np + +In [3]: s = cudf.Series([1, 2, 3]) + +In [4]: p = cudf.Series([10, 11, 12], index=[2, 1, 10]) + +In [5]: np.bitwise_or(s, p) +Out[5]: +0 True +1 True +2 True +10 False +dtype: bool +``` + +New behavior: +```python +In [5]: np.bitwise_or(s, p) +Out[5]: +0 +1 11 +2 11 +10 +dtype: int64 +``` + + +## ufuncs will perform re-indexing + +Performing a numpy ufunc operation on two objects with mismatching index will result in re-indexing: + +Old behavior: + +```python +In [1]: import cudf + +In [2]: df = cudf.DataFrame({"a": [1, 2, 3]}, index=[0, 2, 3]) + +In [3]: df1 = cudf.DataFrame({"a": [1, 2, 3]}, index=[10, 20, 3]) + +In [4]: import numpy as np + +In [6]: np.add(df, df1) +Out[6]: + a +0 2 +2 4 +3 6 +``` + + +New behavior: + +```python +In [6]: np.add(df, df1) +Out[6]: + a +0 +2 +3 6 +10 +20 +``` + + +## `DataFrame` vs `Series` comparisons need to have matching index + +Going forward any comparison between `DataFrame` & `Series` objects will need to have matching axes, i.e., the column names of `DataFrame` should match index of `Series`: + +Old behavior: +```python +In [1]: import cudf + +In [2]: df = cudf.DataFrame({'a':range(0, 5), 'b':range(10, 15)}) + +In [3]: df +Out[3]: + a b +0 0 10 +1 1 11 +2 2 12 +3 3 13 +4 4 14 + +In [4]: s = cudf.Series([1, 2, 3]) + +In [6]: df == s +Out[6]: + a b 0 1 2 +0 False False False False False +1 False False False False False +2 False False False False False +3 False False False False False +4 False False False False False +``` + +New behavior: +```python +In [5]: df == s + +ValueError: Can only compare DataFrame & Series objects whose columns & index are same respectively, please reindex. + +In [8]: s = cudf.Series([1, 2], index=['a', 'b']) +# Create a series with matching Index to that of `df.columns` and then compare. + +In [9]: df == s +Out[9]: + a b +0 False False +1 True False +2 False False +3 False False +4 False False +``` + + +## Series.rank + + +`Series.rank` will now throw an error for non-numeric data when `numeric_only=True` is passed: + +Old behavior: + +```python +In [4]: s = cudf.Series(["a", "b", "c"]) + ...: s.rank(numeric_only=True) +Out[4]: Series([], dtype: float64) +``` + +New behavior: + +```python +In [4]: s = cudf.Series(["a", "b", "c"]) + ...: s.rank(numeric_only=True) +TypeError: Series.rank does not allow numeric_only=True with non-numeric dtype. +``` + + + +## Value counts sets the results name to `count`/`proportion` + + +In past versions, when running `Series.value_counts()`, the result would inherit the original object's name, and the result index would be nameless. This would cause confusion when resetting the index, and the column names would not correspond with the column values. Now, the result name will be `'count'` (or `'proportion'` if `normalize=True`). + +Old behavior: +```python +In [3]: cudf.Series(['quetzal', 'quetzal', 'elk'], name='animal').value_counts() + +Out[3]: +quetzal 2 +elk 1 +Name: animal, dtype: int64 +``` + +New behavior: +```python +In [3]: pd.Series(['quetzal', 'quetzal', 'elk'], name='animal').value_counts() + +Out[3]: +animal +quetzal 2 +elk 1 +Name: count, dtype: int64 +``` + + +## `DataFrame.describe` will include datetime data by default + +Previously by default (i.e., `datetime_is_numeric=False`) `describe` would not return datetime data. Now this parameter is inoperative will always include datetime columns. + +Old behavior: +```python +In [4]: df = cudf.DataFrame( + ...: { + ...: "int_data": [1, 2, 3], + ...: "str_data": ["hello", "world", "hello"], + ...: "float_data": [0.3234, 0.23432, 0.0], + ...: "timedelta_data": cudf.Series( + ...: [1, 2, 1], dtype="timedelta64[ns]" + ...: ), + ...: "datetime_data": cudf.Series( + ...: [1, 2, 1], dtype="datetime64[ns]" + ...: ), + ...: } + ...: ) + ...: + +In [5]: df +Out[5]: + int_data str_data float_data timedelta_data datetime_data +0 1 hello 0.32340 0 days 00:00:00.000000001 1970-01-01 00:00:00.000000001 +1 2 world 0.23432 0 days 00:00:00.000000002 1970-01-01 00:00:00.000000002 +2 3 hello 0.00000 0 days 00:00:00.000000001 1970-01-01 00:00:00.000000001 + +In [6]: df.describe() +Out[6]: + int_data float_data timedelta_data +count 3.0 3.000000 3 +mean 2.0 0.185907 0 days 00:00:00.000000001 +std 1.0 0.167047 0 days 00:00:00 +min 1.0 0.000000 0 days 00:00:00.000000001 +25% 1.5 0.117160 0 days 00:00:00.000000001 +50% 2.0 0.234320 0 days 00:00:00.000000001 +75% 2.5 0.278860 0 days 00:00:00.000000001 +max 3.0 0.323400 0 days 00:00:00.000000002 +``` + +New behavior: +```python +In [6]: df.describe() +Out[6]: + int_data float_data timedelta_data datetime_data +count 3.0 3.000000 3 3 +mean 2.0 0.185907 0 days 00:00:00.000000001 1970-01-01 00:00:00.000000001 +min 1.0 0.000000 0 days 00:00:00.000000001 1970-01-01 00:00:00.000000001 +25% 1.5 0.117160 0 days 00:00:00.000000001 1970-01-01 00:00:00.000000001 +50% 2.0 0.234320 0 days 00:00:00.000000001 1970-01-01 00:00:00.000000001 +75% 2.5 0.278860 0 days 00:00:00.000000001 1970-01-01 00:00:00.000000001 +max 3.0 0.323400 0 days 00:00:00.000000002 1970-01-01 00:00:00.000000002 +std 1.0 0.167047 0 days 00:00:00 +``` + + + +## Converting a datetime string with `Z` to timezone-naive dtype is not allowed. + +Previously when a date that had `Z` at the trailing end was allowed to be type-casted to `datetime64` type, now that will raise an error. + +Old behavior: +```python +In [11]: s = cudf.Series(np.datetime_as_string(np.arange("2002-10-27T04:30", 10 * 60, 1, dtype="M8[m]"),timezone="UTC")) + +In [12]: s +Out[12]: +0 2002-10-27T04:30Z +1 2002-10-27T04:31Z +2 2002-10-27T04:32Z +3 2002-10-27T04:33Z +4 2002-10-27T04:34Z + ... +595 2002-10-27T14:25Z +596 2002-10-27T14:26Z +597 2002-10-27T14:27Z +598 2002-10-27T14:28Z +599 2002-10-27T14:29Z +Length: 600, dtype: object + +In [13]: s.astype('datetime64[ns]') +Out[13]: +0 2002-10-27 04:30:00 +1 2002-10-27 04:31:00 +2 2002-10-27 04:32:00 +3 2002-10-27 04:33:00 +4 2002-10-27 04:34:00 + ... +595 2002-10-27 14:25:00 +596 2002-10-27 14:26:00 +597 2002-10-27 14:27:00 +598 2002-10-27 14:28:00 +599 2002-10-27 14:29:00 +Length: 600, dtype: datetime64[ns] +``` + +New behavior: +```python +In [13]: s.astype('datetime64[ns]') + +*** NotImplementedError: cuDF does not yet support timezone-aware datetimes casting +``` + + +## `Datetime` & `Timedelta` reduction operations will preserve their time resolutions. + + +Previously reduction operations on `datetime64` & `timedelta64` types used to result in lower-resolution results. +Now the original resolution is preserved: + +Old behavior: +```python +In [14]: sr = cudf.Series([10, None, 100, None, None], dtype='datetime64[us]') + +In [15]: sr +Out[15]: +0 1970-01-01 00:00:00.000010 +1 +2 1970-01-01 00:00:00.000100 +3 +4 +dtype: datetime64[us] + +In [16]: sr.std() +Out[16]: Timedelta('0 days 00:00:00.000063639') +``` + +New behavior: +```python +In [16]: sr.std() +Out[16]: Timedelta('0 days 00:00:00.000063') +``` + + +## `get_dummies` default return type is changed from `int8` to `bool` + +The default return values of `get_dummies` will be `boolean` instead of `int8` + +Old behavior: +```python +In [2]: s = cudf.Series([1, 2, 10, 11, None]) + +In [6]: cudf.get_dummies(s) +Out[6]: + 1 2 10 11 +0 1 0 0 0 +1 0 1 0 0 +2 0 0 1 0 +3 0 0 0 1 +4 0 0 0 0 +``` + +New behavior: +```python +In [3]: cudf.get_dummies(s) +Out[3]: + 1 2 10 11 +0 True False False False +1 False True False False +2 False False True False +3 False False False True +4 False False False False +``` + +## `reset_index` will name columns as `None` when `name=None` + +`reset_index` used to name columns as `0` or `self.name` if `name=None`. Now, passing `name=None` will name the column as `None` exactly. + +Old behavior: +```python +In [2]: s = cudf.Series([1, 2, 3]) + +In [4]: s.reset_index(name=None) +Out[4]: + index 0 +0 0 1 +1 1 2 +2 2 3 +``` + +New behavior: +```python +In [7]: s.reset_index(name=None) +Out[7]: + index None +0 0 1 +1 1 2 +2 2 3 +``` + +## Fixed an issue where duration components were being incorrectly calculated + +Old behavior: + +```python +In [18]: sr = cudf.Series([136457654736252, 134736784364431, 245345345545332, 223432411, 2343241, 3634548734, 23234], dtype='timedelta64[ms]') + +In [19]: sr +Out[19]: +0 1579371 days 00:05:36.252 +1 1559453 days 12:32:44.431 +2 2839645 days 04:52:25.332 +3 2 days 14:03:52.411 +4 0 days 00:39:03.241 +5 42 days 01:35:48.734 +6 0 days 00:00:23.234 +dtype: timedelta64[ms] + +In [21]: sr.dt.components +Out[21]: + days hours minutes seconds milliseconds microseconds nanoseconds +0 84843 3 3 40 285 138 688 +1 64925 15 30 48 464 138 688 +2 64093 10 23 7 107 828 992 +3 2 14 3 52 411 0 0 +4 0 0 39 3 241 0 0 +5 42 1 35 48 734 0 0 +6 0 0 0 23 234 0 0 +``` + +New behavior: + +```python +In [21]: sr.dt.components +Out[21]: + days hours minutes seconds milliseconds microseconds nanoseconds +0 1579371 0 5 36 252 0 0 +1 1559453 12 32 44 431 0 0 +2 2839645 4 52 25 332 0 0 +3 2 14 3 52 411 0 0 +4 0 0 39 3 241 0 0 +5 42 1 35 48 734 0 0 +6 0 0 0 23 234 0 0 +``` + + +## `fillna` on `datetime`/`timedelta` with a lower-resolution scalar will now type-cast the series + +Previously, when `fillna` was performed with a higher-resolution scalar than the series, the resulting resolution would have been cast to the higher resolution. Now the original resolution is preserved. + +Old behavior: +```python +In [22]: sr = cudf.Series([1000000, 200000, None], dtype='timedelta64[s]') + +In [23]: sr +Out[23]: +0 11 days 13:46:40 +1 2 days 07:33:20 +2 +dtype: timedelta64[s] + +In [24]: sr.fillna(np.timedelta64(1,'ms')) +Out[24]: +0 11 days 13:46:40 +1 2 days 07:33:20 +2 0 days 00:00:00.001 +dtype: timedelta64[ms] +``` + +New behavior: +```python +In [24]: sr.fillna(np.timedelta64(1,'ms')) +Out[24]: +0 11 days 13:46:40 +1 2 days 07:33:20 +2 0 days 00:00:00 +dtype: timedelta64[s] +``` + +## `Groupby.nth` & `Groupby.dtypes` will have the grouped column in result + +Previously, `Groupby.nth` & `Groupby.dtypes` would set the grouped columns as `Index` object. Now the new behavior will actually preserve the original objects `Index` and return the grouped columns too as part of the result. + +Old behavior: +```python +In [31]: df = cudf.DataFrame( + ...: { + ...: "a": [1, 1, 1, 2, 3], + ...: "b": [1, 2, 2, 2, 1], + ...: "c": [1, 2, None, 4, 5], + ...: "d": ["a", "b", "c", "d", "e"], + ...: } + ...: ) + ...: + +In [32]: df +Out[32]: + a b c d +0 1 1 1 a +1 1 2 2 b +2 1 2 c +3 2 2 4 d +4 3 1 5 e + +In [33]: df.groupby('a').nth(1) +Out[33]: + b c d +a +1 2 2 b + +In [34]: df.groupby('a').dtypes +Out[34]: + b c d +a +1 int64 int64 object +2 int64 int64 object +3 int64 int64 object +``` + +New behavior: +```python +In [33]: df.groupby('a').nth(1) +Out[33]: + a b c d +1 1 2 2.0 b + +In [34]: df.groupby('a').dtypes +Out[34]: + a b c d +a +1 int64 int64 int64 object +2 int64 int64 int64 object +3 int64 int64 int64 object +```