Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add is_nan expression & series method #1625

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

camriddell
Copy link
Contributor

  • add support for pandas, arrow, dask
  • add to docs
  • add tests

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

The is_nan implementation for pandas and dask currently return True for the following conditions:

  1. The backend datatype is float64
  2. The NaN value is not equal to itself

This leaves space for false rejections when there is a NaN value inside of a Series with 'object' dtype, though I am uncertain if we care about this edgecase.

Additionally, I was uncertain on how to structure the tests with regards to having NaN values inside of nullable datatypes. I currently dynamically change the expected result but was considering creating separate tests for data that are backed by a nullable datatype vs a non-nullable datatype so some guidance here would be appreciated.

- add support for pandas, arrow, dask
- add to docs
- add tests
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks @camriddell !

in answer to your question on the issue, i think np.isnan might be fine too, I'd just check that it doesn't inadvertently end up copying data just to do the operation (whereas s != s shouldn't be too expensive)

tests look good to me

docs/api-reference/expr.md Outdated Show resolved Hide resolved
narwhals/expr.py Outdated

Let's define a dataframe-agnostic function:

>>> def my_library_agnostic_function(df_native: IntoFrameT) -> IntoFrameT:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1500 is still in progress, but as we're adding a new function, shall we follow the conventions there?

narwhals/expr.py Show resolved Hide resolved

We define a dataframe-agnostic function:

>>> def my_library_agnostic_function(s_native: IntoSeriesT) -> IntoSeriesT:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

narwhals/series.py Outdated Show resolved Hide resolved
@camriddell
Copy link
Contributor Author

camriddell commented Dec 20, 2024

@MarcoGorelli I realized Polars raises an InvalidOperationError for checking is_nan for non supported dtypes. Is this a behavior we should enforce on all other backends (pandas/dask) as well?

>>> import polars as pl
>>> pl.Series(["a", "b"]).is_nan()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/polars/series/utils.py", line 106, in wrapper
    return s.to_frame().select_seq(f(*args, **kwargs)).to_series()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py", line 9138, in select_seq
    return self.lazy().select_seq(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2029, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: `is_nan` operation not supported for dtype `str`

Should we also raise an exception in these cases? Right now we are returning a Series of False values (though I do need to update this to respect existing Nulls).


Some additional testing code.

import polars as pl
from datetime import date

data = [[0], [1.1], ["a"], [date.today()], [[0]], [["a"]], [False]]
cant_check = []
for d in data:
    s = pl.Series(d)
    try:
        s.is_nan()
    except pl.exceptions.InvalidOperationError:
        print(f"\N{cross mark} {s.dtype = }")
    else:
        print(f"\N{white heavy check mark} {s.dtype = }")

# βœ… s.dtype = Int64
# βœ… s.dtype = Float64
# ❌ s.dtype = String
# ❌ s.dtype = Date
# ❌ s.dtype = List(Int64)
# ❌ s.dtype = List(String)
# ❌ s.dtype = Boolean

PyArrow exhibits the same behavior

>>> import pyarrow as pa
>>> s = pa.array(["1","2"])
>>> pc.is_nan(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pyarrow/compute.py", line 247, in wrapper
    return func.call(args, None, memory_pool)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_compute.pyx", line 393, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'is_nan' has no kernel matching input types (string)

@MarcoGorelli
Copy link
Member

interesting, thanks - yeah that's probably a good idea. then we avoid the risk that people who are too used to pandas accidentally use is_nan in non-float columns in places where they really wanted is_null

@camriddell
Copy link
Contributor Author

camriddell commented Dec 27, 2024

@MarcoGorelli thoughts on the apparent inconsistency in Polars is_nan handling? In Polars, if the underlying dtype is Int64, then a is_nan converts the Null value β†’ False. However if the dtypes is Float64 then the Null is preserved.

import pandas as pd
import polars as pl
import pyarrow as pa
import narwhals as nw
from narwhals.typing import IntoFrameT
data = {"a": [2, None, 4, 5, 6], "b": [2.0, None, float("nan"), 5.0, 6.0]}
df_pd = pd.DataFrame(data).astype({"a": "Int64"})
df_pl = pl.DataFrame(data)
df_pa = pa.table(data)

def agnostic_is_nan_columns(df_native: IntoFrameT) -> IntoFrameT:
    df = nw.from_native(df_native)
    return df.with_columns(
        a_is_nan=nw.col("a").is_nan(), b_is_nan=nw.col("b").is_nan()
    ).to_native()


print(agnostic_is_nan_columns(df_pd))
#       a    b  a_is_nan  b_is_nan
# 0     2  2.0     False     False
# 1  <NA>  NaN      <NA>      True
# 2     4  NaN     False      True
# 3     5  5.0     False     False
# 4     6  6.0     False     False

print(agnostic_is_nan_columns(df_pl))
# shape: (5, 4)
# β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
# β”‚ a    ┆ b    ┆ a_is_nan ┆ b_is_nan β”‚
# β”‚ ---  ┆ ---  ┆ ---      ┆ ---      β”‚
# β”‚ i64  ┆ f64  ┆ bool     ┆ bool     β”‚
# β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════════β•ͺ══════════║
# β”‚ 2    ┆ 2.0  ┆ false    ┆ false    β”‚
# β”‚ null ┆ null ┆ false    ┆ null     β”‚
# β”‚ 4    ┆ NaN  ┆ false    ┆ true     β”‚
# β”‚ 5    ┆ 5.0  ┆ false    ┆ false    β”‚
# β”‚ 6    ┆ 6.0  ┆ false    ┆ false    β”‚
# β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

print(agnostic_is_nan_columns(df_pa))
# pyarrow.Table
# a: int64
# b: double
# a_is_nan: bool
# b_is_nan: bool
# ----
# a: [[2,null,4,5,6]]
# b: [[2,null,nan,5,6]]
# a_is_nan: [[false,null,false,false,false]]
# b_is_nan: [[false,null,true,false,false]]

MRE for the Polars example

>>> import polars as pl
>>> pl.__version__
'1.16.0'
>>> pl.DataFrame({"int": [None, 1], "float": [None, 1.0]}).with_columns(pl.all().is_nan().name.suffix("_is_nan"))
shape: (2, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ int  ┆ float ┆ int_is_nan ┆ float_is_nan β”‚
β”‚ ---  ┆ ---   ┆ ---        ┆ ---          β”‚
β”‚ i64  ┆ f64   ┆ bool       ┆ bool         β”‚
β•žβ•β•β•β•β•β•β•ͺ═══════β•ͺ════════════β•ͺ══════════════║
β”‚ null ┆ null  ┆ false      ┆ null         β”‚
β”‚ 1    ┆ 1.0   ┆ false      ┆ false        β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

ser = self._native_series
if self.dtype.is_numeric():
return self._from_native_series(ser != ser) # noqa: PLR0124
msg = f"`is_nan` is not supported for dtype {self.dtype}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this error message be extended to include a suggestion?

"is_nan is not supported for dtype {self.dtype}, did you mean to use is_null?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of love for kind error messages πŸ’š

@MarcoGorelli
Copy link
Member

thanks for spotting this! I think this looks like a bug in Polars, shall we report it there? I think either is_nan should only be supported on float columns (my preference), or the null preservation should be consistent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: is_nan
2 participants