Skip to content

Commit

Permalink
docs: Add page about pandas booleans (#1392)
Browse files Browse the repository at this point in the history

---------

Co-authored-by: Francesco Bruzzesi <[email protected]>
  • Loading branch information
MarcoGorelli and FBruzzesi authored Nov 17, 2024
1 parent 950661f commit f8f3683
Show file tree
Hide file tree
Showing 10 changed files with 75 additions and 9 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/extremes.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ jobs:
enable-cache: "true"
cache-suffix: ${{ matrix.python-version }}
cache-dependency-glob: "**requirements*.txt"
- name: install-minimum-versions
- name: install-not-so-old-versions
run: uv pip install tox virtualenv setuptools pandas==2.0.3 polars==0.20.8 numpy==1.24.4 pyarrow==14.0.0 scipy==1.8.0 scikit-learn==1.3.0 dask[dataframe]==2024.7 tzdata --system
- name: install-reqs
run: uv pip install -r requirements-dev.txt --system
Expand Down
61 changes: 61 additions & 0 deletions docs/pandas_like_concepts/boolean.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Boolean columns

Generally speaking, Narwhals operations preserve null values.
For example, if you do `nw.col('a')*2`, then:

- Values which were non-null get multiplied by 2.
- Null values stay null.

```python exec="1" source="above" session="boolean" result="python"
import narwhals as nw

import pandas as pd
import polars as pl
import pyarrow as pa

data = {"a": [1.4, None, 4.2]}
print("pandas output")
print(nw.from_native(pd.DataFrame(data)).with_columns(b=nw.col("a") * 2).to_native())
print("\nPolars output")
print(nw.from_native(pl.DataFrame(data)).with_columns(b=nw.col("a") * 2).to_native())
print("\nPyArrow output")
print(nw.from_native(pa.table(data)).with_columns(b=nw.col("a") * 2).to_native())
```

What do we do, however, when the result column is boolean? For
example, `nw.col('a') > 0`?
Unfortunately, this is backend-dependent:

- for all backends except pandas, null values are preserved
- for pandas, this depends on the dtype backend:
- for PyArrow dtypes and pandas nullable dtypes, null
values are preserved
- for the classic NumPy dtypes, null values are typically
filled in with `False`.

pandas is generally moving towards nullable dtypes, and they
[may become the default in the future](https://github.com/pandas-dev/pandas/pull/58988),
so we hope that the classical NumPy dtypes not supporting null values will just
be a temporary legacy pandas issue which will eventually go
away anyway.

```python exec="1" source="above" session="boolean" result="python"
print("pandas output")
print(nw.from_native(pd.DataFrame(data)).with_columns(b=nw.col("a") > 2).to_native())
print("\npandas (nullable dtypes) output")
print(
nw.from_native(pd.DataFrame(data, dtype="Float64"))
.with_columns(b=nw.col("a") > 2)
.to_native()
)
print("\npandas (pyarrow dtypes) output")
print(
nw.from_native(pd.DataFrame(data, dtype="Float64[pyarrow]"))
.with_columns(b=nw.col("a") > 2)
.to_native()
)
print("\nPolars output")
print(nw.from_native(pl.DataFrame(data)).with_columns(b=nw.col("a") > 2).to_native())
print("\nPyArrow output")
print(nw.from_native(pa.table(data)).with_columns(b=nw.col("a") > 2).to_native())
```
File renamed without changes.
File renamed without changes.
File renamed without changes.
7 changes: 4 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,10 @@ nav:
- basics/complete_example.md
- basics/dataframe_conversion.md
- Pandas-like concepts:
- other/pandas_index.md
- other/user_warning.md
- other/column_names.md
- pandas_like_concepts/pandas_index.md
- pandas_like_concepts/user_warning.md
- pandas_like_concepts/column_names.md
- pandas_like_concepts/boolean.md
- overhead.md
- backcompat.md
- extending.md
Expand Down
2 changes: 1 addition & 1 deletion narwhals/_arrow/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ def __narwhals_dataframe__(self) -> Self:
def __narwhals_lazyframe__(self) -> Self:
return self

def _from_native_frame(self, df: Any) -> Self:
def _from_native_frame(self, df: pa.Table) -> Self:
return self.__class__(
df, backend_version=self._backend_version, dtypes=self._dtypes
)
Expand Down
4 changes: 3 additions & 1 deletion narwhals/_arrow/group_by.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
from narwhals.utils import remove_prefix

if TYPE_CHECKING:
import pyarrow as pa

from narwhals._arrow.dataframe import ArrowDataFrame
from narwhals._arrow.expr import ArrowExpr
from narwhals._arrow.typing import IntoArrowExpr
Expand Down Expand Up @@ -115,7 +117,7 @@ def __iter__(self) -> Iterator[tuple[Any, ArrowDataFrame]]:


def agg_arrow(
grouped: Any,
grouped: pa.TableGroupBy,
exprs: list[ArrowExpr],
keys: list[str],
output_names: list[str],
Expand Down
2 changes: 1 addition & 1 deletion narwhals/_arrow/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def __init__(
self._backend_version = backend_version
self._dtypes = dtypes

def _from_native_series(self, series: Any) -> Self:
def _from_native_series(self, series: pa.ChunkedArray | pa.Array) -> Self:
import pyarrow as pa # ignore-banned-import()

if isinstance(series, pa.Array):
Expand Down
6 changes: 4 additions & 2 deletions narwhals/_arrow/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
from narwhals.typing import DTypes


def native_to_narwhals_dtype(dtype: Any, dtypes: DTypes) -> DType:
def native_to_narwhals_dtype(dtype: pa.DataType, dtypes: DTypes) -> DType:
import pyarrow as pa # ignore-banned-import

if pa.types.is_int64(dtype):
Expand Down Expand Up @@ -284,7 +284,9 @@ def floordiv_compat(left: Any, right: Any) -> Any:
return result


def cast_for_truediv(arrow_array: Any, pa_object: Any) -> tuple[Any, Any]:
def cast_for_truediv(
arrow_array: pa.ChunkedArray | pa.Scalar, pa_object: pa.ChunkedArray | pa.Scalar
) -> tuple[pa.ChunkedArray | pa.Scalar, pa.ChunkedArray | pa.Scalar]:
# Lifted from:
# https://github.com/pandas-dev/pandas/blob/262fcfbffcee5c3116e86a951d8b693f90411e68/pandas/core/arrays/arrow/array.py#L108-L122
import pyarrow as pa # ignore-banned-import
Expand Down

0 comments on commit f8f3683

Please sign in to comment.