Skip to content

Commit

Permalink
docs: use intoDataFrame in docs
Browse files Browse the repository at this point in the history
  • Loading branch information
FBruzzesi committed Dec 27, 2024
1 parent a4fbcb4 commit 30cc15e
Show file tree
Hide file tree
Showing 4 changed files with 21 additions and 12 deletions.
4 changes: 2 additions & 2 deletions docs/pandas_like_concepts/boolean.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@ For example, if you do `nw.col('a')*2`, then:

```python exec="1" source="above" session="boolean"
import narwhals as nw
from narwhals.typing import FrameT
from narwhals.typing import IntoFrameT

data = {"a": [1.4, None, 4.2]}


def multiplication(df: FrameT) -> FrameT:
def multiplication(df: IntoFrameT) -> IntoFrameT:
return nw.from_native(df).with_columns((nw.col("a") * 2).alias("a*2")).to_native()
```

Expand Down
11 changes: 5 additions & 6 deletions docs/pandas_like_concepts/column_names.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,11 @@

Polars and PyArrow only allow for string column names. What about pandas?

```python
>>> import pandas as pd
>>> pd.concat([pd.Series([1, 2], name=0), pd.Series([1, 3], name=0)], axis=1)
0 0
0 1 1
1 2 3
```python exec="true" source="above" result="python" session="col_names"
import pandas as pd

df = pd.concat([pd.Series([1, 2], name=0), pd.Series([1, 3], name=0)], axis=1)
print(df)
```

Oh...not only does it let us create a dataframe with a column named `0` - it lets us
Expand Down
6 changes: 5 additions & 1 deletion docs/pandas_like_concepts/pandas_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,10 @@ Let's learn about what Narwhals promises.

```python exec="1" source="above" session="ex1"
import narwhals as nw
from narwhals.typing import IntoFrameT


def my_func(df):
def my_func(df: IntoFrameT) -> IntoFrameT:
df = nw.from_native(df)
df = df.with_columns(a_plus_one=nw.col("a") + 1)
return nw.to_native(df)
Expand Down Expand Up @@ -51,13 +52,16 @@ df_pd = pd.DataFrame({"a": [2, 1, 3], "b": [4, 5, 6]})
s_pd = df_pd["a"].sort_values()
df_pd["a_sorted"] = s_pd
```

Reading the code, you might expect that `'a_sorted'` will contain the
values `[1, 2, 3]`.

**However**, here's what actually happens:

```python exec="1" source="material-block" session="ex2" result="python"
print(df_pd)
```

In other words, pandas' index alignment undid the `sort_values` operation!

Narwhals, on the other hand, preserves the index of the left-hand-side argument.
Expand Down
12 changes: 9 additions & 3 deletions docs/pandas_like_concepts/user_warning.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,15 @@ The pandas API most likely cannot efficiently handle the complexity of the aggre
```python exec="true" source="above" result="python" session="df_ex1"
import narwhals as nw
import pandas as pd
from narwhals.typing import IntoFrameT

data = {"a": [1, 2, 3, 4, 5], "b": [5, 4, 3, 2, 1], "c": [10, 20, 30, 40, 50]}

df_pd = pd.DataFrame(data)


@nw.narwhalify
def approach_1(df):
def approach_1(df: IntoFrameT) -> IntoFrameT:

# Pay attention to this next line
df = df.group_by("a").agg(d=(nw.col("b") + nw.col("c")).sum())
Expand All @@ -43,7 +44,7 @@ The pandas API most likely cannot efficiently handle the complexity of the aggre


@nw.narwhalify
def approach_2(df):
def approach_2(df: IntoFrameT) -> IntoFrameT:

# Pay attention to this next line
df = df.with_columns(d=nw.col("b") + nw.col("c")).group_by("a").agg(nw.sum("d"))
Expand All @@ -54,40 +55,45 @@ The pandas API most likely cannot efficiently handle the complexity of the aggre
print(approach_2(df_pd))
```


Both Approaches shown above return the exact same result, but Approach 1 is inefficient and returns the warning message
we showed at the top.

What makes the first approach inefficient and the second approach efficient? It comes down to what the
pandas API lets us express.

## Approach 1

```python
# From line 11

return df.group_by("a").agg((nw.col("b") + nw.col("c")).sum().alias("d"))
```

To translate this to pandas, we would do:

```python
df.groupby("a").apply(
lambda df: pd.Series([(df["b"] + df["c"]).sum()], index=["d"]), include_groups=False
)
```

Any time you use `apply` in pandas, that's a performance footgun - best to avoid it and use vectorised operations instead.
Let's take a look at how "approach 2" gets translated to pandas to see the difference.

## Approach 2

```python
# Line 11 in Approach 2

return df.with_columns(d=nw.col("b") + nw.col("c")).group_by("a").agg({"d": "sum"})
```

This gets roughly translated to:

```python
df.assign(d=lambda df: df["b"] + df["c"]).groupby("a").agg({"d": "sum"})
```

Because we're using pandas' own API, as opposed to `apply` and a custom `lambda` function, then this is going to be much more efficient.

## Tips for Avoiding the `UserWarning`
Expand Down

0 comments on commit 30cc15e

Please sign in to comment.