Division can cause `np.nan` in nullable dtypes #1334

larsyencken · 2023-07-12T09:14:56Z

Problem

For floating point numbers, Pandas has two ways of representing missing values, pd.NA (new-style) and np.nan (old-style).

np.nan can still occur in new-style types after division (by zero, by np.nan). However it fails df.isnull() and other null-handling methods when it occurs in new-style types.

Background

Data types in Pandas

Compared to other data frame libraries, Pandas uses an unusual strategy of handling missing data for numeric values: it uses the `NaN` floating point value to represent missingness, and converts integer types to floating point types when missingness is required. But this throws information away and can lose precision.

In [12]: pd.Series([1, 2, 3])
Out[12]:
0    1
1    2
2    3
dtype: int64

In [13]: pd.Series([1, 2, 3, None])
Out[13]:
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

To get around this, Pandas introduced new nullable types such as Int64 and Float32. The uppercase first letter means it has new-style NA support.

In [15]: pd.Series([1, 2, 3, None], dtype='Int64')
Out[15]:
0       1
1       2
2       3
3    <NA>
dtype: Int64

Frame packing for efficiency

You might avoid all the above by only using "standard" data types with Pandas. But, the standard data types are quite wasteful. For example, an large array of small integers can be 8x smaller in memory and on disk if represented with one byte per integer (`int8`) instead of eight bytes (`int64`).

For this reason, we wrote the library owid-repack to shrink our data frames to much smaller types, which helps them fit into memory and reduces the size of the catalog a lot. However, the repacking introduces these more exotic types, which means that we can in practice run into the correctness issues above.

Proposed solution

We would like to modify the Table and Variable types to check for np.nan after division, for nullable dtypes, and replace any that we find with pd.NA.

Upstream

BUG: hasnans not accounting for np.nan in FloatingArray pandas-dev/pandas#49818

The text was updated successfully, but these errors were encountered:

larsyencken · 2023-07-12T09:23:52Z

As of Pandas 2.0, you can now choose to use pyarrow dtypes. Apache Arrow uses a more consistent missing value scheme for floating point numbers, so these new dtypes do not suffer from the same issues as before:

In [7]: s = pd.Series([1], dtype='uint32[pyarrow]') / pd.Series([np.nan], dtype='float64')

In [8]: s
Out[8]:
0    <NA>
dtype: double[pyarrow]

In [9]: s.isnull()
Out[9]:
0    True
dtype: bool

In [10]: s.fillna(0)
Out[10]:
0   0.0
dtype: double[pyarrow]

In [11]: s.hasnans
Out[11]: True

Since Pandas is not likely to fix the bug soon, we could consider repacking to these data types instead. That would require a Pandas 2.0 upgrade, and no doubt some compatibility updates to the Table class.

Marigold · 2023-07-12T11:55:42Z

I ran into this issue some time ago and found that some of our steps had exactly this problem. I remember fixing those bugs and adding an assert somewhere to guard against this.

larsyencken · 2024-11-11T13:59:56Z

@Marigold I rewrote this issue now to make it about division specifically, which is when we seem to run into the issue most.

github-actions bot added the needs triage label Jul 12, 2023

larsyencken added the pinned label Jul 12, 2023

larsyencken mentioned this issue Jul 12, 2023

Upgrade pandas to 2.2.x #1094

Closed

larsyencken added the correctness label Jul 12, 2023

larsyencken added priority 2 - important and removed needs triage labels Jul 27, 2023

larsyencken added priority 3 - nice to have and removed priority 2 - important labels Mar 14, 2024

larsyencken changed the title ~~Nullable dtypes from Numpy have inconsistent behaviour~~ Division can cause np.nan in nullable dtypes Nov 11, 2024

larsyencken assigned Marigold Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Division can cause `np.nan` in nullable dtypes #1334

Division can cause `np.nan` in nullable dtypes #1334

larsyencken commented Jul 12, 2023 •

edited

Loading

larsyencken commented Jul 12, 2023

Marigold commented Jul 12, 2023

larsyencken commented Nov 11, 2024

Division can cause np.nan in nullable dtypes #1334

Division can cause np.nan in nullable dtypes #1334

Comments

larsyencken commented Jul 12, 2023 • edited Loading

Problem

Background

Proposed solution

Upstream

larsyencken commented Jul 12, 2023

Marigold commented Jul 12, 2023

larsyencken commented Nov 11, 2024

Division can cause `np.nan` in nullable dtypes #1334

Division can cause `np.nan` in nullable dtypes #1334

larsyencken commented Jul 12, 2023 •

edited

Loading