Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Division can cause np.nan in nullable dtypes #1334

Open
larsyencken opened this issue Jul 12, 2023 · 3 comments
Open

Division can cause np.nan in nullable dtypes #1334

larsyencken opened this issue Jul 12, 2023 · 3 comments

Comments

@larsyencken
Copy link
Collaborator

larsyencken commented Jul 12, 2023

Problem

For floating point numbers, Pandas has two ways of representing missing values, pd.NA (new-style) and np.nan (old-style).

np.nan can still occur in new-style types after division (by zero, by np.nan). However it fails df.isnull() and other null-handling methods when it occurs in new-style types.

Background

Data types in Pandas Compared to other data frame libraries, Pandas uses an unusual strategy of handling missing data for numeric values: it uses the `NaN` floating point value to represent missingness, and converts integer types to floating point types when missingness is required. But this throws information away and can lose precision.
In [12]: pd.Series([1, 2, 3])
Out[12]:
0    1
1    2
2    3
dtype: int64

In [13]: pd.Series([1, 2, 3, None])
Out[13]:
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

To get around this, Pandas introduced new nullable types such as Int64 and Float32. The uppercase first letter means it has new-style NA support.

In [15]: pd.Series([1, 2, 3, None], dtype='Int64')
Out[15]:
0       1
1       2
2       3
3    <NA>
dtype: Int64
Frame packing for efficiency You might avoid all the above by only using "standard" data types with Pandas. But, the standard data types are quite wasteful. For example, an large array of small integers can be 8x smaller in memory and on disk if represented with one byte per integer (`int8`) instead of eight bytes (`int64`).

For this reason, we wrote the library owid-repack to shrink our data frames to much smaller types, which helps them fit into memory and reduces the size of the catalog a lot. However, the repacking introduces these more exotic types, which means that we can in practice run into the correctness issues above.

Proposed solution

We would like to modify the Table and Variable types to check for np.nan after division, for nullable dtypes, and replace any that we find with pd.NA.

Upstream

@larsyencken
Copy link
Collaborator Author

As of Pandas 2.0, you can now choose to use pyarrow dtypes. Apache Arrow uses a more consistent missing value scheme for floating point numbers, so these new dtypes do not suffer from the same issues as before:

In [7]: s = pd.Series([1], dtype='uint32[pyarrow]') / pd.Series([np.nan], dtype='float64')

In [8]: s
Out[8]:
0    <NA>
dtype: double[pyarrow]

In [9]: s.isnull()
Out[9]:
0    True
dtype: bool

In [10]: s.fillna(0)
Out[10]:
0   0.0
dtype: double[pyarrow]

In [11]: s.hasnans
Out[11]: True

Since Pandas is not likely to fix the bug soon, we could consider repacking to these data types instead. That would require a Pandas 2.0 upgrade, and no doubt some compatibility updates to the Table class.

@Marigold
Copy link
Collaborator

I ran into this issue some time ago and found that some of our steps had exactly this problem. I remember fixing those bugs and adding an assert somewhere to guard against this.

@larsyencken larsyencken changed the title Nullable dtypes from Numpy have inconsistent behaviour Division can cause np.nan in nullable dtypes Nov 11, 2024
@larsyencken
Copy link
Collaborator Author

@Marigold I rewrote this issue now to make it about division specifically, which is when we seem to run into the issue most.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants