You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For floating point numbers, Pandas has two ways of representing missing values, pd.NA (new-style) and np.nan (old-style).
np.nan can still occur in new-style types after division (by zero, by np.nan). However it fails df.isnull() and other null-handling methods when it occurs in new-style types.
Background
Data types in Pandas
Compared to other data frame libraries, Pandas uses an unusual strategy of handling missing data for numeric values: it uses the `NaN` floating point value to represent missingness, and converts integer types to floating point types when missingness is required. But this throws information away and can lose precision.
Frame packing for efficiency
You might avoid all the above by only using "standard" data types with Pandas. But, the standard data types are quite wasteful. For example, an large array of small integers can be 8x smaller in memory and on disk if represented with one byte per integer (`int8`) instead of eight bytes (`int64`).
For this reason, we wrote the library owid-repack to shrink our data frames to much smaller types, which helps them fit into memory and reduces the size of the catalog a lot. However, the repacking introduces these more exotic types, which means that we can in practice run into the correctness issues above.
Proposed solution
We would like to modify the Table and Variable types to check for np.nan after division, for nullable dtypes, and replace any that we find with pd.NA.
As of Pandas 2.0, you can now choose to use pyarrow dtypes. Apache Arrow uses a more consistent missing value scheme for floating point numbers, so these new dtypes do not suffer from the same issues as before:
In [7]: s = pd.Series([1], dtype='uint32[pyarrow]') / pd.Series([np.nan], dtype='float64')
In [8]: s
Out[8]:
0 <NA>
dtype: double[pyarrow]
In [9]: s.isnull()
Out[9]:
0 True
dtype: bool
In [10]: s.fillna(0)
Out[10]:
0 0.0
dtype: double[pyarrow]
In [11]: s.hasnans
Out[11]: True
Since Pandas is not likely to fix the bug soon, we could consider repacking to these data types instead. That would require a Pandas 2.0 upgrade, and no doubt some compatibility updates to the Table class.
I ran into this issue some time ago and found that some of our steps had exactly this problem. I remember fixing those bugs and adding an assert somewhere to guard against this.
Problem
For floating point numbers, Pandas has two ways of representing missing values,
pd.NA
(new-style) andnp.nan
(old-style).np.nan
can still occur in new-style types after division (by zero, bynp.nan
). However it failsdf.isnull()
and other null-handling methods when it occurs in new-style types.Background
Data types in Pandas
Compared to other data frame libraries, Pandas uses an unusual strategy of handling missing data for numeric values: it uses the `NaN` floating point value to represent missingness, and converts integer types to floating point types when missingness is required. But this throws information away and can lose precision.To get around this, Pandas introduced new nullable types such as
Int64
andFloat32
. The uppercase first letter means it has new-style NA support.Frame packing for efficiency
You might avoid all the above by only using "standard" data types with Pandas. But, the standard data types are quite wasteful. For example, an large array of small integers can be 8x smaller in memory and on disk if represented with one byte per integer (`int8`) instead of eight bytes (`int64`).For this reason, we wrote the library owid-repack to shrink our data frames to much smaller types, which helps them fit into memory and reduces the size of the catalog a lot. However, the repacking introduces these more exotic types, which means that we can in practice run into the correctness issues above.
Proposed solution
We would like to modify the
Table
andVariable
types to check fornp.nan
after division, for nullable dtypes, and replace any that we find withpd.NA
.Upstream
hasnans
not accounting fornp.nan
inFloatingArray
pandas-dev/pandas#49818The text was updated successfully, but these errors were encountered: