✨ Convert dtypes to safe types Float64, Int64 and string #3474

Marigold · 2024-10-31T08:20:57Z

Implements #3277

Adds new parameter ds.read_table("my_table", safe_types=True) that converts all types to "safe" types Float64, Int64 or string[python]. It's true by default.

(We've also considered the name unpack in the past instead of safe_types. I don't have a preference.)

int64 vs Int64

Should we use numpy's int64 type or nullable pandas type Int64 (or even pyarrow types)? This PR goes with nullable types, but there are some hidden risks to it:

Mixing np.nan and pd.NA creates a nightmare of compatibility issues. We should always use pd.NA
Sometimes np.nan sneaks in, for instance division 0/0 returns np.nan even if we work with nullable types

a = pd.Series([0]).astype("Int64")
b = pd.Series([0]).astype("Int64")
# the first item is np.nan, not pd.NA!
c = a / b
# this is actually False, `pd.isnull` doesn't detect np.nan when dtype is Int64
pd.isnull(c)
# this doesn't work either...
c.fillna(pd.NA)
# only this fixes it
c.mask(np.isnan(c), pd.NA)

TODO before merging:

Increment ETL_EPOCH and make sure old datasets don't change
Undo ETL_EPOCH increment

owidbot · 2024-10-31T08:25:08Z

Quick links (staging server):

Site Dev	Site Preview	Admin	Wizard	Docs

Login: ssh owid@staging-site-safe-types

chart-diff: ✅

No charts for review.

data-diff: ❌ Found differences

= Dataset garden/antibiotics/2024-11-15/testing_coverage
  = Table testing_coverage
⚠ Error: Index must be unique.
= Dataset garden/artificial_intelligence/2023-06-14/ai_deepfakes
  = Table ai_deepfakes
⚠ Error: Index must be unique.
⚠ Error: Index must be unique.
= Dataset garden/artificial_intelligence/2024-11-03/epoch_aggregates_domain
  = Table epoch_aggregates_domain
    ~ Column cumulative_count (changed metadata)
-       -   Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 6 November 2024.
        ?                                                                                                                                                                                                                                                            ^
+       +   Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 03 November 2024.
        ?                                                                                                                                                                                                                                                            ^^
    ~ Column yearly_count (changed metadata)
-       -   Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 6 November 2024.
        ?                                                                                                                                                                                                                                                            ^
+       +   Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 03 November 2024.
        ?                                                                                                                                                                                                                                                            ^^
= Dataset garden/artificial_intelligence/2024-11-03/epoch_compute_intensive_countries
  = Table epoch_compute_intensive_countries
    ~ Column cumulative_count (changed metadata)
-       -   Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 6 November 2024.
        ?                                                                                                                                                                          ^
+       +   Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 03 November 2024.
        ?                                                                                                                                                                          ^^
    ~ Column yearly_count (changed metadata)
-       -   Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 6 November 2024.
        ?                                                                                                                                                                          ^
+       +   Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 03 November 2024.
        ?                                                                                                                                                                          ^^
= Dataset garden/artificial_intelligence/2024-11-03/epoch_compute_intensive_domain
  = Table epoch_compute_intensive_domain
    ~ Column cumulative_count (changed metadata)
-       -   Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 6 November 2024.
        ?                                                                                                                                                               ^
+       +   Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 03 November 2024.
        ?                                                                                                                                                               ^^
    ~ Column yearly_count (changed metadata)
-       -   Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 6 November 2024.
        ?                                                                                                                                                               ^
+       +   Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 03 November 2024.
        ?                                                                                                                                                               ^^
= Dataset garden/faostat/2024-03-14/faostat_fa
  = Table faostat_fa
  = Table faostat_fa_flat
2024-11-19 18:32:21 [warning  ] DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` category=PerformanceWarning filename=/home/owid/etl/lib/catalog/owid/catalog/tables.py lineno=405
2024-11-19 18:33:23 [warning  ] DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` category=PerformanceWarning filename=/home/owid/etl/lib/catalog/owid/catalog/tables.py lineno=405
2024-11-19 18:33:41 [warning  ] DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` category=PerformanceWarning filename=/home/owid/etl/lib/catalog/owid/catalog/tables.py lineno=405
2024-11-19 18:37:53 [warning  ] DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` category=PerformanceWarning filename=/home/owid/etl/lib/catalog/owid/catalog/tables.py lineno=405
= Dataset garden/lis/2024-06-13/luxembourg_income_study
  = Table lis_percentiles
  = Table luxembourg_income_study_adults
  = Table lis_percentiles_adults
  = Table luxembourg_income_study
⚠ Error: Index must be unique.
= Dataset garden/ophi/2024-10-28/multidimensional_poverty_index
  = Table multidimensional_poverty_index
    ~ Column censored_headcount_ratio (changed metadata)
-       - title: Share of the population in multidimensional poverty deprived in the indicator <<indicator>> (<<area>>) - <<flavor>>
        ?                ----
+       + title: Share of population in multidimensional poverty deprived in the indicator <<indicator>> (<<area>>) - <<flavor>>
-       -   name: Share of the population in multidimensional poverty deprived in the indicator <<indicator>>
        ?                 ----
+       +   name: Share of population in multidimensional poverty deprived in the indicator <<indicator>>
-       -   title_public: Share of the population in multidimensional poverty deprived in the indicator <<indicator>>
        ?                         ----
+       +   title_public: Share of population in multidimensional poverty deprived in the indicator <<indicator>>
    ~ Column headcount_ratio (changed metadata)
-       - title: Share of the population in multidimensional poverty (<<area>>) - <<flavor>>
        ?                ----
+       + title: Share of population living in multidimensional poverty (<<area>>) - <<flavor>>
        ?                            +++  ++++
-       -   name: Share of the population in multidimensional poverty
        ?                 ----
+       +   name: Share of population living in multidimensional poverty
        ?                             +++  ++++
-       -   title_public: Share of the population in multidimensional poverty
        ?                         ----
+       +   title_public: Share of population living in multidimensional poverty
        ?                                     +++  ++++
    ~ Column severe (changed metadata)
-       - title: Share of the population in severe multidimensional poverty (<<area>>) - <<flavor>>
        ?                ----
+       + title: Share of population living in severe multidimensional poverty (<<area>>) - <<flavor>>
        ?                            +++  ++++
-       -   name: Share of the population in severe multidimensional poverty
        ?                 ----
+       +   name: Share of population living in severe multidimensional poverty
        ?                             +++  ++++
-       -   title_public: Share of the population in severe multidimensional poverty
        ?                         ----
+       +   title_public: Share of population living in severe multidimensional poverty
        ?                                     +++  ++++
    ~ Column uncensored_headcount_ratio (changed metadata)
-       - title: Share of the population deprived in the indicator <<indicator>> (<<area>>) - <<flavor>>
        ?                ----
+       + title: Share of population deprived in the indicator <<indicator>> (<<area>>) - <<flavor>>
-       -   name: Share of the population deprived in the indicator <<indicator>>
        ?                 ----
+       +   name: Share of population deprived in the indicator <<indicator>>
-       -   title_public: Share of the population deprived in the indicator <<indicator>>
        ?                         ----
+       +   title_public: Share of population deprived in the indicator <<indicator>>
    ~ Column vulnerable (changed metadata)
-       - title: Share of the population vulnerable to multidimensional poverty (<<area>>) - <<flavor>>
        ?                ----
+       + title: Share of population vulnerable to multidimensional poverty (<<area>>) - <<flavor>>
-       -   name: Share of the population vulnerable to multidimensional poverty
        ?                 ----
+       +   name: Share of population vulnerable to multidimensional poverty
-       -   title_public: Share of the population vulnerable to multidimensional poverty
        ?                         ----
+       +   title_public: Share of population vulnerable to multidimensional poverty
= Dataset garden/regions/2023-01-01/regions
  = Table regions
    ~ Column aliases (changed data)
        ~ Changed values: 1 / 334 (0.30%)
          code                                                                                    aliases -                                                                                                                   aliases +
           CIV ["C\u00c3\u00b4te D'Ivoire", "C\u00f4te d'Ivoire", "C\u00f4te d\u2019Ivoire", "Ivory Coast"] ["C\u00c3\u00b4te D'Ivoire", "C\u00f4te d'Ivoire", "C\u00f4te d\u2019Ivoire", "Ivory Coast", "C<U+00F4>te d<U+2019>Ivoire"]
= Dataset garden/un/2022-07-11/un_wpp
  = Table fertility
  = Table un_wpp
  = Table migration
  = Table population
  = Table mortality
  = Table population_granular
    ~ Column value (changed data)
        ~ Changed values: 1391 / 39815008 (0.00%)
                  location  year    metric  sex age variant  value -  value +
                   Tokelau  2047 sex_ratio none  91    high      inf     <NA>
                   Tokelau  2046 sex_ratio none  91     low      inf     <NA>
            Western Sahara  1967 sex_ratio none  95  medium      inf     <NA>
          Falkland Islands  1957 sex_ratio none  89  medium      inf     <NA>
                   Tokelau  2033 sex_ratio none  78  medium      inf     <NA>
  = Table demographic
= Dataset garden/worldbank_wdi/2022-05-26/wdi
  = Table wdi
    ~ Column omm_goods_exp_share_gdp (changed data)
        ~ Changed values: 1 / 14400 (0.01%)
          country  year  omm_goods_exp_share_gdp -  omm_goods_exp_share_gdp +
           Guyana  1977                  57.639999                  57.650002
    ~ Column omm_merch_exp_share_gdp (changed data)
        ~ Changed values: 1 / 14400 (0.01%)
           country  year  omm_merch_exp_share_gdp -  omm_merch_exp_share_gdp +
          Kiribati  1995                      12.42                      12.43
    ~ Column omm_net_savings_percap (changed data)
        ~ Changed values: 5 / 14400 (0.03%)
            country  year  omm_net_savings_percap -  omm_net_savings_percap +
            Albania  2008                311.750000                 311.76001
              Congo  2007               -328.369995               -328.380005
          Indonesia  2019                592.250000                 592.23999
             Israel  1970                243.529999                243.539993
             Panama  1995                694.270020                694.280029
= Dataset garden/worldbank_wdi/2024-05-20/wdi
  = Table wdi
    ~ Column omm_goods_exp_share_gdp (changed data)
        ~ Changed values: 4 / 14570 (0.03%)
              country  year  omm_goods_exp_share_gdp -  omm_goods_exp_share_gdp +
             Eswatini  2019                  44.360001                  44.349998
               Guyana  1977                  57.639999                  57.650002
            Singapore  2006                 188.789993                 188.800003
          Switzerland  2007                  42.720001                      42.73


Legend: +New  ~Modified  -Removed  =Identical  Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet

Edited: 2024-11-19 19:00:06 UTC
Execution time: 1932.62 seconds

pabloarosado

Thanks for doing this! Changing types after repacking is a common headache.
I'm a bit unsure of the preferred dtypes. Maybe we should discuss them on the data architecture call (I agree with your choices, but I'm not sure if there may be some negative side effects).

lib/repack/tests/test_repack.py

Marigold · 2024-11-14T09:16:04Z

I've updated it to use nullable types & string[pyarrow]. The first commit contains changes to steps that are not worth checking (datadiff confirmed they match), so please only look at changes from the second commit.

Changes

Repack library converts everything to nullable types and categoricals rather than a mix of nullable types, numpy types and categoricals
New function to_safe_types that converts to Int64, Float64 and string[pyarrow] types
New method ds.read(...) (it used to be called read_table) that reads a table and by default resets index and converts to safe types
Reading data from snapshot with snap.read(...) also uses safe types by default
Update pandas to 2.2.3 (fixes problems with pyarrow)
Division 0/0 returns pd.NA when working with Tables (it doesn't cover edge cases, but if they happen, you'd get an error when saving a dataset)

Marigold · 2024-11-18T08:17:00Z

@pabloarosado This is finally ready for review. Check the comment above for a summary of changes.

I wanted to go one step further and enable pd.options.future.infer_string = True, which makes the new string[pyarrow] the default. I ran into some issues, though, so I'm leaving it for either a future PR or for the actual upgrade to pandas 3.

pabloarosado

Huge, thanks for this massive refactor! I have quickly scanned through all files and it all looks good. And you already checked for data changes. So I think this is ready to go!

github-actions bot assigned Marigold Oct 31, 2024

Marigold force-pushed the safe-types branch from dc9da47 to e557aec Compare October 31, 2024 08:48

Marigold marked this pull request as ready for review October 31, 2024 09:01

Marigold requested a review from pabloarosado October 31, 2024 12:20

Marigold force-pushed the safe-types branch from 56fc426 to ae0aa63 Compare November 1, 2024 09:25

pabloarosado approved these changes Nov 4, 2024

View reviewed changes

lib/repack/tests/test_repack.py Show resolved Hide resolved

Marigold force-pushed the safe-types branch 3 times, most recently from e57a80b to 234c151 Compare November 12, 2024 06:03

Marigold mentioned this pull request Nov 12, 2024

🎉 Switch to pyarrow dtypes #3495

Closed

Marigold force-pushed the safe-types branch 5 times, most recently from 3ca1ede to c5ece93 Compare November 14, 2024 09:07

Marigold force-pushed the safe-types branch from 7eb73fc to 4118706 Compare November 14, 2024 12:51

Marigold closed this Nov 18, 2024

Marigold force-pushed the safe-types branch from 700cb1e to b6a4c4a Compare November 18, 2024 08:01

Marigold reopened this Nov 18, 2024

Marigold requested a review from pabloarosado November 18, 2024 08:17

pabloarosado approved these changes Nov 19, 2024

View reviewed changes

🔨 Add safe_types to steps

dcc685c

Marigold force-pushed the safe-types branch from 12b9b82 to c1bfa71 Compare November 19, 2024 18:24

✨ Convert dtypes to safe types Float64, Int64 and string

c4b9b7b

Marigold force-pushed the safe-types branch from c1bfa71 to c4b9b7b Compare November 19, 2024 21:26

Marigold merged commit aa0ffcb into master Nov 19, 2024
5 of 7 checks passed

Marigold deleted the safe-types branch November 19, 2024 21:27

lucasrodes mentioned this pull request Nov 20, 2024

📜 types #3572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Convert dtypes to safe types Float64, Int64 and string #3474

✨ Convert dtypes to safe types Float64, Int64 and string #3474

Marigold commented Oct 31, 2024 •

edited

Loading

owidbot commented Oct 31, 2024 •

edited

Loading

pabloarosado left a comment

Marigold commented Nov 14, 2024 •

edited

Loading

Marigold commented Nov 18, 2024

pabloarosado left a comment

✨ Convert dtypes to safe types Float64, Int64 and string #3474

✨ Convert dtypes to safe types Float64, Int64 and string #3474

Conversation

Marigold commented Oct 31, 2024 • edited Loading

int64 vs Int64

TODO before merging:

owidbot commented Oct 31, 2024 • edited Loading

pabloarosado left a comment

Choose a reason for hiding this comment

Marigold commented Nov 14, 2024 • edited Loading

Changes

Marigold commented Nov 18, 2024

pabloarosado left a comment

Choose a reason for hiding this comment

Marigold commented Oct 31, 2024 •

edited

Loading

owidbot commented Oct 31, 2024 •

edited

Loading

Marigold commented Nov 14, 2024 •

edited

Loading