Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Convert dtypes to safe types Float64, Int64 and string #3474

Merged
merged 2 commits into from
Nov 19, 2024
Merged

Conversation

Marigold
Copy link
Collaborator

@Marigold Marigold commented Oct 31, 2024

Implements #3277

Adds new parameter ds.read_table("my_table", safe_types=True) that converts all types to "safe" types Float64, Int64 or string[python]. It's true by default.

(We've also considered the name unpack in the past instead of safe_types. I don't have a preference.)

int64 vs Int64

Should we use numpy's int64 type or nullable pandas type Int64 (or even pyarrow types)? This PR goes with nullable types, but there are some hidden risks to it:

  • Mixing np.nan and pd.NA creates a nightmare of compatibility issues. We should always use pd.NA
  • Sometimes np.nan sneaks in, for instance division 0/0 returns np.nan even if we work with nullable types
a = pd.Series([0]).astype("Int64")
b = pd.Series([0]).astype("Int64")
# the first item is np.nan, not pd.NA!
c = a / b
# this is actually False, `pd.isnull` doesn't detect np.nan when dtype is Int64
pd.isnull(c)
# this doesn't work either...
c.fillna(pd.NA)
# only this fixes it
c.mask(np.isnan(c), pd.NA)

TODO before merging:

  • Increment ETL_EPOCH and make sure old datasets don't change
  • Undo ETL_EPOCH increment

@owidbot
Copy link
Contributor

owidbot commented Oct 31, 2024

Quick links (staging server):

Site Dev Site Preview Admin Wizard Docs

Login: ssh owid@staging-site-safe-types

chart-diff: ✅ No charts for review.
data-diff: ❌ Found differences
= Dataset garden/antibiotics/2024-11-15/testing_coverage
  = Table testing_coverage
⚠ Error: Index must be unique.
= Dataset garden/artificial_intelligence/2023-06-14/ai_deepfakes
  = Table ai_deepfakes
⚠ Error: Index must be unique.
⚠ Error: Index must be unique.
= Dataset garden/artificial_intelligence/2024-11-03/epoch_aggregates_domain
  = Table epoch_aggregates_domain
    ~ Column cumulative_count (changed metadata)
-       -   Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 6 November 2024.
        ?                                                                                                                                                                                                                                                            ^
+       +   Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 03 November 2024.
        ?                                                                                                                                                                                                                                                            ^^
    ~ Column yearly_count (changed metadata)
-       -   Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 6 November 2024.
        ?                                                                                                                                                                                                                                                            ^
+       +   Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 03 November 2024.
        ?                                                                                                                                                                                                                                                            ^^
= Dataset garden/artificial_intelligence/2024-11-03/epoch_compute_intensive_countries
  = Table epoch_compute_intensive_countries
    ~ Column cumulative_count (changed metadata)
-       -   Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 6 November 2024.
        ?                                                                                                                                                                          ^
+       +   Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 03 November 2024.
        ?                                                                                                                                                                          ^^
    ~ Column yearly_count (changed metadata)
-       -   Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 6 November 2024.
        ?                                                                                                                                                                          ^
+       +   Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 03 November 2024.
        ?                                                                                                                                                                          ^^
= Dataset garden/artificial_intelligence/2024-11-03/epoch_compute_intensive_domain
  = Table epoch_compute_intensive_domain
    ~ Column cumulative_count (changed metadata)
-       -   Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 6 November 2024.
        ?                                                                                                                                                               ^
+       +   Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 03 November 2024.
        ?                                                                                                                                                               ^^
    ~ Column yearly_count (changed metadata)
-       -   Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 6 November 2024.
        ?                                                                                                                                                               ^
+       +   Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 03 November 2024.
        ?                                                                                                                                                               ^^
= Dataset garden/faostat/2024-03-14/faostat_fa
  = Table faostat_fa
  = Table faostat_fa_flat
2024-11-19 18:32:21 [warning  ] DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` category=PerformanceWarning filename=/home/owid/etl/lib/catalog/owid/catalog/tables.py lineno=405
2024-11-19 18:33:23 [warning  ] DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` category=PerformanceWarning filename=/home/owid/etl/lib/catalog/owid/catalog/tables.py lineno=405
2024-11-19 18:33:41 [warning  ] DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` category=PerformanceWarning filename=/home/owid/etl/lib/catalog/owid/catalog/tables.py lineno=405
2024-11-19 18:37:53 [warning  ] DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` category=PerformanceWarning filename=/home/owid/etl/lib/catalog/owid/catalog/tables.py lineno=405
= Dataset garden/lis/2024-06-13/luxembourg_income_study
  = Table lis_percentiles
  = Table luxembourg_income_study_adults
  = Table lis_percentiles_adults
  = Table luxembourg_income_study
⚠ Error: Index must be unique.
= Dataset garden/ophi/2024-10-28/multidimensional_poverty_index
  = Table multidimensional_poverty_index
    ~ Column censored_headcount_ratio (changed metadata)
-       - title: Share of the population in multidimensional poverty deprived in the indicator <<indicator>> (<<area>>) - <<flavor>>
        ?                ----
+       + title: Share of population in multidimensional poverty deprived in the indicator <<indicator>> (<<area>>) - <<flavor>>
-       -   name: Share of the population in multidimensional poverty deprived in the indicator <<indicator>>
        ?                 ----
+       +   name: Share of population in multidimensional poverty deprived in the indicator <<indicator>>
-       -   title_public: Share of the population in multidimensional poverty deprived in the indicator <<indicator>>
        ?                         ----
+       +   title_public: Share of population in multidimensional poverty deprived in the indicator <<indicator>>
    ~ Column headcount_ratio (changed metadata)
-       - title: Share of the population in multidimensional poverty (<<area>>) - <<flavor>>
        ?                ----
+       + title: Share of population living in multidimensional poverty (<<area>>) - <<flavor>>
        ?                            +++  ++++
-       -   name: Share of the population in multidimensional poverty
        ?                 ----
+       +   name: Share of population living in multidimensional poverty
        ?                             +++  ++++
-       -   title_public: Share of the population in multidimensional poverty
        ?                         ----
+       +   title_public: Share of population living in multidimensional poverty
        ?                                     +++  ++++
    ~ Column severe (changed metadata)
-       - title: Share of the population in severe multidimensional poverty (<<area>>) - <<flavor>>
        ?                ----
+       + title: Share of population living in severe multidimensional poverty (<<area>>) - <<flavor>>
        ?                            +++  ++++
-       -   name: Share of the population in severe multidimensional poverty
        ?                 ----
+       +   name: Share of population living in severe multidimensional poverty
        ?                             +++  ++++
-       -   title_public: Share of the population in severe multidimensional poverty
        ?                         ----
+       +   title_public: Share of population living in severe multidimensional poverty
        ?                                     +++  ++++
    ~ Column uncensored_headcount_ratio (changed metadata)
-       - title: Share of the population deprived in the indicator <<indicator>> (<<area>>) - <<flavor>>
        ?                ----
+       + title: Share of population deprived in the indicator <<indicator>> (<<area>>) - <<flavor>>
-       -   name: Share of the population deprived in the indicator <<indicator>>
        ?                 ----
+       +   name: Share of population deprived in the indicator <<indicator>>
-       -   title_public: Share of the population deprived in the indicator <<indicator>>
        ?                         ----
+       +   title_public: Share of population deprived in the indicator <<indicator>>
    ~ Column vulnerable (changed metadata)
-       - title: Share of the population vulnerable to multidimensional poverty (<<area>>) - <<flavor>>
        ?                ----
+       + title: Share of population vulnerable to multidimensional poverty (<<area>>) - <<flavor>>
-       -   name: Share of the population vulnerable to multidimensional poverty
        ?                 ----
+       +   name: Share of population vulnerable to multidimensional poverty
-       -   title_public: Share of the population vulnerable to multidimensional poverty
        ?                         ----
+       +   title_public: Share of population vulnerable to multidimensional poverty
= Dataset garden/regions/2023-01-01/regions
  = Table regions
    ~ Column aliases (changed data)
        ~ Changed values: 1 / 334 (0.30%)
          code                                                                                    aliases -                                                                                                                   aliases +
           CIV ["C\u00c3\u00b4te D'Ivoire", "C\u00f4te d'Ivoire", "C\u00f4te d\u2019Ivoire", "Ivory Coast"] ["C\u00c3\u00b4te D'Ivoire", "C\u00f4te d'Ivoire", "C\u00f4te d\u2019Ivoire", "Ivory Coast", "C<U+00F4>te d<U+2019>Ivoire"]
= Dataset garden/un/2022-07-11/un_wpp
  = Table fertility
  = Table un_wpp
  = Table migration
  = Table population
  = Table mortality
  = Table population_granular
    ~ Column value (changed data)
        ~ Changed values: 1391 / 39815008 (0.00%)
                  location  year    metric  sex age variant  value -  value +
                   Tokelau  2047 sex_ratio none  91    high      inf     <NA>
                   Tokelau  2046 sex_ratio none  91     low      inf     <NA>
            Western Sahara  1967 sex_ratio none  95  medium      inf     <NA>
          Falkland Islands  1957 sex_ratio none  89  medium      inf     <NA>
                   Tokelau  2033 sex_ratio none  78  medium      inf     <NA>
  = Table demographic
= Dataset garden/worldbank_wdi/2022-05-26/wdi
  = Table wdi
    ~ Column omm_goods_exp_share_gdp (changed data)
        ~ Changed values: 1 / 14400 (0.01%)
          country  year  omm_goods_exp_share_gdp -  omm_goods_exp_share_gdp +
           Guyana  1977                  57.639999                  57.650002
    ~ Column omm_merch_exp_share_gdp (changed data)
        ~ Changed values: 1 / 14400 (0.01%)
           country  year  omm_merch_exp_share_gdp -  omm_merch_exp_share_gdp +
          Kiribati  1995                      12.42                      12.43
    ~ Column omm_net_savings_percap (changed data)
        ~ Changed values: 5 / 14400 (0.03%)
            country  year  omm_net_savings_percap -  omm_net_savings_percap +
            Albania  2008                311.750000                 311.76001
              Congo  2007               -328.369995               -328.380005
          Indonesia  2019                592.250000                 592.23999
             Israel  1970                243.529999                243.539993
             Panama  1995                694.270020                694.280029
= Dataset garden/worldbank_wdi/2024-05-20/wdi
  = Table wdi
    ~ Column omm_goods_exp_share_gdp (changed data)
        ~ Changed values: 4 / 14570 (0.03%)
              country  year  omm_goods_exp_share_gdp -  omm_goods_exp_share_gdp +
             Eswatini  2019                  44.360001                  44.349998
               Guyana  1977                  57.639999                  57.650002
            Singapore  2006                 188.789993                 188.800003
          Switzerland  2007                  42.720001                      42.73


Legend: +New  ~Modified  -Removed  =Identical  Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet

Automatically updated datasets matching weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included

Edited: 2024-11-19 19:00:06 UTC
Execution time: 1932.62 seconds

@Marigold Marigold marked this pull request as ready for review October 31, 2024 09:01
@Marigold Marigold requested a review from pabloarosado October 31, 2024 12:20
Copy link
Contributor

@pabloarosado pabloarosado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! Changing types after repacking is a common headache.
I'm a bit unsure of the preferred dtypes. Maybe we should discuss them on the data architecture call (I agree with your choices, but I'm not sure if there may be some negative side effects).

lib/repack/tests/test_repack.py Show resolved Hide resolved
@Marigold Marigold force-pushed the safe-types branch 3 times, most recently from e57a80b to 234c151 Compare November 12, 2024 06:03
@Marigold Marigold force-pushed the safe-types branch 5 times, most recently from 3ca1ede to c5ece93 Compare November 14, 2024 09:07
@Marigold
Copy link
Collaborator Author

Marigold commented Nov 14, 2024

I've updated it to use nullable types & string[pyarrow]. The first commit contains changes to steps that are not worth checking (datadiff confirmed they match), so please only look at changes from the second commit.

Changes

  • Repack library converts everything to nullable types and categoricals rather than a mix of nullable types, numpy types and categoricals
  • New function to_safe_types that converts to Int64, Float64 and string[pyarrow] types
  • New method ds.read(...) (it used to be called read_table) that reads a table and by default resets index and converts to safe types
  • Reading data from snapshot with snap.read(...) also uses safe types by default
  • Update pandas to 2.2.3 (fixes problems with pyarrow)
  • Division 0/0 returns pd.NA when working with Tables (it doesn't cover edge cases, but if they happen, you'd get an error when saving a dataset)

@Marigold
Copy link
Collaborator Author

@pabloarosado This is finally ready for review. Check the comment above for a summary of changes.

I wanted to go one step further and enable pd.options.future.infer_string = True, which makes the new string[pyarrow] the default. I ran into some issues, though, so I'm leaving it for either a future PR or for the actual upgrade to pandas 3.

Copy link
Contributor

@pabloarosado pabloarosado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge, thanks for this massive refactor! I have quickly scanned through all files and it all looks good. And you already checked for data changes. So I think this is ready to go!

@Marigold Marigold merged commit aa0ffcb into master Nov 19, 2024
5 of 7 checks passed
@Marigold Marigold deleted the safe-types branch November 19, 2024 21:27
@lucasrodes lucasrodes mentioned this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants