-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Convert dtypes to safe types Float64, Int64 and string #3474
Conversation
Quick links (staging server):
Login: chart-diff: ✅No charts for review.data-diff: ❌ Found differences= Dataset garden/antibiotics/2024-11-15/testing_coverage
= Table testing_coverage
⚠ Error: Index must be unique.
= Dataset garden/artificial_intelligence/2023-06-14/ai_deepfakes
= Table ai_deepfakes
⚠ Error: Index must be unique.
⚠ Error: Index must be unique.
= Dataset garden/artificial_intelligence/2024-11-03/epoch_aggregates_domain
= Table epoch_aggregates_domain
~ Column cumulative_count (changed metadata)
- - Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 6 November 2024.
? ^
+ + Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 03 November 2024.
? ^^
~ Column yearly_count (changed metadata)
- - Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 6 November 2024.
? ^
+ + Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 03 November 2024.
? ^^
= Dataset garden/artificial_intelligence/2024-11-03/epoch_compute_intensive_countries
= Table epoch_compute_intensive_countries
~ Column cumulative_count (changed metadata)
- - Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 6 November 2024.
? ^
+ + Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 03 November 2024.
? ^^
~ Column yearly_count (changed metadata)
- - Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 6 November 2024.
? ^
+ + Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 03 November 2024.
? ^^
= Dataset garden/artificial_intelligence/2024-11-03/epoch_compute_intensive_domain
= Table epoch_compute_intensive_domain
~ Column cumulative_count (changed metadata)
- - Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 6 November 2024.
? ^
+ + Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 03 November 2024.
? ^^
~ Column yearly_count (changed metadata)
- - Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 6 November 2024.
? ^
+ + Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 03 November 2024.
? ^^
= Dataset garden/faostat/2024-03-14/faostat_fa
= Table faostat_fa
= Table faostat_fa_flat
2024-11-19 18:32:21 [warning ] DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` category=PerformanceWarning filename=/home/owid/etl/lib/catalog/owid/catalog/tables.py lineno=405
2024-11-19 18:33:23 [warning ] DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` category=PerformanceWarning filename=/home/owid/etl/lib/catalog/owid/catalog/tables.py lineno=405
2024-11-19 18:33:41 [warning ] DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` category=PerformanceWarning filename=/home/owid/etl/lib/catalog/owid/catalog/tables.py lineno=405
2024-11-19 18:37:53 [warning ] DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` category=PerformanceWarning filename=/home/owid/etl/lib/catalog/owid/catalog/tables.py lineno=405
= Dataset garden/lis/2024-06-13/luxembourg_income_study
= Table lis_percentiles
= Table luxembourg_income_study_adults
= Table lis_percentiles_adults
= Table luxembourg_income_study
⚠ Error: Index must be unique.
= Dataset garden/ophi/2024-10-28/multidimensional_poverty_index
= Table multidimensional_poverty_index
~ Column censored_headcount_ratio (changed metadata)
- - title: Share of the population in multidimensional poverty deprived in the indicator <<indicator>> (<<area>>) - <<flavor>>
? ----
+ + title: Share of population in multidimensional poverty deprived in the indicator <<indicator>> (<<area>>) - <<flavor>>
- - name: Share of the population in multidimensional poverty deprived in the indicator <<indicator>>
? ----
+ + name: Share of population in multidimensional poverty deprived in the indicator <<indicator>>
- - title_public: Share of the population in multidimensional poverty deprived in the indicator <<indicator>>
? ----
+ + title_public: Share of population in multidimensional poverty deprived in the indicator <<indicator>>
~ Column headcount_ratio (changed metadata)
- - title: Share of the population in multidimensional poverty (<<area>>) - <<flavor>>
? ----
+ + title: Share of population living in multidimensional poverty (<<area>>) - <<flavor>>
? +++ ++++
- - name: Share of the population in multidimensional poverty
? ----
+ + name: Share of population living in multidimensional poverty
? +++ ++++
- - title_public: Share of the population in multidimensional poverty
? ----
+ + title_public: Share of population living in multidimensional poverty
? +++ ++++
~ Column severe (changed metadata)
- - title: Share of the population in severe multidimensional poverty (<<area>>) - <<flavor>>
? ----
+ + title: Share of population living in severe multidimensional poverty (<<area>>) - <<flavor>>
? +++ ++++
- - name: Share of the population in severe multidimensional poverty
? ----
+ + name: Share of population living in severe multidimensional poverty
? +++ ++++
- - title_public: Share of the population in severe multidimensional poverty
? ----
+ + title_public: Share of population living in severe multidimensional poverty
? +++ ++++
~ Column uncensored_headcount_ratio (changed metadata)
- - title: Share of the population deprived in the indicator <<indicator>> (<<area>>) - <<flavor>>
? ----
+ + title: Share of population deprived in the indicator <<indicator>> (<<area>>) - <<flavor>>
- - name: Share of the population deprived in the indicator <<indicator>>
? ----
+ + name: Share of population deprived in the indicator <<indicator>>
- - title_public: Share of the population deprived in the indicator <<indicator>>
? ----
+ + title_public: Share of population deprived in the indicator <<indicator>>
~ Column vulnerable (changed metadata)
- - title: Share of the population vulnerable to multidimensional poverty (<<area>>) - <<flavor>>
? ----
+ + title: Share of population vulnerable to multidimensional poverty (<<area>>) - <<flavor>>
- - name: Share of the population vulnerable to multidimensional poverty
? ----
+ + name: Share of population vulnerable to multidimensional poverty
- - title_public: Share of the population vulnerable to multidimensional poverty
? ----
+ + title_public: Share of population vulnerable to multidimensional poverty
= Dataset garden/regions/2023-01-01/regions
= Table regions
~ Column aliases (changed data)
~ Changed values: 1 / 334 (0.30%)
code aliases - aliases +
CIV ["C\u00c3\u00b4te D'Ivoire", "C\u00f4te d'Ivoire", "C\u00f4te d\u2019Ivoire", "Ivory Coast"] ["C\u00c3\u00b4te D'Ivoire", "C\u00f4te d'Ivoire", "C\u00f4te d\u2019Ivoire", "Ivory Coast", "C<U+00F4>te d<U+2019>Ivoire"]
= Dataset garden/un/2022-07-11/un_wpp
= Table fertility
= Table un_wpp
= Table migration
= Table population
= Table mortality
= Table population_granular
~ Column value (changed data)
~ Changed values: 1391 / 39815008 (0.00%)
location year metric sex age variant value - value +
Tokelau 2047 sex_ratio none 91 high inf <NA>
Tokelau 2046 sex_ratio none 91 low inf <NA>
Western Sahara 1967 sex_ratio none 95 medium inf <NA>
Falkland Islands 1957 sex_ratio none 89 medium inf <NA>
Tokelau 2033 sex_ratio none 78 medium inf <NA>
= Table demographic
= Dataset garden/worldbank_wdi/2022-05-26/wdi
= Table wdi
~ Column omm_goods_exp_share_gdp (changed data)
~ Changed values: 1 / 14400 (0.01%)
country year omm_goods_exp_share_gdp - omm_goods_exp_share_gdp +
Guyana 1977 57.639999 57.650002
~ Column omm_merch_exp_share_gdp (changed data)
~ Changed values: 1 / 14400 (0.01%)
country year omm_merch_exp_share_gdp - omm_merch_exp_share_gdp +
Kiribati 1995 12.42 12.43
~ Column omm_net_savings_percap (changed data)
~ Changed values: 5 / 14400 (0.03%)
country year omm_net_savings_percap - omm_net_savings_percap +
Albania 2008 311.750000 311.76001
Congo 2007 -328.369995 -328.380005
Indonesia 2019 592.250000 592.23999
Israel 1970 243.529999 243.539993
Panama 1995 694.270020 694.280029
= Dataset garden/worldbank_wdi/2024-05-20/wdi
= Table wdi
~ Column omm_goods_exp_share_gdp (changed data)
~ Changed values: 4 / 14570 (0.03%)
country year omm_goods_exp_share_gdp - omm_goods_exp_share_gdp +
Eswatini 2019 44.360001 44.349998
Guyana 1977 57.639999 57.650002
Singapore 2006 188.789993 188.800003
Switzerland 2007 42.720001 42.73
Legend: +New ~Modified -Removed =Identical Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet Automatically updated datasets matching weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included Edited: 2024-11-19 19:00:06 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! Changing types after repacking is a common headache.
I'm a bit unsure of the preferred dtypes. Maybe we should discuss them on the data architecture call (I agree with your choices, but I'm not sure if there may be some negative side effects).
e57a80b
to
234c151
Compare
3ca1ede
to
c5ece93
Compare
I've updated it to use nullable types & string[pyarrow]. The first commit contains changes to steps that are not worth checking (datadiff confirmed they match), so please only look at changes from the second commit. Changes
|
@pabloarosado This is finally ready for review. Check the comment above for a summary of changes. I wanted to go one step further and enable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge, thanks for this massive refactor! I have quickly scanned through all files and it all looks good. And you already checked for data changes. So I think this is ready to go!
Implements #3277
Adds new parameter
ds.read_table("my_table", safe_types=True)
that converts all types to "safe" typesFloat64
,Int64
orstring[python]
. It's true by default.(We've also considered the name
unpack
in the past instead ofsafe_types
. I don't have a preference.)int64 vs Int64
Should we use numpy's
int64
type or nullable pandas typeInt64
(or even pyarrow types)? This PR goes with nullable types, but there are some hidden risks to it:np.nan
andpd.NA
creates a nightmare of compatibility issues. We should always usepd.NA
np.nan
sneaks in, for instance division 0/0 returnsnp.nan
even if we work with nullable typesTODO before merging: