Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade pandas to 2.2.x #1094

Closed
Tracked by #2390
Marigold opened this issue May 10, 2023 · 4 comments
Closed
Tracked by #2390

Upgrade pandas to 2.2.x #1094

Marigold opened this issue May 10, 2023 · 4 comments

Comments

@Marigold
Copy link
Collaborator

Marigold commented May 10, 2023

Upgrade to a newer version of Pandas, likely to include bug-fixes and more migration towards Arrow data types.

Pandas 2.2

See: release notes

  • Supports copy-on-write, which will soon be the default (enabled in advance with pd.options.mode.copy_on_write = True)
  • Supports faster pyarrow strings, which will soon be the default (enabled in advance with pd.options.future.infer_string = True)
  • We will start to get warnings for things that will be removed or deprecated in Pandas 3.0

Pandas 2.1

Pandas 2.0

Pandas 2.0 uses arrow as a backend format and promises some performance improvements, though might be slower for some operations. One day we'll have to migrate anyway, but it's probably good idea to wait until 2.0 becomes mature enough and is adopted by majority of users.

(We had a request for pandas 2.0 in owid-catalog-py)

@Marigold
Copy link
Collaborator Author

Tried pandas 2.0 on etl data://meadow/ihme_gbd/2019/gbd_child_mortality and its performance is a bit disappointing. Current ETL with pandas 1.x.x takes 55s and pandas 2.0.1 takes 68s.

@larsyencken
Copy link
Collaborator

larsyencken commented Jul 12, 2023

We would need Pandas 2.0.x if we wanted to address

by using Apache Arrow types in repacking

@stale
Copy link

stale bot commented Sep 10, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 10, 2023
@Marigold Marigold removed the wontfix This will not be worked on label Sep 11, 2023
@stale stale bot added the wontfix This will not be worked on label Nov 17, 2023
@larsyencken larsyencken changed the title Upgrade pandas to 2.0 Upgrade pandas to 2.1.x Nov 18, 2023
@stale stale bot removed the wontfix This will not be worked on label Nov 18, 2023
@owid owid deleted a comment from stale bot Nov 18, 2023
@larsyencken larsyencken changed the title Upgrade pandas to 2.1.x Upgrade pandas to 2.2.x Mar 15, 2024
@Marigold
Copy link
Collaborator Author

I wanted to at least give it a try to see how much we would have to change. Work in progress is here #2468. So far no major problems, though some things are annoying (e.g. read_sql with connections)

We'd need to run Datadiff on all datasets to verify that there are not any side effects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants