Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor pandas save load and convert dtypes #3412

Merged
merged 7 commits into from
Sep 16, 2024

Conversation

alejoe91
Copy link
Member

@alejoe91 alejoe91 commented Sep 15, 2024

We found out that zarr consolidation doens't seem to play well with our way of saving/loading dataframes to zarr, using xarray.

xarray was only used to save/load pandas dataframes to zarr, and this PR modifies that by saving each column and the index directly. This is similar to how xarray saves to zarr, so it should be backcompatible (testing now).

To make sure we don't run in such problems in the future, I added a roundtrip test in the common extension tests that asserts that data reloaded is the same as the original ones.

As sugegsted by @h-mayorquin here #3365, the generated and reloaded dataframes are also converted to numpy dtypes with the convert_dtypes function. We just have to make sure to call a Series.to_numpy to cast pandas dtypes to numpy ones.

@alejoe91 alejoe91 added the core Changes to core module label Sep 15, 2024
Copy link
Collaborator

@zm711 zm711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I haven't fully tested zarr yet I want to make sure. We have an appropriate pandas warning somewhere for users so they know they need pandas for these features. I know we have a warning for qualitymetrics do we have one for templatemetrics?

@@ -287,7 +287,7 @@ def _compute_metrics(self, sorting_analyzer, unit_ids=None, verbose=False, **job
warnings.warn(f"Error computing metric {metric_name} for unit {unit_id}: {e}")
value = np.nan
template_metrics.at[index, metric_name] = value
return template_metrics
return template_metrics.convert_dtypes()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weak recommendation ,but maybe we put this on its own line with a comment. Just from reading this I have no clue why we need to do this and doing this in the return line is even more confusing. So something like

# see xx
template_metrics.convert_dtypes()
return template_metrics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better?

@@ -185,7 +185,7 @@ def _compute_metrics(self, sorting_analyzer, unit_ids=None, verbose=False, **job
if len(empty_unit_ids) > 0:
metrics.loc[empty_unit_ids] = np.nan

return metrics
return metrics.convert_dtypes()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. From the code it is not clear why we need to convert dtypes so I would refer to divide this into a convert step and then only return the converted. That way we can have a comment explaining why we need to convert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment and convert step

@alejoe91
Copy link
Member Author

Since I haven't fully tested zarr yet I want to make sure. We have an appropriate pandas warning somewhere for users so they know they need pandas for these features. I know we have a warning for qualitymetrics do we have one for templatemetrics?

We don't have warnings anywhere. If a user tries to compute template or quality metrics without pandas, it will throw an interpetable ModuleNotFoundError :)

@alejoe91
Copy link
Member Author

One last comment: for analyzers saved to zarr in version 0.101.0, the consolidation step was missing after the computation of each extension. I added a check and a consolitation step, that raises a warning if it fails

@alejoe91 alejoe91 added this to the 0.101.1 milestone Sep 15, 2024
Copy link
Collaborator

@zm711 zm711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me!

warnings.warn(
"The zarr store was not properly consolidated prior to v0.101.1. "
"This may lead to unexpected behavior in loading extensions. "
"Please consider re-saving the SortingAnalyzer object."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with a save_as(format='zarr')? I just want to make sure the error is as clear as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really..since the problem is consolidation, the save as may fail to discover all the pieces of the folder. I changed it to re-generating.

Honestly, I don't think this will be an issue since it will only happen if:

  • you saved to zarr between 0.101.0 and 0.101.1
  • you don't have write access to the data


# we use the convert_dtypes to convert the columns to the most appropriate dtype and avoid object columns
# (in case of NaN values)
template_metrics = template_metrics.convert_dtypes()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I think that's great!

@samuelgarcia samuelgarcia merged commit b4dceac into SpikeInterface:main Sep 16, 2024
15 checks passed
@zm711
Copy link
Collaborator

zm711 commented Sep 16, 2024

Thanks Alessio!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to core module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants