Refactor pandas save load and convert dtypes #3412

alejoe91 · 2024-09-15T11:06:03Z

We found out that zarr consolidation doens't seem to play well with our way of saving/loading dataframes to zarr, using xarray.

xarray was only used to save/load pandas dataframes to zarr, and this PR modifies that by saving each column and the index directly. This is similar to how xarray saves to zarr, so it should be backcompatible (testing now).

To make sure we don't run in such problems in the future, I added a roundtrip test in the common extension tests that asserts that data reloaded is the same as the original ones.

As sugegsted by @h-mayorquin here #3365, the generated and reloaded dataframes are also converted to numpy dtypes with the convert_dtypes function. We just have to make sure to call a Series.to_numpy to cast pandas dtypes to numpy ones.

zm711

Since I haven't fully tested zarr yet I want to make sure. We have an appropriate pandas warning somewhere for users so they know they need pandas for these features. I know we have a warning for qualitymetrics do we have one for templatemetrics?

zm711 · 2024-09-15T15:11:25Z

src/spikeinterface/postprocessing/template_metrics.py

@@ -287,7 +287,7 @@ def _compute_metrics(self, sorting_analyzer, unit_ids=None, verbose=False, **job
                    warnings.warn(f"Error computing metric {metric_name} for unit {unit_id}: {e}")
                    value = np.nan
                template_metrics.at[index, metric_name] = value
-        return template_metrics
+        return template_metrics.convert_dtypes()


This is a weak recommendation ,but maybe we put this on its own line with a comment. Just from reading this I have no clue why we need to do this and doing this in the return line is even more confusing. So something like

# see xx template_metrics.convert_dtypes() return template_metrics

zm711 · 2024-09-15T15:12:49Z

src/spikeinterface/qualitymetrics/quality_metric_calculator.py

@@ -185,7 +185,7 @@ def _compute_metrics(self, sorting_analyzer, unit_ids=None, verbose=False, **job
        if len(empty_unit_ids) > 0:
            metrics.loc[empty_unit_ids] = np.nan

-        return metrics
+        return metrics.convert_dtypes()


same here. From the code it is not clear why we need to convert dtypes so I would refer to divide this into a convert step and then only return the converted. That way we can have a comment explaining why we need to convert.

added comment and convert step

alejoe91 · 2024-09-15T15:54:59Z

Since I haven't fully tested zarr yet I want to make sure. We have an appropriate pandas warning somewhere for users so they know they need pandas for these features. I know we have a warning for qualitymetrics do we have one for templatemetrics?

We don't have warnings anywhere. If a user tries to compute template or quality metrics without pandas, it will throw an interpetable ModuleNotFoundError :)

alejoe91 · 2024-09-15T16:03:57Z

One last comment: for analyzers saved to zarr in version 0.101.0, the consolidation step was missing after the computation of each extension. I added a check and a consolitation step, that raises a warning if it fails

src/spikeinterface/qualitymetrics/quality_metric_calculator.py

zm711

This looks pretty good to me!

zm711 · 2024-09-15T17:07:37Z

src/spikeinterface/core/sortinganalyzer.py

+                warnings.warn(
+                    "The zarr store was not properly consolidated prior to v0.101.1. "
+                    "This may lead to unexpected behavior in loading extensions. "
+                    "Please consider re-saving the SortingAnalyzer object."


with a save_as(format='zarr')? I just want to make sure the error is as clear as possible.

Not really..since the problem is consolidation, the save as may fail to discover all the pieces of the folder. I changed it to re-generating.

Honestly, I don't think this will be an issue since it will only happen if:

you saved to zarr between 0.101.0 and 0.101.1

you don't have write access to the data

zm711 · 2024-09-15T17:08:23Z

src/spikeinterface/postprocessing/template_metrics.py

+
+        # we use the convert_dtypes to convert the columns to the most appropriate dtype and avoid object columns
+        # (in case of NaN values)
+        template_metrics = template_metrics.convert_dtypes()


Thanks I think that's great!

src/spikeinterface/core/sortinganalyzer.py

zm711 · 2024-09-16T11:00:19Z

Thanks Alessio!

alejoe91 added 2 commits September 13, 2024 12:59

Check run info completed only if it exists (back-compatibility)

3a13efe

Simplify pandas save-load and convert dtypes

dcedbb3

alejoe91 added the core Changes to core module label Sep 15, 2024

alejoe91 requested review from samuelgarcia and zm711 September 15, 2024 11:06

local import

9000ce1

zm711 reviewed Sep 15, 2024

View reviewed changes

alejoe91 added 2 commits September 15, 2024 18:00

Add comment and re-consolidation step for 0.101.0 datasets

c1228d9

metge conflicts

aac1029

alejoe91 added this to the 0.101.1 milestone Sep 15, 2024

alejoe91 commented Sep 15, 2024

View reviewed changes

src/spikeinterface/qualitymetrics/quality_metric_calculator.py Outdated Show resolved Hide resolved

Update src/spikeinterface/qualitymetrics/quality_metric_calculator.py

b1677fa

zm711 reviewed Sep 15, 2024

View reviewed changes

alejoe91 commented Sep 16, 2024

View reviewed changes

src/spikeinterface/core/sortinganalyzer.py Outdated Show resolved Hide resolved

Update src/spikeinterface/core/sortinganalyzer.py

9a72959

samuelgarcia approved these changes Sep 16, 2024

View reviewed changes

samuelgarcia merged commit b4dceac into SpikeInterface:main Sep 16, 2024
15 checks passed

alejoe91 added a commit that referenced this pull request Sep 16, 2024

Fix plot_sorting_summary after #3412 with to_numpy()

9cc8c2d

alejoe91 added a commit that referenced this pull request Sep 16, 2024

Fix plot_sorting_summary after #3412 with to_numpy() 2

81e53ab

alejoe91 added a commit that referenced this pull request Sep 16, 2024

Fix plot_sorting_summary after #3412 with to_numpy() 3

2a1ecce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor pandas save load and convert dtypes #3412

Refactor pandas save load and convert dtypes #3412

alejoe91 commented Sep 15, 2024 •

edited

Loading

zm711 left a comment

zm711 Sep 15, 2024

alejoe91 Sep 15, 2024

zm711 Sep 15, 2024

alejoe91 Sep 15, 2024

alejoe91 commented Sep 15, 2024

alejoe91 commented Sep 15, 2024

zm711 left a comment

zm711 Sep 15, 2024

alejoe91 Sep 16, 2024

zm711 Sep 15, 2024

zm711 commented Sep 16, 2024

Refactor pandas save load and convert dtypes #3412

Refactor pandas save load and convert dtypes #3412

Conversation

alejoe91 commented Sep 15, 2024 • edited Loading

zm711 left a comment

Choose a reason for hiding this comment

zm711 Sep 15, 2024

Choose a reason for hiding this comment

alejoe91 Sep 15, 2024

Choose a reason for hiding this comment

zm711 Sep 15, 2024

Choose a reason for hiding this comment

alejoe91 Sep 15, 2024

Choose a reason for hiding this comment

alejoe91 commented Sep 15, 2024

alejoe91 commented Sep 15, 2024

zm711 left a comment

Choose a reason for hiding this comment

zm711 Sep 15, 2024

Choose a reason for hiding this comment

alejoe91 Sep 16, 2024

Choose a reason for hiding this comment

zm711 Sep 15, 2024

Choose a reason for hiding this comment

zm711 commented Sep 16, 2024

alejoe91 commented Sep 15, 2024 •

edited

Loading