Cleanup dataframes (#1360)

### Summary This PR updates the implementation of `ScatterTable` and `AnalysisResultTable` based on the [comment](#1319 (comment)) from @itoko . ### Details and comments Current pattern heavily uses inheritance; `Table(DataFrame, MixIn)`, but this causes several problems. Qiskit Experiments class directly depends on the third party library, resulting in Sphinx directive mismatch and poor robustness of the API. Instead of using inheritance, these classes are refactored with composition and delegation, namely ```python class Table: def __init__(self): self._data = DataFrame(...) ``` this pattern is also common in other software libraries using dataframe. Since this PR removes unreleased public classes, this should be merged before the release. Although this updates many files, these are just delegation of data handling logic to the class itself, which simplifies the implantation of classes that operate the container objects. Also new pattern allows more strict dtype management with dataframe. --------- Co-authored-by: Will Shanks <[email protected]>
qiskit-community · Feb 6, 2024 · 36c06ea · 36c06ea
1 parent 336fe18
commit 36c06ea
Show file tree

Hide file tree

Showing 24 changed files with 1,425 additions and 980 deletions.
diff --git a/docs/howtos/rerun_analysis.rst b/docs/howtos/rerun_analysis.rst
@@ -17,7 +17,7 @@ Solution
     consult the `migration guide <https://docs.quantum.ibm.com/api/migration-guides/qiskit-runtime-from-provider>`_.\
 
 Once you recreate the exact experiment you ran and all of its parameters and options,
-you can call the :meth:`.add_jobs` method with a list of :class:`Job
+you can call the :meth:`.ExperimentData.add_jobs` method with a list of :class:`Job
 <qiskit.providers.JobV1>` objects to generate the new :class:`.ExperimentData` object.
 The following example retrieves jobs from a provider that has access to them via their
 job IDs:
@@ -47,7 +47,7 @@ job IDs:
 instead of overwriting the existing one.
 
 If you have the job data in the form of a :class:`~qiskit.result.Result` object, you can
-invoke the :meth:`.add_data` method instead of :meth:`.add_jobs`:
+invoke the :meth:`.ExperimentData.add_data` method instead of :meth:`.ExperimentData.add_jobs`:
 
 .. jupyter-input::
 

diff --git a/docs/tutorials/curve_analysis.rst b/docs/tutorials/curve_analysis.rst
@@ -240,6 +240,85 @@ generate initial guesses for parameters, from the ``AnalysisA`` class in the fir
 On the other hand, in the latter case, you need to manually copy and paste
 every logic defined in ``AnalysisA``.
 
+.. _data_management_with_scatter_table:
+
+Managing intermediate data
+--------------------------
+
+:class:`.ScatterTable` is the single source of truth for the data used in the curve fit analysis.
+Each data point in a 1-D curve fit may consist of the x value, y value, and
+standard error of the y value.
+In addition, such analysis may internally create several data subsets.
+Each data point is given a metadata triplet (`series_id`, `category`, `analysis`)
+to distinguish the subset.
+
+* The `series_id` is an integer key representing a label of the data which may be classified by fits models.
+  When an analysis consists of multiple fit models and performs a multi-objective fit,
+  the created table may contain multiple datasets for each fit model.
+  Usually the index of series matches with the index of the fit model in the analysis.
+  The table also provides a `series_name` column which is a human-friendly text notation of the `series_id`.
+  The `series_name` and corresponding `series_id` must refer to the identical data subset,
+  and the `series_name` typically matches with the name of the fit model.
+  You can find a particular data subset by either `series_id` or `series_name`.
+
+* The `category` is a string tag categorizing a group of data points.
+  The measured outcomes input as-is to the curve analysis are categorized by "raw".
+  In a standard :class:`.CurveAnalysis` subclass, the input data is formatted for
+  the fitting and the formatted data is also stored in the table with the "formatted" category.
+  You can filter the formatted data to run curve fitting with your custom program.
+  After the fit is successfully conducted and the model parameters are identified,
+  data points in the interpolated fit curves are stored with the "fitted" category
+  for visualization. The management of the data groups depends on the design of
+  the curve analysis protocol, and the convention of category naming might
+  be different in a particular analysis.
+
+* The `analysis` is a string key representing a name of
+  the analysis instance that generated the data point.
+  This allows a user to combine multiple tables from different analyses without collapsing the data points.
+  For a simple analysis class, all rows will have the same value,
+  but a :class:`.CompositeCurveAnalysis` instance consists of
+  nested component analysis instances containing statistically independent fit models.
+  Each component is given a unique analysis name, and datasets generated from each instance
+  are merged into a single table stored in the outermost composite analysis.
+
+User must be aware of this triplet to extract data points that belong to a
+particular data subset. For example,
+
+.. code-block:: python
+
+    mini_table = table.filter(series="my_experiment1", category="raw", analysis="AnalysisA")
+    mini_x = mini_table.x
+    mini_y = mini_table.y
+
+This operation is equivalent to
+
+.. code-block:: python
+
+    mini_x = table.xvals(series="my_experiment1", category="raw", analysis="AnalysisA")
+    mini_y = table.yvals(series="my_experiment1", category="raw", analysis="AnalysisA")
+
+When an analysis only has a single model and the table is created from a single
+analysis instance, the `series_id` and `analysis` are trivial, and you only need to
+specify the `category` to get subset data of interest.
+
+The full description of :class:`.ScatterTable` columns are following below:
+
+- `xval`: Parameter scanned in the experiment. This value must be defined in the circuit metadata.
+- `yval`: Nominal part of the outcome. The outcome is something like expectation value,
+  which is computed from the experiment result with the data processor.
+- `yerr`: Standard error of the outcome, which is mainly due to sampling error.
+- `series_name`: Human readable name of the data series. This is defined by the ``data_subfit_map`` option in the :class:`.CurveAnalysis`.
+- `series_id`: Integer corresponding to the name of data series. This number is automatically assigned.
+- `category`: A tag for the data group. This is defined by a developer of the curve analysis.
+- `shots`: Number of measurement shots used to acquire a data point. This value can be defined in the circuit metadata.
+- `analysis`: The name of the curve analysis instance that generated a data point.
+
+This object helps an analysis developer with writing a custom analysis class
+without an overhead of complex data management, as well as end-users with
+retrieving and reusing the intermediate data for their custom fitting workflow
+outside our curve fitting framework.
+Note that a :class:`ScatterTable` instance may be saved in the :class:`.ExperimentData` as an artifact.
+
 .. _curve_analysis_workflow:
 
 Curve Analysis workflow
@@ -271,67 +350,71 @@ the data processor in the analysis option is internally called.
 This consumes input experiment results and creates the :class:`.ScatterTable` dataframe.
 This table may look like:
 
-.. code-block::
-
-        xval      yval      yerr name  class_id category  shots
-    0    0.1  0.153659  0.011258    A         0      raw   1024
-    1    0.1  0.590732  0.015351    B         1      raw   1024
-    2    0.1  0.315610  0.014510    A         0      raw   1024
-    3    0.1  0.376098  0.015123    B         1      raw   1024
-    4    0.2  0.937073  0.007581    A         0      raw   1024
-    5    0.2  0.323415  0.014604    B         1      raw   1024
-    6    0.2  0.538049  0.015565    A         0      raw   1024
-    7    0.2  0.530244  0.015581    B         1      raw   1024
-    8    0.3  0.143902  0.010958    A         0      raw   1024
-    9    0.3  0.261951  0.013727    B         1      raw   1024
-    10   0.3  0.830732  0.011707    A         0      raw   1024
-    11   0.3  0.874634  0.010338    B         1      raw   1024
+.. jupyter-input::
+
+    table = analysis._run_data_processing(experiment_data.data())
+    print(table)
+
+.. jupyter-output::
+
+        xval      yval      yerr  series_name  series_id  category  shots     analysis
+    0    0.1  0.153659  0.011258            A          0      raw    1024   MyAnalysis
+    1    0.1  0.590732  0.015351            B          1      raw    1024   MyAnalysis
+    2    0.1  0.315610  0.014510            A          0      raw    1024   MyAnalysis
+    3    0.1  0.376098  0.015123            B          1      raw    1024   MyAnalysis
+    4    0.2  0.937073  0.007581            A          0      raw    1024   MyAnalysis
+    5    0.2  0.323415  0.014604            B          1      raw    1024   MyAnalysis
+    6    0.2  0.538049  0.015565            A          0      raw    1024   MyAnalysis
+    7    0.2  0.530244  0.015581            B          1      raw    1024   MyAnalysis
+    8    0.3  0.143902  0.010958            A          0      raw    1024   MyAnalysis
+    9    0.3  0.261951  0.013727            B          1      raw    1024   MyAnalysis
+    10   0.3  0.830732  0.011707            A          0      raw    1024   MyAnalysis
+    11   0.3  0.874634  0.010338            B          1      raw    1024   MyAnalysis
 
 where the experiment consists of two subset series A and B, and the experiment parameter (xval)
 is scanned from 0.1 to 0.3 in each subset. In this example, the experiment is run twice
-for each condition. The role of each column is as follows:
-
-- ``xval``: Parameter scanned in the experiment. This value must be defined in the circuit metadata.
-- ``yval``: Nominal part of the outcome. The outcome is something like expectation value, which is computed from the experiment result with the data processor.
-- ``yerr``: Standard error of the outcome, which is mainly due to sampling error.
-- ``name``: Unique identifier of the result class. This is defined by the ``data_subfit_map`` option.
-- ``class_id``: Numerical index corresponding to the result class. This number is automatically assigned.
-- ``category``: The attribute of data set. The "raw" category indicates an output from the data processing.
-- ``shots``: Number of measurement shots used to acquire this result.
+for each condition.
+See :ref:`data_management_with_scatter_table` for the details of columns.
 
 3. Formatting
 ^^^^^^^^^^^^^
 
-Next, the processed dataset is converted into another format suited for the fitting and
-every valid result is assigned a class corresponding to a fit model.
+Next, the processed dataset is converted into another format suited for the fitting.
 By default, the formatter takes average of the outcomes in the processed dataset
 over the same x values, followed by the sorting in the ascending order of x values.
 This allows the analysis to easily estimate the slope of the curves to
 create algorithmic initial guess of fit parameters.
 A developer can inject extra data processing, for example, filtering, smoothing,
 or elimination of outliers for better fitting.
-The new class_id is given here so that its value corresponds to the fit model object index
-in this analysis class. This index mapping is done based upon the correspondence of
-the data name and the fit model name.
+The new `series_id` is given here so that its value corresponds to the fit model index
+defined in this analysis class. This index mapping is done based upon the correspondence of
+the `series_name` and the fit model name.
 
 This is done by calling :meth:`_format_data` method.
 This may return new scatter table object with the addition of rows like the following below.
 
-.. code-block::
+.. jupyter-input::
+
+    table = analysis._format_data(table)
+    print(table)
+
+.. jupyter-output::
 
-    12   0.1  0.234634  0.009183    A         0  formatted   2048
-    13   0.2  0.737561  0.008656    A         0  formatted   2048
-    14   0.3  0.487317  0.008018    A         0  formatted   2048
-    15   0.1  0.483415  0.010774    B         1  formatted   2048
-    16   0.2  0.426829  0.010678    B         1  formatted   2048
-    17   0.3  0.568293  0.008592    B         1  formatted   2048
+        xval      yval      yerr  series_name  series_id   category  shots     analysis
+    ...
+    12   0.1  0.234634  0.009183            A          0  formatted   2048   MyAnalysis
+    13   0.2  0.737561  0.008656            A          0  formatted   2048   MyAnalysis
+    14   0.3  0.487317  0.008018            A          0  formatted   2048   MyAnalysis
+    15   0.1  0.483415  0.010774            B          1  formatted   2048   MyAnalysis
+    16   0.2  0.426829  0.010678            B          1  formatted   2048   MyAnalysis
+    17   0.3  0.568293  0.008592            B          1  formatted   2048   MyAnalysis
 
 The default :meth:`_format_data` method adds its output data with the category "formatted".
 This category name must be also specified in the analysis option ``fit_category``.
 If overriding this method to do additional processing after the default formatting,
 the ``fit_category`` analysis option can be set to choose a different category name to use to
 select the data to pass to the fitting routine.
-The (x, y) value in each row is passed to the corresponding fit model object
+The (xval, yval) value in each row is passed to the corresponding fit model object
 to compute residual values for the least square optimization.
 
 3. Fitting

diff --git a/qiskit_experiments/curve_analysis/__init__.py b/qiskit_experiments/curve_analysis/__init__.py
@@ -39,6 +39,7 @@
 .. autosummary::
     :toctree: ../stubs/
 
+    ScatterTable
     SeriesDef
     CurveData
     CurveFitResult

diff --git a/qiskit_experiments/curve_analysis/composite_curve_analysis.py b/qiskit_experiments/curve_analysis/composite_curve_analysis.py
@@ -230,34 +230,35 @@ def _create_figures(
             A list of figures.
         """
         for analysis in self.analyses():
-            sub_data = curve_data[curve_data.group == analysis.name]
-            for name, data in list(sub_data.groupby("name")):
-                full_name = f"{name}_{analysis.name}"
+            group_data = curve_data.filter(analysis=analysis.name)
+            model_names = analysis.model_names()
+            for series_id, sub_data in group_data.iter_by_series_id():
+                full_name = f"{model_names[series_id]}_{analysis.name}"
                 # Plot raw data scatters
                 if analysis.options.plot_raw_data:
-                    raw_data = data[data.category == "raw"]
+                    raw_data = sub_data.filter(category="raw")
                     self.plotter.set_series_data(
                         series_name=full_name,
-                        x=raw_data.xval.to_numpy(),
-                        y=raw_data.yval.to_numpy(),
+                        x=raw_data.x,
+                        y=raw_data.y,
                     )
                 # Plot formatted data scatters
-                formatted_data = data[data.category == analysis.options.fit_category]
+                formatted_data = sub_data.filter(category=analysis.options.fit_category)
                 self.plotter.set_series_data(
                     series_name=full_name,
-                    x_formatted=formatted_data.xval.to_numpy(),
-                    y_formatted=formatted_data.yval.to_numpy(),
-                    y_formatted_err=formatted_data.yerr.to_numpy(),
+                    x_formatted=formatted_data.x,
+                    y_formatted=formatted_data.y,
+                    y_formatted_err=formatted_data.y_err,
                 )
                 # Plot fit lines
-                line_data = data[data.category == "fitted"]
+                line_data = sub_data.filter(category="fitted")
                 if len(line_data) == 0:
                     continue
-                fit_stdev = line_data.yerr.to_numpy()
+                fit_stdev = line_data.y_err
                 self.plotter.set_series_data(
                     series_name=full_name,
-                    x_interp=line_data.xval.to_numpy(),
-                    y_interp=line_data.yval.to_numpy(),
+                    x_interp=line_data.x,
+                    y_interp=line_data.y,
                     y_interp_err=fit_stdev if np.isfinite(fit_stdev).all() else None,
                 )
 
@@ -354,7 +355,7 @@ def _run_analysis(
             metadata["group"] = analysis.name
 
             table = analysis._format_data(analysis._run_data_processing(experiment_data.data()))
-            formatted_subset = table[table.category == analysis.options.fit_category]
+            formatted_subset = table.filter(category=analysis.options.fit_category)
             fit_data = analysis._run_curve_fit(formatted_subset)
             fit_dataset[analysis.name] = fit_data
 
@@ -376,32 +377,35 @@ def _run_analysis(
 
             if fit_data.success:
                 # Add fit data to curve data table
-                fit_curves = []
-                columns = list(table.columns)
                 model_names = analysis.model_names()
-                for i, sub_data in list(formatted_subset.groupby("class_id")):
-                    xval = sub_data.xval.to_numpy()
+                for series_id, sub_data in formatted_subset.iter_by_series_id():
+                    xval = sub_data.x
                     if len(xval) == 0:
                         # If data is empty, skip drawing this model.
                         # This is the case when fit model exist but no data to fit is provided.
                         continue
                     # Compute X, Y values with fit parameters.
-                    xval_fit = np.linspace(np.min(xval), np.max(xval), num=100)
-                    yval_fit = eval_with_uncertainties(
-                        x=xval_fit,
-                        model=analysis.models[i],
+                    xval_arr_fit = np.linspace(np.min(xval), np.max(xval), num=100, dtype=float)
+                    uval_arr_fit = eval_with_uncertainties(
+                        x=xval_arr_fit,
+                        model=analysis.models[series_id],
                         params=fit_data.ufloat_params,
                     )
-                    model_fit = np.full((100, len(columns)), np.nan, dtype=object)
-                    fit_curves.append(model_fit)
-                    model_fit[:, columns.index("xval")] = xval_fit
-                    model_fit[:, columns.index("yval")] = unp.nominal_values(yval_fit)
+                    yval_arr_fit = unp.nominal_values(uval_arr_fit)
                     if fit_data.covar is not None:
-                        model_fit[:, columns.index("yerr")] = unp.std_devs(yval_fit)
-                    model_fit[:, columns.index("name")] = model_names[i]
-                    model_fit[:, columns.index("class_id")] = i
-                    model_fit[:, columns.index("category")] = "fitted"
-                table = table.append_list_values(other=np.vstack(fit_curves))
+                        yerr_arr_fit = unp.std_devs(uval_arr_fit)
+                    else:
+                        yerr_arr_fit = np.zeros_like(xval_arr_fit)
+                    for xval, yval, yerr in zip(xval_arr_fit, yval_arr_fit, yerr_arr_fit):
+                        table.add_row(
+                            xval=xval,
+                            yval=yval,
+                            yerr=yerr,
+                            series_name=model_names[series_id],
+                            series_id=series_id,
+                            category="fitted",
+                            analysis=analysis.name,
+                        )
                 analysis_results.extend(
                     analysis._create_analysis_results(
                         fit_data=fit_data,
@@ -416,11 +420,11 @@ def _run_analysis(
                     analysis._create_curve_data(curve_data=formatted_subset, **metadata)
                 )
 
-            # Add extra column to identify the fit model
-            table["group"] = analysis.name
             curve_data_set.append(table)
 
-        combined_curve_data = pd.concat(curve_data_set)
+        combined_curve_data = ScatterTable.from_dataframe(
+            pd.concat([d.dataframe for d in curve_data_set])
+        )
         total_quality = self._evaluate_quality(fit_dataset)
 
         # After the quality is determined, plot can become a boolean flag for whether