Save & load responses as parquet #8684

yngve-sk · 2024-09-12T06:37:23Z

Issue
Towards combining datasets without xr nan artifacts etc

Approach
read&write parquet files with polars

Closes: #6525

Some benchmarking:

Drogon ahm main vs normal
                                                        main     parquet
Open Manage Experiments
Open plotter:                                          3.9s       4.3s
Select FGORH (w all ensembles active)                  1.8        1.2s
Select w1 (w all ensembles active)                4s         3.2s
Select w2 (w all ensembles active)             4.1        3.3s


SLOWPLOT case              main      parquet
open GUI with migration    3m15s       5min
open GUI w/o migration      16s        15s
migrate to7                2m59s       4m45s
open plotter               51s         11s
Select summary vector         21.9s        7s
Open manage experiments     1s          1s
Select experiment           1s         1s
Select Ensemble-            >1s        >1s
Ensemble->Observations      12s        19s (slower)
Select realization          1s          6s

codecov-commenter · 2024-09-30T08:17:43Z

Codecov Report

Attention: Patch coverage is 98.24561% with 6 lines in your changes missing coverage. Please review.

Project coverage is 91.47%. Comparing base (d1c3a88) to head (c0fb05c).
Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
src/ert/config/ert_config.py	73.33%	4 Missing ⚠️
src/ert/config/observations.py	95.83%	1 Missing ⚠️
src/ert/config/summary_config.py	93.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8684      +/-   ##
==========================================
+ Coverage   91.42%   91.47%   +0.05%     
==========================================
  Files         344      344              
  Lines       21120    21243     +123     
==========================================
+ Hits        19308    19433     +125     
+ Misses       1812     1810       -2

Flag	Coverage Δ
cli-tests	`39.58% <35.67%> (-0.05%)`	⬇️
gui-tests	`73.30% <54.67%> (-0.25%)`	⬇️
performance-tests	`50.15% <49.70%> (+<0.01%)`	⬆️
unit-tests	`80.24% <83.33%> (+0.14%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

oyvindeide

I think this improves readability and makes the responses more generic, which is good 👍

oyvindeide · 2024-10-02T07:02:03Z

tests/ert/unit_tests/data/test_integration_data.py

-    assert all(
-        fopr.data.columns.get_level_values("data_index").values == list(range(200))
-    )
+    # Why 210, not 200?


Outdated comment?

tests/ert/unit_tests/analysis/test_es_update.py

oyvindeide · 2024-10-02T07:05:39Z

...t_tests/analysis/snapshots/test_es_update/test_update_report/0-misfit_preprocess3/update_log

@@ -1,212 +1,212 @@
------------  -------------------  -----  -----  -----  -----  ------  -----  ------
-FOPR          2010-01-10T00:00:00  0.002  0.100  5.657  0.566   0.076  0.105  Active


To decrease your diff you could probably just fix the formatting here 😅 Not a big deal though, see that it is only the formatting that changed.

oyvindeide · 2024-10-02T07:55:40Z

src/ert/analysis/_es_update.py

+        pivoted = responses_for_type.pivot(
+            on="realization",
+            index=["response_key", *response_cls.primary_key],
+            aggregate_function="mean",


What is the implication of mean?

It said so in the comment 😅

Will that be output somewhere? Is it possible to for example log it?

It is for the edge case where we end up with duplicate values for one response at one index, for example a given time. In that case, we need to aggregate them for the pivoted table to make sense, else the index used to pivot contains duplicates. So taking the average of the duplicate response values on the timestep seems to be somewhat "close enough" to do what we want, we could set it to use min,max,median,first, etc, could configure it, but not sure if it would be interesting to users to do this?

Example from running test_that_duplicate_summary_time_steps_does_not_fail:

responses_for_type.pivot( on="realization", index=["response_key", *response_cls.primary_key], aggregate_function="mean", ) Out[9]: shape: (1, 5) ┌──────────────┬─────────────────────┬───────────┬────────┬──────────┐ │ response_key ┆ time ┆ 0 ┆ 1 ┆ 2 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ datetime[ms] ┆ f32 ┆ f32 ┆ f32 │ ╞══════════════╪═════════════════════╪═══════════╪════════╪══════════╡ │ FOPR ┆ 2014-09-10 00:00:00 ┆ -1.603837 ┆ 0.0641 ┆ 0.740891 │ └──────────────┴─────────────────────┴───────────┴────────┴──────────┘ responses_for_type Out[10]: shape: (4, 4) ┌─────────────┬──────────────┬─────────────────────┬───────────┐ │ realization ┆ response_key ┆ time ┆ values │ │ --- ┆ --- ┆ --- ┆ --- │ │ u16 ┆ str ┆ datetime[ms] ┆ f32 │ ╞═════════════╪══════════════╪═════════════════════╪═══════════╡ │ 0 ┆ FOPR ┆ 2014-09-10 00:00:00 ┆ -1.603837 │ │ 1 ┆ FOPR ┆ 2014-09-10 00:00:00 ┆ 0.0641 │ │ 2 ┆ FOPR ┆ 2014-09-10 00:00:00 ┆ 0.740891 │ │ 2 ┆ FOPR ┆ 2014-09-10 00:00:00 ┆ 0.740891 │ └─────────────┴──────────────┴─────────────────────┴───────────┘

Alternatively we could strive to achieve something like this:

┌──────────────┬─────────────────────┬───────────┬────────┬──────────┐ │ response_key ┆ time ┆ 0 ┆ 1 ┆ 2 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ datetime[ms] ┆ f32 ┆ f32 ┆ f32 │ ╞══════════════╪═════════════════════╪═══════════╪════════╪══════════╡ │ FOPR ┆ 2014-09-10 00:00:00 ┆ -1.603837 ┆ 0.0641 ┆ 0.740891 │ │ FOPR ┆ 2014-09-10 00:00:00 ┆ NaN ┆ NaN ┆ 0.740891 │ └──────────────┴─────────────────────┴───────────┴────────┴──────────┘

Could be logged / given as a warning somehow, I'm not so familiar with when/why it happens, which may be relevant to what the warning/logging message should be.

(Performance-wise it might be slow to always check if some values were aggregated, or a naive try-catch around the pivot, as it will pass if there are no duplicate values)

If there is a good, somewhat performant way of warning the user this has happened, that would be good. My hunch is that this would typically happen in pressure tests where the time resolution is quite high, and the simulator does not have the same resolution.

Would it be OK to do this in a separate PR? I think the try-catch, first trying without an aggregation, then trying with one, should be easy to add / easy to remove if it turns out to have bad side effects. Should maybe be tested as its own thing just to be sure.

oyvindeide · 2024-10-03T08:16:11Z

src/ert/analysis/_es_update.py

+        # We need to either assume that if there is a time column
+        # we will approx-join that, or we could specify in response configs
+        # that there is a column that requires an approx "asof" join.
+        # Suggest we simplify and assume that there is always only


Agree, if and when we add new response types where this might be relevant we can add it then.

oyvindeide · 2024-10-03T08:17:39Z

src/ert/config/ert_config.py

-        self.observations: Dict[str, xr.Dataset] = self.enkf_obs.datasets
+        self.observations: Dict[str, polars.DataFrame] = self.enkf_obs.datasets
+
+    def write_observations_to_folder(self, dest: Path) -> None:


This is a nitpick, but should this function be here? Maybe it belongs with the observations?

Moved it to enkf_obs

oyvindeide · 2024-10-03T08:17:59Z

src/ert/config/gen_data_config.py

-                },
+            return polars.DataFrame(
+                {
+                    "report_step": polars.Series(


This made it much easier to read!

oyvindeide · 2024-10-03T08:19:24Z

src/ert/config/observation_vector.py

        if self.observation_type == EnkfObservationImplementationType.GEN_OBS:
-            datasets = []
+            actual_response_key = self.data_key


Just use self.data_key directly? Same on the next line, seems it is only used once.

oyvindeide · 2024-10-03T08:20:48Z

src/ert/config/observations.py

@@ -61,8 +80,12 @@ def __getitem__(self, key: str) -> ObsVector:
    def __eq__(self, other: object) -> bool:
        if not isinstance(other, EnkfObs):
            return False
+
+        if self.datasets.keys() != other.datasets.keys():


Isnt this duplicated in ErtConfig?

Appears so, but this is for the EnkfObs, and in ErtConfig it is for the dict mapping response type to obs ds. Long-term we should maybe cut out enkfobs and only keep the dict but right now it is a bit duplicated and necessary.

oyvindeide

LGTM! Nice job, just some minor comments.

oyvindeide · 2024-10-05T17:23:00Z

tests/ert/unit_tests/config/observations_generator.py

@@ -183,6 +183,9 @@ def summary_observations(
        "error_mode": draw(st.sampled_from(ErrorMode)),
        "value": draw(positive_floats),
    }
+
+    assume(kws["error_mode"] == ErrorMode.ABS or kws["error"] < 2)


This is in a separate commit, but think it has effect on logic from the first commit? If so they should be squashed so the tests pass on all commits.

oyvindeide · 2024-10-05T17:23:58Z

tests/ert/unit_tests/scenarios/test_summary_response.py

@@ -236,3 +236,36 @@ def test_that_mismatched_responses_gives_nan_measured_data(ert_config, prior_ens
    assert pd.isna(fopr_2.loc[0].iloc[0])
    assert pd.isna(fopr_2.loc[1].iloc[0])
    assert pd.isna(fopr_1.loc[2].iloc[0])
+
+
+def test_reading_past_2263_is_ok(ert_config, storage, prior_ensemble):


This should be squashed into the previous commit as the bug is fixed there, and so the test belongs along side that. Feel free to write a longer commit body of the first commit explaining the reason behind this change and the implications.

* Datetime reading past 2263 should now work, added test asserting that it does work * Enforced f32 precision for observations & responses

yngve-sk marked this pull request as draft September 12, 2024 06:37

yngve-sk force-pushed the responses-as-parquet branch 29 times, most recently from 3e063f7 to 378376d Compare September 17, 2024 08:21

yngve-sk force-pushed the responses-as-parquet branch 6 times, most recently from 31a824e to 0d8ddd4 Compare September 30, 2024 08:02

yngve-sk self-assigned this Sep 30, 2024

yngve-sk force-pushed the responses-as-parquet branch from c2309fa to 7b8976a Compare October 1, 2024 07:10

oyvindeide reviewed Oct 2, 2024

View reviewed changes

yngve-sk force-pushed the responses-as-parquet branch from 18d2de1 to 2bc6bf8 Compare October 2, 2024 12:23

oyvindeide reviewed Oct 3, 2024

View reviewed changes

yngve-sk force-pushed the responses-as-parquet branch 4 times, most recently from 2769d67 to 41aae5d Compare October 4, 2024 10:35

yngve-sk mentioned this pull request Oct 4, 2024

Update observation keys getter equinor/semeio#648

Merged

oyvindeide approved these changes Oct 5, 2024

View reviewed changes

Polars&parquet for responses and scaling factors

c0fb05c

* Datetime reading past 2263 should now work, added test asserting that it does work * Enforced f32 precision for observations & responses

yngve-sk force-pushed the responses-as-parquet branch from 6086201 to c0fb05c Compare October 7, 2024 08:10

yngve-sk merged commit a52cebf into equinor:main Oct 7, 2024
55 of 56 checks passed

yngve-sk removed the release-notes:unreleased-feature-changes PR with changes to a feature which is not yet released. Not for introduction of new features! label Oct 7, 2024

oyvindeide mentioned this pull request Nov 14, 2024

ValueError: unable to infer dtype on variable 'time'; xarray cannot serialize arbitrary Python objects #9194

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save & load responses as parquet #8684

Save & load responses as parquet #8684

yngve-sk commented Sep 12, 2024 •

edited

Loading

codecov-commenter commented Sep 30, 2024 •

edited

Loading

oyvindeide left a comment

oyvindeide Oct 2, 2024

oyvindeide Oct 2, 2024

oyvindeide Oct 2, 2024

oyvindeide Oct 2, 2024 •

edited

Loading

oyvindeide Oct 2, 2024

yngve-sk Oct 2, 2024 •

edited

Loading

yngve-sk Oct 2, 2024

yngve-sk Oct 2, 2024

oyvindeide Oct 3, 2024

yngve-sk Oct 3, 2024

oyvindeide Oct 3, 2024

oyvindeide Oct 3, 2024

yngve-sk Oct 3, 2024

oyvindeide Oct 3, 2024

oyvindeide Oct 3, 2024

oyvindeide Oct 3, 2024

yngve-sk Oct 3, 2024

oyvindeide left a comment

oyvindeide Oct 5, 2024

oyvindeide Oct 5, 2024

		@@ -1,212 +1,212 @@
		------------ ------------------- ----- ----- ----- ----- ------ ----- ------
		FOPR 2010-01-10T00:00:00 0.002 0.100 5.657 0.566 0.076 0.105 Active

Save & load responses as parquet #8684

Save & load responses as parquet #8684

Conversation

yngve-sk commented Sep 12, 2024 • edited Loading

codecov-commenter commented Sep 30, 2024 • edited Loading

Codecov Report

oyvindeide left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oyvindeide Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yngve-sk Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oyvindeide left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yngve-sk commented Sep 12, 2024 •

edited

Loading

codecov-commenter commented Sep 30, 2024 •

edited

Loading

oyvindeide Oct 2, 2024 •

edited

Loading

yngve-sk Oct 2, 2024 •

edited

Loading