[nexus] deflake `test_instance_watcher_metrics` #5768

hawkw · 2024-05-14T22:41:11Z

Presently, test_instance_watcher_metrics will wait for the
instance_watcher background task to have run before making assertions
about metrics, but it does not ensure that oximeter has actually
collected those metrics. This can result in flaky failures --- see
#5752.

This commit adds explicit calls to oximeter.force_collect() prior to
making assertions, to ensure that the latest metrics have been
collected.

Fixes #5752

iliana · 2024-05-14T23:31:17Z

(should be good to rebase now)

Presently, `test_instance_watcher_metrics` will wait for the `instance_watcher` background task to have run before making assertions about metrics, but it does *not* ensure that oximeter has actually collected those metrics. This can result in flaky failures --- see #5752. This commit adds explicit calls to `oximeter.force_collect()` prior to making assertions, to ensure that the latest metrics have been collected. Fixes #5752

iliana · 2024-05-15T22:01:23Z

nexus/tests/integration_tests/metrics.rs

+        // Make sure that the latest metrics have been collected.
+        oximeter.force_collect().await;
    };


I'm looking at this change since it didn't seem to help with the flakiness -- I don't know that this would do anything? The line immediately following each use of the activate_instance_watcher closure is an awaited call to timeseries_query, which performs a force_collect first thing:

omicron/nexus/tests/integration_tests/metrics.rs

Lines 283 to 288 in e0a1184

pub async fn timeseries_query(

cptestctx: &ControlPlaneTestContext<omicron_nexus::Server>,

query: impl ToString,

) -> Vec<oximeter_db::oxql::Table> {

// first, make sure the latest timeseries have been collected.

cptestctx.oximeter.force_collect().await;

Huh, okay, I'm not sure if I understand why the test was flaky in the first place, then! I'll have to keep digging.

Oh, yeah I definitely missed that. I've honestly found it frustrating to write metrics tests that assert exact equalities like this. There are just too many programs that the data has to move through, all of which are running a mess of Tokio tasks and the like (the producer, oximeter, ClickHouse, the test program, etc.). I've generally used inequalities if one wants to check things like the amount of data; or checks that there are samples with timestamps after a certain point; or some value exists in the data stream; etc.

@bnaecker

The test `integration_tests::metrics::test_instance_watcher_metrics` remains flaky even after adding an explicit call to `Oximeter::force_collect` to ensure that metrics have been collected. I believe this is due to the fact that, if the test runs long enough, the `instance_watcher` background may be activated by its timer, causing metrics to be collected another time, in addition to the test's explicit activations. This can cause flaky failures when we then assert that there is exactly a certain number of timeseries counted. This branch changes the test to make assertions based on inequality, instead. Now, we assert that the timeseries has *at least* the expected count, so if the `instance_watcher` task has collected instance metrics an additional time, we can tolerate that. We're still able to assert that at least the expected counts are present. This is based on the approach suggested by @bnaecker in [this comment][1]. I've re-run the test five times on my machine, and it appears to always pass. Hopefully, this should actually fix #5752, but we probably shouldn't close the issue until this has made it through a few CI runs... [1] #5768 (comment)

hawkw mentioned this pull request May 14, 2024

test flake: omicron-nexus::test_all integration_tests::metrics::test_instance_watcher_metrics #5752

Closed

hawkw force-pushed the eliza/deflake branch from 497b244 to e90777a Compare May 14, 2024 23:41

hawkw enabled auto-merge (squash) May 14, 2024 23:42

hawkw requested review from bnaecker, iliana and sunshowers May 15, 2024 00:27

bnaecker approved these changes May 15, 2024

View reviewed changes

hawkw merged commit 7566128 into main May 15, 2024
20 checks passed

hawkw deleted the eliza/deflake branch May 15, 2024 01:05

iliana reviewed May 15, 2024

View reviewed changes

hawkw mentioned this pull request May 16, 2024

[nexus] really deflake test_instance_watcher_metrics #5784

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus] deflake `test_instance_watcher_metrics` #5768

[nexus] deflake `test_instance_watcher_metrics` #5768

hawkw commented May 14, 2024

iliana commented May 14, 2024

iliana May 15, 2024

hawkw May 15, 2024

bnaecker May 15, 2024

	pub async fn timeseries_query(
	cptestctx: &ControlPlaneTestContext<omicron_nexus::Server>,
	query: impl ToString,
	) -> Vec<oximeter_db::oxql::Table> {
	// first, make sure the latest timeseries have been collected.
	cptestctx.oximeter.force_collect().await;

[nexus] deflake test_instance_watcher_metrics #5768

[nexus] deflake test_instance_watcher_metrics #5768

Conversation

hawkw commented May 14, 2024

iliana commented May 14, 2024

iliana May 15, 2024

Choose a reason for hiding this comment

hawkw May 15, 2024

Choose a reason for hiding this comment

bnaecker May 15, 2024

Choose a reason for hiding this comment

[nexus] deflake `test_instance_watcher_metrics` #5768

[nexus] deflake `test_instance_watcher_metrics` #5768