Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] deflake test_instance_watcher_metrics #5768

Merged
merged 1 commit into from
May 15, 2024
Merged

[nexus] deflake test_instance_watcher_metrics #5768

merged 1 commit into from
May 15, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented May 14, 2024

Presently, test_instance_watcher_metrics will wait for the
instance_watcher background task to have run before making assertions
about metrics, but it does not ensure that oximeter has actually
collected those metrics. This can result in flaky failures --- see
#5752.

This commit adds explicit calls to oximeter.force_collect() prior to
making assertions, to ensure that the latest metrics have been
collected.

Fixes #5752

@iliana
Copy link
Contributor

iliana commented May 14, 2024

(should be good to rebase now)

Presently, `test_instance_watcher_metrics` will wait for the
`instance_watcher` background task to have run before making assertions
about metrics, but it does *not* ensure that oximeter has actually
collected those metrics. This can result in flaky failures --- see
#5752.

This commit adds explicit calls to `oximeter.force_collect()` prior to
making assertions, to ensure that the latest metrics have been
collected.

Fixes #5752
@hawkw hawkw enabled auto-merge (squash) May 14, 2024 23:42
@hawkw hawkw requested review from bnaecker, iliana and sunshowers May 15, 2024 00:27
@hawkw hawkw merged commit 7566128 into main May 15, 2024
20 checks passed
@hawkw hawkw deleted the eliza/deflake branch May 15, 2024 01:05
Comment on lines +403 to 405
// Make sure that the latest metrics have been collected.
oximeter.force_collect().await;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at this change since it didn't seem to help with the flakiness -- I don't know that this would do anything? The line immediately following each use of the activate_instance_watcher closure is an awaited call to timeseries_query, which performs a force_collect first thing:

pub async fn timeseries_query(
cptestctx: &ControlPlaneTestContext<omicron_nexus::Server>,
query: impl ToString,
) -> Vec<oximeter_db::oxql::Table> {
// first, make sure the latest timeseries have been collected.
cptestctx.oximeter.force_collect().await;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, okay, I'm not sure if I understand why the test was flaky in the first place, then! I'll have to keep digging.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah I definitely missed that. I've honestly found it frustrating to write metrics tests that assert exact equalities like this. There are just too many programs that the data has to move through, all of which are running a mess of Tokio tasks and the like (the producer, oximeter, ClickHouse, the test program, etc.). I've generally used inequalities if one wants to check things like the amount of data; or checks that there are samples with timestamps after a certain point; or some value exists in the data stream; etc.

hawkw added a commit that referenced this pull request May 16, 2024
The test `integration_tests::metrics::test_instance_watcher_metrics`
remains flaky even after adding an explicit call to
`Oximeter::force_collect` to ensure that metrics have been collected. I
believe this is due to the fact that, if the test runs long enough, the
`instance_watcher` background may be activated by its timer, causing
metrics to be collected another time, in addition to the test's explicit
activations. This can cause flaky failures when we then assert that
there is exactly a certain number of timeseries counted.

This branch changes the test to make assertions based on inequality,
instead. Now, we assert that the timeseries has *at least* the expected
count, so if the `instance_watcher` task has collected instance metrics
an additional time, we can tolerate that. We're still able to assert
that at least the expected counts are present. This is based on the
approach suggested by @bnaecker in [this comment][1].

I've re-run the test five times on my machine, and it appears to always
pass. Hopefully, this should actually fix #5752, but we probably
shouldn't close the issue until this has made it through a few CI
runs...

[1] #5768 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test flake: omicron-nexus::test_all integration_tests::metrics::test_instance_watcher_metrics
3 participants