-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] deflake test_instance_watcher_metrics
#5768
Conversation
(should be good to rebase now) |
Presently, `test_instance_watcher_metrics` will wait for the `instance_watcher` background task to have run before making assertions about metrics, but it does *not* ensure that oximeter has actually collected those metrics. This can result in flaky failures --- see #5752. This commit adds explicit calls to `oximeter.force_collect()` prior to making assertions, to ensure that the latest metrics have been collected. Fixes #5752
// Make sure that the latest metrics have been collected. | ||
oximeter.force_collect().await; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking at this change since it didn't seem to help with the flakiness -- I don't know that this would do anything? The line immediately following each use of the activate_instance_watcher
closure is an awaited call to timeseries_query
, which performs a force_collect
first thing:
omicron/nexus/tests/integration_tests/metrics.rs
Lines 283 to 288 in e0a1184
pub async fn timeseries_query( | |
cptestctx: &ControlPlaneTestContext<omicron_nexus::Server>, | |
query: impl ToString, | |
) -> Vec<oximeter_db::oxql::Table> { | |
// first, make sure the latest timeseries have been collected. | |
cptestctx.oximeter.force_collect().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, okay, I'm not sure if I understand why the test was flaky in the first place, then! I'll have to keep digging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah I definitely missed that. I've honestly found it frustrating to write metrics tests that assert exact equalities like this. There are just too many programs that the data has to move through, all of which are running a mess of Tokio tasks and the like (the producer, oximeter
, ClickHouse, the test program, etc.). I've generally used inequalities if one wants to check things like the amount of data; or checks that there are samples with timestamps after a certain point; or some value exists in the data stream; etc.
The test `integration_tests::metrics::test_instance_watcher_metrics` remains flaky even after adding an explicit call to `Oximeter::force_collect` to ensure that metrics have been collected. I believe this is due to the fact that, if the test runs long enough, the `instance_watcher` background may be activated by its timer, causing metrics to be collected another time, in addition to the test's explicit activations. This can cause flaky failures when we then assert that there is exactly a certain number of timeseries counted. This branch changes the test to make assertions based on inequality, instead. Now, we assert that the timeseries has *at least* the expected count, so if the `instance_watcher` task has collected instance metrics an additional time, we can tolerate that. We're still able to assert that at least the expected counts are present. This is based on the approach suggested by @bnaecker in [this comment][1]. I've re-run the test five times on my machine, and it appears to always pass. Hopefully, this should actually fix #5752, but we probably shouldn't close the issue until this has made it through a few CI runs... [1] #5768 (comment)
Presently,
test_instance_watcher_metrics
will wait for theinstance_watcher
background task to have run before making assertionsabout metrics, but it does not ensure that oximeter has actually
collected those metrics. This can result in flaky failures --- see
#5752.
This commit adds explicit calls to
oximeter.force_collect()
prior tomaking assertions, to ensure that the latest metrics have been
collected.
Fixes #5752