-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measure metrics latency #2323
base: main
Are you sure you want to change the base?
Measure metrics latency #2323
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2323 +/- ##
=====================================
Coverage 79.5% 79.5%
=====================================
Files 123 123
Lines 21492 21492
=====================================
Hits 17094 17094
Misses 4398 4398 ☔ View full report in Codecov by Sentry. |
stress/src/metrics_latency.rs
Outdated
// collect threads in short intervals | ||
let mut total_count = 0; | ||
while threads.iter().any(|t| !t.is_finished()) { | ||
// collect agressively so we could measure inserts properly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to understand what this is simulating? Metric collection interval is by default 30 sec or 60 sec in the spec. What is the scenario for collecting every 500 microsec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically idea, is to choose collection interval so it wouldn't affect numbers (p99) but still allow to test updates and inserts (for insert I need to clean it regularly, but not too frequent that updates would become inserts).
I updated code with explanation why this number... it could be different, but it should be good enough so it would work in all cases without affecting p99.
( Sorry for rebase,-force push, first commit was a bit rushed... I wasn't happy with, now it's better).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am not following the scenario being tested - are we expecting that users will run collect() every few microseconds? I don't think that is something we need to consider for performance optimizations.
Also, I opened #2328 to track revisiting the aggressive cleaning up of HashMap.
e0df038
to
aa55bb8
Compare
})); | ||
} | ||
let mut total_count = 0; | ||
while start.elapsed() < Duration::from_secs(1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to measure the impact of Collect to overall throughput, lets add the existing PeriodicReader with a dummy exporter to the existing stress test. The PeriodicReader interval should be the defaults (60 secs) - or we can shorten to a low number like 5 sec, but definitely not micro-secs!
(I don't expect adding Reader to have any measurable impact to stress test.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is opposite,- collect should be invisible in the statistics, but keep in mind, that this number is almost arbitrary... and I should probably have better implementation, but for the starters it already shows meaningful numbers :)
So with 300micro-sec... it will clear around 1800-2500times per second (time to collect also adds up).
For p99, I care about 99% of measurements, and can ignore the rest. Which means, that for p99 it will have no effect on updates, if I measure more than 2500*99 =~250K/s, for my case I'm always over several millions, so collecting doesn't have any affect on statistics.
For inserts situation is different... I need to avoid hitting overflow (2k), which means that at most I can reach around 2M (which might be the case with better hardware...), so I basically hope that I clear frequently enough, to not hit 2k (overflow), but not too frequent, so that it will greatly affect p99 (but it probably affects in some scenarios... so this is not ideal).
Probably I should implement some sort of dynamic/adaptive interval, so that it wouldn't clean to frequently (so it wouldn't affect more than 1% of measurements for p99, but not too rarely to hit overflow (2k).
In other words, if I collect after less than 100, then it will affect p99, but if I collect after 2k, then for inserts it might reached overflow.
I could also implement different strategies for each case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate on the scenario being covered? Without understanding the actual scenario, its hard to provide useful feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I update revision a bit. Generally, the purpose of this PR is to simply look how much time it takes for individual calls to measure things under various conditions, that's it :)
There's something to learn too :) For example, I have learned, that even simple update (no-attributes) under high load (lots of threads) can take as little as 60ns, or as much as 80ms, it's more than 1000
000 time difference!
I still cannot explain why some times it takes so long... I tried to do "warmup" but, at least on my machine, this is consistently reproducible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fraillt for the great work, as always!
It is very important to ensure that the update time is predictable and never causes thread to get blocked etc. which will defeat the whole point of relying on OpenTelemetry to report metrics (if OTel itself can cause a thread to lose its CPU slice and mess up the duration!!). I don't think we'll fix this for 1.0 Metrics release, so I plan to call this out in docs. My current guess is.- some form of Bound instruments will solve this problem.
Let me re-check this PR to see if we want to keep this for measuring this scenario. The current behavior of aggressively draining the hash map is working against us, as that will force some updates to need write lock, effectively causing other threads to potentially get blocked...Something #2328 can address..
3b7fbe8
to
7ae7647
Compare
7ae7647
to
8898ac5
Compare
KeyValue::new("url.scheme", "not_found"), | ||
KeyValue::new("error.type", 404), | ||
KeyValue::new("http.response.status_code", 404), | ||
KeyValue::new("http.route", "testing/metrics/latency"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one interesting thing is - the route may not be readily available as a `static str always, forcing users to String allocate to report this!
#2089 (comment) is trying to fix that scenario, hopefully we can make it happen for 1.0 stable of Metrics!
Tests how much time it takes to call
.add(value, attribs)
under various conditions.It measures several things: average time, median time (p50), p95 and p99.
The way it works is pretty simple:
.add(value, attribs)
on several threads in the loop and measure each callHere's the results on my machine on main branch.
See results.
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes