-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for publishing kernel statistics to oximeter #4355
Conversation
This PR is still a bit of a WIP, but it's large and complicated enough that I'd like to get some eyes on it before going any further. It does the minimum needed to get data from illumos's The @rmustacc and others and I have tried a bunch to put together some macro-trickery which would make this much easier to do. For example, it'd be nice for the sled agent to consume a file in some DSL at startup, and use that to determine which kstats to collect from. That may still be useful, but I am not sure how to do it! I can dig into why it's hard, if folks are curious, but it's basically a mismatch between doing work at runtime and I've included a few tests of the bnaecker@shale : ~/omicron $ ./target/release/oxdb --address fd00:1122:3344:101::e query physical_data_link:bytes_received 'link_name==net0' | jq '.[0].timeseries_name'
"physical_data_link:bytes_received"
bnaecker@shale : ~/omicron $ ./target/release/oxdb --address fd00:1122:3344:101::e query physical_data_link:bytes_received 'link_name==net0' | jq '.[0].target'
{
"name": "physical_data_link",
"fields": [
{
"name": "hostname",
"value": {
"type": "string",
"value": "shale"
}
},
{
"name": "link_name",
"value": {
"type": "string",
"value": "net0"
}
},
{
"name": "rack_id",
"value": {
"type": "uuid",
"value": "399e3fa2-a73e-4b04-8595-b1f890b640d5"
}
},
{
"name": "serial",
"value": {
"type": "string",
"value": "935aaf0b412a4e248513829916ac784d"
}
},
{
"name": "sled_id",
"value": {
"type": "uuid",
"value": "7bfb8206-e907-4bd1-a420-50e187f9975b"
}
}
]
}
bnaecker@shale : ~/omicron $ ./target/release/oxdb --address fd00:1122:3344:101::e query physical_data_link:bytes_received 'link_name==net0' | jq '.[0].metric'
{
"name": "bytes_received",
"fields": [],
"datum_type": "cumulative_u64"
}
bnaecker@shale : ~/omicron $ ./target/release/oxdb --address fd00:1122:3344:101::e query physical_data_link:bytes_received 'link_name==net0' | jq '.[0].measurements[-1]'
{
"timestamp": "2023-10-25T23:13:18.396810853Z",
"datum": {
"type": "cumulative_u64",
"datum": {
"start_time": "2023-10-25T22:04:15.851667324Z",
"value": 476145
}
}
}
bnaecker@shale : ~/omicron $ kstat link:0:net0:rbytes64
module: link instance: 0
name: net0 class: net
rbytes64 479609
There are really two big pieces I'd love input on. First, I want to make sure that we have the right set of fields on these timeseries. For physical links and virtual links not tied to a guest, we are recording:
For guest links, we record those, plus also the project UUID and instance UUID. Those are not tracked yet, so we can easily change them. That just requires many more changes, and I was hoping to defer it a bit. The second big item is around the actual mechanics of tracking these kstats, especially handling errors. I'm really not sure how we'll want to do that, especially for things like datalinks that may be torn down asynchronously from the kstat sampler itself. I added way for consumers to ask about the status of their tracked target, or remove it. In addition, the code currently requires that one specifies a policy around errors at the time the target is registered. This lets folks decide if they want to (1) keep tracking forever; (2) stop after a period of failures or; (3) stop after a number of attempts. The fact that the target "expired" is available through the status method, until someone removes it or registers a new one. Anyway, that's all a bit in the weeds. But I was pretty unsure how we're going to end up using this, and didn't want to have things continue to be sampled indefinitely without explicitly opting in to that behavior. I'd love input here though, since it's certainly a bit convoluted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! It'll be really nice to have these stats recorded into ClickHouse 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Although, I think it'd be a good idea to wait for a review from someone who knows about kstats more than me before merging.
Friendly ping on this one @jgallagher @rmustacc @gjcolombo! I also wanted to make clear that nearly 3k of the 4k lines in this PR are just the OpenAPI spec, I swear it's not as massive as it looks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together! We're going to need a small truckload of memory kstats soon, so it's great to have all this support available.
warn!( | ||
self.log, | ||
"sample queue is too full, dropping oldest samples"; | ||
"n_new_samples" => n_new_samples, | ||
"n_current_samples" => n_current_samples, | ||
"n_overflow_samples" => n_overflow_samples, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be obvious from looking at metrics later that this has happened? Would it make any sense to try to produce a separate "number of dropped samples" metric that indicates whether any providers have been particularly noisy in this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question, and no, I don't think it will be easily visible anywhere outside this log right now. One would need to infer it from a gap in the samples for a particular timeseries, or be looking at this log file. Those are both terrible, and generating our own timeseries to track these drops seems like a very good idea to surface it. I'll experiment with publishing that as well, since we'll mostly want to think about what fields to attach to such a timeseries.
Another point this raises for me is that noisy producers can negatively impact well-behaved ones, since they share the same sample queue. That interaction is not good. I think I have a reasonable idea for how to implement that in a way that doesn't complicate the locking story too much.
Thanks for bringing this up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've done the per-target queue work in b6eb2dc. I am going to go back and experiment with publishing our own self-stats here next.
let kstats = ctl | ||
.iter() | ||
.filter(|kstat| sampled_kstat.target.interested(kstat)) | ||
.zip(sampled_kstat.creation_times.iter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about the math here. IIUC, the number of elements returned from ctl.iter.filter
depends on the number of kstat
s that meet the chosen registrant's criteria right now, i.e., at sampling time. But creation_times
is initialized once in add_target
, when there might have been a different number of stats. If the former length exceeds the latter then this won't call read
on the "extra" elements. Is there something else guaranteeing these are equal that I'm overlooking?
(The new tests in link.rs
only create one test etherstub apiece and always do it before adding any targets, so I'm not sure they cover this case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch, it's definitely possible to be zipping different numbers of elements here.
I think this hits on one of my difficulties on this PR. Kernel statistics publish their creation time as an hrtime_t
, which is really only useful as a relative timestamp. I do some work to convert it into UTC here, which I am frankly not a huge fan of. That is intentionally done once at registration time, because there's enough slop in the conversions that it starts to matter. Cumulative counters always include a start time, and that'll move around by a lot, up to a few milliseconds, in my experimentation. There was actually an illumos issue to report the boot time in UTC, but I'm struggling to find it now. Maybe @jclulow remembers where that is, since I believe he filed it?
Registering them once was my tradeoff, but you're correct that it's wrong! I'll add a regression test for this first off. Now what to do about it? One, we could map directly from the interested kstats to the creation times, which might be a bit tricky. Second would be to compute the creation time on the fly each time. I could just be doing the math wrong, but I worry about that wandering around a bunch and confusing anyone looking at it downstream.
One other idea that occurs to me: we could pretend to start recording the value of the kstat when it's registered (or we see interest in it), and compute the cumulative values over that time interval. That would basically mean we ignore history prior to the registration / interest of the kstat, but that we can just compute a Utc::now()
all the time.
I think another possibility is to call KstatTarget::interested
only once, at registration time, not sampling time. I'm sure there are others, but I'll need to stew on the tradeoffs.
Thanks for flagging this, definitely an issue.
Meant to leave an additional note: the changes in service of my first round of comments all look great--thanks for picking those up! There's a couple more things I had questions about that aren't related to those updates but that I only noticed on my second lap. |
Thanks for the kinds words, and the feedback, all really helpful!
I had idea for solving the time stamp issue. What I’m really trying to do is consistently map between the system hires timer and UTC. I don’t know if it’s possible to always do that, but we can make it easier.
We could take a UTC timestamp and hires time just once, when we create the sampler object. Then we have a consistent reference for each domain. When we create samples, we can take the snap time from the stat and compute the difference to the hires time in nanoseconds. Then offset the UTC timestamp by the same.
That’s the same logic we already use, but we can make a reference point in each domain once, to avoid precision issues and make sure we have one start time for every stat
I think that should work anyway, I’ll give it a go tomorrow!
|
I'm going to work on getting the tests to manual advance / pause of time, and put the hrtime -> UTC mapping after that. It's probably more subtle than I thought, and may take some time to think and experiment with approaches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dropped stat reporting and new test code LGTM, though I'm a little bummed out that my suggestion about the test produced the Instant
/DateTime
split in the actual code. I'll think about that a bit more, though I suspect the best answer is ultimately going to involve using Instant
in most places and converting to DateTime
using some reference time stored when the sampler is started--i.e., what I think you're trying to do with timestamps anyway.
Thanks @gjcolombo. I've been thinking more about the timestamp issue, and I don't think there's an obviously correct way to handle it. The issues is that there can be drifts between a high-res time and UTC, due to things like NTP or slightly different time resolution. That means that mapping a As an example, consider creating a "timestamp reference", with a UTC and On the other hand, if we create one reference, then we hit the issue you pointed out: that may not be the right reference for two kstats that are created at different times. I think the only way around this is to require that the implementer of I hope that makes sense. Without the OS itself storing a notion of its boot time in UTC, it's surprisingly difficult to map between |
FWIW, I think we do store the UTC boottime in a variable in the kernel. It doesn't get adjusted by the kernel to account for NTP today, but to be honest I'm not sure there's any reason that we couldn't do that when we get adjustments fed to us to adjust the actual clock. |
@jclulow I wasn't aware of that! How does one access that? If it's not currently adjusted by NTP, then I'm not sure it'll be useful for this PR at this point, but I could be wrong about that. |
I'm taking another pass on time synchronisation too, and am looking at possibly updating that kernel variable as part of that work, although I'm still working through it. https://github.com/jclulow/junk-gimlet-fixtime/blob/7e0754dc50b9b256fb2624c4e3520a443ae4b8eb/src/main.rs has some code to do this, using bindings to libkvm. |
@gjcolombo Ok, whew this is ready again. After some help from Dave, I think there's a better way to track creation times: a mapping that stores the creation time for each kstat. This now inserts the creation times for a target's kstats into the mapping when it is registered, or at sampling time, if needed. The creation times are removed periodically, if we discover that a kstat no longer exists at all on the kstat chain. |
Hmm, this is a weird failure: https://github.com/oxidecomputer/omicron/actions/runs/6790935431/job/18461441508?pr=4355#step:11:706 It looks like |
Ok, the failure is due to GitHub's disposition of creating a merge commit with the latest |
23b0aab
to
cf7b3cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest changes LGTM with one nitpicky comment. Thanks for working on all this!
6740d1b
to
77dc30f
Compare
Manually rebased / squashed to resolve merge conflicts. Will auto-merge, thanks @gjcolombo for all the reviews, extremely helpful! |
77dc30f
to
8b79a19
Compare
- Add a `kstat` module in `oximeter_instruments`. This includes a trait for describing how to map from one or more kstats into oximeter targets and metrics. It can be used to convert name-value kernel statistics in a pretty straightforward way into oximeter samples. - Add a `KstatSampler`, which is used to register kstat targets, and will periodically poll them to generate samples. It is an `oximeter::Producer`, so that it can be easily hooked up to produce data for an `oximeter` collector. - Add targets for tracking physical, virtual, and guest datalinks. - Add metrics for bytes in/out, packets in/out, and errors in/out for the above. - Use the `KstatStampler` in a new `MetricsManager` type in the sled agent, and track the physical (underlay) data links on a system. Does not yet track any virtual or guest links. The manager can be used to also collect other statistics, such as HTTP request latencies similar to nexus, or any kstats through the sampler. Remove local config, some cross-OS cfg directives Update sled-agent OpenAPI for metrics reporting Review feedback - Make a queue per tracked kstat target, so that noisy producers don't impact samples from the other targets. - Add overview doc to `oximeter_instruments::kstat`, and make most docstrings public for visibility. - Delegate future impl to `Sleep` in `YieldIdAfter`, rather than spawning a new task. - Don't fork to get hostname on every call, only when needed. - Fixup to naming for test Etherstubs, to avoid invalid names. What is time anyway? - Improve tests by manually manipulating time to expected events. - Switch out `tokio::time::Instant` and `chrono::DateTime<Utc>` dending on test configuration, so that we can move time around during tests. - Add a queue to which test code can subscribe, onto which the sampler task places the actual number of samples it collects. Use this to test the actual number of samples / error samples we get. - Test expiration behavior by manually stepping time Add a time reference for converting hrtime to UTC - Adds a `TimeReference` type for consistently converting from hrtime to UTC, using the mapping from a single point in time. This is stored in each `SampledKstat` now, and passed to `KstatTarget::to_samples()`. The implementer can use this to consistently derive a start time for their kstats now, though this isn't (can't be?) consistent if you remove and re-add a target. - Add some quick sanity tests for `TimeReference`, more tests verifying start time consistency - Cleanup unused code Further improvements around kstat creation times - Update the kstat chain more frequently, including when targets are added and sampled. - Handle situation when a target signals interest in zero kstats, and clarify documentation. This is actually what happens when a kstat itself disappears and we then update the chain (previously, it was an error because the sampling method did _not_ update the chain). Treat this like an error, and increment the expiration counters. Make clear in the documentation that this situation is included in those counters. - Add a per-kstat (not target) mapping that stores the creation times. These are included whenever we add a target, and also at sampling time, if needed. This lets us track the creation time in UTC reliably, while also providing it accurately to the `KstatTarget::to_samples()` method. These are removed only when the kstat itself goes away, which we check for periodically in the main `run()` loop, to avoid keeping them around forever. Rebase fixup Reword expiration after first failure
8b79a19
to
5c38069
Compare
kstat
module inoximeter_instruments
. This includes a trait for describing how to map from one or more kstats into oximeter targets and metrics. It can be used to convert name-value kernel statistics in a pretty straightforward way into oximeter samples.KstatSampler
, which is used to register kstat targets, and will periodically poll them to generate samples. It is anoximeter::Producer
, so that it can be easily hooked up to produce data for anoximeter
collector.KstatStampler
in a newMetricsManager
type in the sled agent, and track the physical (underlay) data links on a system. Does not yet track any virtual or guest links. The manager can be used to also collect other statistics, such as HTTP request latencies similar to nexus, or any kstats through the sampler.