Add support for publishing kernel statistics to oximeter #4355

bnaecker · 2023-10-25T23:06:16Z

Add a kstat module in oximeter_instruments. This includes a trait for describing how to map from one or more kstats into oximeter targets and metrics. It can be used to convert name-value kernel statistics in a pretty straightforward way into oximeter samples.
Add a KstatSampler, which is used to register kstat targets, and will periodically poll them to generate samples. It is an oximeter::Producer, so that it can be easily hooked up to produce data for an oximeter collector.
Add targets for tracking physical, virtual, and guest datalinks.
Add metrics for bytes in/out, packets in/out, and errors in/out for the above.
Use the KstatStampler in a new MetricsManager type in the sled agent, and track the physical (underlay) data links on a system. Does not yet track any virtual or guest links. The manager can be used to also collect other statistics, such as HTTP request latencies similar to nexus, or any kstats through the sampler.

bnaecker · 2023-10-25T23:24:21Z

This PR is still a bit of a WIP, but it's large and complicated enough that I'd like to get some eyes on it before going any further. It does the minimum needed to get data from illumos's kstat framework into ClickHouse via oximeter. The main pieces are noted in the commit message, but here are some more details to help orient around the PR.

The KstatTarget trait is the main interface for folks adding new stats. The idea is to describe the set of kstats one is interested in, and then a method for generating oximeter::Samples from those. Here is an example, which I implemented to collect data about Ethernet datalinks on the system.

@rmustacc and others and I have tried a bunch to put together some macro-trickery which would make this much easier to do. For example, it'd be nice for the sled agent to consume a file in some DSL at startup, and use that to determine which kstats to collect from. That may still be useful, but I am not sure how to do it! I can dig into why it's hard, if folks are curious, but it's basically a mismatch between doing work at runtime and oximeter's compile-time data model.

I've included a few tests of the KstatSampler, used to track and publish samples. I've also run a number of tests on my own non-Gimlet machine. Below, we can see the data coming out of the ClickHouse database that's derived from these kstats:

bnaecker@shale : ~/omicron $ ./target/release/oxdb --address fd00:1122:3344:101::e query physical_data_link:bytes_received 'link_name==net0' | jq '.[0].timeseries_name'
"physical_data_link:bytes_received"
bnaecker@shale : ~/omicron $ ./target/release/oxdb --address fd00:1122:3344:101::e query physical_data_link:bytes_received 'link_name==net0' | jq '.[0].target'
{
  "name": "physical_data_link",
  "fields": [
    {
      "name": "hostname",
      "value": {
        "type": "string",
        "value": "shale"
      }
    },
    {
      "name": "link_name",
      "value": {
        "type": "string",
        "value": "net0"
      }
    },
    {
      "name": "rack_id",
      "value": {
        "type": "uuid",
        "value": "399e3fa2-a73e-4b04-8595-b1f890b640d5"
      }
    },
    {
      "name": "serial",
      "value": {
        "type": "string",
        "value": "935aaf0b412a4e248513829916ac784d"
      }
    },
    {
      "name": "sled_id",
      "value": {
        "type": "uuid",
        "value": "7bfb8206-e907-4bd1-a420-50e187f9975b"
      }
    }
  ]
}
bnaecker@shale : ~/omicron $ ./target/release/oxdb --address fd00:1122:3344:101::e query physical_data_link:bytes_received 'link_name==net0' | jq '.[0].metric'
{
  "name": "bytes_received",
  "fields": [],
  "datum_type": "cumulative_u64"
}
bnaecker@shale : ~/omicron $ ./target/release/oxdb --address fd00:1122:3344:101::e query physical_data_link:bytes_received 'link_name==net0' | jq '.[0].measurements[-1]'
{
  "timestamp": "2023-10-25T23:13:18.396810853Z",
  "datum": {
    "type": "cumulative_u64",
    "datum": {
      "start_time": "2023-10-25T22:04:15.851667324Z",
      "value": 476145
    }
  }
}
bnaecker@shale : ~/omicron $ kstat link:0:net0:rbytes64
module: link                            instance: 0
name:   net0                            class:    net
        rbytes64                        479609

There are really two big pieces I'd love input on. First, I want to make sure that we have the right set of fields on these timeseries. For physical links and virtual links not tied to a guest, we are recording:

hostname of the sled, or the zone name for non-GZ links
serial number of the sled
link name
sled UUID
rack UUID

For guest links, we record those, plus also the project UUID and instance UUID. Those are not tracked yet, so we can easily change them. That just requires many more changes, and I was hoping to defer it a bit.

The second big item is around the actual mechanics of tracking these kstats, especially handling errors. I'm really not sure how we'll want to do that, especially for things like datalinks that may be torn down asynchronously from the kstat sampler itself. I added way for consumers to ask about the status of their tracked target, or remove it. In addition, the code currently requires that one specifies a policy around errors at the time the target is registered. This lets folks decide if they want to (1) keep tracking forever; (2) stop after a period of failures or; (3) stop after a number of attempts. The fact that the target "expired" is available through the status method, until someone removes it or registers a new one.

Anyway, that's all a bit in the weeds. But I was pretty unsure how we're going to end up using this, and didn't want to have things continue to be sampled indefinitely without explicitly opting in to that behavior. I'd love input here though, since it's certainly a bit convoluted.

karencfv

This is great! It'll be really nice to have these stats recorded into ClickHouse 🎉

smf/sled-agent/non-gimlet/config-rss.toml

smf/sled-agent/non-gimlet/config.toml

sled-agent/src/metrics.rs

karencfv

Looks good to me! Although, I think it'd be a good idea to wait for a review from someone who knows about kstats more than me before merging.

bnaecker · 2023-10-30T22:46:03Z

Friendly ping on this one @jgallagher @rmustacc @gjcolombo! I also wanted to make clear that nearly 3k of the 4k lines in this PR are just the OpenAPI spec, I swear it's not as massive as it looks.

gjcolombo

Thanks for putting this together! We're going to need a small truckload of memory kstats soon, so it's great to have all this support available.

oximeter/instruments/src/kstat/sampler.rs

gjcolombo · 2023-10-31T21:06:03Z

oximeter/instruments/src/kstat/sampler.rs

+                                warn!(
+                                    self.log,
+                                    "sample queue is too full, dropping oldest samples";
+                                    "n_new_samples" => n_new_samples,
+                                    "n_current_samples" => n_current_samples,
+                                    "n_overflow_samples" => n_overflow_samples,
+                                );


Will it be obvious from looking at metrics later that this has happened? Would it make any sense to try to produce a separate "number of dropped samples" metric that indicates whether any providers have been particularly noisy in this way?

Great question, and no, I don't think it will be easily visible anywhere outside this log right now. One would need to infer it from a gap in the samples for a particular timeseries, or be looking at this log file. Those are both terrible, and generating our own timeseries to track these drops seems like a very good idea to surface it. I'll experiment with publishing that as well, since we'll mostly want to think about what fields to attach to such a timeseries.

Another point this raises for me is that noisy producers can negatively impact well-behaved ones, since they share the same sample queue. That interaction is not good. I think I have a reasonable idea for how to implement that in a way that doesn't complicate the locking story too much.

Thanks for bringing this up!

Ok, I've done the per-target queue work in b6eb2dc. I am going to go back and experiment with publishing our own self-stats here next.

oximeter/instruments/src/kstat/sampler.rs

oximeter/instruments/src/kstat/mod.rs

sled-agent/src/metrics.rs

gjcolombo · 2023-11-01T22:22:46Z

oximeter/instruments/src/kstat/sampler.rs

+        let kstats = ctl
+            .iter()
+            .filter(|kstat| sampled_kstat.target.interested(kstat))
+            .zip(sampled_kstat.creation_times.iter())


I'm worried about the math here. IIUC, the number of elements returned from ctl.iter.filter depends on the number of kstats that meet the chosen registrant's criteria right now, i.e., at sampling time. But creation_times is initialized once in add_target, when there might have been a different number of stats. If the former length exceeds the latter then this won't call read on the "extra" elements. Is there something else guaranteeing these are equal that I'm overlooking?

(The new tests in link.rs only create one test etherstub apiece and always do it before adding any targets, so I'm not sure they cover this case.)

That's a good catch, it's definitely possible to be zipping different numbers of elements here.

I think this hits on one of my difficulties on this PR. Kernel statistics publish their creation time as an hrtime_t, which is really only useful as a relative timestamp. I do some work to convert it into UTC here, which I am frankly not a huge fan of. That is intentionally done once at registration time, because there's enough slop in the conversions that it starts to matter. Cumulative counters always include a start time, and that'll move around by a lot, up to a few milliseconds, in my experimentation. There was actually an illumos issue to report the boot time in UTC, but I'm struggling to find it now. Maybe @jclulow remembers where that is, since I believe he filed it?

Registering them once was my tradeoff, but you're correct that it's wrong! I'll add a regression test for this first off. Now what to do about it? One, we could map directly from the interested kstats to the creation times, which might be a bit tricky. Second would be to compute the creation time on the fly each time. I could just be doing the math wrong, but I worry about that wandering around a bunch and confusing anyone looking at it downstream.

One other idea that occurs to me: we could pretend to start recording the value of the kstat when it's registered (or we see interest in it), and compute the cumulative values over that time interval. That would basically mean we ignore history prior to the registration / interest of the kstat, but that we can just compute a Utc::now() all the time.

I think another possibility is to call KstatTarget::interested only once, at registration time, not sampling time. I'm sure there are others, but I'll need to stew on the tradeoffs.

Thanks for flagging this, definitely an issue.

oximeter/instruments/src/kstat/link.rs

gjcolombo · 2023-11-01T22:45:32Z

Meant to leave an additional note: the changes in service of my first round of comments all look great--thanks for picking those up! There's a couple more things I had questions about that aren't related to those updates but that I only noticed on my second lap.

bnaecker · 2023-11-02T01:47:55Z

Thanks for the kinds words, and the feedback, all really helpful! I had idea for solving the time stamp issue. What I’m really trying to do is consistently map between the system hires timer and UTC. I don’t know if it’s possible to always do that, but we can make it easier. We could take a UTC timestamp and hires time just once, when we create the sampler object. Then we have a consistent reference for each domain. When we create samples, we can take the snap time from the stat and compute the difference to the hires time in nanoseconds. Then offset the UTC timestamp by the same. That’s the same logic we already use, but we can make a reference point in each domain once, to avoid precision issues and make sure we have one start time for every stat I think that should work anyway, I’ll give it a go tomorrow!

bnaecker · 2023-11-02T17:49:35Z

I'm going to work on getting the tests to manual advance / pause of time, and put the hrtime -> UTC mapping after that. It's probably more subtle than I thought, and may take some time to think and experiment with approaches.

gjcolombo

The dropped stat reporting and new test code LGTM, though I'm a little bummed out that my suggestion about the test produced the Instant/DateTime split in the actual code. I'll think about that a bit more, though I suspect the best answer is ultimately going to involve using Instant in most places and converting to DateTime using some reference time stored when the sampler is started--i.e., what I think you're trying to do with timestamps anyway.

bnaecker · 2023-11-06T17:23:03Z

Thanks @gjcolombo. I've been thinking more about the timestamp issue, and I don't think there's an obviously correct way to handle it. The issues is that there can be drifts between a high-res time and UTC, due to things like NTP or slightly different time resolution. That means that mapping a hrtime_t to UTC at one point in time can result in a slightly different answer when doing that mapping later.

As an example, consider creating a "timestamp reference", with a UTC and hrtime_t. That lets us map between them. Suppose we make one of those each time someone calls KstatTarget::to_samples(). That will provide the best approximation for the current timestamp in both domains. But if one uses that new reference to compute the start time, then we have different start timestamps for each sample of a cumulative metric. I think that'll be quite confusing during querying, and require a lot of massaging.

On the other hand, if we create one reference, then we hit the issue you pointed out: that may not be the right reference for two kstats that are created at different times.

I think the only way around this is to require that the implementer of KstatTarget store the creation time if they care about it. We can provide them with a new timestamp reference at each call to KstatTarget::to_samples(). The type's implementation of that trait can decide to store the creation time in UTC if they want, to provide a consistent base for cumulative metrics. If they're producing a gauge, then they need not store it. If they're producing more than one cumulative metric, they can store multiple creation times.

I hope that makes sense. Without the OS itself storing a notion of its boot time in UTC, it's surprisingly difficult to map between hrtime_ts and UTC timestamps!

jclulow · 2023-11-06T17:30:05Z

FWIW, I think we do store the UTC boottime in a variable in the kernel. It doesn't get adjusted by the kernel to account for NTP today, but to be honest I'm not sure there's any reason that we couldn't do that when we get adjustments fed to us to adjust the actual clock.

bnaecker · 2023-11-06T17:33:00Z

@jclulow I wasn't aware of that! How does one access that? If it's not currently adjusted by NTP, then I'm not sure it'll be useful for this PR at this point, but I could be wrong about that.

citrus-it · 2023-11-06T18:18:31Z

I'm taking another pass on time synchronisation too, and am looking at possibly updating that kernel variable as part of that work, although I'm still working through it.

https://github.com/jclulow/junk-gimlet-fixtime/blob/7e0754dc50b9b256fb2624c4e3520a443ae4b8eb/src/main.rs has some code to do this, using bindings to libkvm.

bnaecker · 2023-11-07T22:16:39Z

@gjcolombo Ok, whew this is ready again. After some help from Dave, I think there's a better way to track creation times: a mapping that stores the creation time for each kstat. This now inserts the creation times for a target's kstats into the mapping when it is registered, or at sampling time, if needed. The creation times are removed periodically, if we discover that a kstat no longer exists at all on the kstat chain.

bnaecker · 2023-11-07T22:35:53Z

Hmm, this is a weird failure:

https://github.com/oxidecomputer/omicron/actions/runs/6790935431/job/18461441508?pr=4355#step:11:706

It looks like cargo doc is failing, claiming that some fields of a struct don't exist, such as the field id of struct StartSledAgenRequest. That field certainly does exist in my branch, so I'm not sure what's going on.

bnaecker · 2023-11-07T23:55:07Z

Ok, the failure is due to GitHub's disposition of creating a merge commit with the latest main, and running the pull_request actions against that. Very confusing behavior IMO, but there's some value in that as long as it's clear that it's happening. I'll rebase on main to fix that up.

gjcolombo

The latest changes LGTM with one nitpicky comment. Thanks for working on all this!

oximeter/instruments/src/kstat/sampler.rs

bnaecker · 2023-11-08T19:45:27Z

Manually rebased / squashed to resolve merge conflicts. Will auto-merge, thanks @gjcolombo for all the reviews, extremely helpful!

- Add a `kstat` module in `oximeter_instruments`. This includes a trait for describing how to map from one or more kstats into oximeter targets and metrics. It can be used to convert name-value kernel statistics in a pretty straightforward way into oximeter samples. - Add a `KstatSampler`, which is used to register kstat targets, and will periodically poll them to generate samples. It is an `oximeter::Producer`, so that it can be easily hooked up to produce data for an `oximeter` collector. - Add targets for tracking physical, virtual, and guest datalinks. - Add metrics for bytes in/out, packets in/out, and errors in/out for the above. - Use the `KstatStampler` in a new `MetricsManager` type in the sled agent, and track the physical (underlay) data links on a system. Does not yet track any virtual or guest links. The manager can be used to also collect other statistics, such as HTTP request latencies similar to nexus, or any kstats through the sampler. Remove local config, some cross-OS cfg directives Update sled-agent OpenAPI for metrics reporting Review feedback - Make a queue per tracked kstat target, so that noisy producers don't impact samples from the other targets. - Add overview doc to `oximeter_instruments::kstat`, and make most docstrings public for visibility. - Delegate future impl to `Sleep` in `YieldIdAfter`, rather than spawning a new task. - Don't fork to get hostname on every call, only when needed. - Fixup to naming for test Etherstubs, to avoid invalid names. What is time anyway? - Improve tests by manually manipulating time to expected events. - Switch out `tokio::time::Instant` and `chrono::DateTime<Utc>` dending on test configuration, so that we can move time around during tests. - Add a queue to which test code can subscribe, onto which the sampler task places the actual number of samples it collects. Use this to test the actual number of samples / error samples we get. - Test expiration behavior by manually stepping time Add a time reference for converting hrtime to UTC - Adds a `TimeReference` type for consistently converting from hrtime to UTC, using the mapping from a single point in time. This is stored in each `SampledKstat` now, and passed to `KstatTarget::to_samples()`. The implementer can use this to consistently derive a start time for their kstats now, though this isn't (can't be?) consistent if you remove and re-add a target. - Add some quick sanity tests for `TimeReference`, more tests verifying start time consistency - Cleanup unused code Further improvements around kstat creation times - Update the kstat chain more frequently, including when targets are added and sampled. - Handle situation when a target signals interest in zero kstats, and clarify documentation. This is actually what happens when a kstat itself disappears and we then update the chain (previously, it was an error because the sampling method did _not_ update the chain). Treat this like an error, and increment the expiration counters. Make clear in the documentation that this situation is included in those counters. - Add a per-kstat (not target) mapping that stores the creation times. These are included whenever we add a target, and also at sampling time, if needed. This lets us track the creation time in UTC reliably, while also providing it accurately to the `KstatTarget::to_samples()` method. These are removed only when the kstat itself goes away, which we check for periodically in the main `run()` loop, to avoid keeping them around forever. Rebase fixup Reword expiration after first failure

bnaecker requested review from rmustacc, jgallagher and gjcolombo October 25, 2023 23:06

karencfv reviewed Oct 26, 2023

View reviewed changes

smf/sled-agent/non-gimlet/config-rss.toml Outdated Show resolved Hide resolved

smf/sled-agent/non-gimlet/config.toml Outdated Show resolved Hide resolved

sled-agent/src/metrics.rs Show resolved Hide resolved

karencfv reviewed Oct 30, 2023

View reviewed changes

sled-agent/src/metrics.rs Outdated Show resolved Hide resolved

karencfv approved these changes Oct 30, 2023

View reviewed changes

gjcolombo reviewed Nov 1, 2023

View reviewed changes

bnaecker requested a review from gjcolombo November 2, 2023 20:15

gjcolombo reviewed Nov 3, 2023

View reviewed changes

bnaecker force-pushed the oximeter-kstats branch from 23b0aab to cf7b3cd Compare November 7, 2023 23:57

gjcolombo approved these changes Nov 8, 2023

View reviewed changes

oximeter/instruments/src/kstat/sampler.rs Outdated Show resolved Hide resolved

bnaecker force-pushed the oximeter-kstats branch from 6740d1b to 77dc30f Compare November 8, 2023 19:44

bnaecker enabled auto-merge (squash) November 8, 2023 19:45

bnaecker force-pushed the oximeter-kstats branch from 77dc30f to 8b79a19 Compare November 8, 2023 20:02

bnaecker force-pushed the oximeter-kstats branch from 8b79a19 to 5c38069 Compare November 8, 2023 21:09

bnaecker merged commit 6a0995a into main Nov 8, 2023
22 of 23 checks passed

bnaecker deleted the oximeter-kstats branch November 8, 2023 22:41

bnaecker mentioned this pull request May 29, 2024

Publish per-datalink metrics to oximeter #2161

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for publishing kernel statistics to oximeter #4355

Add support for publishing kernel statistics to oximeter #4355

bnaecker commented Oct 25, 2023

bnaecker commented Oct 25, 2023

karencfv left a comment

karencfv left a comment

bnaecker commented Oct 30, 2023

gjcolombo left a comment

gjcolombo Oct 31, 2023

bnaecker Nov 1, 2023

bnaecker Nov 1, 2023

gjcolombo Nov 1, 2023

bnaecker Nov 1, 2023

gjcolombo commented Nov 1, 2023

bnaecker commented Nov 2, 2023 via email

bnaecker commented Nov 2, 2023

gjcolombo left a comment

bnaecker commented Nov 6, 2023 •

edited

Loading

jclulow commented Nov 6, 2023

bnaecker commented Nov 6, 2023

citrus-it commented Nov 6, 2023

bnaecker commented Nov 7, 2023

bnaecker commented Nov 7, 2023

bnaecker commented Nov 7, 2023

gjcolombo left a comment

bnaecker commented Nov 8, 2023

Add support for publishing kernel statistics to oximeter #4355

Add support for publishing kernel statistics to oximeter #4355

Conversation

bnaecker commented Oct 25, 2023

bnaecker commented Oct 25, 2023

karencfv left a comment

Choose a reason for hiding this comment

karencfv left a comment

Choose a reason for hiding this comment

bnaecker commented Oct 30, 2023

gjcolombo left a comment

Choose a reason for hiding this comment

gjcolombo Oct 31, 2023

Choose a reason for hiding this comment

bnaecker Nov 1, 2023

Choose a reason for hiding this comment

bnaecker Nov 1, 2023

Choose a reason for hiding this comment

gjcolombo Nov 1, 2023

Choose a reason for hiding this comment

bnaecker Nov 1, 2023

Choose a reason for hiding this comment

gjcolombo commented Nov 1, 2023

bnaecker commented Nov 2, 2023 via email

bnaecker commented Nov 2, 2023

gjcolombo left a comment

Choose a reason for hiding this comment

bnaecker commented Nov 6, 2023 • edited Loading

jclulow commented Nov 6, 2023

bnaecker commented Nov 6, 2023

citrus-it commented Nov 6, 2023

bnaecker commented Nov 7, 2023

bnaecker commented Nov 7, 2023

bnaecker commented Nov 7, 2023

gjcolombo left a comment

Choose a reason for hiding this comment

bnaecker commented Nov 8, 2023

bnaecker commented Nov 6, 2023 •

edited

Loading