Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for publishing kernel statistics to oximeter #4355

Merged
merged 1 commit into from
Nov 8, 2023

Conversation

bnaecker
Copy link
Collaborator

  • Add a kstat module in oximeter_instruments. This includes a trait for describing how to map from one or more kstats into oximeter targets and metrics. It can be used to convert name-value kernel statistics in a pretty straightforward way into oximeter samples.
  • Add a KstatSampler, which is used to register kstat targets, and will periodically poll them to generate samples. It is an oximeter::Producer, so that it can be easily hooked up to produce data for an oximeter collector.
  • Add targets for tracking physical, virtual, and guest datalinks.
  • Add metrics for bytes in/out, packets in/out, and errors in/out for the above.
  • Use the KstatStampler in a new MetricsManager type in the sled agent, and track the physical (underlay) data links on a system. Does not yet track any virtual or guest links. The manager can be used to also collect other statistics, such as HTTP request latencies similar to nexus, or any kstats through the sampler.

@bnaecker
Copy link
Collaborator Author

This PR is still a bit of a WIP, but it's large and complicated enough that I'd like to get some eyes on it before going any further. It does the minimum needed to get data from illumos's kstat framework into ClickHouse via oximeter. The main pieces are noted in the commit message, but here are some more details to help orient around the PR.

The KstatTarget trait is the main interface for folks adding new stats. The idea is to describe the set of kstats one is interested in, and then a method for generating oximeter::Samples from those. Here is an example, which I implemented to collect data about Ethernet datalinks on the system.

@rmustacc and others and I have tried a bunch to put together some macro-trickery which would make this much easier to do. For example, it'd be nice for the sled agent to consume a file in some DSL at startup, and use that to determine which kstats to collect from. That may still be useful, but I am not sure how to do it! I can dig into why it's hard, if folks are curious, but it's basically a mismatch between doing work at runtime and oximeter's compile-time data model.

I've included a few tests of the KstatSampler, used to track and publish samples. I've also run a number of tests on my own non-Gimlet machine. Below, we can see the data coming out of the ClickHouse database that's derived from these kstats:

bnaecker@shale : ~/omicron $ ./target/release/oxdb --address fd00:1122:3344:101::e query physical_data_link:bytes_received 'link_name==net0' | jq '.[0].timeseries_name'
"physical_data_link:bytes_received"
bnaecker@shale : ~/omicron $ ./target/release/oxdb --address fd00:1122:3344:101::e query physical_data_link:bytes_received 'link_name==net0' | jq '.[0].target'
{
  "name": "physical_data_link",
  "fields": [
    {
      "name": "hostname",
      "value": {
        "type": "string",
        "value": "shale"
      }
    },
    {
      "name": "link_name",
      "value": {
        "type": "string",
        "value": "net0"
      }
    },
    {
      "name": "rack_id",
      "value": {
        "type": "uuid",
        "value": "399e3fa2-a73e-4b04-8595-b1f890b640d5"
      }
    },
    {
      "name": "serial",
      "value": {
        "type": "string",
        "value": "935aaf0b412a4e248513829916ac784d"
      }
    },
    {
      "name": "sled_id",
      "value": {
        "type": "uuid",
        "value": "7bfb8206-e907-4bd1-a420-50e187f9975b"
      }
    }
  ]
}
bnaecker@shale : ~/omicron $ ./target/release/oxdb --address fd00:1122:3344:101::e query physical_data_link:bytes_received 'link_name==net0' | jq '.[0].metric'
{
  "name": "bytes_received",
  "fields": [],
  "datum_type": "cumulative_u64"
}
bnaecker@shale : ~/omicron $ ./target/release/oxdb --address fd00:1122:3344:101::e query physical_data_link:bytes_received 'link_name==net0' | jq '.[0].measurements[-1]'
{
  "timestamp": "2023-10-25T23:13:18.396810853Z",
  "datum": {
    "type": "cumulative_u64",
    "datum": {
      "start_time": "2023-10-25T22:04:15.851667324Z",
      "value": 476145
    }
  }
}
bnaecker@shale : ~/omicron $ kstat link:0:net0:rbytes64
module: link                            instance: 0
name:   net0                            class:    net
        rbytes64                        479609

There are really two big pieces I'd love input on. First, I want to make sure that we have the right set of fields on these timeseries. For physical links and virtual links not tied to a guest, we are recording:

  • hostname of the sled, or the zone name for non-GZ links
  • serial number of the sled
  • link name
  • sled UUID
  • rack UUID

For guest links, we record those, plus also the project UUID and instance UUID. Those are not tracked yet, so we can easily change them. That just requires many more changes, and I was hoping to defer it a bit.

The second big item is around the actual mechanics of tracking these kstats, especially handling errors. I'm really not sure how we'll want to do that, especially for things like datalinks that may be torn down asynchronously from the kstat sampler itself. I added way for consumers to ask about the status of their tracked target, or remove it. In addition, the code currently requires that one specifies a policy around errors at the time the target is registered. This lets folks decide if they want to (1) keep tracking forever; (2) stop after a period of failures or; (3) stop after a number of attempts. The fact that the target "expired" is available through the status method, until someone removes it or registers a new one.

Anyway, that's all a bit in the weeds. But I was pretty unsure how we're going to end up using this, and didn't want to have things continue to be sampled indefinitely without explicitly opting in to that behavior. I'd love input here though, since it's certainly a bit convoluted.

Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! It'll be really nice to have these stats recorded into ClickHouse 🎉

smf/sled-agent/non-gimlet/config-rss.toml Outdated Show resolved Hide resolved
smf/sled-agent/non-gimlet/config.toml Outdated Show resolved Hide resolved
sled-agent/src/metrics.rs Show resolved Hide resolved
sled-agent/src/metrics.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Although, I think it'd be a good idea to wait for a review from someone who knows about kstats more than me before merging.

@bnaecker
Copy link
Collaborator Author

Friendly ping on this one @jgallagher @rmustacc @gjcolombo! I also wanted to make clear that nearly 3k of the 4k lines in this PR are just the OpenAPI spec, I swear it's not as massive as it looks.

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! We're going to need a small truckload of memory kstats soon, so it's great to have all this support available.

oximeter/instruments/src/kstat/sampler.rs Outdated Show resolved Hide resolved
oximeter/instruments/src/kstat/sampler.rs Outdated Show resolved Hide resolved
oximeter/instruments/src/kstat/sampler.rs Outdated Show resolved Hide resolved
Comment on lines 296 to 302
warn!(
self.log,
"sample queue is too full, dropping oldest samples";
"n_new_samples" => n_new_samples,
"n_current_samples" => n_current_samples,
"n_overflow_samples" => n_overflow_samples,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be obvious from looking at metrics later that this has happened? Would it make any sense to try to produce a separate "number of dropped samples" metric that indicates whether any providers have been particularly noisy in this way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question, and no, I don't think it will be easily visible anywhere outside this log right now. One would need to infer it from a gap in the samples for a particular timeseries, or be looking at this log file. Those are both terrible, and generating our own timeseries to track these drops seems like a very good idea to surface it. I'll experiment with publishing that as well, since we'll mostly want to think about what fields to attach to such a timeseries.

Another point this raises for me is that noisy producers can negatively impact well-behaved ones, since they share the same sample queue. That interaction is not good. I think I have a reasonable idea for how to implement that in a way that doesn't complicate the locking story too much.

Thanks for bringing this up!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've done the per-target queue work in b6eb2dc. I am going to go back and experiment with publishing our own self-stats here next.

oximeter/instruments/src/kstat/sampler.rs Outdated Show resolved Hide resolved
oximeter/instruments/src/kstat/sampler.rs Outdated Show resolved Hide resolved
oximeter/instruments/src/kstat/mod.rs Show resolved Hide resolved
oximeter/instruments/src/kstat/mod.rs Show resolved Hide resolved
sled-agent/src/metrics.rs Outdated Show resolved Hide resolved
let kstats = ctl
.iter()
.filter(|kstat| sampled_kstat.target.interested(kstat))
.zip(sampled_kstat.creation_times.iter())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about the math here. IIUC, the number of elements returned from ctl.iter.filter depends on the number of kstats that meet the chosen registrant's criteria right now, i.e., at sampling time. But creation_times is initialized once in add_target, when there might have been a different number of stats. If the former length exceeds the latter then this won't call read on the "extra" elements. Is there something else guaranteeing these are equal that I'm overlooking?

(The new tests in link.rs only create one test etherstub apiece and always do it before adding any targets, so I'm not sure they cover this case.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch, it's definitely possible to be zipping different numbers of elements here.

I think this hits on one of my difficulties on this PR. Kernel statistics publish their creation time as an hrtime_t, which is really only useful as a relative timestamp. I do some work to convert it into UTC here, which I am frankly not a huge fan of. That is intentionally done once at registration time, because there's enough slop in the conversions that it starts to matter. Cumulative counters always include a start time, and that'll move around by a lot, up to a few milliseconds, in my experimentation. There was actually an illumos issue to report the boot time in UTC, but I'm struggling to find it now. Maybe @jclulow remembers where that is, since I believe he filed it?

Registering them once was my tradeoff, but you're correct that it's wrong! I'll add a regression test for this first off. Now what to do about it? One, we could map directly from the interested kstats to the creation times, which might be a bit tricky. Second would be to compute the creation time on the fly each time. I could just be doing the math wrong, but I worry about that wandering around a bunch and confusing anyone looking at it downstream.

One other idea that occurs to me: we could pretend to start recording the value of the kstat when it's registered (or we see interest in it), and compute the cumulative values over that time interval. That would basically mean we ignore history prior to the registration / interest of the kstat, but that we can just compute a Utc::now() all the time.

I think another possibility is to call KstatTarget::interested only once, at registration time, not sampling time. I'm sure there are others, but I'll need to stew on the tradeoffs.

Thanks for flagging this, definitely an issue.

oximeter/instruments/src/kstat/link.rs Outdated Show resolved Hide resolved
@gjcolombo
Copy link
Contributor

Meant to leave an additional note: the changes in service of my first round of comments all look great--thanks for picking those up! There's a couple more things I had questions about that aren't related to those updates but that I only noticed on my second lap.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 2, 2023 via email

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 2, 2023

I'm going to work on getting the tests to manual advance / pause of time, and put the hrtime -> UTC mapping after that. It's probably more subtle than I thought, and may take some time to think and experiment with approaches.

@bnaecker bnaecker requested a review from gjcolombo November 2, 2023 20:15
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dropped stat reporting and new test code LGTM, though I'm a little bummed out that my suggestion about the test produced the Instant/DateTime split in the actual code. I'll think about that a bit more, though I suspect the best answer is ultimately going to involve using Instant in most places and converting to DateTime using some reference time stored when the sampler is started--i.e., what I think you're trying to do with timestamps anyway.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 6, 2023

Thanks @gjcolombo. I've been thinking more about the timestamp issue, and I don't think there's an obviously correct way to handle it. The issues is that there can be drifts between a high-res time and UTC, due to things like NTP or slightly different time resolution. That means that mapping a hrtime_t to UTC at one point in time can result in a slightly different answer when doing that mapping later.

As an example, consider creating a "timestamp reference", with a UTC and hrtime_t. That lets us map between them. Suppose we make one of those each time someone calls KstatTarget::to_samples(). That will provide the best approximation for the current timestamp in both domains. But if one uses that new reference to compute the start time, then we have different start timestamps for each sample of a cumulative metric. I think that'll be quite confusing during querying, and require a lot of massaging.

On the other hand, if we create one reference, then we hit the issue you pointed out: that may not be the right reference for two kstats that are created at different times.

I think the only way around this is to require that the implementer of KstatTarget store the creation time if they care about it. We can provide them with a new timestamp reference at each call to KstatTarget::to_samples(). The type's implementation of that trait can decide to store the creation time in UTC if they want, to provide a consistent base for cumulative metrics. If they're producing a gauge, then they need not store it. If they're producing more than one cumulative metric, they can store multiple creation times.

I hope that makes sense. Without the OS itself storing a notion of its boot time in UTC, it's surprisingly difficult to map between hrtime_ts and UTC timestamps!

@jclulow
Copy link
Collaborator

jclulow commented Nov 6, 2023

FWIW, I think we do store the UTC boottime in a variable in the kernel. It doesn't get adjusted by the kernel to account for NTP today, but to be honest I'm not sure there's any reason that we couldn't do that when we get adjustments fed to us to adjust the actual clock.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 6, 2023

@jclulow I wasn't aware of that! How does one access that? If it's not currently adjusted by NTP, then I'm not sure it'll be useful for this PR at this point, but I could be wrong about that.

@citrus-it
Copy link
Contributor

I'm taking another pass on time synchronisation too, and am looking at possibly updating that kernel variable as part of that work, although I'm still working through it.

https://github.com/jclulow/junk-gimlet-fixtime/blob/7e0754dc50b9b256fb2624c4e3520a443ae4b8eb/src/main.rs has some code to do this, using bindings to libkvm.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 7, 2023

@gjcolombo Ok, whew this is ready again. After some help from Dave, I think there's a better way to track creation times: a mapping that stores the creation time for each kstat. This now inserts the creation times for a target's kstats into the mapping when it is registered, or at sampling time, if needed. The creation times are removed periodically, if we discover that a kstat no longer exists at all on the kstat chain.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 7, 2023

Hmm, this is a weird failure:

https://github.com/oxidecomputer/omicron/actions/runs/6790935431/job/18461441508?pr=4355#step:11:706

It looks like cargo doc is failing, claiming that some fields of a struct don't exist, such as the field id of struct StartSledAgenRequest. That field certainly does exist in my branch, so I'm not sure what's going on.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 7, 2023

Ok, the failure is due to GitHub's disposition of creating a merge commit with the latest main, and running the pull_request actions against that. Very confusing behavior IMO, but there's some value in that as long as it's clear that it's happening. I'll rebase on main to fix that up.

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest changes LGTM with one nitpicky comment. Thanks for working on all this!

oximeter/instruments/src/kstat/sampler.rs Outdated Show resolved Hide resolved
@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 8, 2023

Manually rebased / squashed to resolve merge conflicts. Will auto-merge, thanks @gjcolombo for all the reviews, extremely helpful!

@bnaecker bnaecker enabled auto-merge (squash) November 8, 2023 19:45
- Add a `kstat` module in `oximeter_instruments`. This includes a trait
  for describing how to map from one or more kstats into oximeter
  targets and metrics. It can be used to convert name-value kernel
  statistics in a pretty straightforward way into oximeter samples.
- Add a `KstatSampler`, which is used to register kstat targets, and
  will periodically poll them to generate samples. It is an
  `oximeter::Producer`, so that it can be easily hooked up to produce
  data for an `oximeter` collector.
- Add targets for tracking physical, virtual, and guest datalinks.
- Add metrics for bytes in/out, packets in/out, and errors in/out for
  the above.
- Use the `KstatStampler` in a new `MetricsManager` type in the sled
  agent, and track the physical (underlay) data links on a system. Does
  not yet track any virtual or guest links. The manager can be used to
  also collect other statistics, such as HTTP request latencies similar
  to nexus, or any kstats through the sampler.

Remove local config, some cross-OS cfg directives

Update sled-agent OpenAPI for metrics reporting

Review feedback

- Make a queue per tracked kstat target, so that noisy producers don't
  impact samples from the other targets.
- Add overview doc to `oximeter_instruments::kstat`, and make most
  docstrings public for visibility.
- Delegate future impl to `Sleep` in `YieldIdAfter`, rather than
  spawning a new task.
- Don't fork to get hostname on every call, only when needed.
- Fixup to naming for test Etherstubs, to avoid invalid names.

What is time anyway?

- Improve tests by manually manipulating time to expected events.
- Switch out `tokio::time::Instant` and `chrono::DateTime<Utc>` dending
  on test configuration, so that we can move time around during tests.
- Add a queue to which test code can subscribe, onto which the sampler
  task places the actual number of samples it collects. Use this to test
  the actual number of samples / error samples we get.
- Test expiration behavior by manually stepping time

Add a time reference for converting hrtime to UTC

- Adds a `TimeReference` type for consistently converting from hrtime to
  UTC, using the mapping from a single point in time. This is stored in
  each `SampledKstat` now, and passed to `KstatTarget::to_samples()`.
  The implementer can use this to consistently derive a start time for
  their kstats now, though this isn't (can't be?) consistent if you
  remove and re-add a target.
- Add some quick sanity tests for `TimeReference`, more tests verifying
  start time consistency
- Cleanup unused code

Further improvements around kstat creation times

- Update the kstat chain more frequently, including when targets are
  added and sampled.
- Handle situation when a target signals interest in zero kstats, and
  clarify documentation. This is actually what happens when a kstat
  itself disappears and we then update the chain (previously, it was an
  error because the sampling method did _not_ update the chain). Treat
  this like an error, and increment the expiration counters. Make clear
  in the documentation that this situation is included in those
  counters.
- Add a per-kstat (not target) mapping that stores the creation times.
  These are included whenever we add a target, and also at sampling
  time, if needed. This lets us track the creation time in UTC reliably,
  while also providing it accurately to the `KstatTarget::to_samples()`
  method. These are removed only when the kstat itself goes away, which
  we check for periodically in the main `run()` loop, to avoid keeping
  them around forever.

Rebase fixup

Reword expiration after first failure
@bnaecker bnaecker merged commit 6a0995a into main Nov 8, 2023
22 of 23 checks passed
@bnaecker bnaecker deleted the oximeter-kstats branch November 8, 2023 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants