Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish instance vCPU usage statistics to oximeter #4855

Closed
wants to merge 5 commits into from

Conversation

bnaecker
Copy link
Collaborator

  • Adds the silo and project IDs to the instance-ensure request from Nexus to the sled-agent. These are used as fields on the instance-related statistics.
  • Defines a VirtualMachine oximeter target and VcpuUsage metric. The latter has a state field which corresponds to the named kstats published by the hypervisor that accumulate the time spent in a number of vCPU microstates. The combination of these should allow us to aggregate or break down vCPU usage by silo, project, instance, vCPU ID, and CPU state.
  • Adds APIs to the MetricsManager for starting / stopping tracking instance-related metrics, and plumbs the type through the InstanceManager and Instance (and their internal friends), so that new instances can control when data is produced from them. Currently, we'll start producing as soon as we get a non-terminate response from Propolis in the instance_state_monitor() task, and stop when the instance is terminated.

@bnaecker
Copy link
Collaborator Author

This still needs more testing around the instance lifecycle itself, but here are some manual tests I ran to verify the new code for pulling kstats themselves from the KstatSampler.

Testing this is hard, since it requires an actual VM. Instead, I wrote a tiny
CLI tool to fake up the important parts. That is here:

use oximeter_instruments::kstat::KstatTarget;

fn main() {
    let ctl = kstat_rs::Ctl::new().unwrap();

    // Find a VMM kstat.
    //
    // We'll use this to cons up a `VirtualMachine` that matches, to test our
    // shit.
    let Some(id) = ctl.iter().find_map(|mut kstat| {
        if kstat.ks_module == "vmm" && kstat.ks_name == "vm" {
            let data = ctl.read(&mut kstat).unwrap();
            let kstat_rs::Data::Named(named) = data else {
                return None;
            };
            named.iter().find_map(|nv| {
                if nv.name == "vm_name" {
                    let kstat_rs::NamedData::String(vm_name) = nv.value else {
                        return None;
                    };
                    Some(vm_name.to_string())
                } else {
                    None
                }
            })
        } else {
            None
        }
    }) else
    {
        eprintln!("No VMM kstats found");
        return;
    };

    println!("Found VMM kstats for VM with name: {id}");
    let vm = oximeter_instruments::kstat::virtual_machine::VirtualMachine {
        silo_id: uuid::Uuid::new_v4(),
        project_id: uuid::Uuid::new_v4(),
        instance_id: id.parse().unwrap(),
    };

    // Filter the kstats again to those in the `vmm` module, so that we can call
    // `to_samples()` as it normally will be
    let now = chrono::Utc::now();
    let kstats: Vec<_> = ctl.iter().filter(|kstat| kstat.ks_module == "vmm").map(|mut kstat| {
        (now, kstat, ctl.read(&mut kstat).unwrap())
    })
    .collect();

    let samples = vm.to_samples(&kstats).unwrap();
    println!("produced samples:\n{samples:#?}");
}

So it finds a VM name first, so that we can construct a VirtualMachine target.
It then filters the list of kstats, which is normally the job of the
KstatSampler. This is all so we can test the to_samples() method, which
converts the kstats into the VcpuUsage metric samples we care about. There is
one per (vCPU, state). Here is some of the output of that CLI run on the Gimlet
in cubby 10 of the dogfood rack:

 BRM42220009 # /tmp/vcpu | head -60
Found VMM kstats for VM with name: 0fe5243a-ae48-44f3-8a42-3e55c8ea1d40produced samples:
[
    Sample {
        measurement: Measurement {
            timestamp: 2024-01-19T22:53:30.652303313Z,
            datum: CumulativeU64(
                Cumulative {
                    start_time: 2024-01-19T22:53:34.520006783Z,
                    value: 595956991,
                },
            ),
        },
        timeseries_name: "virtual_machine:vcpu_usage",
        target: FieldSet {
            name: "virtual_machine",
            fields: {
                "instance_id": Field {
                    name: "instance_id",
                    value: Uuid(
                        0fe5243a-ae48-44f3-8a42-3e55c8ea1d40,
                    ),
                },
                "project_id": Field {
                    name: "project_id",
                    value: Uuid(
                        e23d5dbd-460a-40a8-b01d-3a914ce322d5,
                    ),
                },
                "silo_id": Field {
                    name: "silo_id",
                    value: Uuid(
                        ddc3dcc7-39c2-4200-8d4c-be3803f9994c,
                    ),
                },
            },
        },
        metric: FieldSet {
            name: "vcpu_usage",
            fields: {
                "state": Field {
                    name: "state",
                    value: String(
                        "init",
                    ),
                },
                "vcpu_id": Field {
                    name: "vcpu_id",
                    value: U32(
                        0,
                    ),
                },
            },
        },
    },
    Sample {
        measurement: Measurement {
            timestamp: 2024-01-19T22:53:30.652303313Z,
            datum: CumulativeU64(
                Cumulative {
                    start_time: 2024-01-19T22:53:34.520006783Z,
$

So that measured the cumulative time in the time_init microstate as
595956991, which is what kstat itself reports:

BRM42220009 # kstat vmm:1:vm:vm_name
module: vmm                             instance: 1
name:   vm                              class:    misc
        vm_name                         0fe5243a-ae48-44f3-8a42-3e55c8ea1d40

BRM42220009 # kstat vmm:1:vcpu0:time_init
module: vmm                             instance: 1
name:   vcpu0                           class:    misc
        time_init                       595956991

So this all seems to work OK. I'm going to do more testing locally starting and stopping a bunch of VMs to check that the statistics continue to work and that they show up in ClickHouse as I expect.

@bnaecker bnaecker marked this pull request as draft January 20, 2024 01:32
@bnaecker bnaecker force-pushed the publish-instance-vcpu-stats branch 2 times, most recently from a900cdd to 59364a2 Compare January 21, 2024 20:08
@bnaecker
Copy link
Collaborator Author

Fixes #4851

@bnaecker bnaecker force-pushed the publish-instance-vcpu-stats branch 4 times, most recently from 0c8226f to b45edeb Compare January 23, 2024 20:11
- Adds the silo and project IDs to the instance-ensure request from
  Nexus to the sled-agent. These are used as fields on the
  instance-related statistics.
- Defines a `VirtualMachine` oximeter target and `VcpuUsage` metric. The
  latter has a `state` field which corresponds to the named kstats
  published by the hypervisor that accumulate the time spent in a number
  of vCPU microstates. The combination of these should allow us to
  aggregate or break down vCPU usage by silo, project, instance, vCPU
  ID, and CPU state.
- Adds APIs to the `MetricsManager` for starting / stopping tracking
  instance-related metrics, and plumbs the type through the
  `InstanceManager` and `Instance` (and their internal friends), so that
  new instances can control when data is produced from them. Currently,
  we'll start producing as soon as we get a non-terminate response from
  Propolis in the `instance_state_monitor()` task, and stop when the
  instance is terminated.
This is a WIP to test publishing this data from Propolis itself, rather
than the sled-agent. Should we take that path, we'll end up deleting
most of this diff in the sled-agent itself.
- Store number of vcpus in virtual machine
- impl target manually
- only produce samples from expected vcpu kstats
- track instance metrics once
@bnaecker bnaecker force-pushed the publish-instance-vcpu-stats branch from b45edeb to 225b07e Compare January 23, 2024 20:59
@bnaecker
Copy link
Collaborator Author

After some discussion with Greg and Patrick, I'm moving towards getting the actual tracking of vCPU metrics into Propolis proper. The WIP doing that is https://github.com/oxidecomputer/propolis/tree/vcpu-usage-stats. I am inclined to keep the definitions of the oximeter::Target here, as a centralized definition of those statistics, behind a feature flag to avoid pull more unused code into Propolis.

@bnaecker
Copy link
Collaborator Author

I'm going to close this, and instead re-open a slightly smaller PR just for the first chunk of work listed in #4851. That will add the oximeter metric definitions and include the instance metadata in the sled-agent HTTP API.

@bnaecker bnaecker closed this Jan 25, 2024
@bnaecker bnaecker deleted the publish-instance-vcpu-stats branch January 25, 2024 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant