-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publish instance vCPU usage statistics to oximeter
#4855
Conversation
This still needs more testing around the instance lifecycle itself, but here are some manual tests I ran to verify the new code for pulling kstats themselves from the Testing this is hard, since it requires an actual VM. Instead, I wrote a tiny use oximeter_instruments::kstat::KstatTarget;
fn main() {
let ctl = kstat_rs::Ctl::new().unwrap();
// Find a VMM kstat.
//
// We'll use this to cons up a `VirtualMachine` that matches, to test our
// shit.
let Some(id) = ctl.iter().find_map(|mut kstat| {
if kstat.ks_module == "vmm" && kstat.ks_name == "vm" {
let data = ctl.read(&mut kstat).unwrap();
let kstat_rs::Data::Named(named) = data else {
return None;
};
named.iter().find_map(|nv| {
if nv.name == "vm_name" {
let kstat_rs::NamedData::String(vm_name) = nv.value else {
return None;
};
Some(vm_name.to_string())
} else {
None
}
})
} else {
None
}
}) else
{
eprintln!("No VMM kstats found");
return;
};
println!("Found VMM kstats for VM with name: {id}");
let vm = oximeter_instruments::kstat::virtual_machine::VirtualMachine {
silo_id: uuid::Uuid::new_v4(),
project_id: uuid::Uuid::new_v4(),
instance_id: id.parse().unwrap(),
};
// Filter the kstats again to those in the `vmm` module, so that we can call
// `to_samples()` as it normally will be
let now = chrono::Utc::now();
let kstats: Vec<_> = ctl.iter().filter(|kstat| kstat.ks_module == "vmm").map(|mut kstat| {
(now, kstat, ctl.read(&mut kstat).unwrap())
})
.collect();
let samples = vm.to_samples(&kstats).unwrap();
println!("produced samples:\n{samples:#?}");
} So it finds a VM name first, so that we can construct a
So that measured the cumulative time in the
So this all seems to work OK. I'm going to do more testing locally starting and stopping a bunch of VMs to check that the statistics continue to work and that they show up in ClickHouse as I expect. |
a900cdd
to
59364a2
Compare
Fixes #4851 |
0c8226f
to
b45edeb
Compare
- Adds the silo and project IDs to the instance-ensure request from Nexus to the sled-agent. These are used as fields on the instance-related statistics. - Defines a `VirtualMachine` oximeter target and `VcpuUsage` metric. The latter has a `state` field which corresponds to the named kstats published by the hypervisor that accumulate the time spent in a number of vCPU microstates. The combination of these should allow us to aggregate or break down vCPU usage by silo, project, instance, vCPU ID, and CPU state. - Adds APIs to the `MetricsManager` for starting / stopping tracking instance-related metrics, and plumbs the type through the `InstanceManager` and `Instance` (and their internal friends), so that new instances can control when data is produced from them. Currently, we'll start producing as soon as we get a non-terminate response from Propolis in the `instance_state_monitor()` task, and stop when the instance is terminated.
b45edeb
to
225b07e
Compare
After some discussion with Greg and Patrick, I'm moving towards getting the actual tracking of vCPU metrics into Propolis proper. The WIP doing that is https://github.com/oxidecomputer/propolis/tree/vcpu-usage-stats. I am inclined to keep the definitions of the |
I'm going to close this, and instead re-open a slightly smaller PR just for the first chunk of work listed in #4851. That will add the |
VirtualMachine
oximeter target andVcpuUsage
metric. The latter has astate
field which corresponds to the named kstats published by the hypervisor that accumulate the time spent in a number of vCPU microstates. The combination of these should allow us to aggregate or break down vCPU usage by silo, project, instance, vCPU ID, and CPU state.MetricsManager
for starting / stopping tracking instance-related metrics, and plumbs the type through theInstanceManager
andInstance
(and their internal friends), so that new instances can control when data is produced from them. Currently, we'll start producing as soon as we get a non-terminate response from Propolis in theinstance_state_monitor()
task, and stop when the instance is terminated.