You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The CPU temperature value we get from AMD host CPUs over SB-TSI represents the CPU's Tctl, which is not a measurement of physical temperature in Celsius like other temperature sensors. Instead, it's a dimensionless value from 0-100 calculated by the CPU's thermal control loop that represents a sort of "thermal throttling danger level". See oxidecomputer/omicron#6634,oxidecomputer/stlouis#5, or this block comment in stlouis for details on what this value actually means.
Right now, the control-plane-agent task will report Tctl values as a sensor measurement with MeasurementKind::Temperature, which suggests to other software that this is a physical temperature measurement in degrees Celsius. We should change this so that it's clearer that this is a dimensionless measurement. As of oxidecomputer/omicron#6656, the MGS subsystem that collects Oximeter metrics from SP sensor measurements special-cases Tctl based on the sensor name and device ID, so the Oximeter metrics are not recorded with incorrect units. This is probably the most important place to get it right, since Oximeter metrics are the primary way we expect customers to consume SP sensor data.
However, because we are currently special-casing Tctl only in the MGS sensor metrics task, these values will still be treated as physical temperature in Celsius elsewhere --- for instance, in faux-mgs, humility sensors, and to other consumers of the MGS HTTP API for reading SP component details. This is a bit of a bummer, so we should probably fix this in a more principled way.
Doing this in Hubris is probably a bit annoying, as we still want to represent the SB-TSI Tctl value as a
"temperature sensor" as far as task-thermal is concerned, as it's an important control parameter for the SP thermal control code...but, we would need to indicate that task-sensors and control-plane-agent should report the value from that particular temperature sensor with different units. Entirely doable, but just requires some plumbing.
While we're here, we probably want to consider changing the Hubris-reported names for this measurement to make it more clear that it's specifically the CPU's Tctl value. Right now, SB-TSI measurements are reported as a component with device: "sbtsi" and description: "CPU temperature sensor", which has one measurement channel with sensor: "CPU" (the Tctl value. We should probably consider changing this to sensor: "Tctl" or sensor: "CPU Tctl" to make this clearer --- especially if we were to support reading other CPU thermal measurements, such as Tdie or Tccd_n_ from Hubris in the future.1
Footnotes
Note that, as I understand it, those values can't be read through SB-TSI over SMBus, and would instead need to come over IPCC from the host or something. So, I'm not sure if/when we'll actually expose them to Hubris... ↩
The text was updated successfully, but these errors were encountered:
This commit adds a `Dimensionless` variant to the `MeasurementKind` enum
to represent measurements without units. This is intended to be used to
represent AMD CPU T<sub>ctl</sub> thermal values, which are a unitless
value from 0-100 representing a kind of "abstract thermal throttling
danger level" --- see oxidecomputer/hubris#1881 for details on why we
need to differentiate this from other temperature measurements.
I wasn't totally sure as to the best approach for naming this variant. I
note that the name of the enum is "measurement kind", not "measurement
unit", so it doesn't directly represent the unit of the measurement but
rather, the kind of quantity being measured. So, perhaps this ought to
be a `CpuTctl` variant or similar, to *specifically* represent
T<sub>ctl</sub> measurements. If other dimensionless values show up in
future, they would then have their own variant. I'm open to being
convicned either way --- what do others think?
This commit adds a `CpuTctl` variant to the `MeasurementKind` enum. This
is intended to be used to represent AMD CPU T<sub>ctl</sub> thermal
values, which are a unitless value from 0-100 representing a kind of
"abstract thermal throttling danger level" --- see
oxidecomputer/hubris#1881 for details on why we need to differentiate
this from other temperature measurements.
The CPU temperature value we get from AMD host CPUs over SB-TSI represents the CPU's Tctl, which is not a measurement of physical temperature in Celsius like other temperature sensors. Instead, it's a dimensionless value from 0-100 calculated by the CPU's thermal control loop that represents a sort of "thermal throttling danger level". See oxidecomputer/omicron#6634,oxidecomputer/stlouis#5, or this block comment in stlouis for details on what this value actually means.
Right now, the
control-plane-agent
task will report Tctl values as a sensor measurement withMeasurementKind::Temperature
, which suggests to other software that this is a physical temperature measurement in degrees Celsius. We should change this so that it's clearer that this is a dimensionless measurement. As of oxidecomputer/omicron#6656, the MGS subsystem that collects Oximeter metrics from SP sensor measurements special-cases Tctl based on the sensor name and device ID, so the Oximeter metrics are not recorded with incorrect units. This is probably the most important place to get it right, since Oximeter metrics are the primary way we expect customers to consume SP sensor data.However, because we are currently special-casing Tctl only in the MGS sensor metrics task, these values will still be treated as physical temperature in Celsius elsewhere --- for instance, in
faux-mgs
,humility sensors
, and to other consumers of the MGS HTTP API for reading SP component details. This is a bit of a bummer, so we should probably fix this in a more principled way.Doing this in Hubris is probably a bit annoying, as we still want to represent the SB-TSI Tctl value as a
"temperature sensor" as far as
task-thermal
is concerned, as it's an important control parameter for the SP thermal control code...but, we would need to indicate thattask-sensors
andcontrol-plane-agent
should report the value from that particular temperature sensor with different units. Entirely doable, but just requires some plumbing.While we're here, we probably want to consider changing the Hubris-reported names for this measurement to make it more clear that it's specifically the CPU's Tctl value. Right now, SB-TSI measurements are reported as a component with
device: "sbtsi"
anddescription: "CPU temperature sensor"
, which has one measurement channel withsensor: "CPU"
(the Tctl value. We should probably consider changing this tosensor: "Tctl"
orsensor: "CPU Tctl"
to make this clearer --- especially if we were to support reading other CPU thermal measurements, such as Tdie or Tccd_n_ from Hubris in the future.1Footnotes
Note that, as I understand it, those values can't be read through SB-TSI over SMBus, and would instead need to come over IPCC from the host or something. So, I'm not sure if/when we'll actually expose them to Hubris... ↩
The text was updated successfully, but these errors were encountered: