-
Notifications
You must be signed in to change notification settings - Fork 27
Convergence metrics
WORK IN PROGRESS
Right now we focus on highly visible and visual metrics. This is great, because we can quickly if, when and how things go wrong. However, we want to:
- quantify how well the entire system is behaving,
- quantify how well individual components are behaving,
- use these metrics to judge current development,
- use these metrics to guide future development.
Suppose we land a patch in otter that we believe will increase total system reliability. We implement the patch, and sure enough, reliability metrics improve. However, it turns out the patch actually made problems worse; we saw things improve because of some unrelated thing happening (e.g. capacity upgrades, Repose being upgraded, software upgrades in systems we depend on...)
The easiest way to resolve this problem is to do A/B testing.
This is generally useful, and we've already discussed adding a "user error" state to groups, so we should simply:
- do that, and
- make it available for metrics calculation.
It is important that metrics that measure bad things happening measure "regret", i.e. positive values mean something bad happened, zero is ideal. This prevents aggregate errors from canceling out. For example, if we were to measure the difference between desired capacity and actual capacity, and we overprovision by 10 on one group but underprovision by 5 on two other groups, the "total" is zero, even though we actually performed poorly.
Under-provisioning and over-provisioning by the same amount is not equally bad. Being over capacity may cost marginally more money, but under-provisioning usually comes with service degradation. Mathematically, the derivative of the delta -> regret
function isn't symmetrical: it is greater below zero than above zero.
Under- or over-provisioning by the same amount near the desired capacity is not as bad as under- or over-provisioning far away from the desired capacity; large discrepancies need to be punished much more severely than small ones. Mathematically, the derivative of the delta -> regret
function isn't constant; it's low near zero, and high away from zero.
A simple function that does this is (and measures regret, not delta) is x ** 2 if x < 0 else x ** 1.6
. It looks like this:
TODO: this thing needs a good name.
This is the "area between the curves" metric, where "the curves" are the desired and actual capacity:
Both desired and actual are piecewise step functions, so integrating them is normally trivial. However, keep in mind the above points about a) regret (regular integration of the difference will make the errors cancel out) and b) asymmetry.
This metric has upsides and downsides. It is a high-level aggregate metric, which is good because it measures how much the entire failed to do what was asked, but is bad because it doesn't really measure any particular property.
Several useful things we can measure here:
- Time to complete the components of a single convergence cycle (gathering, planning, executing)
- Time to complete a single convergence cycle (gather -> plan -> execute)
- Time for convergence to acquiesce
These things aren't scale-invariant: we will most likely need to normalize for number of servers.
Right now, our timeouts are a very heavy-handed fixed timeout. In the future, it may be useful to make that a bit smarter: the same timeout that's reasonable for a huge image is probably not reasonable for a very common base image that should be hot in all the caches. When we do, it'd be great to have data to support that.