Convergence metrics

WORK IN PROGRESS

Rationale

Right now we focus on highly visible and visual metrics. This is great, because we can quickly if, when and how things go wrong. However, we want to:

quantify how well the entire system is behaving,
quantify how well individual components are behaving,
use these metrics to judge current development,
use these metrics to guide future development.

Generic properties of metrics

Independently measuring change impact

Suppose we land a patch in otter that we believe will increase total system reliability. We implement the patch, and sure enough, reliability metrics improve. However, it turns out the patch actually made problems worse; we saw things improve because of some unrelated thing happening (e.g. capacity upgrades, Repose being upgraded, software upgrades in systems we depend on...)

The easiest way to resolve this problem is to do A/B testing.

User error vs system error

This is generally useful, and we've already discussed adding a "user error" state to groups, so we should simply:

do that, and
make it available for metrics calculation.

Per group vs aggregate, measuring regret

It is important that metrics that measure bad things happening measure "regret", i.e. positive values mean something bad happened, zero is ideal. This prevents aggregate errors from canceling out. For example, if we were to measure the difference between desired capacity and actual capacity, and we overprovision by 10 on one group but underprovision by 5 on two other groups, the "total" is zero, even though we actually performed poorly.

Delta asymmetry

Under-provisioning and over-provisioning by the same amount is not equally bad. Being over capacity may cost marginally more money, but under-provisioning usually comes with service degradation. Mathematically, the derivative of the delta -> regret function isn't symmetrical: it is greater below zero than above zero.

Under- or over-provisioning by the same amount near the desired capacity is not as bad as under- or over-provisioning far away from the desired capacity; large discrepancies need to be punished much more severely than small ones. Mathematically, the derivative of the delta -> regret function isn't constant; it's low near zero, and high away from zero.

A simple function that does this is (and measures regret, not delta) is x ** 2 if x < 0 else x ** 1.6. It looks like this:

plot

Specific metrics

Desired/actual delta, integrated over time

TODO: this thing needs a good name.

This is the "area between the curves" metric, where "the curves" are the desired and actual capacity:

areabetwixtcurves

Both desired and actual are piecewise step functions, so integrating them is normally trivial. However, keep in mind the above points about a) regret (regular integration of the difference will make the errors cancel out) and b) asymmetry.

This metric has upsides and downsides. It is a high-level aggregate metric, which is good because it measures how much the entire failed to do what was asked, but is bad because it doesn't really measure any particular property.

Time to convergence

Several useful things we can measure here:

Time to complete the components of a single convergence cycle (gathering, planning, executing)
Time to complete a single convergence cycle (gather -> plan -> execute)
Time for convergence to acquiesce

These things aren't scale-invariant: we will most likely need to normalize for number of servers.

Server build time, normalized per image size

Right now, our timeouts are a very heavy-handed fixed timeout. In the future, it may be useful to make that a bit smarter: the same timeout that's reasonable for a huge image is probably not reasonable for a very common base image that should be hot in all the caches. When we do, it'd be great to have data to support that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly