Skip to content

Converger meeting notes (2014 09 12)

lvh edited this page Nov 21, 2014 · 1 revision

Basic components

High level components:

REST API
Takes REST API calls; tells converger
Observer
Takes events (e.g. AtomHopper); tells converger
Converger
Takes a desired capacity, then makes it so

In the future, there will also be:

Decisioner
Figures out what the desired capacity is

Detailed components:

Points:

  • The converger itself doesn’t contain a loop, but there’s a semantic loop in the behavior of the converger as part of the entire system. Converger does an iteration, that triggers events, which trigger iterations, et cetera.
  • The observer just gives us edge triggering. Its only job is to figure out which group some event pertains to, then firing off an event. In that sense, it is really a map + filter: it consumes the fire hose of AtomHopper events, finds the ones that are interesting, and re-formats them.
  • The observer might at some point also use all the information it receives to keep our Nova caches warm, but that is not something we have to do right now.
  • I still like “planner” better than “decisioner” (@lvh)

Battle plan

  1. Basic implementation (soon)
    1. Implement converge()
    2. In-memory IConverger
    3. Run inside API node
  2. Connect with queue (future)
  3. Observer (future)
    1. Implement
    2. Hook up to queue
  4. Decisioner (distant future)
    1. Plan
    2. Implement

Different approaches to scaling

Idea Partition groups over nodes Distribute work across nodes
Picks… Consistency Availability
Over… Availability Consistency
Hence… Always 1 node responsible for group Highly available
But also… Sometimes downtime Sometimes overreact
and needs tricks for… Avoiding downtime (TODO: how?) Fix mistakes with tight loop, avoid overreacting when possible
Or, basically… Look Before You Leap Easier to Ask Forgiveness than Permission
Local state Allowed Disallowed

Some tricks were discussed for making the behavior of these systems a little better:

  • Limit convergence in time e.g. max every 30s
    • Difference in implementation between two scaling approaches:
      • When partitioning: you know you have a lock on the group, so you can have some local state.
      • When distributing: no local state because you have no exclusive lock on the group. Needs a shared timepiece.
        • Added ops difficulty.
        • Timepiece down? React anyway, if overreacting, resolve in the next convergence iteration.
        • Suggested Redis as a shared timepiece. (Distributed wallclocks are not a good idea.)
  • Limit convergence in size e.g. max 7 machines
    • Can mitigate issues with over-reacting
      • Suddenly overreact by max 7 instead of max e.g. 50
    • Ties in with a point @cyli made:
      • Under-provisioning is way worse than over-provisioning!
        • Over-provisioning results in there being too many machines for a very brief time (order of magnitude convergence cycle time)
          • Worst outcome: several dollar-cents wasted in capacity
          • Doesn’t compare to overall cloud/otter savings
        • Under-provisioning results in downtime
          • Worst outcome: massive revenue loss
      • So: have lower bound as well as upper bound, and make those bounds asymmetric e.g. -1 <= amount <= 7
      • @glyph noted that if all workers decide to delete the same nodes, we never under-provision. Just needs a total ordering over nodes, such as: building first, then order by age.
    • @jimbaker noted that we probably want bounds to be flexible:
      • Provisioning 2000 machines 7 at a time is not sensible

Some generic points that were made:

  • In distributed systems failures are common, many failures are partial failures. We’re in the unusual situation that we have not just an idempotent operation, but a self-correcting loop that’s always going to converge towards the ideal like a dampened sine. So, in general, I have a preference for continuing in the face of (partial) failure and simply correcting it automatically. (@lvh)
  • Tricks are cool, but we don’t actually know which ones we’ll need (for example: there was some doubt if we need to debounce convergence scheduling). Conclusion: build converger, throw it into staging, see how it behaves. Fix problems only once we have proof they exist. (@lvh)
  • We need to make sure that the demands we’re making from our queue (currently probably Kafka) are actually things that it provides and is good at.
  • No matter what we pick, Nova/CLB is never going to start acting like a CP system. This is purely an internal implementation detail, and does not affect how the converger behaves as a world-correcting loop. (@lvh)