-
Notifications
You must be signed in to change notification settings - Fork 27
Converger meeting notes (2014 09 12)
lvh edited this page Nov 21, 2014
·
1 revision
High level components:
- REST API
- Takes REST API calls; tells converger
- Observer
- Takes events (e.g. AtomHopper); tells converger
- Converger
- Takes a desired capacity, then makes it so
In the future, there will also be:
- Decisioner
- Figures out what the desired capacity is
Detailed components:
Points:
- The converger itself doesn’t contain a loop, but there’s a semantic loop in the behavior of the converger as part of the entire system. Converger does an iteration, that triggers events, which trigger iterations, et cetera.
- The observer just gives us edge triggering. Its only job is to figure out which group some event pertains to, then firing off an event. In that sense, it is really a map + filter: it consumes the fire hose of AtomHopper events, finds the ones that are interesting, and re-formats them.
- The observer might at some point also use all the information it receives to keep our Nova caches warm, but that is not something we have to do right now.
- I still like “planner” better than “decisioner” (@lvh)
- Basic implementation (soon)
- Implement
converge()
- In-memory
IConverger
- Run inside API node
- Implement
- Connect with queue (future)
- Observer (future)
- Implement
- Hook up to queue
- Decisioner (distant future)
- Plan
- Implement
Idea | Partition groups over nodes | Distribute work across nodes |
Picks… | Consistency | Availability |
Over… | Availability | Consistency |
Hence… | Always 1 node responsible for group | Highly available |
But also… | Sometimes downtime | Sometimes overreact |
and needs tricks for… | Avoiding downtime (TODO: how?) | Fix mistakes with tight loop, avoid overreacting when possible |
Or, basically… | Look Before You Leap | Easier to Ask Forgiveness than Permission |
Local state | Allowed | Disallowed |
Some tricks were discussed for making the behavior of these systems a little better:
- Limit convergence in time e.g. max every 30s
- Difference in implementation between two scaling approaches:
- When partitioning: you know you have a lock on the group, so you can have some local state.
- When distributing: no local state because you have no
exclusive lock on the group. Needs a shared timepiece.
- Added ops difficulty.
- Timepiece down? React anyway, if overreacting, resolve in the next convergence iteration.
- Suggested Redis as a shared timepiece. (Distributed wallclocks are not a good idea.)
- Difference in implementation between two scaling approaches:
- Limit convergence in size e.g. max 7 machines
- Can mitigate issues with over-reacting
- Suddenly overreact by max 7 instead of max e.g. 50
- Ties in with a point @cyli made:
- Under-provisioning is way worse than over-provisioning!
- Over-provisioning results in there being too many machines for
a very brief time (order of magnitude convergence cycle time)
- Worst outcome: several dollar-cents wasted in capacity
- Doesn’t compare to overall cloud/otter savings
- Under-provisioning results in downtime
- Worst outcome: massive revenue loss
- Over-provisioning results in there being too many machines for
a very brief time (order of magnitude convergence cycle time)
- So: have lower bound as well as upper bound, and make those
bounds asymmetric e.g.
-1 <= amount <= 7
- @glyph noted that if all workers decide to delete the same nodes, we never under-provision. Just needs a total ordering over nodes, such as: building first, then order by age.
- Under-provisioning is way worse than over-provisioning!
- @jimbaker noted that we probably want bounds to be flexible:
- Provisioning 2000 machines 7 at a time is not sensible
- Can mitigate issues with over-reacting
Some generic points that were made:
- In distributed systems failures are common, many failures are partial failures. We’re in the unusual situation that we have not just an idempotent operation, but a self-correcting loop that’s always going to converge towards the ideal like a dampened sine. So, in general, I have a preference for continuing in the face of (partial) failure and simply correcting it automatically. (@lvh)
- Tricks are cool, but we don’t actually know which ones we’ll need (for example: there was some doubt if we need to debounce convergence scheduling). Conclusion: build converger, throw it into staging, see how it behaves. Fix problems only once we have proof they exist. (@lvh)
- We need to make sure that the demands we’re making from our queue (currently probably Kafka) are actually things that it provides and is good at.
- No matter what we pick, Nova/CLB is never going to start acting like a CP system. This is purely an internal implementation detail, and does not affect how the converger behaves as a world-correcting loop. (@lvh)