Note: These lecture notes were slightly modified from the ones posted on the 6.824 course website from Spring 2015.
- larger topic is fault tolerance via replicated state machines
- Raft -- a much more complete design than straight Paxos
Raft overview:
clients -> leader -> followers -> logs -> execution
- Our use of Paxos:
- agrees separately on each client operation
- Raft:
- agrees on each new leader (and on tail of log)
- agreement not required for most client operations
- Raft is Paxos optimized for log appends (more or less)
- why Raft-style leader?
- no dueling proposers (unless leader fails)
- leader just tells other people what to do
- fewer messages, less complexity (unless leader fails)
- well-defined notion of one log being more complete than another
- simplifies switching leaders (and maybe crash recovery)
- very hard to find a solution for this in Paxos because logs have "holes"
- no dueling proposers (unless leader fails)
- you must decide for yourself
- straight Paxos is simpler than Raft
- but straight Paxos is too simple for practical replication
- everyone extends it in their own way
- and ends up with something more or less like Raft
- Paxos+log+leader probably not simpler than Raft
- though presumably depends on which Paxos variant you choose
Is more direct use of Paxos (like Lab 3) ever a win?
- i.e. is a Raft-style leader ever a bad idea?
- geographically spread peers
- a single leader would be far from some clients
- some peers would be slow to other peers (Paxos tolerates lag)
Let's start w/ Raft w/ no leader change
- for now, reliable leader
- followers may be slow or unreachable (but they do not lose state)
- what do we want?
- tolerate a minority of failed followers
- live followers and dead followers converge on same log since replication requires same order of execution
- execute only when entry cannot be lost (committed) since cannot easily undo execution or reply to client
- idea for ensuring identical log:
- leader sends log entry, index, and info about previous entry
- client can reject (e.g I don't have previous entry!)
- leader backs up for that follower, sends earlier entries
- leader forces followers' logs to be identical to leader's
- idea for execution:
- idea #1 means leader knows follower is identical up to some point
- once a majority are identical up to a point,
- leader sends that out as commit point,
- everyone can execute through that point,
- leader can reply to clients
- other servers time out (no AppendEntries "heart-beats" for a while)
- if other servers are missing heartbeats they start to suspect
the leader is down
- can't really know for sure leader is down/up on a network
- choose a new leader!
- Raft divides time into terms
- most terms have a leader
- two leaders
- no leader
- might forget an executed log entry
- logs might end up different (diverge)
Talk about leader election first, then log consistency at term boundary
- (look at Figure 2, RequestVote RPC, and Rules for Servers)
- leader must get votes from a majority of servers
- Rule: server can cast only one vote per term
- thus at most one server can think it has won
- why a majority?
- the answer is always the same!
- "requiring a majority means not requiring a minority essentially"
- allows fault tolerance (failure of minority doesn't impede progress)
- prevents split brain (at most one candidate can get a majority)
- ensures overlap (at least one in majority has every previously committed log entry)
Could election fail to choose any leader?
- Yes!
-
= 3 candidates split the vote evenly or even # of live servers, two candidates each get half
-
- timeout, increment term, new election
- when a server decides it might wants to be a candidate it first waits a random delay and only if it doesn't hear from anyone else then it becomes a candidate
- higher term takes precedence, candidates for older terms quit
- Note: timeout must be longer than it takes to complete election!
- Note: this means some terms may have no leader, no log entries
- each server delays a random amount of time before starting candidacy
- why is the random delay useful?
- [see diagram of times at which servers' delays expire]
- one will choose lowest random delay
- hopefully enough time to elect before next delay expires
- this idea comes up often in distributed systems
Diagram:
20 ms 50 ms 80 ms
|-------------*-----------------------*-----------------*-----------|
S1 S2 S3
- too short: 2nd candidate starts before first finishes
- too long: system sits idle for too long after leader fails
- a rough guide:
- suppose it takes 10ms to complete an unopposed election
- and there are five servers
- we want delays to be separated by (say) 20ms
- so random delay from 0 to 100 ms
- plus a few multiples of leader heartbeat interval
Remember this random delay idea!
- it's a classic scheme for decentralized soft election; e.g. ethernet
Raft's elections follow a common pattern: separation of safety from progress
- Hard mechanisms ensure
< 2
leaders in one term- Problem: elections can fail (e.g. 3-way split)
- Solution: always safe to start a new election in a new term
- Poblem: repeated elections can prevent any work getting done
- Solution: soft mechanisms reduce probability of wasted elections
- heartbeat from leader (remind servers not to start election)
- timeout period (don't start election too soon)
- random delays (give one leader time to be elected)
Remember: there's a way to split the problem into "safety/correctness" concerns and "liveness/performance" concerns
- perhaps b/c old leader didn't see election messages
- new leader means a majority of servers have incremented currentTerm
- so old leader (w/ old term) can't get majority for AppendEntries
- though a minority may accept old server's log entries...
- so logs may diverge at end of old term...
Now let's switch topics to data handling at term boundaries
What do we want to ensure?
- each server executes the same client cmds, in the same order
- i.e. if any server executes, then no server executes something else for that log entry
- as long as single leader, we've already seen it makes logs identical what about when leader changes?
What's the danger?
Leader of term 3 crashed while sending AppendEntries
S1: 3
S2: 3 3
S3: 3 3
S2 and S3 might have executed; does Raft preserve it?
May be a series of crashes, e.g.
S1: 3
S2: 3 3 (new leader) 4
S3: 3 3 (new leader) 5
Thus different entries for the same index!
Roll-back is a big hammer -- forces leader's log on everyone
- in above examples, whoever is elected imposes log on everyone
- Example:
- S3 is chosen as new leader for term 6
- S3 wants to send out a new entry (in term 6)
AppendEntries
says previous entry must have term 5
- S2 replies false (
AppendEntries
step 2) - S3 decrements
nextIndex[S2]
- S3 sends
AppendEntries
for the op w/ term=5, saying prev has term=3 - S2 deletes op from term 4 (
AppendEntries
step 3) and replaces with op for term 5 from S3 (and S1 rejects b/c it doesn't have anything in that entry)- S2 sets op for term 6 as well
Ok, leader will force its own log on followers
- but that's not enough!
- can roll-back delete an executed entry?
When is a log entry executed?
- when leader advances
commitIndex/leaderCommit
- when a majority match the leader up through this point
Could new leader roll back executed entries from end of previous term?
- i.e. could an executed entry be missing from the new leader's log?
- Raft needs to ensure new leader's log contains every potentially executed entry
- i.e. must forbid election of server who might be missing an executed entry
What are the election rules?
- Figure 2 says only vote if candidate's log "at least as up to date"
- So leader will be at least as up to date as a majority
What does "at least as up to date" mean?
Could it mean log is >= length? No, example:
S1: 5, (leader) 6, (crash + leader) 7,
S2: 5 (leader) 8
S3: 5 8
- first, could this scenario happen? how?
- S1 leader in epoch 6; crash+reboot; leader in epoch 7; crash and stay down
- both times it crashed after only appending to its own log
- S2 leader in epoch 8, only S2+S3 alive, then crash
- S1 leader in epoch 6; crash+reboot; leader in epoch 7; crash and stay down
- who should be next leader?
- S1 has longest log, but entry 8 is committed !!!
- Raft adopts leader's log, so S1 as leader -> un-commit entry 8
- this would be incorrect since S2 may have replied to client
- so new leader can only be one of S2 or S3
- i.e. the rule cannot be simply "longest log"
- S1 has longest log, but entry 8 is committed !!!
End of 5.4.1 explains "at least as up to date" voting rule
- compare last entry
- higher term wins
- if equal terms, longer log wins
So:
- S1 can't get any vote from S2 or S3, since
7 < 8
- S1 will vote for either S2 or S3, since
8 > 7
- S1's operations from terms 6 and 7 will be discarded!
- ok since no majority -> not executed -> no client reply
The point:
- "at least as up to date" rule causes new leader to have all executed entries in its log
- so new leader won't roll back any executed operation
- similar to Paxos: new round ends up using chosen value (if any) of prev round
The question: Figure 7, which of a/d/f could be elected?
- i.e. majority of votes from "less up to date" servers?
The most subtle thing about Raft (figure 8)
Figure 8:
S1 1, L 2, , L 4,
S2 1, 2, , \A/,
S3 1, <-------- 2 <-| ,
S4 1, , , ,
S5 1, , L 3, , L will erase all 2's
- not 100% true that a log entry on a majority is committed
- i.e. will never be forgotten
- Figure 8:
- S1 was leader in term 2, sends out two copies of 2
- S5 leader in term 3
- S1 leader in term 4, sends one more copy of 2 (b/c S3 rejected op 4)
- what if S5 now becomes leader?
- S5 can get a majority (w/o S1)
- S5 will roll back 2 and replace it with 3
- could 2 have executed?
- it is on a majority...
- so could S1 have mentioned it in leaderCommit after majority?
- no! very end of Figure 2 says "log[N].term == currentTerm"
- and S1 was in term 4 when sending 3rd copy of 2
- what's Raft's actual commit point?
- bottom-right of page 310
- "committed once the leader that created the entry has replicated on majority"
- and commit point of one entry commits all before it
- which is how 2 could have committed if S1 hadn't lost leadership
Another topic: configuration change (Section 6)
- configuration = set of servers
- how does Raft change the set of servers?
- e.g. every few years might want to retire some, add some
- or move all at once to an entirely new set of server
- or increase/decrease the number of servers
How might a broken configuration change work?
- each server has the list of servers in the current config
- change configuation by changing lists, one by one
- example: want to replace S3 with S4
- S1: 1,2,3 1,2,4
- S2: 1,2,3 1,2,3
- S3: 1,2,3 1,2,3
- S4: 1,2,4 1,2,4
- OOPS!
- now two disjoint group/leaders can form:
- S2 and S3 (who know nothing of new config)
- S1 and S4
- both can process client requests, so split brain
- now two disjoint group/leaders can form:
- Idea: "join consensus" stage that includes both old and new configuration
- leader of old group logs entry that switches to joint consensus
- during joint consensus, leader separately logs in old and new
- i.e. two log and two agreements on each log entry
- this will force new servers to catch up and force new and old logs to be the same
- during joint consensus, leader separately logs in old and new
- after majority of old and new have switched to joint consensus,
- leader logs entry that switches to final configuration
Example (which won't make sense because it's not properly illustrated in the original notes):
S1: 1,2,3 1,2,3+1,2,4
S2: 1,2,3
S3: 1,2,3
S4: 1,2,3+1,2,4
- if crash but new leader didn't see the switch to joint consensus,
- then old group will continue, no switch, but that's OK
- if crash and new leader did see the switch to joint consensus,
- it will complete the configuration change
- no numbers on how fast it can process requests
- what are the bottlenecks likely to be?
- disk:
- need to write disk for client data durability, and for protocol promises
- write per client request? so 100 per second?
- could probably batch and get 10,000 to 100,000
- net: a few message exchanges per client request
- 10s of microseconds for local LAN message exchange?
- so 100,000 per second?
Next week: use of a Raft-like protocol in a complex application