forked from alinush/6.824-lecture-notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
l05-paxos.html
699 lines (579 loc) · 24.3 KB
/
l05-paxos.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
<h1>6.824 2015 Lecture 5: Paxos</h1>
<p><strong>Note:</strong> These lecture notes were slightly modified from the ones posted on the 6.824
<a href="http://nil.csail.mit.edu/6.824/2015/schedule.html">course website</a> from Spring 2015.</p>
<h2>Intro</h2>
<p>Starting a new group of lectures on stronger fault tolerance</p>
<ul>
<li>Today:
<ul>
<li>cleaner approach to replication: RSM via Paxos</li>
<li>Lab 3</li>
</ul></li>
<li>Subsequent lectures:
<ul>
<li>How to use Paxos to build systems (Harp, EPaxos, Spanner)</li>
</ul></li>
</ul>
<h2>Paxos</h2>
<p>Links:</p>
<ul>
<li><a href="papers/paxos-simple.pdf">Paxos Made Simple</a>, by Leslie Lamport, 2001</li>
<li><a href="https://www.quora.com/Distributed-Systems/What-is-a-simple-explanation-of-the-Paxos-algorithm">Simple explanations from Quora</a></li>
<li><a href="http://harry.me/blog/2014/12/27/neat-algorithms-paxos/">Neat Algorithms - Paxos</a></li>
<li><a href="http://static.usenix.org/event/nsdi11/tech/full_papers/Bolosky.pdf">Paxos Replicated State Machines as the Basis of a High-Performance Data Store</a></li>
<li><a href="http://wellquite.org/blog/paxos_notes.html">Paxos notes</a></li>
<li><a href="http://blog.acolyer.org/2015/03/04/paxos-made-simple/">Paxos made simple paper review</a></li>
<li><a href="http://the-paper-trail.org/blog/on-some-subtleties-of-paxos/">On some subtleties of Paxos</a></li>
</ul>
<p>Recall: RSM</p>
<ul>
<li>maintain replicas by executing operations in the same order</li>
<li>requires all replicas to agree on the (set and) order of operations</li>
</ul>
<p>Lab 2 critique</p>
<ul>
<li>primary/backup with viewserver</li>
<li><strong>pro:</strong>
<ul>
<li>conceptually simple</li>
<li>just two msgs per op (request, reply)</li>
<li>primary can do computation, send result to backup</li>
<li>only two k/v servers needed to tolerate one failure</li>
<li>works with network partition</li>
</ul></li>
<li><strong>con:</strong>
<ul>
<li>ViewServer is a <em>single point of failure</em></li>
<li>order can be messy: e.g., new view, data to backup, ack, &c</li>
<li>tension if backup is slow / temporarily unavail
<ol>
<li>primary can wait for backup -- slow</li>
<li>viewserver can declare backup dead -- expensive, hurts fault tolerance</li>
</ol></li>
</ul></li>
</ul>
<p>We would like a general-purpose ordering scheme with:</p>
<ul>
<li>no single point of failure</li>
<li>graceful handling of slow / intermittent replicas</li>
<li>handling of network partitions</li>
</ul>
<p><strong>Paxos</strong> will be a key building block for this.</p>
<ul>
<li>some number of nodes participate in <em>an instance of Paxos</em>
<ul>
<li><strong>Q:</strong> What is this <em>instance</em>?</li>
<li><strong>A:</strong> <em>"Each new command requires a separate Paxos agreement, which the
paper calls an instance. So the database replicas might agree that the
first command to execute is 'command one', and they use Paxos to agree
on that. Then that instance of Paxos is done. A while later another
client sends 'command two'; the replicas start up an entirely separate
instance of Paxos to agree on this second client command.'</em> --RTM</li>
</ul></li>
<li>each node knows the address of every other node for that instance</li>
<li>each instance of Paxos can reach consensus on at most one value
typically, a system uses many instances of Paxos
each instance usually decides one operation
assumptions: asynchronous, non-Byzantine</li>
</ul>
<h3>What does Paxos provide? How does it work?</h3>
<ul>
<li><strong>"black-box"</strong> interface to a Paxos instance, on each node:
<ul>
<li>Propose a value (e.g., operation)</li>
<li>Check what value has been decided, if any</li>
<li>[ Lab 3A: <code>src/paxos/paxos.go</code>: Start, Status ]</li>
</ul></li>
<li><strong>Correctness:</strong>
<ul>
<li>if agreement reached, all agreeing nodes see same value</li>
</ul></li>
<li><strong>Fault-tolerance:</strong>
<ul>
<li>can tolerate non-reachability of a minority of nodes
(correctness implies they won't agree at all)</li>
</ul></li>
<li><strong>Liveness:</strong>
<ul>
<li>a majority must be alive and able to communicate reliably
(minorities are not live)</li>
</ul></li>
</ul>
<h3>How to build a system using Paxos?</h3>
<ol>
<li>Primary/Backup like Lab 2, but use Paxos to replicate the ViewServer
<ul>
<li>[ next Tuesday's lecture will be about such a system ]</li>
</ul></li>
<li><em>Lab 3</em>: no ViewServer, all replicas use Paxos instead of primary/backup</li>
</ol>
<p>Replicating either the ViewServer or K/V server with Paxos is similar.</p>
<p>Will look at a sketch of how to do a Paxos-based K/V server.</p>
<p>The <strong>basic idea</strong>:</p>
<ul>
<li>[ Diagram: clients, replicas, log in each replica, K/V layer, Paxos layer]</li>
<li>no viewserver</li>
<li>three replicas</li>
<li>clients can send RPCs to any replica (not just primary)</li>
<li>server appends each client op to a replicated <em>log</em> of operations
<ul>
<li><code>Put</code>, <code>Get</code> (and more later)</li>
</ul></li>
<li>log entries (instances) are numbered sequentially</li>
<li>Paxos ensures agreement on content of each log entry</li>
<li>separate Paxos agreement for each of these log entries
<ul>
<li>separate <em>instance</em> of Paxos algorithm is run for log entry #<code>i</code></li>
<li><strong>Q:</strong> Can one log entry be agreed on at the same time with another? What if they depend on one another like <code>Put(k1, a)</code> and <code>Append(k1, b)</code>?</li>
<li><strong>A:</strong> Yes! They can be agreed upon on the same time.</li>
<li><strong>A:</strong> you can have agreed on log entry #<code>i</code> before agreeing on log entry #<code>i+1</code>
<ul>
<li>This means the reply associated with the <code>Get</code> or <code>Put</code> request in log entry <code>i+1</code> will
have to wait for the other log entries to be set (interesting)</li>
</ul></li>
</ul></li>
<li>servers can throw away log entries that all other servers have agreed on (and responded to?)
<ul>
<li>but if a server crashes, the other servers will know to keep their log entries around for when it comes back</li>
</ul></li>
<li>protocol does <strong>not</strong> require designated proposers or leaders for correctness
<ul>
<li>these only help w/ performance</li>
<li>low probability of proposing "livelock" that can be overcome by having proposers wait a random amount of time</li>
</ul></li>
<li>once a Paxos node agrees on a value it never changes its mind</li>
</ul>
<p>Example:</p>
<ul>
<li>client sends <code>Put(a, b)</code> to <code>S1</code></li>
<li><code>S1</code> picks a log entry 3</li>
<li><code>S1</code> uses Paxos to get all servers to agree that entry 3 holds <code>Put(a,b)</code></li>
</ul>
<p>Example:</p>
<ul>
<li>client sends <code>Get(a)</code> to <code>S2</code></li>
<li><code>S2</code> picks log entry 4</li>
<li><code>S2</code> uses Paxos to get all servers to agree that entry 4 holds <code>Get(a)</code></li>
<li><code>S2</code> scans log up to entry 4 to find latest <code>Put(a, ...)</code>
<ul>
<li><strong>TODO:</strong> <code>O(n)</code> worst case for doing a <code>Get</code> because you can have <code>Put</code> followed by a gazillion <code>PutAppend</code>'s (or you can have just one <code>Put</code> stored way back?).
<ul>
<li>Can the replicas index their log? I suppose. If they all store it fully.</li>
</ul></li>
</ul></li>
<li><code>S2</code> replies with that value
<ul>
<li><code>S2</code> can cache content of DB up through last log scan</li>
</ul></li>
</ul>
<h4>Q: Why a log?</h4>
<ul>
<li>Why not require all replicas to agree on each op in lock-step?</li>
<li>Allows one replica to fall behind, then catch up
<ul>
<li>e.g., if it is slow</li>
<li>other replicas do not have to wait</li>
</ul></li>
<li>Allows one replica to crash and catch up
<ul>
<li>if it keeps state on disk</li>
<li>can replay missed operations</li>
</ul></li>
<li>Allows pipelining/overlap of agreement
<ul>
<li>agreement turns out to require multiple message rounds</li>
</ul></li>
</ul>
<h4>Q: What about agreement -- we need all replicas to have same op in each log slot</h4>
<ul>
<li>Provided by Paxos, as we will see next</li>
</ul>
<p><em>Agreement is hard (1):</em></p>
<ul>
<li>May be multiple proposals for the op in a particular log slot</li>
<li><code>Sx</code> (server <code>x</code>) may initially hear of one, <code>Sy</code> may hear of another</li>
<li>Clearly one must later change its mind</li>
<li>Thus: multiple rounds, tentative initially</li>
<li>How do we know when agreement is permanent -- no longer tentative?</li>
</ul>
<p><em>Agreement is hard (2):</em></p>
<ul>
<li><strong>TODO:</strong> If <code>S1</code> and <code>S2</code> agree, and <code>S3</code> and <code>S4</code> don't respond, are we done?</li>
<li>Agreement has to be able to complete even w/ failed servers</li>
<li>We can't distinguish failed server from network partition</li>
<li>So maybe <code>S3</code>/<code>S4</code> are partitioned have "agreed" on a different operation!</li>
</ul>
<p>Two <strong>main ideas</strong> in Paxos:</p>
<ol>
<li>Many rounds may be required but they will converge on one value</li>
<li>A majority is required for agreement -- prevent "split brain"
<ul>
<li><em>Key point</em>: any two majorities overlap</li>
<li>so any later majority will share at least one server w/ any earlier majority</li>
<li>so any later majority can find out what earlier majority decided (discussed below)</li>
</ul></li>
</ol>
<p>Lab 3B K/V server creates a separate Paxos instance for each client <code>Put</code>, <code>Get</code> (much harder)</p>
<ul>
<li>rest of lecture focuses on agreement for a specific instance</li>
</ul>
<h2>Paxos sketch</h2>
<ul>
<li>each node consists of three logical entities:
<ul>
<li><strong>proposer</strong></li>
<li><strong>acceptor</strong></li>
<li><strong>learner</strong></li>
</ul></li>
<li>each proposer wants to get agreement on its value
<ul>
<li>could try to use a "designated leader" to avoid dueling proposers</li>
<li>OK to have multiple proposers, so leader election can be approximate</li>
</ul></li>
<li>proposer contacts acceptors, tries to assemble a majority
<ul>
<li>if a majority respond, we're done</li>
</ul></li>
<li>in our K/V server example, roughly:
<ul>
<li>proposer gets RPC from client, proposes operation</li>
<li>acceptors are internal to Paxos, help decide consensus</li>
<li>learner figures out what operation was decided to run, responds to client</li>
</ul></li>
</ul>
<p><em>Broken strawman:</em> can we do Paxos in a single round?</p>
<ul>
<li>acceptor "accepts" the first value that it hears from proposer</li>
<li>when is consensus reached?
<ul>
<li>can we take the value with the most votes?</li>
<li>no, need a majority of accepts for the same value: <code>floor(n/2)+1</code></li>
<li>otherwise, consensus on 2 different values (lossy/partitioned network)</li>
</ul></li>
<li><em>Problem:</em>
<ul>
<li>suppose we have 3 servers: <code>S1</code>, <code>S2</code>, <code>S3</code></li>
<li>what if each server proposes + accepts its own value?
<ul>
<li>no majority, stuck</li>
<li>but maybe we can detect this situation and recover?</li>
</ul></li>
<li><em>Worse:</em> <code>S3</code> crashes <code>-></code> we may have reached majority, but we'll never know</li>
</ul></li>
<li>need a way for acceptors to change their mind, if no consensus reached yet</li>
</ul>
<h3>Basic Paxos exchange</h3>
<pre><code> proposer acceptors
prepare(n) ->
<- prepare_ok(n, n_a, v_a)
accept(n, v') ->
<- accept_ok(n)
decided(v') ->
</code></pre>
<h3>Why <code>n</code>?</h3>
<ul>
<li>to distinguish among multiple rounds, e.g., proposer crashes, simul props</li>
<li>want later rounds to supersede earlier ones</li>
<li>numbers allow us to compare early/late</li>
<li><code>n</code> values must be unique and roughly follow time</li>
<li><code>n = <time, server ID></code>
<ul>
<li>e.g., ID can be server's IP address</li>
</ul></li>
<li><em>"round"</em> is the same as <em>"proposal"</em> but completely different from "instance"
<ul>
<li>round/proposal numbers are <em>WITHIN</em> a particular instance</li>
</ul></li>
</ul>
<p><strong>Definition:</strong> server S <em>accepts</em> <code>n/v</code></p>
<ul>
<li>it responded <code>accept_ok</code> to <code>accept(n, v)</code></li>
</ul>
<p><strong>Definition:</strong> <code>n/v</code> is <em>chosen</em>
- a majority of servers accepted <code>n/v</code></p>
<p>The <strong>crucial property:</strong></p>
<ul>
<li>if a value was chosen, any subsequent choice must be the same value
<ul>
<li>i.e., protocol must not change its mind</li>
<li>maybe a different proposer &c, but same value!</li>
<li>this allows us to freely start new rounds after crashes &c</li>
</ul></li>
<li>tricky b/c <em>"chosen"</em> is system-wide property
<ul>
<li>e.g., majority accepts, then proposer crashes</li>
<li><code>=></code> <em>no node can tell locally that agreement was reached</em></li>
</ul></li>
</ul>
<p>So:</p>
<ul>
<li>proposer doesn't send out value with <code>prepare</code> (but sends it a bit later)</li>
<li>acceptors send back any value they have already accepted</li>
<li>if there is one, proposer proposes that value
<ul>
<li>to avoid changing an existing choice</li>
</ul></li>
<li>if no value already accepted,
<ul>
<li>proposer can propose any value (e.g., a client request)</li>
</ul></li>
<li>proposer must get <code>prepare_ok</code> from majority
<ul>
<li>to guarantee intersection with any previous majority,</li>
<li>to guarantee proposer hears of any previously chosen value</li>
</ul></li>
</ul>
<h3>Now the protocol -- see the handout</h3>
<pre><code> proposer(v):
choose n, unique and higher than any n seen so far
send prepare(n) to all servers including self
if prepare_ok(n, n_a, v_a) from majority:
v' = v_a with highest n_a; choose own v otherwise
send accept(n, v') to all
if accept_ok(n) from majority:
send decided(n, v') to all
acceptor state:
must persist across reboots
n_p (highest prepare seen)
n_a, v_a (highest accept seen)
acceptor's prepare(n) handler:
if n > n_p
n_p = n
reply prepare_ok(n, n_a, v_a)
else
reply prepare_reject
acceptor's accept(n, v) handler:
if n >= n_p
n_p = n
n_a = n
v_a = v
reply accept_ok(n)
else
reply accept_reject
</code></pre>
<p><strong>Example 1</strong> (normal operation):</p>
<pre><code> `S1`, `S2`, `S3`
[ but `S3` is dead or slow ]
`S1`: -> starts proposal w/ n=1 v=A
`S1`: <- p1 <- a1vA <- dA
`S2`: <- p1 <- a1vA <- dA
`S3`: dead...
"p1" means Sx receives prepare(n=1)
"a1vA" means Sx receives accept(n=1, v=A)
"dA" means Sx receives decided(v=A)
</code></pre>
<ul>
<li>S1 and S2 will reply with <code>prepare_ok(1, 0, null)</code> to the <code>p1</code> message.</li>
<li>If <code>dA</code> is lost, one of the nodes waiting can run Paxos again and try a new <code>n</code> higher than the previous one.
<ul>
<li>the <code>prepare_ok(2, 1, 'A')</code> reply will come back,</li>
<li>then the node is forced to send <code>a2vA</code> and hopefully this time, after the node gets the <code>accept_ok</code> message, it
will send out <code>dA</code> messages that won't get lost again</li>
</ul></li>
<li>a value is said to be chosen when a majority of acceptors in the <code>accept</code> handler take the accept branch and accept the value
<ul>
<li>however, not everyone will <em>know</em> this, so that's why the <code>decide</code> message is sent out</li>
</ul></li>
</ul>
<p>These diagrams are not specific about who the proposer is</p>
<ul>
<li>it doesn't really matter</li>
<li>the proposers are logically separate from the acceptors</li>
<li>we only care about what acceptors saw and replied</li>
</ul>
<p>Note Paxos only requires a majority of the servers</p>
<ul>
<li>so we can continue even though <code>S3</code> was down</li>
<li>proposer must not wait forever for any acceptor's response</li>
</ul>
<h4>What would happen if network partition?</h4>
<ul>
<li>i.e., <code>S3</code> was alive and had a proposed value B</li>
<li><code>S3</code>'s prepare would not assemble a majority</li>
</ul>
<h4>The homework question</h4>
<p>How does Paxos ensure that the following sequence of events can't happen?
What actually happens, and which value is ultimately chosen?</p>
<pre><code> proposer 1 crashes after sending two accept() requests
proposer 2 has a different value in mind
A: p1 a1foo
B: p1 p2 a2bar
C: p1 a1foo p2 a2bar
C's prepare_ok to B really included "foo"
thus a2foo, and so no problem
</code></pre>
<p><strong>The point:</strong></p>
<ul>
<li>if the system has already reached agreement, majority will know value
<ul>
<li>in this example, A and C agreed on 'foo'</li>
</ul></li>
<li>any new majority of prepares will intersect that majority
<ul>
<li>in this example, AC intersects BC in C</li>
</ul></li>
<li>so subsequent proposer will learn of already-agreed-on value
<ul>
<li>in this example, C will reply with a <code>prepare_ok(n=2, n_a=1, v_a=foo)</code> to proposer</li>
</ul></li>
<li>and subsequent proposer will send already-agreed-on value in <code>accept</code> msg
<ul>
<li>in this example, proposer will send <code>accept(n=2, v=foo)</code> </li>
</ul></li>
</ul>
<p><strong>Example 2</strong> (concurrent proposers):</p>
<pre><code> A1 starts proposing n=10 by sending prepare(n=10)
A1 sends out just one accept v=10
[ => A1 must have received prepare_ok from majority ]
A3 starts proposing n=11
but A1 does not receive its proposal
[ => A1 never replies to A3 with prepare_ok(n=11, n_a=10, v=10) because it never got the prepare ]
A3 only has to wait for a majority of proposal responses
A1: p10 a10v10
A2: p10 p11
A3: p10 p11 a11v11
A1 and A3 have accepted different values!
</code></pre>
<p>What will happen?</p>
<ul>
<li><strong>Q:</strong> What will <code>A2</code> do if it gets <code>a10v10</code> accept msg from <code>A1</code>?
<ul>
<li><em>Recall:</em> <code>a10v10</code> means <code>accept(n=10,v=10)</code> which is sent by proposer after he receives a majority of <code><-prepare_ok</code>'s on <code>n=10</code></li>
<li><strong>A:</strong> A2 will reject because it has a higher <code>np</code> from <code>p11</code></li>
</ul></li>
<li><strong>Q:</strong> What will <code>A1</code> do if it gets <code>a11v11</code> accept msg from <code>A3</code>?
<ul>
<li><strong>A:</strong> <code>A1</code> will reply <code>accept_ok</code> and change its value to 11 because <code>n = 11 > np = 10</code></li>
</ul></li>
<li>In other words, a value has not been chosen yet because no majority accepted the same value yet with the same proposal number.</li>
</ul>
<p>What if A3 (2nd proposer) were to crash at this point (and not restart)?</p>
<ul>
<li><strong>TODO:</strong> Not sure which "point" they are referring to.</li>
<li><em>Case 1:</em> If <code>A3</code> crashes after proposing <code>n=11</code> and after receiving a majority of <code>prepare_ok</code>'s but before sending out the other two <code>accept</code>'s to <code>A1</code> and <code>A2</code>, then:
<ul>
<li><code>A1</code> could come back online and send an <code>accept(n=10,v=10)</code> to <code>A2</code></li>
<li><code>A2</code> will reject because it has a higher <code>np = 11</code> (see above and see algorithm)</li>
<li><code>A1</code> will repropose with higher <code>n</code> and eventually convince both itself and <code>A2</code> to accept <code>v10</code></li>
</ul></li>
<li><em>Case 2:</em> <code>A3</code> crashes after proposing <code>n=11</code> and receiving a majority of <code>prepare_ok</code>'s and <em>after</em> sending out an <code>accept</code> to <code>A2</code>
<ul>
<li><code>A2</code> will <code>accept_ok</code> on <code>v11</code></li>
<li>Now <code>v11</code> is chosen: any future majority of <code>prepare_ok</code>'s will return <code>v11</code></li>
</ul></li>
</ul>
<p>How about this:</p>
<pre><code>A1: p10 a10v10 p12
A2: p10 p11 a11v11
A3: p10 p11 p12 a12v10
</code></pre>
<p>Has the system agreed to a value at this point?</p>
<h4>What's the commit point?</h4>
<ul>
<li>i.e., exactly when has agreement been reached?</li>
<li>i.e., at what point would changing the value be a disaster?</li>
<li>after a majority has the same <code>v_a</code>? no -- why not? above counterexample</li>
<li>after a majority has the same <code>v_a/n_a</code>? yes -- why sufficient? sketch:
<ul>
<li>suppose majority has same <code>v_a/n_a</code></li>
<li>acceptors will reject <code>accept()</code> with lower <code>n</code></li>
<li>for any higher <code>n</code>: prepare's must have seen our majority <code>v_a/n_a</code> (overlap)</li>
<li>what if overlap servers saw <code>prepare(n)</code> before <code>accept(v_a, n_a)</code>?
<ul>
<li>would reject <code>v_a/n_a</code></li>
<li>thus wouldn't have a majority yet</li>
<li>proposer might be free to choose <code>v != v_a</code></li>
</ul></li>
</ul></li>
</ul>
<h4>Why does the proposer need to pick <code>v_a</code> with highest <code>n_a</code>?</h4>
<pre><code> A1: p10 a10vA p12
A2: p10 p11 a11vB
A3: p10 p11 a11vB p12 a12v??
n=11 already agreed on vB
n=12 sees both vA and vB, but must choose vB
</code></pre>
<p>Two cases:</p>
<ol>
<li>There was a majority before <code>n=11</code>
<ul>
<li><code>n=11</code>'s prepares would have seen value and re-used it</li>
<li>so it's safe for <code>n=12</code> to re-use <code>n=11</code>'s value</li>
</ul></li>
<li>There was not a majority before <code>n=11</code>
<ul>
<li><code>n=11</code> might have obtained a majority</li>
<li>so it's required for <code>n=12</code>to re-use <code>n=11</code>'s value</li>
</ul></li>
</ol>
<h4>Why does prepare handler check that <code>n > n_p</code>?</h4>
<ul>
<li>it's taking <code>max(concurrent n's)</code>, for accept handler</li>
<li>responding to all <code>prepare()</code> with <code>prepare_ok()</code> would be also fine,
<ul>
<li>but proposers with <code>n < n_p</code> would be ignored by <code>accept()</code> anyway</li>
</ul></li>
</ul>
<h4>Why does accept handler check <code>n >= n_p</code>?</h4>
<ul>
<li>required to ensure agreement</li>
<li>there's a unique highest <code>n</code> active</li>
<li>everyone favors the highest <code>n</code></li>
<li>without <code>n >= n_p</code> check, you could get this bad scenario:</li>
</ul>
<p>Scenario:</p>
<pre><code> A1: p1 p2 a1vA
A2: p1 p2 a1vA a2vB
A3: p1 p2 a2vB
</code></pre>
<h4>Why does accept handler update <code>n_p = n</code>?</h4>
<ul>
<li>required to prevent earlier <code>n</code>'s from being accepted</li>
<li>node can get <code>accept(n,v)</code> even though it never saw <code>prepare(n)</code></li>
<li>without <code>n_p = n</code>, can get this bad scenario:</li>
</ul>
<p>Scenario:</p>
<pre><code> A1: p1 a2vB a1vA p3 a3vA
A2: p1 p2 p3 a3vA
A3: p2 a2vB
</code></pre>
<h4>What if new proposer chooses <code>n < old proposer</code>?</h4>
<ul>
<li>i.e., if clocks are not synced</li>
<li>cannot make progress, though no correctness problem</li>
</ul>
<h4>What if an acceptor crashes after receiving accept?</h4>
<pre><code>A1: p1 a1v1
A2: p1 a1v1 reboot p2 a2v?
A3: p1 p2 a2v?
A2 must remember v_a/n_a across reboot! on disk
might be only intersection with new proposer's majority
and thus only evidence that already agreed on v1
</code></pre>
<h4>What if an acceptor reboots after sending <code>prepare_ok</code>?</h4>
<ul>
<li>does it have to remember <code>n_p</code> on disk?</li>
<li>if <code>n_p</code> not remembered, this could happen:</li>
</ul>
<p>Example:</p>
<pre><code> `S1`: p10 a10v10
`S2`: p10 p11 reboot a10v10 a11v11
`S3`: p11 a11v11
</code></pre>
<ul>
<li>11's proposer did not see value 10, so 11 proposed its own value</li>
<li>but just before that, 10 had been chosen!</li>
<li>b/c <code>S2</code> did not remember to ignore <code>a10v10</code></li>
</ul>
<h4>Can Paxos get stuck?</h4>
<ul>
<li>Yes, if there is not a majority that can communicate</li>
<li>How about if a majority is available?
<ul>
<li>Possible to livelock: dueling proposers, keep <code>prepare</code>'ing higher <code>n</code>'s
<ul>
<li>One reason to try electing a leader: reduce chance of dueling proposers</li>
</ul></li>
<li>With single proposer and reachable majority, should reach consensus</li>
</ul></li>
</ul>