forked from alinush/6.824-lecture-notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
l17-pnuts.html
738 lines (617 loc) · 21.7 KB
/
l17-pnuts.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
<h1>6.824 2015 Lecture 17: PNUTS</h1>
<p><strong>Note:</strong> These lecture notes were slightly modified from the ones posted on the
6.824 <a href="http://nil.csail.mit.edu/6.824/2015/schedule.html">course website</a> from
Spring 2015.</p>
<h1>PNUTS</h1>
<ul>
<li>a solution to the same problem Spanner and memcached solved</li>
<li>PNUTS is a more-principled designed than the memcache Facebook design
<ul>
<li>"it was actually designed"</li>
</ul></li>
<li>make reads fast</li>
<li>upside: web applications are able to do fast local reads due to replication</li>
<li>downside: writes will be slow, because they need to be replicated</li>
<li>because writes have to be distributed to all the regions, there will
be a fundamental delay between when writes happen and when the updates
actually propagate
<ul>
<li><code>=></code> potential for stale reads</li>
</ul></li>
<li>if there's data that could be updated by concurrent clients, there will
be a problem with multiple writes
<ul>
<li>need all regions to see our writes in the same order</li>
</ul></li>
</ul>
<p>Diagram:</p>
<pre><code>Region R1 Region R2
--------- ---------
W1 Mesage broker W1 Message broker
W2 (replicated) W2 (replicated)
W3 W3
.. Tablet controller .. Tablet controller
(replicated) (replicated)
Router1 Router2 ... Router1 Router2 ...
SU1 SU2 SU3 ... SU1 SU2 SU3 ...
</code></pre>
<ul>
<li>each region has its own set of webservers</li>
<li>each region stores all data</li>
<li>each table in a region is partitioned among storage units (SUs)</li>
<li>routers know the partitioning</li>
<li>each SU has a disk</li>
</ul>
<h2>Updates</h2>
<ul>
<li>each record in PNUTS has its own master region through which all writes have
to go
<ul>
<li>different than memcache at facebook, they had a master region for <em>all</em>
records</li>
<li>in PNUTS every record has a different master</li>
<li>Note: a record is just a row in a table (and has an extra field that
stores its master)</li>
</ul></li>
<li>updating records that are in regions far away from the user will take longer
of course</li>
<li>how does the webserver know where to send the update?
<ul>
<li>contact one of the routers</li>
<li>router looks at the key, knows it's stored in say SU3</li>
<li>find out from SU3 that a different region <code>r2</code> has the master copy
<ul>
<li>doesn't know which SU at <code>r2</code> the record is at</li>
</ul></li>
<li>contact one of the routers in <code>r2</code></li>
<li>router tells you the SU to store it at</li>
<li>the SU then needs to send out the update to all the other regions</li>
<li>the SU sends the update to the message brokers
<ul>
<li>not clear if SU applies the update to its own disk before</li>
</ul></li>
<li>the message broker writes a copy of the update to the disk
because it is <em>committing</em> to actually sending the update everywhere
<ul>
<li>important because we don't want a failed server to result in partially
propagating the update</li>
</ul></li>
<li>the MB will send it out to other MBs at other sites </li>
<li>somehow the web app needs to find out that the write completes
<ul>
<li>not clear who sends the ACK back</li>
<li>seems that the MB replies back to the webserver app as soon as it
commits the update to the disk</li>
</ul></li>
<li>asynchronous writes because, from POV of webapp, write is done when MB has
written it to its disk</li>
<li>why isn't the MB a bottleneck? It has to write a lot of stuff:
<ul>
<li>different applications have a different message broker</li>
<li>MB may be able to do much more batching of the writes</li>
<li>maybe also MB writes are also less complex than normal database writes
where you have to modify Btrees, maybe go through a file system, etc.</li>
</ul></li>
<li>because they funnel all the writes through some MB they get some
semantics for writes</li>
</ul></li>
</ul>
<h2>Write semantics</h2>
<h3>Write order to single records</h3>
<pre><code>Name Where What
---- ----- ----
Alice home asleep
Bob
</code></pre>
<ul>
<li>Alice writes where record which has 3 columns (Name, Where, What)</li>
<li>Alice's application says <code>write(what=awake)</code>
<ul>
<li>write goes through PNUTS</li>
</ul></li>
<li>Alice's application says <code>write(where=work)</code>
<ul>
<li>write goes through PNUTS</li>
</ul></li>
<li>useful semantics given by PNUTS
<ul>
<li>other people in different regions might see
<ul>
<li>alice at home asleep</li>
<li>alice at home awake</li>
<li>alice at work awake</li>
</ul></li>
<li>other people won't see a view of the record inconsistent with the order of
the writes
<ul>
<li>alice at work asleep</li>
</ul></li>
<li>kind of the main consistency semantics provided by PNUTS</li>
<li>a result of sequencing the writes through the MBs</li>
<li>paper calls this <em>per-record timeline consistency</em></li>
<li>note that their model restricts them to only have transactions on a
single record basis</li>
</ul></li>
</ul>
<h3>When would you care about stale data?</h3>
<ul>
<li>after you added something to your shopping cart, you would expect to see it
there</li>
</ul>
<p>Reads vs. staleness</p>
<pre><code> read-any(key) -> fast read, just executes the read on the SU and does
not wait for any writes to propagate
read-critical(key, ver) -> returns the read record where ver(record) >= ver
- useful for reading your own writes
- true when you have one webpage in a single tab
- if you update your shopping cart in one tab, then the other tab
will not be aware of that version number from the first tab
read-latest(key) -> will always go to the master copy and read the latest
data there
</code></pre>
<h3>Writes, atomic updates</h3>
<p>Example: increment a counter in a record</p>
<pre><code> test-and-set-write(ver, key, newvalue) -> always gets sent to the master
region for the key. look at the version and if it matches provided
one then update the record with the new value
// implementing v[k]++
while true:
(x, v) = read-latest(k)
if test-and-set-write(k, v, x+1)
break
</code></pre>
<h2>Question of the day</h2>
<p>Alice comes back from spring break and she:</p>
<ul>
<li>removes her mom from her ACL</li>
<li>posts spring break photos</li>
</ul>
<p>Can her mom see her photos due to out-of-order writes?</p>
<p>If Alice has all the photos her mom can see in a single record, then no.</p>
<pre><code>Alice | ACL | List of photos
-------- ----------- ----------------
mom p7, p99
</code></pre>
<p>Assuming the code her mom is executing reads the full record (ACL + photos) when
doing the check, and doesn't first read the ACL, wait a while and then read
the photos</p>
<h2>Failures</h2>
<p>If webapp server fails in the middle of doing a bunch of writes, then only
partial info would have been written to PNUTS, possibly leading to corruption.</p>
<ul>
<li>no transactions for multiple writes</li>
</ul>
<p>If SU crashes and reboots, it can recover from disk and MB can keep retrying it</p>
<p>What happens when SU loses its disk? It needs to recover the data.</p>
<ul>
<li>the paper says the SU will clone its data from a SU from another region
<ul>
<li>main challenge is that updates are being sent by MBs to records that
are being copied</li>
<li>either updates go to source of copy, or destination of copy remembers the
update</li>
<li>ultimately they both need to have the update in the end</li>
</ul></li>
</ul>
<h2>Performance</h2>
<p>Evaluation mostly focuses on latency and not on throughput. Maybe this is
specific to their needs.</p>
<p>Not clear how they can support millions of users with MBs that can only do
hundreds of writes per second.</p>
<p>Why is it taking them 75ms to do a local update, where everyone is in the same
region?</p>
<ul>
<li>computation, disk, network?</li>
<li>75ms is enormous for a write in a DB</li>
</ul>
<h1>6.824 notes</h1>
<p>Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam
Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel
Weaver and Ramana Yerneni. PNUTS: Yahoo!'s Hosted Data Serving
Platform. Proceedings of VLDB, 2008.</p>
<p>Why this paper?</p>
<ul>
<li>same basic goals as Facebook/memcache paper, more principled design</li>
<li>multi-region is very challenging -- 100ms network delays</li>
<li>conscious trade-off between consistency and performance</li>
</ul>
<p>What is PNUTS' overall goal?</p>
<p>Diagram:</p>
<pre><code>[world, browsers, data centers]
</code></pre>
<ul>
<li>overall story similar to that of Spanner and Facebook/memcache</li>
<li>data centers ("regions") all over the world</li>
<li>web applications, e.g. mail, shopping, social net
<ul>
<li>each app probably runs at all regions</li>
</ul></li>
<li>PNUTS keeps state for apps
<ul>
<li>per-user: profile, shopping cart, friend list</li>
<li>per-item: book popularity, user comments</li>
</ul></li>
<li>app might need any piece of data at any data center</li>
<li>need to handle lots of concurrent updates to different data
<ul>
<li>e.g. lots of users must be able to add items to shopping cart at same time
thus 1000s of PNUTS servers</li>
</ul></li>
<li>1000s of servers => crashes must be frequent</li>
</ul>
<h2>Overview</h2>
<p>Diagram:</p>
<pre><code>3 regions, browsers, web apps, tablet ctlrs, routers, storage units, MBs]
</code></pre>
<ul>
<li>each region has all data</li>
<li>each table partitioned by key over storage units
<ul>
<li>tablet servers + routers know the partition plan</li>
</ul></li>
</ul>
<p>Why replicas of all data at multiple regions?</p>
<ul>
<li>multiple regions -> each user's data geographically close to user</li>
<li>multiple complete replicas -> maybe survive entire region failure</li>
<li>complete replicas -> read anything quickly
<ul>
<li>since some data used by many users / many regions</li>
<li>once you have multiple regions, fast reads are very important</li>
</ul></li>
</ul>
<p>What are the drawbacks of a copy at each region?</p>
<ul>
<li>updates will be slow, need to contact every region</li>
<li>local reads will probably be stale</li>
<li>updates from multiple regions need to be sorted out
<ul>
<li>keep replicas identical</li>
<li>avoid order anomalies</li>
<li>don't lose updates (e.g. read-modify-write for counter)</li>
</ul></li>
<li>disk space probably not an issue for their uses</li>
</ul>
<p>What is the data and query model?</p>
<ul>
<li>basically key/value</li>
<li>reads/writes probably by column
<ul>
<li>so a write might replace just one column, not whole record</li>
</ul></li>
<li>range scan for ordered tables</li>
</ul>
<p>How do updates work?</p>
<ul>
<li>app server gets web request, needs to write data in PNUTS</li>
<li>need to update every region!</li>
<li>why not just have app logic send update to every region?
<ul>
<li>what if app crashes after updating only some regions?</li>
<li>what if concurrent updates to same record?</li>
</ul></li>
</ul>
<p>PNUTS has a "record master" for each record</p>
<ul>
<li>all updates must go through that region
<ul>
<li>each record has a hidden column indicating region of record master</li>
</ul></li>
<li>responsible storage unit executes updates one at a time per record</li>
<li>tells MB to broadcast update to all regions</li>
<li>per-record master probably better than Facebook/memcache master region</li>
</ul>
<p>So the complete update story (some guesswork):</p>
<p>App wants to update some columns of a record, knows key</p>
<ol>
<li>app sends key and update to local SU1</li>
<li>SU1 looks up record master for key: SI2</li>
<li>SU1 sends update request to router at SI2</li>
<li>router at SI2 forwards update to local SU2 for key</li>
<li>SU2 sends update to local Message Broker (MB)</li>
<li>MB stores on disk + backup MB, sends vers # to original app
how does MB know the vers #? maybe SU2 told it
or perhaps SU2 (not MB) replies to original app</li>
<li>MB sends update to router at every region</li>
<li>every region updates local copy</li>
</ol>
<p>Puzzles:</p>
<ul>
<li>3.2.1 says MB is commit point
<ul>
<li>i.e. MB writes to log on two disks, keeps trying to deliver
why isn't MB disk a terrible bottleneck?</li>
</ul></li>
<li>does update go to MB then SU2? or SU2 then MB? or SU2, MB, SU2?
<ul>
<li>maybe MB then SU2, since MB is commit point</li>
<li>maybe SU2 then MB, since SU2 has to check it's the record's master
and perhaps pick the new version number, tho maybe not needed</li>
</ul></li>
<li>who replies to client w/ new version #?</li>
</ul>
<p>All writes are multi-region and thus slow -- why does it make sense?</p>
<ul>
<li>application waits for MB commit but not propagation ("asynchronous")</li>
<li>master likely to be local (they claim 80% of the time)
<ul>
<li>so MB commit will often be quick</li>
<li>and app/user will often see its own writes soon</li>
</ul></li>
<li>still, eval says 300ms if master is remote!</li>
<li>down side: readers at non-master regions may see stale data</li>
</ul>
<p>How does a read-only query execute?</p>
<ul>
<li>multiple kinds of reads (section 2.2) -- why?</li>
<li>application gets to choose how consistent</li>
<li><code>read-any(k)</code>
<ul>
<li>read from local SU</li>
<li>might return stale data (even if you just wrote!)</li>
<li>why: app wants speed but doesn't care about freshness</li>
</ul></li>
<li><code>read-critical(k, required_version)</code>
<ul>
<li>maybe read from local SU if it has vers >= required_version</li>
<li>otherwise read from master SU?</li>
<li>why: app wants to see its own write</li>
</ul></li>
<li><code>read-latest(k)</code>
<ul>
<li>always read from master SU (? "if local copy too stale")</li>
<li>slow if master is remote!</li>
<li>why: app needs fresh data</li>
</ul></li>
</ul>
<p>What if app needs to increment a counter stored in a record?</p>
<ul>
<li>app reads old value, increments locally, writes new value</li>
<li>what if the local read produced stale data?</li>
<li>what if read was OK, but concurrent updates?</li>
</ul>
<p><code>test-and-set-write(version#, new value)</code> gives you atomic update to one record
- master rejects the write if current version # != version#
- so if concurrent updates, one will lose and retry </p>
<p><code>TestAndSet</code> example:</p>
<pre><code> while(1):
(x, ver) = read-latest(k)
if(t-a-s-w(k, ver, x+1))
break
</code></pre>
<h2>The Question</h2>
<ul>
<li>how does PNUTS cope with Example 1 (page 2)</li>
<li>Initially Alice's mother is in Alice's ACL, so mother can see photos
<ol>
<li>Alice removes her mother from ACL</li>
<li>Alice posts spring-break photos</li>
</ol></li>
<li>could her mother see update #2 but not update #1?
<ul>
<li>esp if mother uses different region than Alice
or if Alice does the updates from different regions</li>
</ul></li>
<li>ACL and photo list must be in the same record
<ul>
<li>since PNUTS guarantees order only for updates to same record</li>
</ul></li>
<li>Alice sends updates to her record's master region in order
<ul>
<li>master region broadcasts via MB in order</li>
<li>MB tells other regions to apply updates in order</li>
</ul></li>
<li>What if Alice's mother:
<ul>
<li>reads the old ACL, that includes mother</li>
<li>reads the new photo list</li>
<li>answer: just one read of Alice's record, has both ACL and photo list
<ul>
<li>if record doesn't have new ACL, order says it can't have new photos either</li>
</ul></li>
</ul></li>
<li>How could a storage system get this wrong?
<ul>
<li>No ordering through single master (e.g. Dynamo)</li>
</ul></li>
</ul>
<p>How to change record's master if no failures?</p>
<ul>
<li>e.g. I move from Boston to LA</li>
<li>perhaps just update the record, via old master?
<ul>
<li>since ID of master region is stored in the record</li>
</ul></li>
<li>old master announces change over MB</li>
<li>a few subsequent updates might go to the old master
<ul>
<li>it will reject them, app retries and finds new master?</li>
</ul></li>
</ul>
<p>What if we wanted to do bank transfers?</p>
<ul>
<li>from one account (record) to another</li>
<li>can <code>t-a-s-w</code> be used for this?</li>
<li>multi-record updates are not atomic
<ul>
<li>other readers can see intermediate state</li>
<li>other writers are not locked out</li>
</ul></li>
<li>multi-record reads are not atomic
<ul>
<li>might read one account before xfer, other account after xfer</li>
</ul></li>
</ul>
<p>Is lack of general transactions a problem for web applications?</p>
<ul>
<li>maybe not, if programmers know to expect it</li>
</ul>
<p>What about tolerating failures?</p>
<p>App server crashes midway through a set of updates</p>
<ul>
<li>not a transaction, so only some of writes will happen</li>
<li>but master SU/MB either did or didn't get each write
<ul>
<li>so each write happens at all regions, or none</li>
</ul></li>
</ul>
<p>SU down briefly, or network temporarily broken/lossy</p>
<ul>
<li>(I'm guessing here, could be wrong)</li>
<li>MB keeps trying until SU acks
<ul>
<li>SU shouldn't ACK until safely on disk</li>
</ul></li>
</ul>
<p>SU loses disk contents, or doesn't automatically reboot </p>
<ul>
<li>can apps read from remote regions?
<ul>
<li>paper doesn't say</li>
</ul></li>
<li>need to restore disk content from SUs at other regions
<ol>
<li>subscribe to MB feed, and save them for now</li>
<li>copy content from SU at another region</li>
<li>replay saved MB updates</li>
</ol></li>
<li>Puzzle:
<ul>
<li>how to ensure we didn't miss any MB updates for this SU?
<ul>
<li>e.g. subscribe to MB at time=100, but source SU only saw through 90?</li>
</ul></li>
<li>will replay apply updates twice? is that harmful?</li>
<li>paper mentions sending checkpoint message through MB
<ul>
<li>maybe fetch copy as of when the checkpoint arrived</li>
<li>and only replay after the checkpoint</li>
<li>BUT no ordering among MB streams from multiple regions</li>
</ul></li>
</ul></li>
</ul>
<p>MB crashes after accepting update</p>
<ul>
<li>logs to disks on two MB servers before ACKing</li>
<li>recovery looks at log, (re)sends logged msgs</li>
<li>record master SU maybe re-sends an update if MB crash before ACK
<ul>
<li>maybe record version #s will allow SUs to ignore duplicate</li>
</ul></li>
</ul>
<p>MB is a neat idea</p>
<ul>
<li>atomic: updates all replicas, or none
<ul>
<li>rather than app server updating replicas (crash...)</li>
</ul></li>
<li>reliable: keeps trying, to cope with temporarily SU/region failure</li>
<li>async: apps don't have to wait for write to complete, good for WAN</li>
<li>ordered: keeps replicas identical even w/ multiple writers</li>
</ul>
<p>Record's master region loses network connection</p>
<ul>
<li>can other regions designate a replacement RM?
<ul>
<li>no: original RM's MB may have logged updates, only some sent out</li>
</ul></li>
<li>do other regions have to wait indefinitely? yes
<ul>
<li>this is one price of ordered updates / strict-ish consistency</li>
</ul></li>
</ul>
<h2>Evaluation</h2>
<p>Evaluation focuses on latency and scaling, not throughput</p>
<p>5.2: time for an insert while busy</p>
<ul>
<li>depends on how far away Record Master is</li>
<li>RM local: 75.6 ms</li>
<li>RM nearby: 131.5 ms</li>
<li>RM other coast: 315.5 ms</li>
</ul>
<p>What is 5.2 measuring? from what to what?</p>
<ul>
<li>maybe web server starts insert, to RM replies w/ new version?</li>
<li>not time for MB to propagate to all regions
<ul>
<li>since then local RM wouldn't be <code>< remote</code></li>
</ul></li>
</ul>
<p>Why 75 ms?</p>
<p>Is it 75 ms of network speed-of-light delay?</p>
<ul>
<li>no: local</li>
</ul>
<p>Is the 75 ms mostly queuing, waiting for other client's operations?</p>
<ul>
<li>no: they imply 100 clients was max that didn't cause delay to rise</li>
</ul>
<p>End of 5.2 suggests 40 ms of 75 ms in in SU</p>
<ul>
<li>how could it take 40 ms?
<ul>
<li>each key/value is one file?</li>
<li>creating a file takes 3 disk writes (directory, inode, content)?</li>
</ul></li>
<li>what's the other 35 ms?
<ul>
<li>MB disk write?</li>
</ul></li>
</ul>
<p>But only 33 ms (not 75) for "ordered table" (MySQL/Innodb)</p>
<ul>
<li>closer to the one or two disk write we'd expect</li>
</ul>
<p>5.3 / Figure 3: effect of increasing request rate</p>
<ul>
<li>what do we expect for graph w/ x-axis req rate, y-axis latency?
<ul>
<li>system has some inherent capacity, e.g. total disk seeks/second</li>
<li>for lower rates, constant latency</li>
<li>for higher rates, queue grows rapidly, avg latency blows up</li>
</ul></li>
<li>blow-up should be near max capacity of h/w
<ul>
<li>e.g. # disk arms / seek time</li>
</ul></li>
<li>we don't see that in Figure 3
<ul>
<li>apparently their clients were not able to generate too much load</li>
<li>end of 5.3 says clients too slow</li>
<li>at >= 75 ms/op, 300 clients -> about 4000/sec</li>
</ul></li>
<li>text says max possible rate was about 3000/second
<ul>
<li>10% writes, so 300 writes/second</li>
<li>5 SU per region, so 60 writes/SU/second</li>
<li>about right if each write does a random disk I/O</li>
<li>but you'll need lots of SUs for millions of active users</li>
</ul></li>
</ul>
<p>Stepping back, what were PNUTS key design decisions?</p>
<ol>
<li>replication of all data at multiple regions
<ul>
<li>fast reads, slow writes</li>
</ul></li>
<li>relaxed consistency -- stale reads
<ul>
<li>b/c writes are slow</li>
</ul></li>
<li>only single-row transactions w/ test-and-set-write</li>
<li>sequence all writes thru master region
<ul>
<li>pro: keeps replicas identical,
enforces serial order on updates,
easy to reason about</li>
<li>con: slow, no progress if master region disconnected</li>
</ul></li>
</ol>
<p>Next: Dynamo, a very different design</p>
<ul>
<li>async replication, but no master</li>
<li>eventual consistency</li>
<li>always allow updates</li>
<li>tree of versions if network partitions</li>
<li>readers must reconcile versions</li>
</ul>