forked from alinush/6.824-lecture-notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
l04-more-primary-backup.html
564 lines (458 loc) · 17.5 KB
/
l04-more-primary-backup.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
<h1>6.824 2015 Lecture 4: "Flat Datacenter Storage" Case Study</h1>
<p><strong>Note:</strong> These lecture notes were slightly modified from the ones posted on the 6.824
<a href="http://nil.csail.mit.edu/6.824/2015/schedule.html">course website</a> from Spring 2015.</p>
<h2>Flat datacenter storage</h2>
<p><a href="papers/fds.pdf">Flat Datacenter Storage</a>, <em>Nightingale, Elson, Fan, Hofmann, Howell, Suzue</em>, OSDI 2012</p>
<h3>Why are we looking at this paper?</h3>
<ul>
<li>Lab 2 wants to be like this when it grows up
<ul>
<li>though details are all different</li>
</ul></li>
<li>fantastic performance -- world record cluster sort</li>
<li>good systems paper -- details from apps all the way to network</li>
</ul>
<h3>What is FDS?</h3>
<ul>
<li>a cluster storage system</li>
<li>stores giant blobs -- 128-bit ID, multi-megabyte content</li>
<li>clients and servers connected by network with high bisection bandwidth</li>
<li>for big-data processing (like MapReduce)
<ul>
<li>cluster of 1000s of computers processing data in parallel</li>
</ul></li>
</ul>
<h3>High-level design -- a common pattern</h3>
<ul>
<li>lots of clients</li>
<li>lots of storage servers ("tractservers")</li>
<li>lots of bandwidth between any two servers</li>
<li>data is stored in blobs
<ul>
<li>addressed by 128bit IDs</li>
<li>further split into tracts
<ul>
<li>numbered from 0</li>
<li>8MB sized</li>
</ul></li>
</ul></li>
<li>partition the data</li>
<li>master ("metadata server") controls partitioning</li>
<li>replica groups for reliability</li>
<li>tract table locator (TLT) stores a bunch entries
<ul>
<li>in a <code>k</code>-replicated system, each entry has <code>k</code> tractservers</li>
</ul></li>
<li>how to find where a tract <code>t</code> for blob <code>b</code> is?
<ul>
<li>compute TLT entry as <code>(h(b) + t) mod len(tlt)</code>
<ul>
<li>and you'll get a list of servers in that entry</li>
</ul></li>
<li>blob metadata is <em>distributed</em> and NOT stored in the TLT</li>
</ul></li>
<li>how to write a tract from a blob?
<ul>
<li>look it up as described above</li>
<li>send write to all servers in TLT entry</li>
<li>only acknowledge write to client if <em>all</em> servers replied</li>
</ul></li>
<li>how to read a tract from a blob?
<ul>
<li>look it up as described above</li>
<li>send read to <em>a random</em> server in the TLT entry</li>
</ul></li>
</ul>
<h3>Why is this high-level design useful?</h3>
<ul>
<li>1000s of disks of space
<ul>
<li>store giant blobs, or many big blobs</li>
</ul></li>
<li>1000s of servers/disks/arms of parallel throughput</li>
<li>can expand over time -- reconfiguration</li>
<li>large pool of storage servers for instant replacement after failure</li>
</ul>
<h3>Motivating app: MapReduce-style sort</h3>
<ul>
<li>a mapper reads its split <code>1/M'th</code> of the input file (e.g., a tract)
<ul>
<li>map emits a <code><key, record></code> for each record in split</li>
<li>map partitions keys among <code>R</code> intermediate files (<code>M*R</code> intermediate files in total)</li>
</ul></li>
<li>a reducer reads 1 of <code>R</code> intermediate files produced by each mapper
<ul>
<li>reads <code>M</code> intermediate files (of <code>1/R</code> size)</li>
<li>sorts its input</li>
<li>produces <code>1/R'th</code> of the final sorted output file (<code>R</code> blobs)</li>
</ul></li>
<li>FDS sort
<ul>
<li>FDS sort does not store the intermediate files in FDS</li>
<li>a client is both a mapper and reducer</li>
<li>FDS sort is not locality-aware
<ul>
<li>in mapreduce, master schedules workers on machine that are close to the data</li>
<li>e.g., in same cluster
later versions of FDS sort uses more fine-grained work assignment
e.g., mapper doesn't get 1/N of the input file but something smaller
deals better with stragglers </li>
</ul></li>
</ul></li>
</ul>
<h3>The abstract's main claims are about performance.</h3>
<ul>
<li>They set the world-record for disk-to-disk sorting in 2012 for MinuteSort
<ul>
<li>1,033 disks and 256 computers (136 tract servers, 120 clients)</li>
<li>1,401 Gbyte in 59.4s</li>
</ul></li>
</ul>
<p><strong>Q:</strong> Does the abstract's 2 GByte/sec per client seem impressive?</p>
<ul>
<li>how fast can you read a file from Athena AFS? (abt 10 MB/sec)</li>
<li>how fast can you read a typical hard drive?</li>
<li>how fast can typical networks move data?</li>
</ul>
<p><strong>Q:</strong> Abstract claims recover from lost disk (92 GB) in 6.2 seconds</p>
<ul>
<li>that's 15 GByte / sec</li>
<li>impressive?</li>
<li>how is that even possible? that's 30x the speed of a disk!</li>
<li>who might care about this metric?</li>
</ul>
<h3>What should we want to know from the paper?</h3>
<ul>
<li>API?</li>
<li>layout?</li>
<li>finding data?</li>
<li>add a server?</li>
<li>replication?</li>
<li>failure handling?</li>
<li>failure model?</li>
<li>consistent reads/writes? (i.e. does a read see latest write?)
<ul>
<li>Not in FDS: "The current protocol for replication depends upon the
client to issue all writes to all replicas. This decision
means that FDS provides weak consistency guarantees
to clients. For example, if a client writes a tract to 1
of 3 replicas and then crashes, other clients reading
different replicas of that tract will observe differing state."</li>
<li>"Writes are not guaranteed to be committed
in order of issue. Applications with ordering requirements
are responsible for issuing operations after previous
acknowledgments have been received, rather than
concurrently. FDS guarantees atomicity: a write is either
committed or failed completely"</li>
</ul></li>
<li>config mgr failure handling?</li>
<li>good performance?</li>
<li>useful for apps?</li>
</ul>
<h4>API</h4>
<ul>
<li>Figure 1</li>
<li>128-bit blob IDs</li>
<li>blobs have a length</li>
<li>only whole-tract read and write -- 8 MB</li>
</ul>
<p><strong>Q:</strong> Why are 128-bit blob IDs a nice interface?
- Why not file names?</p>
<p><strong>Q:</strong> Why do 8 MB tracts make sense?
- (Figure 3...)</p>
<p><strong>Q:</strong> What kinds of client applications is the API aimed at?
- and not aimed at?</p>
<h4>Layout: how do they spread data over the servers?</h4>
<ul>
<li>Section 2.2</li>
<li>break each blob into 8 MB tracts</li>
<li>TLT maintained by metadata server
<ul>
<li>has <code>n</code> entries</li>
<li>for blob <code>b</code> and tract <code>t</code>, <code>i = (hash(b) + t) mod n</code></li>
<li><code>TLT[i]</code> contains list of tractservers w/ copy of the tract</li>
</ul></li>
<li>clients and servers all have copies of the latest TLT table</li>
</ul>
<h4>Example four-entry TLT with no replication:</h4>
<pre><code> 0: S1
1: S2
2: S3
3: S4
suppose hash(27) = 2
then the tracts of blob 27 are laid out:
S1: 2 6
S2: 3 7
S3: 0 4 8
S4: 1 5 ...
FDS is "striping" blobs over servers at tract granularity
</code></pre>
<p><strong>Q:</strong> hy have tracts at all? Why not store each blob on just one server?
- What kinds of apps will benefit from striping?
- What kinds of apps won't?</p>
<p><strong>Q:</strong> How fast will a client be able to read a single tract?</p>
<p><strong>Q:</strong> Where does the abstract's single-client 2 GB number come from?</p>
<p><strong>Q:</strong> Why not the UNIX i-node approach?</p>
<ul>
<li>store an array per blob, indexed by tract #, yielding tractserver</li>
<li>so you could make per-tract placement decisions
<ul>
<li>e.g. write new tract to most lightly loaded server</li>
</ul></li>
</ul>
<p><strong>Q:</strong> Why not <code>hash(b + t)</code>?</p>
<p><strong>Q:</strong> How many TLT entries should there be?</p>
<ul>
<li>how about <code>n = number of tractservers</code>?</li>
<li>why do they claim this works badly? Section 2.2</li>
</ul>
<p>The system needs to choose server pairs (or triplets &c) to put in TLT entries</p>
<ul>
<li>For replication</li>
<li>Section 3.3</li>
</ul>
<p><strong>Q:</strong> How about:</p>
<pre><code> 0: S1 S2
1: S2 S1
2: S3 S4
3: S4 S3
...
</code></pre>
<ul>
<li>Why is this a bad idea?</li>
<li>How long will repair take?</li>
<li>What are the risks if two servers fail?</li>
</ul>
<p><strong>Q:</strong> why is the paper's <code>n^2</code> scheme better?</p>
<p>Example:</p>
<pre><code> 0: S1 S2
1: S1 S3
2: S1 S4
3: S2 S1
4: S2 S3
5: S2 S4
...
</code></pre>
<ul>
<li>TLT with <code>n^2</code> entries, with every server pair occuring once</li>
<li>How long will repair take?</li>
<li>What are the risks if two servers fail?</li>
</ul>
<p><strong>Q:</strong> Why do they actually use a minimum replication level of 3?</p>
<ul>
<li>Same <code>n^2</code> table as before, third server is randomly chosen</li>
<li>What effect on repair time?</li>
<li>What effect on two servers failing?</li>
<li>What if three disks fail?</li>
</ul>
<h4>Adding a tractserver</h4>
<ul>
<li>To increase the amount of disk space / parallel throughput</li>
<li>Metadata server picks some random TLT entries</li>
<li>Substitutes new server for an existing server in those TLT entries</li>
</ul>
<h3>Extending a tract's size</h3>
<ul>
<li>Newly created blobs have a length of 0 tracts</li>
<li>Applications must extend a blob before writing past the end of it. </li>
<li>The extend operation is atomic, is safe to execute concurrently with other
clients, and returns the new size of the blob as a result of the client’s call. </li>
<li>A separate API tells the client the blob's current size. </li>
<li>Extend operations for a blob are sent to the tractserver that owns that blob’s
metadata tract. </li>
<li>The tractserver serializes it, atomically updates the metadata, and returns
the new size to each caller. </li>
<li>If all writers follow this pattern, the extend operation provides a range of
tracts the caller may write without risk of conflict. Therefore, the extend
API is functionally equivalent to the Google File System's "atomic append." </li>
<li>Space is allocated lazily on tractservers, so tracts claimed but not used do
not waste storage.</li>
</ul>
<h4>How do they maintain <code>n^2</code> plus one arrangement as servers leave join?</h4>
<p>Unclear.</p>
<p><strong>Q:</strong> How long will adding a tractserver take?</p>
<p><strong>Q:</strong> What about client <code>write</code>'s while tracts are being transferred?</p>
<ul>
<li>receiving tractserver may have copies from client(s) and from old srvr</li>
<li>how does it know which is newest?</li>
</ul>
<p><strong>Q:</strong> What if a client reads/writes but has an old tract table?</p>
<ul>
<li>tractservers tell him</li>
</ul>
<h4>Replication</h4>
<ul>
<li>A writing client sends a copy to each tractserver in the TLT.</li>
<li>A reading client asks one tractserver.</li>
</ul>
<p><strong>Q:</strong> Why don't they send writes through a primary?</p>
<ul>
<li>puts a lot of work on a primary? has to lookup and know TLT</li>
<li>goal is not to have just one backup for a primary, it's to replicate and
strip data effectively across many disks</li>
</ul>
<p><strong>Q:</strong> What problems are they likely to have because of lack of primary?</p>
<ul>
<li>Why weren't these problems show-stoppers?</li>
</ul>
<h4>What happens after a tractserver fails?</h4>
<ul>
<li>Metadata server stops getting heartbeat RPCs</li>
<li>Picks random replacement for each TLT entry failed server was in</li>
<li>New TLT gets a new version number</li>
<li>Replacement servers fetch copies</li>
</ul>
<p>Example of the tracts each server holds:</p>
<pre><code> S1: 0 4 8 ...
S2: 0 1 ...
S3: 4 3 ...
S4: 8 2 ...
</code></pre>
<p><strong>Q:</strong> Why not just pick one replacement server?</p>
<ul>
<li>it will have to take in a lot of writes for the lost data <code>=></code> bad perf.</li>
</ul>
<p><strong>Q:</strong> How long will it take to copy all the tracts?</p>
<p><strong>Q:</strong> If a tractserver's net breaks and is then repaired, might srvr serve old data?</p>
<p><strong>Q:</strong> If a server crashes and reboots with disk intact, can contents be used?</p>
<ul>
<li>e.g. if it only missed a few writes?</li>
<li>3.2.1's "partial failure recovery"</li>
<li>but won't it have already been replaced?</li>
<li>how to know what writes it missed?</li>
</ul>
<p><strong>Q:</strong> When is it better to use 3.2.1's partial failure recovery?</p>
<h4>What happens when the metadata server crashes?</h4>
<p><strong>Q:</strong> While metadata server is down, can the system proceed?</p>
<ul>
<li>yes, clients who have the TLT can go on</li>
</ul>
<p><strong>Q:</strong> Is there a backup metadata server?</p>
<ul>
<li>not in the paper, they said they might use Paxos for replication</li>
<li><strong>TODO:</strong> not clear why replicating the metadata server would lead to consistency
problems</li>
</ul>
<p><strong>Q:</strong> How does rebooted metadata server get a copy of the TLT?</p>
<ul>
<li>Eh, maybe it has it on disk?</li>
<li>Maybe it just simply reconstructs it from all the heartbeats?</li>
</ul>
<p><strong>Q:</strong> Does their scheme seem correct?</p>
<ul>
<li>how does the metadata server know it has heard from all tractservers?
<ul>
<li>it doesn't, it just adds servers as they send heartbeats</li>
</ul></li>
<li>how does it know all tractservers were up to date?
<ul>
<li><strong>TODO:</strong> Up to date with what?</li>
</ul></li>
</ul>
<h4>Random issues</h4>
<p><strong>Q:</strong> Is the metadata server likely to be a bottleneck?</p>
<ul>
<li>hard to tell. what's the use case?</li>
<li>if you have a client w/ that has memory to remember TLT
then he only contacts metadata server once and then
starts doing all of his reads/writes</li>
<li>if you have a lot of clients joining the system, or coming back but
forgetting the TLT (because of lack of storage maybe), then the metadata
server would be in use heavily
<ul>
<li>however, this won't affect the bandwidth the clients get once they
downloaded the TLT</li>
</ul></li>
</ul>
<p><strong>Q:</strong> Why do they need the scrubber application mentioned in 2.3?</p>
<ul>
<li>why don't they delete the tracts when the blob is deleted?
<ul>
<li>faster to do GC, rather than scheduling & executing deletes?</li>
</ul></li>
<li>can a blob be written after it is deleted?
<ul>
<li><strong>TODO:</strong> not sure, seems like yes, because the metadata for that
blob is in tract -1 and I don't think <code>WriteTract</code> checks the metadata
before every write, so you could maybe have races</li>
</ul></li>
</ul>
<h4>Performance</h4>
<p><strong>Q:</strong> How do we know we're seeing "good" performance? What's the best you can expect?</p>
<ul>
<li>best you can expect is to take each disks bandwidth and have the system's
bandwidth be <code># of disk * disk bandwidth</code></li>
</ul>
<p><strong>Q:</strong> Limiting resource for 2 GBps single-client?</p>
<ul>
<li>assuming this is end of 5.2</li>
<li>30 tractservers means maximum of 30 * 130MB/s = 3.9GBps</li>
<li>so limiting resource is network bandwidth</li>
</ul>
<p><strong>Q:</strong> Figure 4a: Why starts low? Why goes up? Why levels off? Why does it level off at that particular performance?</p>
<ul>
<li>starts low because single client bandwidth is limited</li>
<li>goes up b.c. as # of clients is increased each one adds more bandwidth to
the system</li>
<li>levels off because at some point the client bandwidth > server's bandwidth</li>
<li>why levels off at 32 GBps for <code>x</code> clients w/ 516 disks?
<ul>
<li>Figure 3 suggests a 10,000 RPM disk can read 5MB chunks at around 130MB/s
<ul>
<li>writes are similar</li>
</ul></li>
<li>not clear from logarithmic scale graph what <code>x</code> is
<ul>
<li><code>10 < x < 50</code> (maybe <code>25 < x < 50</code>?)</li>
</ul></li>
<li><code>516 disks * 130MB/s = 67 GBps</code>, so seems like best case performance
should've leveled off at more than 32 GBps?
<ul>
<li>in reality not all disks are 130MB/s maybe? (only the 10,000rpm SAS onese
were that fast)</li>
<li>in reality multiple disks on a single node might make that number smaller
maybe?</li>
</ul></li>
<li>anyway, something like <code>x=40</code> clients would have <code>40 * 10Gbps = 40 * 1.25GBps
= 50 Gbps</code> which is higher than the actual (claimed) bandwidth of the server
of 32 GBps</li>
</ul></li>
</ul>
<p><strong>Q:</strong> Figure 4b shows random r/w as fast as sequential (Figure 4a). Is this what you'd expect?</p>
<ul>
<li>Yes. Random R/W requests for different tracts go to different servers, just
like sequential ones do <code>=</code>> no difference</li>
</ul>
<p><strong>Q:</strong> Why are writes slower than reads with replication in Figure 4c?</p>
<ul>
<li>A write is sent to all tract servers? Not over until all of them reply.
<ul>
<li>w/ higher number of clients writing <code>=></code> more work done by each server</li>
<li>Paper says: "As expected, the write bandwidth is about onethird
of the read bandwidth since clients must send three
copies of each write"</li>
</ul></li>
<li>A read is sent to just one?</li>
</ul>
<p><strong>Q:</strong> Where does the 92 GB in 6.2 seconds come from?</p>
<ul>
<li>Table 1, 4th column</li>
<li>that's 15 GB / second, both read and written</li>
<li>1000 disks, triple replicated, 128 servers?</li>
<li>what's the limiting resource? disk? cpu? net?</li>
</ul>
<p>How big is each sort bucket?</p>
<ul>
<li>i.e. is the sort of each bucket in-memory?</li>
<li>1400 GB total</li>
<li>128 compute servers</li>
<li>between 12 and 96 GB of RAM each</li>
<li>hmm, say 50 on average, so total RAM may be 6400 GB</li>
<li>thus sort of each bucket is in memory, does not write passes to FDS</li>
<li>thus total time is just four transfers of 1400 GB
<ul>
<li>client limit: <code>128 * 2 GB/s = 256 GB / sec</code></li>
<li>disk limit: <code>1000 * 50 MB/s = 50 GB / sec</code></li>
</ul></li>
<li>thus bottleneck is likely to be disk throughput</li>
</ul>