-
Notifications
You must be signed in to change notification settings - Fork 40
/
l09-dist-comp-seq-consistency.html
661 lines (554 loc) · 18.6 KB
/
l09-dist-comp-seq-consistency.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
<h1>6.824 2015 Lecture 9: DSM and Sequential Consistency</h1>
<p><strong>Note:</strong> These lecture notes were slightly modified from the ones posted on the 6.824
<a href="http://nil.csail.mit.edu/6.824/2015/schedule.html">course website</a> from Spring 2015.</p>
<p><strong>Topic:</strong> Distributed computing</p>
<ul>
<li>parallel computing on distributed machines</li>
<li>4 papers on how to use a collection of machines to solve big computational problems
<ul>
<li>we already read one of them: MapReduce</li>
</ul></li>
<li>3 other papers (IVY, TreadMarks, and Spark)
<ul>
<li>two provide a general-purpose memory model</li>
<li>Spark is in MapReduce style</li>
</ul></li>
</ul>
<h2>Distributed Shared Memory (DSM) programming model</h2>
<ul>
<li>Adopt the same programming model that multiprocessors offer</li>
<li>Programmers can use locks and shared memory
<ul>
<li>Programmers are familiar with this model</li>
<li>e.g., like goroutines sharing memory </li>
</ul></li>
<li>General purpose
<ul>
<li>e.g., no restrictions like with MapReduce</li>
</ul></li>
<li>Applications that run on a multiprocessor can run on IVY/TreadMarks </li>
</ul>
<p><strong>Challenge:</strong> distributed systems don't have physical shared memory</p>
<ul>
<li>On a network of cheap machines
<ul>
<li>[diagram: LAN, machines w/ RAM, MGR]</li>
</ul></li>
</ul>
<p>Diagram:</p>
<pre><code> *----------* *----------* *----------*
| | | | | |
| | | | | |
| | | | | |
*-----*----* *-----*----* *-----*----*
| | |
--------*--------------*--------------*-------- LAN
</code></pre>
<p>Diagram:</p>
<pre><code> M0 M1
*----------* *----------*
| M0 acces | | x x x x |
|----------| |----------|
| x x x x | | M1 acces |
*-----*----* *-----*----*
| |
--------*--------------*------- LAN
The 'xxxxx' pages are not accesible locally,
they have to be fetched via the network
</code></pre>
<p><strong>Approach:</strong></p>
<ul>
<li>Simulate shared memory using hardware support for virtual memory </li>
<li>General idea illustrated with 2 machines:
<ul>
<li>Part of the address space maps a part of M0's physical memory
<ul>
<li>On M0 it maps to the M0's physical page</li>
<li>On M1 it maps to not present </li>
</ul></li>
<li>Part of the address space maps a part of M1's physical memory
<ul>
<li>On M0 it maps to not present</li>
<li>On M1 it maps to its physical memory</li>
</ul></li>
</ul></li>
<li>A thread of the application on M1 may refer to an address that lives on M0
<ul>
<li>If thread LD/ST to that "shared" address, M1's hardware will take a page fault
<ul>
<li>Because page is mapped as not present</li>
</ul></li>
<li>OS propagates page fault to DSM runtime</li>
<li>DSM runtime can fetch page from M0</li>
<li>DSM on M0, maps page not present, and sends page to M1</li>
<li>DSM on M1 receives it from M0, copies it somewhere in memory (say address 4096)</li>
<li>DSM on M1 maps the shared address to physical address 4096</li>
<li>DSM returns from page fault handler</li>
<li>Hardware retries LD/ST</li>
</ul></li>
<li>Runs threaded code w/o modification
<ul>
<li>e.g. matrix multiply, physical simulation, sort</li>
</ul></li>
</ul>
<p><strong>Challenges:</strong></p>
<ul>
<li>How to implement it efficiently?
<ul>
<li>IVY and Treadmarks</li>
</ul></li>
<li>How to provide fault tolerance?
<ul>
<li>Many DSMs <a href="https://www.google.com/search?q=punt&oq=punt">punt</a> on this</li>
<li>Some DSM checkpoint the whole memory periodically</li>
<li>We will return to this when talking about Spark</li>
</ul></li>
</ul>
<p><strong>Correctness: coherence</strong></p>
<ul>
<li>We need to articulate what is correctness before optimizing performance
<ul>
<li>Optimizations should preserve correctness</li>
</ul></li>
<li>Less obvious than it may seem!
<ul>
<li>Choice trades off between performance and programmer-friendliness</li>
<li>Huge factor in many designs</li>
<li>More in next lecture </li>
</ul></li>
<li>Today's paper assumes a simple model
<ul>
<li>The distributed memory should behave like a single memory</li>
<li>Load/stores much like put/gets in labs 2-4</li>
</ul></li>
</ul>
<p><a name="example-1"></a>
<strong>Example 1:</strong></p>
<pre><code> x and y start out = 0
M0:
x = 1
if y == 0:
print yes
M1:
y = 1
if x == 0:
print yes
Can they both print "yes"?
</code></pre>
<p>Naive distributed shared memory</p>
<p><a name="diagram-1"></a>
<strong>Diagram 1:</strong></p>
<ul>
<li>M0, M1, M2, LAN</li>
<li>each machine has a local copy of all of memory</li>
<li>read: from local memory</li>
<li>write: send update msg to each other host (but don't wait)</li>
<li>fast: never waits for communication</li>
</ul>
<p>Does this naive memory work well?</p>
<ul>
<li>What will it do with <a href="#example-1">Example 1</a>?
<ul>
<li>It can fail because M0 and M1 could not see the
writes by the time their <code>if</code> statements are reached
so they will both print <em>yes</em>.</li>
</ul></li>
<li>Naive distributed memory is fast but incorrect</li>
</ul>
<p>Diagram (broken scheme):</p>
<pre><code> M0
*----------* *----------* *----------*
| | | | | |
| ------------------------> wAx |
| ----------> wAx | | |
*-----*----* *-----*----* *-----*----*
| | |
--------*--------------*--------------*-------- LAN
</code></pre>
<ul>
<li>M0 does write locally and tells other machines about the
write after it has done it</li>
<li>imagine what output you would get instead of 9, if each
machine was running a program that incremented the value
at address A 3 times</li>
</ul>
<p>Coherence = <em>sequential consistency</em></p>
<ul>
<li>"Read sees <em>most recent</em> write" is not clear enough when you
have multiple processes</li>
<li>Need to nail down correctness a bit more precisely </li>
<li>Sequential consistency means:
<ul>
<li>The result of any execution is the same as if
<ul>
<li>the operations of all the processors were executed in some sequential order (total order)</li>
<li>and the operations of each individual processor appear in this sequence
in the order specified by its program </li>
<li>(if P says A before B, you can't have B; A; show up in that seq. order)</li>
<li>and read sees last write in total order</li>
</ul></li>
</ul></li>
<li>There must be some total order of operations such that
<ol>
<li>Each machine's instructions appear in-order in the order</li>
<li>All machines see results consistent with that order
<ul>
<li>i.e. reads see most recent write in the order</li>
</ul></li>
</ol></li>
<li>Behavior of a single shared memory</li>
</ul>
<p>Would sequential consistency cause our example to get the intuitive result?</p>
<p>Sequence:</p>
<pre><code> M0: Wx1 Ry?
M1: Wy1 Rx?
</code></pre>
<ul>
<li>The system is required to merge these into one order,
and to maintain the order of each machine's operations.</li>
<li>So there are a few possibilities:
<ul>
<li><code>Wx1 Ry0 Wy1 Rx1</code></li>
<li><code>Wx1 Wy1 Ry1 Rx1</code></li>
<li><code>Wx1 Wy1 Rx1 Ry1</code></li>
<li>others too, but all symmetric?</li>
</ul></li>
<li>What is forbidden?
<ul>
<li><code>Wx1 Ry0 Wy1 Rx0</code> -- Rx0 read didn't see preceding Wx1 write (naive system did this)</li>
<li><code>Ry0 Wy1 Rx0 Wx1</code> -- M0's instructions out of order (some CPUs do this)</li>
</ul></li>
</ul>
<p>Go's memory consistency model</p>
<ul>
<li>What is Go's semantics for the example?</li>
<li>Go would allow both goroutines to print "yes"!</li>
<li>Go race detector wouldn't like the example program anyway</li>
<li>Programmer is <em>required</em> to use locks/channels to get sensible semantics</li>
<li>Go doesn't require the hardware/DSM to implement strict consistency</li>
<li>More about weaker consistency on Thursday</li>
</ul>
<p>Example:</p>
<pre><code>x = 1
y = 2
</code></pre>
<ul>
<li>Go's memory model tells you if thread A will see the write to x if it has seen the write to y
<ul>
<li>In Go, there's no guarantee x's write will be seen if y was written</li>
</ul></li>
</ul>
<p>A simple implementation of sequential consistency</p>
<p>A straightforward way to get sequential consistency: Just have
a manager in between the two or three machines that interleaves
their instructions</p>
<pre><code> *----------* *----------*
| | | |
| | | |
| | | |
*-----*----* *-----*----*
| |
--------*--------------*--------
|
|
*----------*
| inter- |
| leaver |
| |
*-----*----*
|
-*-
\ /
.
RAM
</code></pre>
<p><a name="diagram-2"></a>
<strong>Diagram 2:</strong></p>
<pre><code> *----------* *----------*
| | | |
| | | |
| | | |
*-----*----* *-----*----*
| |
--------*--------------*--------
|
|
*----------*
| |
| |
| |
*-----*----*
|
-*-
\ /
.
RAM
</code></pre>
<ul>
<li>single memory server</li>
<li>each machine sends r/w ops to server, in order, waiting for reply</li>
<li>server picks order among waiting ops</li>
<li>server executes one by one, sending replies</li>
<li>big ideas:
<ul>
<li>if people just read some data, we can replicate it on all of them</li>
<li>if someone writes data, we need to prevent other people from writing it
<ul>
<li>so we take the page out of those other people's memory</li>
</ul></li>
</ul></li>
</ul>
<p>This simple implementation will be slow</p>
<ul>
<li>single server will get overloaded</li>
<li>no local cache, so all operations wait for server</li>
</ul>
<p>Which brings us to <strong>IVY</strong></p>
<ul>
<li>IVY: Integrated shared Virtual memory at Yale</li>
<li>Memory Coherence in Shared Virtual Memory Systems, Li and Hudak, PODC 1986</li>
</ul>
<p>IVY big picture</p>
<p><a name="diagram-3"></a></p>
<pre><code> [diagram: M0 w/ a few pages of mem, M1 w/ a few pages, LAN]
</code></pre>
<ul>
<li>Operates on pages of memory, stored in machine DRAM (no mem server)</li>
<li>Each page present in each machine's virtual address space</li>
<li>On each a machine, a page might be invalid, read-only, or read-write</li>
<li>Uses VM hardware to intercept reads/writes</li>
</ul>
<p>Invariant:</p>
<ul>
<li>A page is either:
<ul>
<li>Read/write on one machine, invalid on all others; or</li>
<li>Read/only on $\geq 1$ machines, read/write on none</li>
</ul></li>
<li>Read fault on an invalid page:
<ul>
<li>Demote R/W (if any) to R/O</li>
<li>Copy page</li>
<li>Mark local copy R/O</li>
</ul></li>
<li>Write fault on an R/O page:
<ul>
<li>Invalidate all copies</li>
<li>Mark local copy R/W</li>
</ul></li>
</ul>
<p>IVY allows multiple reader copies between writes</p>
<ul>
<li>For speed -- local reads are fast</li>
<li>No need to force an order for reads that occur between two writes</li>
<li>Let them occur concurrently -- a copy of the page at each reader</li>
</ul>
<p>Why crucial to invalidate all copies before write?</p>
<ul>
<li>Once a write completes, all subsequent reads <em>must</em> see new data</li>
<li>Otherwise we break our example, and don't get sequential consistency</li>
</ul>
<p>How does IVY do on the example?</p>
<ul>
<li>I.e. could both M0 and M1 print "yes"?</li>
<li>If M0 sees y == 0,
<ul>
<li>M1 hasn't done ites write to y (no stale data == reads see prior writes),</li>
<li>M1 hasn't read x (each machine in order),</li>
<li>M1 must see x == 1 (no stale data == reads see prior writes).</li>
</ul></li>
</ul>
<p>Message types:</p>
<ul>
<li>[don't list these on board, just for reference]</li>
<li>RQ read query (reader to manager (MGR))</li>
<li>RF read forward (MGR to owner)</li>
<li>RD read data (owner to reader)</li>
<li>RC read confirm (reader to MGR)</li>
<li>&c</li>
</ul>
<p>(see <a href="code/ivy-code.txt">ivy-code.txt</a> on web site)</p>
<p>Scenario 1: M0 has writeable copy, M1 wants to read</p>
<p><strong>Diagram 3:</strong></p>
<pre><code> [time diagram: M 0 1]
</code></pre>
<ol>
<li>M1 tries to read gets a page fault
<ul>
<li>b.c. page must have been marked invalid since
M0 has it for R/W (see invariant described earlier)</li>
</ul></li>
<li>M1 sends RQ to MGR</li>
<li>MGR sends RF to M0, MGR adds M1 to <code>copy_set</code>
<ul>
<li>What is <code>copy_set</code>?</li>
<li>"The <code>copy_set</code> field lists all processors that have copies of the page.
This allows an invalidation operation to be performed without using
broadcast."</li>
</ul></li>
<li>M0 marks page as $access = read$, sends RD to M1</li>
<li>M1 marks $access = read$, sends RC to MGR</li>
</ol>
<p>Scenario 2: now M2 wants to write</p>
<p><strong>Diagram 4:</strong></p>
<pre><code> [time diagram: M 0 1 2]
</code></pre>
<ol>
<li>Page fault on M2</li>
<li>M2 sends WQ to MGR</li>
<li>MGR sends IV to copy_set (i.e. M1)</li>
<li>M1 sends IC msg to MGR</li>
<li>MGR sends WF to M0, sets owner=M2, copy_set={}</li>
<li>M0 sends WD to M2, access=none</li>
<li>M2 marks r/w, sends WC to MGR</li>
</ol>
<p><strong>Q:</strong> What if two machines want to write the same page at the same time?</p>
<p><strong>Q:</strong> What if one machine reads just as ownership is changing hands?</p>
<p>Does IVY provide strict consistency?</p>
<ul>
<li>no: MGR might process two STs in order opposite to issue time</li>
<li>no: ST may take a long time to revoke read access on other machines
<ul>
<li>so LDs may get old data long after the ST issues</li>
</ul></li>
</ul>
<p>What if there were no IC message?</p>
<p><strong>TODO:</strong> What is IC?</p>
<ul>
<li>(this is the new Question)</li>
<li>i.e. MGR didn't wait for holders of copies to ACK?</li>
</ul>
<p>No WC?</p>
<p><strong>TODO:</strong> What is WC?</p>
<ul>
<li>(this used to be The Question)</li>
<li>e.g. MGR unlocked after sending WF to M0?</li>
<li>MGR would send subsequent RF, WF to M2 (new owner)</li>
<li>What if such a WF/RF arrived at M2 before WD?
<ul>
<li>No problem! M2 has <code>ptable[p].lock</code> locked until it gets WD</li>
</ul></li>
<li>RC + <code>info[p].lock</code> prevents RF from being overtaken by a WF</li>
<li>so it's not clear why WC is needed!
<ul>
<li>but I am not confident in this conclusion</li>
</ul></li>
</ul>
<p>What if there were no RC message?</p>
<ul>
<li>i.e. MGR unlocked after sending RF?</li>
<li>could RF be overtaken by subsequent WF?</li>
<li>or by a subsequent IV?</li>
</ul>
<p>In what situations will IVY perform well?</p>
<ol>
<li>Page read by many machines, written by none</li>
<li>Page written by just one machine at a time, not used at all by others</li>
</ol>
<p>Cool that IVY moves pages around in response to changing use patterns</p>
<p>Will page size of e.g. 4096 bytes be good or bad?</p>
<ul>
<li>good if spatial locality, i.e. program looks at large blocks of data</li>
<li>bad if program writes just a few bytes in a page
<ul>
<li>subsequent readers copy whole page just to get a few new bytes</li>
</ul></li>
<li>bad if false sharing
<ul>
<li>i.e. two unrelated variables on the same page
<ul>
<li>and at least one is frequently written</li>
</ul></li>
<li>page will bounce between different machines
<ul>
<li>even read-only users of a non-changing variable will get invalidations</li>
</ul></li>
<li>even though those computers never use the same location</li>
</ul></li>
</ul>
<p>What about IVY's performance?</p>
<ul>
<li>after all, the point was speedup via parallelism</li>
</ul>
<p>What's the best we could hope for in terms of performance?</p>
<ul>
<li>$N \times$ faster on N machines</li>
</ul>
<p>What might prevent us from getting $N \times$ speedup?</p>
<ul>
<li>Application is inherently non-scalable
<ul>
<li>Can't be split into parallel activities</li>
</ul></li>
<li>Application communicates too many bytes
<ul>
<li>So network prevents more machines yielding more performance</li>
</ul></li>
<li>Too many small reads/writes to shared pages
<ul>
<li>Even if # bytes is small, IVY makes this expensive</li>
</ul></li>
</ul>
<p>How well do they do?</p>
<ul>
<li>Figure 4: near-linear for PDE (partial derivative equations)</li>
<li>Figure 6: very sub-linear for sort
<ul>
<li>sorting a huge array involves moving a lot of data</li>
<li>almost certain to move all data over the network at least once</li>
</ul></li>
<li>Figure 7: near-linear for matrix multiply</li>
<li>in general, you end up being limited by network throughput
for instance when reading a lot of pages</li>
</ul>
<p>Why did sort do poorly?</p>
<ul>
<li>Here's my guess</li>
<li>N machines, data in 2*N partitions</li>
<li>Phase 1: Local sort of 2*N partitions for N machines</li>
<li>Phase 2: 2N-1 merge-splits; each round sends all data over network</li>
<li>Phase 1 probably gets linear speedup</li>
<li>Phase 2 probably does not -- limited by LAN speed
<ul>
<li>also more machines may mean more rounds</li>
</ul></li>
<li>So for small # machines, local sort dominates, more machines helps</li>
<li>For large # machines, communication dominates, more machines don't help</li>
<li>Also, more machines shifts from n*log(n) local sort to n^2 bubble-ish short</li>
</ul>
<p>How could one speed up IVY?</p>
<ul>
<li>next lecture: relax the consistency model
<ul>
<li>allow multiple writers to same page!</li>
</ul></li>
</ul>
<p>Paper intro says DSM subsumes RPC -- is that true?</p>
<ul>
<li>When would DSM be better than RPC?
<ul>
<li>More transparent. Easier to program.</li>
</ul></li>
<li>When would RPC be better?
<ul>
<li>Isolation. Control over communication. Tolerate latency.</li>
<li>Portability. Define your own semantics.</li>
<li>Abstraction?</li>
</ul></li>
<li>Might you still want RPC in your DSM system? For efficient sleep/wakeup?</li>
</ul>
<p>Known problems in Section 3.1 pseudo-code</p>
<ul>
<li>Fault handlers must wait for owner to send <code>p</code> before confirming to manager</li>
<li>Deadlock if owner has page R/O and takes write fault
<ul>
<li>Worrisome that no clear order <code>ptable[p].lock</code> vs <code>info[p].lock</code></li>
<li>TODO: Whaaaat?</li>
</ul></li>
<li>Write server / manager must set <code>owner = request_node</code></li>
<li>Manager parts of fault handlers don't ask owner for the page</li>
<li>Does processing of the invalidate request hold <code>ptable[p].lock?</code>
<ul>
<li>probably can't -- deadlock</li>
</ul></li>
</ul>