forked from alinush/6.824-lecture-notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
l16-memcached.html
571 lines (463 loc) · 20.9 KB
/
l16-memcached.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
<h1>6.824 2015 Lecture 16: Memcache at Facebook</h1>
<p><strong>Note:</strong> These lecture notes were slightly modified from the ones posted on the
6.824 <a href="http://nil.csail.mit.edu/6.824/2015/schedule.html">course website</a> from
Spring 2015.</p>
<h2>Introduction</h2>
<p>Facebook Memcached paper:</p>
<ul>
<li>an experience paper, not a research results paper</li>
<li>you can read it as a triumph paper</li>
<li>you can read it as a caution paper: what happens when you don't think
about scalability</li>
<li>you can read it as a trade-offs paper</li>
</ul>
<h2>Scaling a webapp</h2>
<p>Initial design for any webapp is a single webserver machine with a DB server
running on it.</p>
<p>Diagram (single machine: webserver and DB server)</p>
<pre><code>Single machine
-------------------
| Web app |
| |
| |
| |
| | DB | <-----> |Disk|
------------------
</code></pre>
<ul>
<li>eventually they find out they use 100% of the CPU on this machine</li>
<li><code>top</code> will tell them the CPU time is going to the web-app</li>
</ul>
<p>Diagram (multiple webserver machines, single DB machine):</p>
<pre><code>Web app -----
Web app \ -----> |DB| <-> Disk
...
Web app
</code></pre>
<ul>
<li>next problem they will face is the database will be the bottleneck
<ul>
<li>DB CPU is 100% or disk is 100%</li>
</ul></li>
<li>say they decide to buy a bunch of DB servers</li>
</ul>
<p>Diagram:</p>
<pre><code>Web app |DB1| <-> Disk (users A-M)
Web app |DB2| <-> Disk (users N-T)
... |DB3| <-> Disk ...
Web app ...
</code></pre>
<ul>
<li>now they gotta figure out how to <em>shard</em> the data on the database</li>
<li>application software now has to know which DB server to talk to</li>
<li>we can no longer have transactions across the whole DB</li>
<li>we can no longer have single queries on the entire dataset
<ul>
<li>need to send separate queries to each server</li>
</ul></li>
<li>you can't push this too far because after a while the shards get very small
and you get database servers that become hotspots</li>
</ul>
<p>Next, you notice that most of the operations in the DB are reads (if that's
the case. It is at Facebook.)</p>
<ul>
<li>it turns out you can build a very simple memory cache that can serve half
a million requests per second</li>
<li>then you can remove 90% of the load on the database</li>
</ul>
<p>Diagram:</p>
<pre><code>Web app -> |MC| --> |DB1| <-> Disk (users A-M)
Web app |MC| |DB2| <-> Disk (users N-T)
... .. |DB3| <-> Disk ...
Web app ...
</code></pre>
<ul>
<li>the next bottleneck will be database writes, if you keep growing your service</li>
</ul>
<p>Observation: you could use DB read-only replicas, instead of your own customized
memcache (MC) nodes. Facebook did not do this because they wanted to separate
their caching logic from their DB deployment:</p>
<blockquote>
<p>It was the best choice given limited engineering
resources and time. Additionally, separating our
caching layer from our persistence layer allows us to adjust
each layer independently as our workload changes</p>
</blockquote>
<h2>Facebook's use case</h2>
<p><strong>Very crucial:</strong> they do not care too much about all their users getting a consistent
view of their system.</p>
<p>The only case when the paper cares about freshness and consistency is when
webapp clients read their own writes.</p>
<p>Their high level picture:</p>
<pre><code>Regions (data centers):
Master region (writable)
-----------------
| Web1 Web2 ... |
| MC1 MC2 ... |
| DB1 DB2 ... <--- complete copy of all data
-----------------
Slave regions (read only)
-----------------
| Web1 Web2 ... |
| MC1 MC2 ... |
| DB1 DB2 ... <--- complete copy of all data
-----------------
</code></pre>
<p>The reason for having multiple data centers: parallelism across the globe</p>
<ul>
<li>maybe also for backup purposes (paper doesn't detail too much)</li>
</ul>
<h2>Big lessons</h2>
<h3>Look-aside caching can be tricky</h3>
<p>This style of look-aside caching, where the application looks in the cache
to see what's there, is extremely easy to add to an existing system</p>
<ul>
<li>but there are some nasty consistency problems that appear when the caching
layer is oblivious to what happens in the DB</li>
</ul>
<h3>Caching is about throughput not latency</h3>
<p>It wasn't about reducing latency for the users. They were using the cache to
increase throughput and take the load off the database.</p>
<ul>
<li>no way the DB could've handled the load, which is 10x or 100x more than
what the DB can access</li>
</ul>
<h3>They can tolerate stale data</h3>
<h3>They want to be able to read their own writes</h3>
<p>You'd think this can be easily fixed in the application. Slightly surprised that
this was not fixed by just having the application remembering the writes. Not
clear why they solved it differently.</p>
<h3>Eventual consistency is good enough</h3>
<h3>They have enormous fan-out</h3>
<p>Each webpage they serve might generate hundreds and hundreds of reads. A little
bit surprising. So they have to do a bunch of tricks. Issue the reads in parallel.
When a single server does this, it gets a bunch of responses back, and the amount
of buffering in the switches and webservers is limited, so if they're not careful
they can lose packets and thus performance when retrying.</p>
<h2>Performance</h2>
<ul>
<li>a lot of content about consistency in the paper</li>
<li>but really they were desperate to get performance which led to doing tricks,
which led to consistency problems</li>
<li>performance comes from being able to serve a lot of <code>Get</code>'s in parallel</li>
</ul>
<p>Really only two strategies:</p>
<ul>
<li>partition data</li>
<li>replicate data</li>
<li>they use both</li>
</ul>
<p>Partitioning works if keys are roughly all as popular. Otherwise, certain
partitions would be more popular and lead to hotspots. Replication helps with
handling demand for popular keys. Also, replication helps with requests from
remote places in the world.</p>
<p>You can't simply cache keys in the web app servers, because they would all
fill their memories quickly and you would double-store a lot of data.</p>
<h2>Specific problem they dealt with</h2>
<p>Each cluster has a full set of memcache servers and a full set of web servers.
Each web server talks to memcache servers in its own cluster.</p>
<h3>Adding a new cluster</h3>
<p>Sometimes, they want to add a new cluster, which will obviously have empty memcache
servers.</p>
<ul>
<li>all webservers in new cluster will always miss on every request and will
have to go down and contact the DB, which cannot handle the increased load</li>
<li>instead of contacting the DB, the new cluster will contact memcache servers
from other clusters until the new cluster's cache is warmed up</li>
</ul>
<p><strong>Q:</strong> what benefit do they get from adding new clusters? instead of increasing
size of existing cluster?</p>
<ul>
<li>one possibility is there are some very popular keys so over-partitioning a
cluster won't help with that</li>
<li>another possibility could be that it's easier to add more memcache servers
by adding a new cluster, because of the data movement problem</li>
</ul>
<h3>Memcache server goes down</h3>
<p>If a memcache server goes down, requests are redirected to a gutter server.
The gutter machines will miss a lot initially, but at least it will be caching
the results for the future.</p>
<h3>Homework question:</h3>
<p><strong>Q:</strong> Why aren't gutters invalidated on writes? </p>
<p>On a write, DB typically sends an invalidate to all the MC servers that might
have that key. So there's a lot of deletes being sent around to a lot of MC
servers. Maybe they don't want to overflow the gutter servers with all the deletes.</p>
<p>Note that gutter keys expire after a certain time, to deal with the fact that
the keys never change.</p>
<p>Not clear what happens if gutter servers go down.</p>
<p><strong>Q:</strong> Wouldn't it better if the DB server sent the out the new value instead
of invalidate. </p>
<ul>
<li>what's cached in MC, might not be the DB value, but it might be some function
of the DB value, that the DB layer is not aware of
<ul>
<li>think of how a friend list is stored in a DB versus how it would be stored in MC</li>
</ul></li>
</ul>
<h3>Leases for thundering herds</h3>
<p>One client sends an update to the DB and that gets a popular key invalidated.
So now lots of lots of clients generate Get's into MC, but the key was deleted,
which would lead to lots of DB queries and then lots of caching of the result.</p>
<p>If memcache receives a get for a key that's not present, it will set a lease
on that key and say "you're allowed to go ask the DB for this key, but please
finish doing this in 10 seconds." When subsequent Get's come in, they are told
"no such key, but another guy is getting it, so please wait for him instead of
querying the DB"</p>
<p>The lease is cancelled after 10s or when the owner sets the key.</p>
<p>Each cluster will generate a separate lease.</p>
<h1>6.824 notes</h1>
<pre><code>Scaling Memcache at Facebook, by Nishtala et al, NSDI 2013
why are we reading this paper?
it's an experience paper, not about new ideas/techniques
three ways to read it:
cautionary tale of problems from not taking consistency seriously
impressive story of super high capacity from mostly-off-the-shelf s/w
fundamental struggle between performance and consistency
we can argue with their design, but not their success
how do web sites scale up with growing load?
a typical story of evolution over time:
1. one machine, web server, application, DB
DB stores on disk, crash recovery, transactions, SQL
application queries DB, formats, HTML, &c
but the load grows, your PHP application takes too much CPU time
2. many web FEs, one shared DB
an easy change, since web server + app already separate from storage
FEs are stateless, all sharing (and concurrency control) via DB
but the load grows; add more FEs; soon single DB server is bottleneck
3. many web FEs, data sharded over cluster of DBs
partition data by key over the DBs
app looks at key (e.g. user), chooses the right DB
good DB parallelism if no data is super-popular
painful -- cross-shard transactions and queries probably don't work
hard to partition too finely
but DBs are slow, even for reads, why not cache read requests?
4. many web FEs, many caches for reads, many DBs for writes
cost-effective b/c read-heavy and memcached 10x faster than a DB
memcached just an in-memory hash table, very simple
complex b/c DB and memcacheds can get out of sync
(next bottleneck will be DB writes -- hard to solve)
the big facebook infrastructure picture
lots of users, friend lists, status, posts, likes, photos
fresh/consistent data apparently not critical
because humans are tolerant?
high load: billions of operations per second
that's 10,000x the throughput of one DB server
multiple data centers (at least west and east coast)
each data center -- "region":
"real" data sharded over MySQL DBs
memcached layer (mc)
web servers (clients of memcached)
each data center's DBs contain full replica
west coast is master, others are slaves via MySQL async log replication
how do FB apps use mc?
read:
v = get(k) (computes hash(k) to choose mc server)
if v is nil {
v = fetch from DB
set(k, v)
}
write:
v = new value
send k,v to DB
delete(k)
application determines relationship of mc to DB
mc doesn't know anything about DB
FB uses mc as a "look-aside" cache
real data is in the DB
cached value (if any) should be same as DB
what does FB store in mc?
paper does not say
maybe userID -> name; userID -> friend list; postID -> text; URL -> likes
basically copies of data from DB
paper lessons:
look-aside is much trickier than it looks -- consistency
paper is trying to integrate mutually-oblivious storage layers
cache is critical:
not really about reducing user-visible delay
mostly about surviving huge load!
cache misses and failures can create intolerable DB load
they can tolerate modest staleness: no freshness guarantee
stale data nevertheless a big headache
want to avoid unbounded staleness (e.g. missing a delete() entirely)
want read-your-own-writes
each performance fix brings a new source of staleness
huge "fan-out" => parallel fetch, in-cast congestion
let's talk about performance first
majority of paper is about avoiding stale data
but staleness only arose from performance design
performance comes from parallel get()s by many mc servers
driven by parallel processing of HTTP requests by many web servers
two basic parallel strategies for storage: partition vs replication
will partition or replication yield most mc throughput?
partition: server i, key k -> mc server hash(k)
replicate: server i, key k -> mc server hash(i)
partition is more memory efficient (one copy of each k/v)
partition works well if no key is very popular
partition forces each web server to talk to many mc servers (overhead)
replication works better if a few keys are very popular
performance and regions (Section 5)
Q: what is the point of regions -- multiple complete replicas?
lower RTT to users (east coast, west coast)
parallel reads of popular data due to replication
(note DB replicas help only read performance, no write performance)
maybe hot replica for main site failure?
Q: why not partition users over regions?
i.e. why not east-coast users' data in east-coast region, &c
social net -> not much locality
very different from e.g. e-mail
Q: why OK performance despite all writes forced to go to the master region?
writes would need to be sent to all regions anyway -- replicas
users probably wait for round-trip to update DB in master region
only 100ms, not so bad
users do not wait for all effects of writes to finish
i.e. for all stale cached values to be deleted
performance within a region (Section 4)
multiple mc clusters *within* each region
cluster == complete set of mc cache servers
i.e. a replica, at least of cached data
why multiple clusters per region?
why not add more and more mc servers to a single cluster?
1. adding mc servers to cluster doesn't help single popular keys
replicating (one copy per cluster) does help
2. more mcs in cluster -> each client req talks to more servers
and more in-cast congestion at requesting web servers
client requests fetch 20 to 500 keys! over many mc servers
MUST request them in parallel (otherwise total latency too large)
so all replies come back at the same time
network switches, NIC run out of buffers
3. hard to build network for single big cluster
uniform client/server access
so cross-section b/w must be large -- expensive
two clusters -> 1/2 the cross-section b/w
but -- replicating is a waste of RAM for less-popular items
"regional pool" shared by all clusters
unpopular objects (no need for many copies)
decided by *type* of object
frees RAM to replicate more popular objects
bringing up new mc cluster was a serious performance problem
new cluster has 0% hit rate
if clients use it, will generate big spike in DB load
if ordinarily 1% miss rate, and (let's say) 2 clusters,
adding "cold" third cluster will causes misses for 33% of ops.
i.e. 30x spike in DB load!
thus the clients of new cluster first get() from existing cluster (4.3)
and set() into new cluster
basically lazy copy of existing cluster to new cluster
better 2x load on existing cluster than 30x load on DB
important practical networking problems:
n^2 TCP connections is too much state
thus UDP for client get()s
UDP is not reliable or ordered
thus TCP for client set()s
and mcrouter to reduce n in n^2
small request per packet is not efficient (for TCP or UDP)
per-packet overhead (interrupt &c) is too high
thus mcrouter batches many requests into each packet
mc server failure?
can't have DB servers handle the misses -- too much load
can't shift load to one other mc server -- too much
can't re-partition all data -- time consuming
Gutter -- pool of idle servers, clients only use after mc server fails
The Question:
why don't clients send invalidates to Gutter servers?
my guess: would double delete() traffic
and send too many delete()s to small gutter pool
since any key might be in the gutter pool
thundering herd
one client updates DB and delete()s a key
lots of clients get() but miss
they all fetch from DB
they all set()
not good: needless DB load
mc gives just the first missing client a "lease"
lease = permission to refresh from DB
mc tells others "try get() again in a few milliseconds"
effect: only one client reads the DB and does set()
others re-try get() later and hopefully hit
let's talk about consistency now
the big truth
hard to get both consistency (== freshness) and performance
performance for reads = many copies
many copies = hard to keep them equal
what is their consistency goal?
*not* read sees latest write
since not guaranteed across clusters
more like "not more than a few seconds stale"
i.e. eventual
*and* writers see their own writes
read-your-own-writes is a big driving force
first, how are DB replicas kept consistent across regions?
one region is master
master DBs distribute log of updates to DBs in slave regions
slave DBs apply
slave DBs are complete replicas (not caches)
DB replication delay can be considerable (many seconds)
how do we feel about the consistency of the DB replication scheme?
good: eventual consistency, b/c single ordered write stream
bad: longish replication delay -> stale reads
how do they keep mc content consistent w/ DB content?
1. DBs send invalidates (delete()s) to all mc servers that might cache
+ Do they wait for ACK? I'm guessing no.
2. writing client also invalidates mc in local cluster
for read-your-writes
why did they have consistency problems in mc?
client code to copy DB to mc wasn't atomic:
1. writes: DB update ... mc delete()
2. read miss: DB read ... mc set()
so *concurrent* clients had races
what were the races and fixes?
Race 1: one client's cached get(k) replaces another client's updated k
k not in cache
C1: MC::get(k), misses
C1: v = read k from DB
C2: updates k in DB
C2: and DB calls MC::delete(k) -- k is not cached, so does nothing
C1: set(k, v)
now mc has stale data, delete(k) has already happened
will stay stale indefinitely, until key is next written
solved with leases -- C1 gets a lease, but C2's delete() invalidates lease,
so mc ignores C1's set
key still missing, so next reader will refresh it from DB
Race 2: updating(k) in cold cluster, but getting stale k from warm cluster
during cold cluster warm-up
remember clients try get() in warm cluster, copy to cold cluster
k starts with value v1
C1: updates k to v2 in DB
C1: delete(k) -- in cold cluster
C2: get(k), miss -- in cold cluster
C2: v1 = get(k) from warm cluster, hits
C2: set(k, v1) into cold cluster
now mc has stale v1, but delete() has already happened
will stay stale indefinitely, until key is next written
solved with two-second hold-off, just used on cold clusters
after C1 delete(), cold ignores set()s for two seconds
by then, delete() will propagate via DB to warm cluster
Race 3: writing to master region, but reading stale from local
k starts with value v1
C1: is in a slave region
C1: updates k=v2 in master DB
C1: delete(k) -- local region
C1: get(k), miss
C1: read local DB -- sees v1, not v2!
later, v2 arrives from master DB
solved by "remote mark"
C1 delete() marks key "remote"
get()/miss yields "remote"
tells C1 to read from *master* region
"remote" cleared when new data arrives from master region
Q: aren't all these problems caused by clients copying DB data to mc?
why not instead have DB send new values to mc, so clients only read mc?
then there would be no racing client updates &c, just ordered writes
A:
1. DB doesn't generally know how to compute values for mc
generally client app code computes them from DB results,
i.e. mc content is often not simply a literal DB record
2. would increase read-your-own writes delay
3. DB doesn't know what's cached, would end up sending lots
of values for keys that aren't cached
PNUTS does take this alternate approach of master-updates-all-copies
FB/mc lessons for storage system designers?
cache is vital to throughput survival, not just a latency tweak
need flexible tools for controlling partition vs replication
need better ideas for integrating storage layers with consistency
</code></pre>