l04-more-primary-backup.html

<h1>6.824 2015 Lecture 4: "Flat Datacenter Storage" Case Study</h1>

<p><strong>Note:</strong> These lecture notes were slightly modified from the ones posted on the 6.824 
<a href="http://nil.csail.mit.edu/6.824/2015/schedule.html">course website</a> from Spring 2015.</p>

<h2>Flat datacenter storage</h2>

<p><a href="papers/fds.pdf">Flat Datacenter Storage</a>, <em>Nightingale, Elson, Fan, Hofmann, Howell, Suzue</em>, OSDI 2012</p>

<h3>Why are we looking at this paper?</h3>

<ul>
<li>Lab 2 wants to be like this when it grows up
<ul>
<li>though details are all different</li>
</ul></li>
<li>fantastic performance -- world record cluster sort</li>
<li>good systems paper -- details from apps all the way to network</li>
</ul>

<h3>What is FDS?</h3>

<ul>
<li>a cluster storage system</li>
<li>stores giant blobs -- 128-bit ID, multi-megabyte content</li>
<li>clients and servers connected by network with high bisection bandwidth</li>
<li>for big-data processing (like MapReduce)
<ul>
<li>cluster of 1000s of computers processing data in parallel</li>
</ul></li>
</ul>

<h3>High-level design -- a common pattern</h3>

<ul>
<li>lots of clients</li>
<li>lots of storage servers ("tractservers")</li>
<li>lots of bandwidth between any two servers</li>
<li>data is stored in blobs
<ul>
<li>addressed by 128bit IDs</li>
<li>further split into tracts
<ul>
<li>numbered from 0</li>
<li>8MB sized</li>
</ul></li>
</ul></li>
<li>partition the data</li>
<li>master ("metadata server") controls partitioning</li>
<li>replica groups for reliability</li>
<li>tract table locator (TLT) stores a bunch entries
<ul>
<li>in a <code>k</code>-replicated system, each entry has <code>k</code> tractservers</li>
</ul></li>
<li>how to find where a tract <code>t</code> for blob <code>b</code> is?
<ul>
<li>compute TLT entry as <code>(h(b) + t) mod len(tlt)</code>
<ul>
<li>and you'll get a list of servers in that entry</li>
</ul></li>
<li>blob metadata is <em>distributed</em> and NOT stored in the TLT</li>
</ul></li>
<li>how to write a tract from a blob?
<ul>
<li>look it up as described above</li>
<li>send write to all servers in TLT entry</li>
<li>only acknowledge write to client if <em>all</em> servers replied</li>
</ul></li>
<li>how to read a tract from a blob?
<ul>
<li>look it up as described above</li>
<li>send read to <em>a random</em> server in the TLT entry</li>
</ul></li>
</ul>

<h3>Why is this high-level design useful?</h3>

<ul>
<li>1000s of disks of space
<ul>
<li>store giant blobs, or many big blobs</li>
</ul></li>
<li>1000s of servers/disks/arms of parallel throughput</li>
<li>can expand over time -- reconfiguration</li>
<li>large pool of storage servers for instant replacement after failure</li>
</ul>

<h3>Motivating app: MapReduce-style sort</h3>

<ul>
<li>a mapper reads its split <code>1/M'th</code> of the input file (e.g., a tract)
<ul>
<li>map emits a <code>&lt;key, record&gt;</code> for each record in split</li>
<li>map partitions keys among <code>R</code> intermediate files  (<code>M*R</code> intermediate files in total)</li>
</ul></li>
<li>a reducer reads 1 of <code>R</code> intermediate files produced by each mapper
<ul>
<li>reads <code>M</code> intermediate files (of <code>1/R</code> size)</li>
<li>sorts its input</li>
<li>produces <code>1/R'th</code> of the final sorted output file (<code>R</code> blobs)</li>
</ul></li>
<li>FDS sort
<ul>
<li>FDS sort does not store the intermediate files in FDS</li>
<li>a client is both a mapper and reducer</li>
<li>FDS sort is not locality-aware
<ul>
<li>in mapreduce, master schedules workers on machine that are close to the data</li>
<li>e.g.,  in same cluster
later versions of FDS sort uses more fine-grained work assignment
e.g., mapper doesn't get 1/N of the input file but something smaller
deals better with stragglers   </li>
</ul></li>
</ul></li>
</ul>

<h3>The abstract's main claims are about performance.</h3>

<ul>
<li>They set the world-record for disk-to-disk sorting in 2012 for MinuteSort
<ul>
<li>1,033 disks and 256 computers (136 tract servers, 120 clients)</li>
<li>1,401 Gbyte in 59.4s</li>
</ul></li>
</ul>

<p><strong>Q:</strong> Does the abstract's 2 GByte/sec per client seem impressive?</p>

<ul>
<li>how fast can you read a file from Athena AFS? (abt 10 MB/sec)</li>
<li>how fast can you read a typical hard drive?</li>
<li>how fast can typical networks move data?</li>
</ul>

<p><strong>Q:</strong> Abstract claims recover from lost disk (92 GB) in 6.2 seconds</p>

<ul>
<li>that's 15 GByte / sec</li>
<li>impressive?</li>
<li>how is that even possible? that's 30x the speed of a disk!</li>
<li>who might care about this metric?</li>
</ul>

<h3>What should we want to know from the paper?</h3>

<ul>
<li>API?</li>
<li>layout?</li>
<li>finding data?</li>
<li>add a server?</li>
<li>replication?</li>
<li>failure handling?</li>
<li>failure model?</li>
<li>consistent reads/writes? (i.e. does a read see latest write?)
<ul>
<li>Not in FDS: "The current protocol for replication depends upon the
client to issue all writes to all replicas. This decision
means that FDS provides weak consistency guarantees
to clients. For example, if a client writes a tract to 1
of 3 replicas and then crashes, other clients reading
different replicas of that tract will observe differing state."</li>
<li>"Writes are not guaranteed to be committed
in order of issue. Applications with ordering requirements 
are responsible for issuing operations after previous
acknowledgments have been received, rather than
concurrently. FDS guarantees atomicity: a write is either
committed or failed completely"</li>
</ul></li>
<li>config mgr failure handling?</li>
<li>good performance?</li>
<li>useful for apps?</li>
</ul>

<h4>API</h4>

<ul>
<li>Figure 1</li>
<li>128-bit blob IDs</li>
<li>blobs have a length</li>
<li>only whole-tract read and write -- 8 MB</li>
</ul>

<p><strong>Q:</strong> Why are 128-bit blob IDs a nice interface?
 - Why not file names?</p>

<p><strong>Q:</strong> Why do 8 MB tracts make sense?
 - (Figure 3...)</p>

<p><strong>Q:</strong> What kinds of client applications is the API aimed at?
 - and not aimed at?</p>

<h4>Layout: how do they spread data over the servers?</h4>

<ul>
<li>Section 2.2</li>
<li>break each blob into 8 MB tracts</li>
<li>TLT maintained by metadata server
<ul>
<li>has <code>n</code> entries</li>
<li>for blob <code>b</code> and tract <code>t</code>, <code>i = (hash(b) + t) mod n</code></li>
<li><code>TLT[i]</code> contains list of tractservers w/ copy of the tract</li>
</ul></li>
<li>clients and servers all have copies of the latest TLT table</li>
</ul>

<h4>Example four-entry TLT with no replication:</h4>

<pre><code>  0: S1
  1: S2
  2: S3
  3: S4
  suppose hash(27) = 2
  then the tracts of blob 27 are laid out:
  S1: 2 6
  S2: 3 7
  S3: 0 4 8
  S4: 1 5 ...
  FDS is "striping" blobs over servers at tract granularity
</code></pre>

<p><strong>Q:</strong> hy have tracts at all? Why not store each blob on just one server?
 - What kinds of apps will benefit from striping?
 - What kinds of apps won't?</p>

<p><strong>Q:</strong> How fast will a client be able to read a single tract?</p>

<p><strong>Q:</strong> Where does the abstract's single-client 2 GB number come from?</p>

<p><strong>Q:</strong> Why not the UNIX i-node approach?</p>

<ul>
<li>store an array per blob, indexed by tract #, yielding tractserver</li>
<li>so you could make per-tract placement decisions
<ul>
<li>e.g. write new tract to most lightly loaded server</li>
</ul></li>
</ul>

<p><strong>Q:</strong> Why not <code>hash(b + t)</code>?</p>

<p><strong>Q:</strong> How many TLT entries should there be?</p>

<ul>
<li>how about <code>n = number of tractservers</code>?</li>
<li>why do they claim this works badly? Section 2.2</li>
</ul>

<p>The system needs to choose server pairs (or triplets &amp;c) to put in TLT entries</p>

<ul>
<li>For replication</li>
<li>Section 3.3</li>
</ul>

<p><strong>Q:</strong> How about:</p>

<pre><code>   0: S1 S2
   1: S2 S1
   2: S3 S4
   3: S4 S3
   ...
</code></pre>

<ul>
<li>Why is this a bad idea?</li>
<li>How long will repair take?</li>
<li>What are the risks if two servers fail?</li>
</ul>

<p><strong>Q:</strong> why is the paper's <code>n^2</code> scheme better?</p>

<p>Example:</p>

<pre><code>   0: S1 S2
   1: S1 S3
   2: S1 S4
   3: S2 S1
   4: S2 S3
   5: S2 S4
   ...
</code></pre>

<ul>
<li>TLT with <code>n^2</code> entries, with every server pair occuring once</li>
<li>How long will repair take?</li>
<li>What are the risks if two servers fail?</li>
</ul>

<p><strong>Q:</strong> Why do they actually use a minimum replication level of 3?</p>

<ul>
<li>Same <code>n^2</code> table as before, third server is randomly chosen</li>
<li>What effect on repair time?</li>
<li>What effect on two servers failing?</li>
<li>What if three disks fail?</li>
</ul>

<h4>Adding a tractserver</h4>

<ul>
<li>To increase the amount of disk space / parallel throughput</li>
<li>Metadata server picks some random TLT entries</li>
<li>Substitutes new server for an existing server in those TLT entries</li>
</ul>

<h3>Extending a tract's size</h3>

<ul>
<li>Newly created blobs have a length of 0 tracts</li>
<li>Applications must extend a blob before writing past the end of it. </li>
<li>The extend operation is atomic, is safe to execute concurrently with other 
clients, and returns the new size of the blob as a result of the client’s call. </li>
<li>A separate API tells the client the blob's current size. </li>
<li>Extend operations for a blob are sent to the tractserver that owns that blob’s
metadata tract. </li>
<li>The tractserver serializes it, atomically updates the metadata, and returns 
the new size to each caller. </li>
<li>If all writers follow this pattern, the extend operation provides a range of 
tracts the caller may write without risk of conflict. Therefore, the extend 
API is functionally equivalent to the Google File System's "atomic append." </li>
<li>Space is allocated lazily on tractservers, so tracts claimed but not used do 
not waste storage.</li>
</ul>

<h4>How do they maintain <code>n^2</code> plus one arrangement as servers leave join?</h4>

<p>Unclear.</p>

<p><strong>Q:</strong> How long will adding a tractserver take?</p>

<p><strong>Q:</strong> What about client <code>write</code>'s while tracts are being transferred?</p>

<ul>
<li>receiving tractserver may have copies from client(s) and from old srvr</li>
<li>how does it know which is newest?</li>
</ul>

<p><strong>Q:</strong> What if a client reads/writes but has an old tract table?</p>

<ul>
<li>tractservers tell him</li>
</ul>

<h4>Replication</h4>

<ul>
<li>A writing client sends a copy to each tractserver in the TLT.</li>
<li>A reading client asks one tractserver.</li>
</ul>

<p><strong>Q:</strong> Why don't they send writes through a primary?</p>

<ul>
<li>puts a lot of work on a primary? has to lookup and know TLT</li>
<li>goal is not to have just one backup for a primary, it's to replicate and
strip data effectively across many disks</li>
</ul>

<p><strong>Q:</strong> What problems are they likely to have because of lack of primary?</p>

<ul>
<li>Why weren't these problems show-stoppers?</li>
</ul>

<h4>What happens after a tractserver fails?</h4>

<ul>
<li>Metadata server stops getting heartbeat RPCs</li>
<li>Picks random replacement for each TLT entry failed server was in</li>
<li>New TLT gets a new version number</li>
<li>Replacement servers fetch copies</li>
</ul>

<p>Example of the tracts each server holds:</p>

<pre><code>  S1: 0 4 8 ...
  S2: 0 1 ...
  S3: 4 3 ...
  S4: 8 2 ...
</code></pre>

<p><strong>Q:</strong> Why not just pick one replacement server?</p>

<ul>
<li>it will have to take in a lot of writes for the lost data <code>=&gt;</code> bad perf.</li>
</ul>

<p><strong>Q:</strong> How long will it take to copy all the tracts?</p>

<p><strong>Q:</strong> If a tractserver's net breaks and is then repaired, might srvr serve old data?</p>

<p><strong>Q:</strong> If a server crashes and reboots with disk intact, can contents be used?</p>

<ul>
<li>e.g. if it only missed a few writes?</li>
<li>3.2.1's "partial failure recovery"</li>
<li>but won't it have already been replaced?</li>
<li>how to know what writes it missed?</li>
</ul>

<p><strong>Q:</strong> When is it better to use 3.2.1's partial failure recovery?</p>

<h4>What happens when the metadata server crashes?</h4>

<p><strong>Q:</strong> While metadata server is down, can the system proceed?</p>

<ul>
<li>yes, clients who have the TLT can go on</li>
</ul>

<p><strong>Q:</strong> Is there a backup metadata server?</p>

<ul>
<li>not in the paper, they said they might use Paxos for replication</li>
<li><strong>TODO:</strong> not clear why replicating the metadata server would lead to consistency
problems</li>
</ul>

<p><strong>Q:</strong> How does rebooted metadata server get a copy of the TLT?</p>

<ul>
<li>Eh, maybe it has it on disk?</li>
<li>Maybe it just simply reconstructs it from all the heartbeats?</li>
</ul>

<p><strong>Q:</strong> Does their scheme seem correct?</p>

<ul>
<li>how does the metadata server know it has heard from all tractservers?
<ul>
<li>it doesn't, it just adds servers as they send heartbeats</li>
</ul></li>
<li>how does it know all tractservers were up to date?
<ul>
<li><strong>TODO:</strong> Up to date with what?</li>
</ul></li>
</ul>

<h4>Random issues</h4>

<p><strong>Q:</strong> Is the metadata server likely to be a bottleneck?</p>

<ul>
<li>hard to tell. what's the use case?</li>
<li>if you have a client w/ that has memory to remember TLT 
then he only contacts metadata server once and then
starts doing all of his reads/writes</li>
<li>if you have a lot of clients joining the system, or coming back but
forgetting the TLT (because of lack of storage maybe), then the metadata
server would be in use heavily
<ul>
<li>however, this won't affect the bandwidth the clients get once they
downloaded the TLT</li>
</ul></li>
</ul>

<p><strong>Q:</strong> Why do they need the scrubber application mentioned in 2.3?</p>

<ul>
<li>why don't they delete the tracts when the blob is deleted?
<ul>
<li>faster to do GC, rather than scheduling &amp; executing deletes?</li>
</ul></li>
<li>can a blob be written after it is deleted?
<ul>
<li><strong>TODO:</strong> not sure, seems like yes, because the metadata for that
blob is in tract -1 and I don't think <code>WriteTract</code> checks the metadata
before every write, so you could maybe have races</li>
</ul></li>
</ul>

<h4>Performance</h4>

<p><strong>Q:</strong> How do we know we're seeing "good" performance? What's the best you can expect?</p>

<ul>
<li>best you can expect is to take each disks bandwidth and have the system's
bandwidth be <code># of disk * disk bandwidth</code></li>
</ul>

<p><strong>Q:</strong> Limiting resource for 2 GBps single-client?</p>

<ul>
<li>assuming this is end of 5.2</li>
<li>30 tractservers means maximum of 30 * 130MB/s = 3.9GBps</li>
<li>so limiting resource is network bandwidth</li>
</ul>

<p><strong>Q:</strong> Figure 4a: Why starts low? Why goes up? Why levels off? Why does it level off at that particular performance?</p>

<ul>
<li>starts low because single client bandwidth is limited</li>
<li>goes up b.c. as # of clients is increased each one adds more bandwidth to 
the system</li>
<li>levels off because at some point the client bandwidth > server's bandwidth</li>
<li>why levels off at 32 GBps for <code>x</code> clients w/ 516 disks?
<ul>
<li>Figure 3 suggests a 10,000 RPM disk can read 5MB chunks at around 130MB/s
<ul>
<li>writes are similar</li>
</ul></li>
<li>not clear from logarithmic scale graph what <code>x</code> is
<ul>
<li><code>10 &lt; x &lt; 50</code> (maybe <code>25 &lt; x &lt; 50</code>?)</li>
</ul></li>
<li><code>516 disks * 130MB/s = 67 GBps</code>, so seems like best case performance
should've leveled off at more than 32 GBps?
<ul>
<li>in reality not all disks are 130MB/s maybe? (only the 10,000rpm SAS onese
were that fast)</li>
<li>in reality multiple disks on a single node might make that number smaller
maybe?</li>
</ul></li>
<li>anyway, something like <code>x=40</code> clients would have <code>40 * 10Gbps = 40 * 1.25GBps
= 50 Gbps</code> which is higher than the actual (claimed) bandwidth of the server
of 32 GBps</li>
</ul></li>
</ul>

<p><strong>Q:</strong> Figure 4b shows random r/w as fast as sequential (Figure 4a). Is this what you'd expect?</p>

<ul>
<li>Yes. Random R/W requests for different tracts go to different servers, just 
like sequential ones do <code>=</code>> no difference</li>
</ul>

<p><strong>Q:</strong> Why are writes slower than reads with replication in Figure 4c?</p>

<ul>
<li>A write is sent to all tract servers? Not over until all of them reply.
<ul>
<li>w/ higher number of clients writing <code>=&gt;</code> more work done by each server</li>
<li>Paper says: "As expected, the write bandwidth is about onethird
of the read bandwidth since clients must send three
copies of each write"</li>
</ul></li>
<li>A read is sent to just one?</li>
</ul>

<p><strong>Q:</strong> Where does the 92 GB in 6.2 seconds come from?</p>

<ul>
<li>Table 1, 4th column</li>
<li>that's 15 GB / second, both read and written</li>
<li>1000 disks, triple replicated, 128 servers?</li>
<li>what's the limiting resource? disk? cpu? net?</li>
</ul>

<p>How big is each sort bucket?</p>

<ul>
<li>i.e. is the sort of each bucket in-memory?</li>
<li>1400 GB total</li>
<li>128 compute servers</li>
<li>between 12 and 96 GB of RAM each</li>
<li>hmm, say 50 on average, so total RAM may be 6400 GB</li>
<li>thus sort of each bucket is in memory, does not write passes to FDS</li>
<li>thus total time is just four transfers of 1400 GB
<ul>
<li>client limit: <code>128 * 2 GB/s = 256 GB / sec</code></li>
<li>disk limit: <code>1000 * 50 MB/s = 50 GB / sec</code></li>
</ul></li>
<li>thus bottleneck is likely to be disk throughput</li>
</ul>