Note: These lecture notes were slightly modified from the ones posted on the 6.824 course website from Spring 2015.
- Shattered old assumption: cannot assume that clocks are tightly synchronized
- tightly synchronized clocks are now feasible in a global scale distributed system: GPS and atomic clocks as independent sources
- Data model: immutable versioned data
- built and deployed system in multiple data centers
- Paxos helps you determine order of events. Why do we still need time?
- used synchronized time to allow local reads without locks
- transactions on top of replication
- two-phase commit across groups of replicas
- concurrency control
- strict two phase locking with timestamps
- Paxos
- long-lived leader (timed leases)
- pipelined (multiple proposals in flight)
- out-of-order commit, in-order apply
- team is chock-full of PhDs
- we write research papers when we feel the urge and we have something to say
- cutting edge development, unbelievable scale, but we are not researchers
- started development at end of 2003 (6 PhDs)
- first customer launched on Bigtable mid 2005
- distributed key-value store
- single-row transactions
- later added lazy replication
- value proposition
- scale to large numbers
- automatic resharding
- Bigtable was one of the progenitors of "NoSQL" or more precisely "of how do you store a lot of data without building a database"
- basic tenets at the time (design assumptions for Bigtable):
- who needs a database? key-value store suffices
- who needs SQL? unnecessary for most applications
- who needs transactions? two-phase commit is too expensive
- found that Bigtable is too hard to use
- users like the power that SQL database give them
- engineers shouldn't have to code around
- the lack of transactions
- the bugs that manifest due to weak semantics provided by lazy replication
- programmer productivity matters
Megastore, started ca. 2006, built on top of Bigtable
- optimistic concurrency control
- paxos-based replication
- no long-lived leader (paxos "election" on every write)
- every paxos message was written to bigtable
- broader class of transactions than bigtable
- SQL-like schema and query languages
- had consistent replication
Dremel, data analysis at Google, started ca. 2008
- column-oriented storage and query engine
- http://research.google.com/pubs/pub36632.html
- popular because it allowed SQL
Percolator, general purpose transactions
- snapshot isolation: a normal transaction has one commit point (logically
when you commit, everything happened then)
- TODO: lookup what this means, because I couldn't write down his explanation
- built on top of Bigtable
- users demanded transactions, but we weren't ready to build that into bigtable
- we knew we needed
- a database
- SQL
- consistent replication across data centers
- general purpose transactions
- the rest was "merely engineering"
TrueTime came along... (story about how they found out about a guy in NY who was working on distributed clocks and they realized it could be useful for their concurrency control)
- spanner behaves like a single-machine database
- consistent replication: replicas all report the same state
- external consistency: replicas all report the same order of events
- nice semantics
Yes, and no:
- yes for the long-term: didn't know in 2003 what they knew in 2009, didn't have the people or the technology
- no, because lots of people use bigtable at Google
Imagine you are running a startup. What long-term issues can be postponed?
Startup dilemma:
- too much time spent on scalable storage => wasted effort => not done in time => fail
- too little time spent on scalable storage => when they get popular can't scale => fail
What do you have the skill/ability/will/vision to do?
- we could not have built Spanner 10 years ago: or even 5 years ago
- someone told them they should build transactions in, but they didn't do it because they couldn't at the time
Why has the Bigtable paper had arguably a bigger impact on both the research communities and technology communities?
- research vs. practice
Why do system-researchers insist on building scalable key-value stores (and not databases)?
Timing is everything. Except luck trumps timing.
You can't plan timing when the world is changing: design the best you can for the problems you have in front of you
TrueTime happened due to fortuitous confluence of events and people (i.e. luck). Same with Bigtable. Spanner's initial design (before 2008) was nowhere near what Google has now: they had anti-luck until the project was restarted in 2008.
Build what you need, and don't overdesign. Don't underdesign either, because you'll pay for it.
Sometimes ignorance really is bliss. Or maybe luck.
If you have blinders on, you can't overreach. If we had known we needed a distributed replicated database with external consistency in 2004, we would have failed.
Your userbase matters.
- bigtable was started when Google
< 2000
employees- limited # of products
- not that many engineers
- spanner was started when Google was
10K
employees- more products
- many more engineers, many more junior engineers, many more acquired companies
- productivity of your employees matters
You can't buy luck. You can't plan for luck. But you can't ignore luck.
You can increase your chances to be lucky:
- have strong technical skills
- work on your design sense (find opportunities to learn!)
- build a strong network of colleagues and friends
- learn how to work on a team
- learn what you are good at, and what you are not good at
- be brutally honest with yourself
- be willing to ask for help
- admit when you are wrong
- people don't like working with people that constantly tell them they are wrong
Maybe disconnected access: Can we build apps that use DBs and can operate offline?
Disconnected operation in Coda file system work.
Spanner: Google's Globally-Distributed Database, Corbett et al, OSDI 2012
Why this paper?
- modern, high performance, driven by real-world needs
- sophisticated use of paxos
- tackles consistency + performance (will be a big theme)
- Lab 4 a (hugely) simplified version of Spanner
What are the big ideas?
- shard management w/ paxos replication
- high performance despite synchronous WAN replication
- fast reads by asking only the nearest replica
- consistency despite sharding (this is the real focus)
- clever use of time for consistency
- distributed transactions
This is a dense paper! I've tried to boil down some of the ideas to simpler form.
Idea: sharding
- we've seen this before in FDS
- the real problem is managing configuration changes
- Spanner has a more convincing design for this than FDS
Simplified sharding outline (lab 4):
- replica groups, paxos-replicated
- paxos log in each replica group
- master, paxos-replicated
- assigns shards to groups
- numbered configurations
- if master moves a shard, groups eventually see new config
"start handoff Num=7"
op in both groups' paxos logs- though perhaps not at the same time
dst
can't finish handoff until it has copies of shard data at majority- and can't wait long for possibly-dead minority
- minority must catch up, so perhaps put shard data in paxos log (!)
"end handoff Num=7"
op in both groups' logs
Q: What if a Put is concurrent w/ handoff?
- client sees new config, sends Put to new group before handoff starts?
- client has stale view and sends it to old group after handoff?
- arrives at either during handoff?
Q: What if a failure during handoff?
- e.g. old group thinks shard is handed off
- but new group fails before it thinks so
Q: Can two groups think they are serving a shard?
Q: Could old group still serve shard if can't hear master?
Idea: wide-area synchronous replication
- Goal: survive single-site disasters
- Goal: replica near customers
- Goal: don't lose any updates
Considered impractical until a few years ago
- paxos too expensive, so maybe primary/backup?
- if primary waits for ACK from backup
- 50ms network will limit throughput and cause palpable delay
- esp if app has to do multiple reads at 50ms each
- if primary does not wait, it will reply to client before durable
- danger of split brain; can't make network reliable
What's changed?
- other site may be only 5 ms away -- San Francisco / Los Angeles
- faster/cheaper WAN
- apps written to tolerate delays
- may make many slow read requests
- but issue them in parallel
- maybe time out quickly and try elsewhere, or redundant gets
- huge # of concurrent clients lets you get hi thruput despite high delay
- run their requests in parallel
- people appreciate paxos more and have streamlined variants
- fewer msgs
- page 9 of paxos paper: 1 round per op w/ leader + bulk preprepare
- paper's scheme a little more involved b/c they must ensure there's at most one leader
- read at any replica
- fewer msgs
Actual performance?
- Table 3
- pretend just measuring paxos for writes, read at any replica for reads
latency
- why doesn't write latency go up w/ more replicas?
- why does std dev of latency go down w/ more replicas?
- r/o a lot faster since not a paxos agreement + use closest replica throughput
- why does read throughput go up w/ # replicas?
- why doesn't write throughput go up?
- does write thruput seem to be going down?
- what can we conclude from Table 3?
- is the system fast? slow?
- how fast do your paxoses run?
- mine takes 10 ms per agreement
- with purely local communication and no disk
- Spanner paxos might wait for disk write
- pretend just measuring paxos for writes, read at any replica for reads
latency
- Figure 5
npaxos=5
, all leaders in same zone- why does killing a non-leader in each group have no effect?
for killing all the leaders ("leader-hard")
- why flat for a few seconds?
- what causes it to start going up?
- why does it take 5 to 10 seconds to recover?
- why is slope higher until it rejoins?
Spanner reads from any paxos replica
- read does not involve a paxos agreement
- just reads the data directly from replica's k/v DB
- maybe 100x faster -- same room rather than cross-country
Q: Could we write to just one replica?
Q: Is reading from any replica correct?
Example of problem:
- photo sharing site
- i have photos
- i have an ACL (access control list) saying who can see my photos
- i take my mom out of my ACL, then upload new photo
- really it's web front ends doing these client reads/writes
Order of events:
- W1: I write ACL on group G1 (bare majority), then
- W2: I add image on G2 (bare majority), then
- mom reads image -- may get old data from lagging G2 replica
- mom reads ACL -- may get new data from G1
This system is not acting like a single server!
- there was not really any point at which the image was
- present but the ACL hadn't been updated
This problem is caused by a combination of
- partitioning -- replica groups operate independently
- cutting corners for performance -- read from any replica
How can we fix this?
- Make reads see latest data
- e.g. full paxos for reads expensive!
- Make reads see consistent data
- data as it existed at some previous point in time
- i.e. before #1, between #1 and #2, or after #2
- this turns out to be much cheaper
- spanner does this
Here's a super-simplification of spanner's consistency story for r/o clients
- "snapshot" or "lock-free" reads
- assume for now that all the clocks agree
- server (paxos leader) tags each write with the time at which it occurred
- k/v DB stores multiple values for each key,
- each with a different time
- reading client picks a time
t
- for each read
- ask relevant replica to do the read at time
t
- ask relevant replica to do the read at time
- for each read
- how does a replica read a key at time
t
?- return the stored value with highest time
<= t
- return the stored value with highest time
- but wait, the replica may be behind
- that is, there may be a write at time
< t
, but replica hasn't seen it - so replica must somehow be sure it has seen all writes
<= t
- idea: has it seen any operation from time
> t
?- if yes, and paxos group always agrees on ops in time order,
it's enough to check/wait for an op with time
> t
- that is what spanner does on reads (4.1.3)
- if yes, and paxos group always agrees on ops in time order,
it's enough to check/wait for an op with time
- that is, there may be a write at time
- what time should a reading client pick?
- using current time may force lagging replicas to wait
- so perhaps a little in the past
- client may miss latest updates
- but at least it will see consistent snapshot
- in our example, won't see new image w/o also seeing ACL update
How does that fix our ACL/image example?
- W1: I write ACL, G1 assigns it time=10, then
- W2: I add image, G2 assigns it time=15 (> 10 since clocks agree)
- mom picks a time, for example t=14
- mom reads ACL t=14 from lagging G1 replica
- if it hasn't seen paxos agreements up through t=14, it knows to wait so it will return G1
- mom reads image from G2 at t=14
- image may have been written on that replica
- but it will know to not return it since image's time is 15
- other choices of
t
work too.
Q: Is it reasonable to assume that different computers' clocks agree?
- Why might they not agree?
Q: What may go wrong if servers' clocks don't agree?
A performance problem: reading client may pick time in the future, forcing reading replicas to wait to "catch up"
A correctness problem:
- again, the ACL/image example
- G1 and G2 disagree about what time it is
Sequence of events:
- W1: I write ACL on G1 -- stamped with time=15
- W2: I add image on G2 -- stamped with time=10
Now a client read at t=14 will see image but not ACL update
Q: Why doesn't spanner just ensure that the clocks are all correct?
- after all, it has all those master GPS / atomic clocks
- there is an actual "absolute" time
t_abs
- but server clocks are typically off by some unknown amount
- TrueTime can bound the error
- so
now()
yields an interval: [earliest,latest]- earliest and latest are ordinary scalar times
- perhaps microseconds since Jan 1 1970
- earliest and latest are ordinary scalar times
t_abs
is highly likely to be between earliest and latest
Q: How does TrueTime choose the interval?
Q: Why are GPS time receivers able to avoid this problem?
- Do they actually avoid it?
- What about the "atomic clocks"?
Spanner assigns each write a scalar time
- might not be the actual absolute time
- but is chosen to ensure consistency
The danger:
- W1 at G1, G1's interval is [20,30]
- is any time in that interval OK?
- then W2 at G2, G2's interval is [11,21]
- is any time in that interval OK?
- if they are not careful, might get s1=25 s2=15
So what we want is:
- if W2 starts after W1 finishes, then
s2 > s1
- simplified "external consistency invariant" from 4.1.2
- causes snapshot reads to see data consistent w/ true order of W1, W2
How does spanner assign times to writes?
- (again, this is much simplified, see 4.1.2)
- a write request arrives at paxos leader
s
will be the write's time-stamp- leader sets
s
toTrueTime now().latest
- this is "Start" in 4.1.2
- then leader delays until
s < now().earliest
- i.e. until
s
is guaranteed to be in the past (compared to absolute time) - this is "commit wait" in 4.1.2
- i.e. until
- then leader runs paxos to cause the write to happen
- then leader replies to client
Does this work for our example?
- W1 at G1, TrueTime says [20,30]
s1 = 30
- commit wait until TrueTime says [31,41]
- reply to client
- W2 at G2, TrueTime must now say
>= [21,31]
- (otherwise TrueTime is broken)
- s2 = 31
- commit wait until TrueTime says [32,43]
- reply to client
- it does work for this example:
- the client observed that W1 finished before S2 started,
- and indeed
s2 > s1
- even though G2's TrueTime clock was slow by the most it could be
- so if my mom sees S2, she is guaranteed to also see W1
Why the "Start" rule?
- i.e. why choose the time at the end of the TrueTime interval?
- previous writers waited only until their timestamps were barely
< t_abs
- new writer must choose
s
greater than any completed write t_abs
might be as high asnow().latest
- so s = now().latest
Why the "Commit Wait" rule?
- ensures that
s < t_abs
- otherwise write might complete with an s in the future
- and would let Start rule give too low an
s
to a subsequent write
- and would let Start rule give too low an
Q: Why commit wait; why not immediately write value with chosen time?
- indirectly forces subsequent write to have high enough s
- the system has no other way to communicate minimum acceptable next s for writes in different replica groups
- waiting forces writes that some external agent is serializing to have monotonically increasing timestamps
- w/o wait, our example goes back to s1=30 s2=21
- you could imagine explicit schemes to communicate last write's TS to the next write
Q: How long is the commit wait?
This answers today's Question: a large TrueTime uncertainty requires a long commit wait so Spanner authors are interested in accurate low-uncertainty time
Let's step back
- why did we get into all this timestamp stuff?
- our replicas were 100s or 1000s of miles apart (for locality/fault tol)
- we wanted fast reads from a local replica (no full paxos)
- our data was partitioned over many replica groups w/ separate clocks
- we wanted consistency for reads:
- if W1 then W2, reads don't see W2 but not W1
- it's complex but it makes sense as a
- high-performance evolution of Lab 3 / Lab 4
Why is this timestamp technique interesting?
- we want to enforce order -- things that happened in some order in real time are ordered the same way by the distributed system -- "external consistency"
- the naive approach requires a central agent, or lots of communication
- Spanner does the synchronization implicitly via time
- time can be a form of communication
- e.g. we agree in advance to meet for dinner at 6:00pm
There's a lot of additional complexity in the paper
- transactions, two phase commit, two phase locking,
- schema change, query language, &c
- some of this we'll see more of later
- in particular, the problem of ordering events in a distributed system will come up a lot, soon