From 1e252f1febf605f001a4f1dbb2b2efa98bd5f9ef Mon Sep 17 00:00:00 2001 From: Anthony Romano Date: Thu, 2 Mar 2017 13:49:52 -0800 Subject: [PATCH] Documentation: suggest ionice for disk tuning Also cleaned up tuning.md newlines to conform with style. --- Documentation/tuning.md | 59 +++++++++++++++++------------------------ 1 file changed, 24 insertions(+), 35 deletions(-) diff --git a/Documentation/tuning.md b/Documentation/tuning.md index f1cdc38cc31..087d39d18f8 100644 --- a/Documentation/tuning.md +++ b/Documentation/tuning.md @@ -6,33 +6,16 @@ The network isn't the only source of latency. Each request and response may be i ## Time parameters -The underlying distributed consensus protocol relies on two separate time parameters to ensure that nodes can handoff leadership if one stalls or goes offline. -The first parameter is called the *Heartbeat Interval*. -This is the frequency with which the leader will notify followers that it is still the leader. -For best practices, the parameter should be set around round-trip time between members. -By default, etcd uses a `100ms` heartbeat interval. - -The second parameter is the *Election Timeout*. -This timeout is how long a follower node will go without hearing a heartbeat before attempting to become leader itself. -By default, etcd uses a `1000ms` election timeout. - -Adjusting these values is a trade off. -The value of heartbeat interval is recommended to be around the maximum of average round-trip time (RTT) between members, normally around 0.5-1.5x the round-trip time. -If heartbeat interval is too low, etcd will send unnecessary messages that increase the usage of CPU and network resources. -On the other side, a too high heartbeat interval leads to high election timeout. Higher election timeout takes longer time to detect a leader failure. -The easiest way to measure round-trip time (RTT) is to use [PING utility][ping]. - -The election timeout should be set based on the heartbeat interval and average round-trip time between members. -Election timeouts must be at least 10 times the round-trip time so it can account for variance in the network. -For example, if the round-trip time between members is 10ms then the election timeout should be at least 100ms. - -The election timeout should be set to at least 5 to 10 times the heartbeat interval to account for variance in leader replication. -For a heartbeat interval of 50ms, set the election timeout to at least 250ms - 500ms. - -The upper limit of election timeout is 50000ms (50s), which should only be used when deploying a globally-distributed etcd cluster. -A reasonable round-trip time for the continental United States is 130ms, and the time between US and Japan is around 350-400ms. -If the network has uneven performance or regular packet delays/loss then it is possible that a couple of retries may be necessary to successfully send a packet. So 5s is a safe upper limit of global round-trip time. -As the election timeout should be an order of magnitude bigger than broadcast time, in the case of ~5s for a globally distributed cluster, then 50 seconds becomes a reasonable maximum. +The underlying distributed consensus protocol relies on two separate time parameters to ensure that nodes can handoff leadership if one stalls or goes offline. The first parameter is called the *Heartbeat Interval*. This is the frequency with which the leader will notify followers that it is still the leader. +For best practices, the parameter should be set around round-trip time between members. By default, etcd uses a `100ms` heartbeat interval. + +The second parameter is the *Election Timeout*. This timeout is how long a follower node will go without hearing a heartbeat before attempting to become leader itself. By default, etcd uses a `1000ms` election timeout. + +Adjusting these values is a trade off. The value of heartbeat interval is recommended to be around the maximum of average round-trip time (RTT) between members, normally around 0.5-1.5x the round-trip time. If heartbeat interval is too low, etcd will send unnecessary messages that increase the usage of CPU and network resources. On the other side, a too high heartbeat interval leads to high election timeout. Higher election timeout takes longer time to detect a leader failure. The easiest way to measure round-trip time (RTT) is to use [PING utility][ping]. + +The election timeout should be set based on the heartbeat interval and average round-trip time between members. Election timeouts must be at least 10 times the round-trip time so it can account for variance in the network. For example, if the round-trip time between members is 10ms then the election timeout should be at least 100ms. + +The upper limit of election timeout is 50000ms (50s), which should only be used when deploying a globally-distributed etcd cluster. A reasonable round-trip time for the continental United States is 130ms, and the time between US and Japan is around 350-400ms. If the network has uneven performance or regular packet delays/loss then it is possible that a couple of retries may be necessary to successfully send a packet. So 5s is a safe upper limit of global round-trip time. As the election timeout should be an order of magnitude bigger than broadcast time, in the case of ~5s for a globally distributed cluster, then 50 seconds becomes a reasonable maximum. The heartbeat interval and election timeout value should be the same for all members in one cluster. Setting different values for etcd members may disrupt cluster stability. @@ -50,18 +33,13 @@ The values are specified in milliseconds. ## Snapshots -etcd appends all key changes to a log file. -This log grows forever and is a complete linear history of every change made to the keys. -A complete history works well for lightly used clusters but clusters that are heavily used would carry around a large log. +etcd appends all key changes to a log file. This log grows forever and is a complete linear history of every change made to the keys. A complete history works well for lightly used clusters but clusters that are heavily used would carry around a large log. -To avoid having a huge log etcd makes periodic snapshots. -These snapshots provide a way for etcd to compact the log by saving the current state of the system and removing old logs. +To avoid having a huge log etcd makes periodic snapshots. These snapshots provide a way for etcd to compact the log by saving the current state of the system and removing old logs. ### Snapshot tuning -Creating snapshots can be expensive so they're only created after a given number of changes to etcd. -By default, snapshots will be made after every 10,000 changes. -If etcd's memory usage and disk usage are too high, try lowering the snapshot threshold by setting the following on the command line: +Creating snapshots with the V2 backend can be expensive, so snapshots are only created after a given number of changes to etcd. By default, snapshots will be made after every 10,000 changes. If etcd's memory usage and disk usage are too high, try lowering the snapshot threshold by setting the following on the command line: ```sh # Command line arguments: @@ -71,6 +49,17 @@ $ etcd --snapshot-count=5000 $ ETCD_SNAPSHOT_COUNT=5000 etcd ``` +## Disk + +An etcd cluster is very sensitive to disk latencies. Since etcd must persist proposals to its log, disk activity from other processes may cause long `fsync` latencies. The upshot is etcd may miss heartbeats, causing request timeouts and temporary leader loss. An etcd server can sometimes stably run alongside these processes when given a high disk priority. + +On Linux, etcd's disk priority can be configured with `ionice`: + +```sh +# best effort, highest priority +$ sudo ionice -c2 -n0 -p `pgrep etcd` +``` + ## Network If the etcd leader serves a large number of concurrent client requests, it may delay processing follower peer requests due to network congestion. This manifests as send buffer error messages on the follower nodes: