Skip to content
Paul Norman edited this page Nov 10, 2022 · 2 revisions

Runbooks for responding to an outage or degradation of service

Standard Tile Layer

Tile CDN node health check failures

image

Health check failures can be caused by Fastly issues, routing issues, or backend server issues. The most common is routing issues.

Routing issue runbook

Preconditions: Have the shell variable FASTLY_API_TOKEN set to a fastly API key

  1. Identify the server with health check failures
  2. Using the rendering dashboard and filtering it to the host with failures, verify that the server still has some traffic. If not, this is likely a backend failure.
  3. Using a Prometheus query like fastly_healthcheck_status{host="tile.openstreetmap.org", backend=~"nidhogg"}, identify the datacenter code for the fastly POP with problems
  4. Identify if the routing problem is production impacting. Sometimes the render server failing a health check would not be used by that POP because there are closer ones. We want to fix the problem in either case, but it helps prioritize it.
  5. Find the POP IP by checking the healthcheck response with curl -s -H "Fastly-Key: ${FASTLY_API_TOKEN}" https://api.fastly.com/content/edge_check?url=tile.openstreetmap.org/fastly/api/hc-status | jq .. Search for the datacenter code (e.g. FRA) and find the x-cacheip header. Copy this IP
  6. SSH to the render server and run mtr -w -z -c 100 <ip>. This will take a couple of minutes.
  7. Identify if the packet loss is coming from the first hops. If so, contact the NOC of the internet provider for the server. If it is coming later on, check Fastly Status for any errors related the POP.
  8. If it is not a known issue, open a ticket to Fastly support. Open the ticket as "Contact Support" with a category of "performance" stating that there is packet loss between a fastly pop and render server and to please forward the information to NetOps. Include
  9. the MTR results. If there are multiple nodes, include MTRs for all of them.
  10. When the problems started, as established by Prometheus
  11. if it is intermittent
  12. If it is currently impacting production, or if another server is handling the load for that POP Use priority "Normal" for non-impacting and priority "High" for impacting

Sample message

We are having packet loss between the KUL datacenter and our origin server, nidhogg.openstreetmap.org. This is not immediately impacting our service as traffic from that datacenter is being routed to other origin servers by default, but does indicate a network problem. Can you please forward this information to NetOps

An MTR from nidhogg is below

pnorman@nidhogg:~$ mtr -w -z -c 100 167.82.235.255
Start: 2022-11-10T07:42:07+0000
HOST: nidhogg.openstreetmap.org                          Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS15980  ftp2-umu-sunet.ftp.acc.umu.se               0.0%   100    1.5   1.4   0.2  12.7   2.4
  2. AS1653   umea-ume8-r1.sunet.se                       0.0%   100    1.0   0.8   0.3  18.3   2.1
  3. AS1653   sundsvall-sva-r1.sunet.se                   0.0%   100    5.5   3.8   3.4  16.3   1.5
  4. AS1653   gavle-sbo-r1.sunet.se                       0.0%   100    6.9   7.0   6.0  20.8   2.7
  5. AS1653   uppsala-upa-r1.sunet.se                     0.0%   100    8.6   8.1   7.3  36.9   3.5
  6. AS1653   stockholm-tug-r1.sunet.se                   0.0%   100   25.9  11.5   8.5  42.9   6.9
  7. AS2603   se-tug.nordu.net                            0.0%   100   10.3   9.3   8.6  22.1   1.7
  8. AS2603   dk-bal2.nordu.net                           0.0%   100   18.8  20.1  18.7  37.5   3.5
  9. AS2603   dk-esbj.nordu.net                           0.0%   100   23.2  24.4  23.0  60.8   4.8
 10. AS2603   nl-ams.nordu.net                            0.0%   100   47.2  31.2  29.1  47.2   3.9
 11. AS2603   uk-hex.nordu.net                            0.0%   100   34.4  35.1  33.8  65.1   3.7
 12. AS???    ???                                        100.0   100    0.0   0.0   0.0   0.0   0.0
 13. AS3491   TenGE0-2-0-3.br02.klp01.pccwbtn.net        18.0%   100  208.3 208.5 207.7 210.8   0.6
 14. AS3491   samsung.ser0-3-0-0.ar01.klp01.pccwbtn.net  27.0%   100  210.2 208.6 208.0 210.7   0.6
 15. AS54113  167.82.235.255                             21.0%   100  210.7 208.5 207.7 210.7   0.6

If there is any follow-up communication, do so through the fastly website and add [email protected] as a CC.