-
Notifications
You must be signed in to change notification settings - Fork 12
Incident Runbooks
Paul Norman edited this page Nov 10, 2022
·
2 revisions
Runbooks for responding to an outage or degradation of service
Health check failures can be caused by Fastly issues, routing issues, or backend server issues. The most common is routing issues.
Preconditions: Have the shell variable FASTLY_API_TOKEN
set to a fastly API key
- Identify the server with health check failures
- Using the rendering dashboard and filtering it to the host with failures, verify that the server still has some traffic. If not, this is likely a backend failure.
- Using a Prometheus query like
fastly_healthcheck_status{host="tile.openstreetmap.org", backend=~"nidhogg"}
, identify the datacenter code for the fastly POP with problems - Identify if the routing problem is production impacting. Sometimes the render server failing a health check would not be used by that POP because there are closer ones. We want to fix the problem in either case, but it helps prioritize it.
- Find the POP IP by checking the healthcheck response with
curl -s -H "Fastly-Key: ${FASTLY_API_TOKEN}" https://api.fastly.com/content/edge_check?url=tile.openstreetmap.org/fastly/api/hc-status | jq .
. Search for the datacenter code (e.g.FRA
) and find thex-cacheip
header. Copy this IP - SSH to the render server and run
mtr -w -z -c 100 <ip>
. This will take a couple of minutes. - Identify if the packet loss is coming from the first hops. If so, contact the NOC of the internet provider for the server. If it is coming later on, check Fastly Status for any errors related the POP.
- If it is not a known issue, open a ticket to Fastly support. Open the ticket as "Contact Support" with a category of "performance" stating that there is packet loss between a fastly pop and render server and to please forward the information to NetOps. Include
- the MTR results. If there are multiple nodes, include MTRs for all of them.
- When the problems started, as established by Prometheus
- if it is intermittent
- If it is currently impacting production, or if another server is handling the load for that POP Use priority "Normal" for non-impacting and priority "High" for impacting
Sample message
We are having packet loss between the KUL datacenter and our origin server, nidhogg.openstreetmap.org. This is not immediately impacting our service as traffic from that datacenter is being routed to other origin servers by default, but does indicate a network problem. Can you please forward this information to NetOps
An MTR from nidhogg is below
pnorman@nidhogg:~$ mtr -w -z -c 100 167.82.235.255
Start: 2022-11-10T07:42:07+0000
HOST: nidhogg.openstreetmap.org Loss% Snt Last Avg Best Wrst StDev
1. AS15980 ftp2-umu-sunet.ftp.acc.umu.se 0.0% 100 1.5 1.4 0.2 12.7 2.4
2. AS1653 umea-ume8-r1.sunet.se 0.0% 100 1.0 0.8 0.3 18.3 2.1
3. AS1653 sundsvall-sva-r1.sunet.se 0.0% 100 5.5 3.8 3.4 16.3 1.5
4. AS1653 gavle-sbo-r1.sunet.se 0.0% 100 6.9 7.0 6.0 20.8 2.7
5. AS1653 uppsala-upa-r1.sunet.se 0.0% 100 8.6 8.1 7.3 36.9 3.5
6. AS1653 stockholm-tug-r1.sunet.se 0.0% 100 25.9 11.5 8.5 42.9 6.9
7. AS2603 se-tug.nordu.net 0.0% 100 10.3 9.3 8.6 22.1 1.7
8. AS2603 dk-bal2.nordu.net 0.0% 100 18.8 20.1 18.7 37.5 3.5
9. AS2603 dk-esbj.nordu.net 0.0% 100 23.2 24.4 23.0 60.8 4.8
10. AS2603 nl-ams.nordu.net 0.0% 100 47.2 31.2 29.1 47.2 3.9
11. AS2603 uk-hex.nordu.net 0.0% 100 34.4 35.1 33.8 65.1 3.7
12. AS??? ??? 100.0 100 0.0 0.0 0.0 0.0 0.0
13. AS3491 TenGE0-2-0-3.br02.klp01.pccwbtn.net 18.0% 100 208.3 208.5 207.7 210.8 0.6
14. AS3491 samsung.ser0-3-0-0.ar01.klp01.pccwbtn.net 27.0% 100 210.2 208.6 208.0 210.7 0.6
15. AS54113 167.82.235.255 21.0% 100 210.7 208.5 207.7 210.7 0.6
If there is any follow-up communication, do so through the fastly website and add [email protected] as a CC.