- Incident Date
- March 30, 2018
- Incident Summary
- On March 27, 2018 at ~2:55pm, bib/item updates from Sierra stopped propagating to our Platform.
- It was discovered on March 30, 2018 at ~11:00am (almost 3 days later) that the primary Sierra API key used by our Platform services was accidently deleted.
- Author(s)
- Kevin Friedman
- Timeline
- On March 30, 2018 at ~11:00am, it was noticed that there were no bib/item updates reported on our dashboard (:+1: for dashboards).
- Upon checking CloudWatch, it was discovered that were virtually no updates posted since March 27, 2018 at ~2:55pm.
- After checking the CloudWatch logs from the bib/item pollers (:+1: for good logging), it was discovered that there was an authentication error retrieving an access token from the Sierra API.
- It was confirmed that the Sierra API key used by our Platform services was deleted.
- A new Sierra API key was issued and installed on appropriate services.
- Normal service resumed by March 30, 2018 at ~12:30pm.
- Root Causes
- The Sierra API key used by our Platform was accidentally deleted.
- Alarms were improperly configured.
- The metric filter was being generated from the wrong log group
web-1.error.log
. - It should have been generated from
web-1.log
.
- The metric filter was being generated from the wrong log group
- Resolution and Recovery
- A new Sierra API Key was issued and saved in Parameter Store.
- Alarms were re-configured based off the correct metric filter.
- Preventative Measures
- When used by services, Sierra API keys should be generated and documented as for services/applications and not people.
- Communicate that metric filters should be configured to watch the correct log group.
- Test Alarms as part of the Production Readiness process.
- Consider tracking usage of Sierra API keys (i.e. what services use a Sierra API key) so they can changed more easily.
- Tracking of Preventative Measures
- This post-mortem will shared with relevant staff and asked to take appropriate actions.
- A PR will be opened to the Monitoring & Alarms standard to discuss improvements.