A curated list of awesome Site Reliability and Production Engineering resources.
Please take a look at the contribution guidelines first. Contributions are always welcome!
- Culture
- Education
- Books
- Hiring
- Reliability
- Alerting
- Monitoring
- On-Call
- Post-Mortem
- Capacity Planning
- Presentations
- Articles
- Blogs
- Conferences
- What is Site Reliability Engineering?
- Keys To SRE
- Google SRE Resources
- Notes from Production Engineering
- PostOps: Recovery from Operations
- Love DevOps? Wait 'till you meet SRE
- How Google Does Planet-Scale Engineering for Planet-Scale Infra
- Site Reliability Engineering at Facebook
- A History of Site Reliability Engineering at Uber
- Case Study: Adopting SRE Principles at StackOverflow
- Site Reliability Engineering at Dropbox
- Site Reliability Engineers — Keeping Google up and running 24/7
- Site Reliability Engineering at Salesforce
- From Sys Admin to Netflix SRE
- SRE@Google: Thousands of DevOps Since 2004
- Panel: Educating SRE
- From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams
- Site Reliability Engineering: How Google Runs Production Systems
- The Practice Of Cloud System Administration: Designing and Operating Large Distributed Systems
- Fail at Scale by Ben Maurer
- Embracing Failure: Fault-Injection and Service Reliability
- The Verification of a Distributed System by Caitie McCaffrey
- Add your favorite resources
- Add your favorite resources
- Add your favorite resources
- Add your favorite resources
- Performance Checklists for SREs
- Engineering Reliability into Web Sites: Google SRE
- From SysAdmin to Netflix SRE