Operations manages how code is deployed, configured, and monitored, as well as the availability, latency, change management, emergency response, and capacity management of services in production.
Clone this repo and document your specific choice here:
``
Content
See:
- SRE Practices Without SREs
- Do you have an SRE team yet? How to start and assess your journey
- The Site Reliability Workbook: Practical Ways to Implement SRE
- Site Reliability Engineering: How Google Runs Production Systems
-
Use SRE as a basis
-
Define reasonable SLOs (service level objectives)
-
Have a monitoring strategy implemented and measure the SLIs (service level indicators)
-
Have a rollback strategy implemented
-
Have an incident management procedure in place
-
Create a culture of authoring blameless postmortems
-
As a start, have normal development teams assign SRE-engineers that spend 50% of their time in a specialized horizontal SRE-team
Principle #1: SRE needs SLOs with consequences.
Principle #2: SREs must have time to make tomorrow better than today.
Principle #3: SRE teams have the ability to regulate their workload.
Technical application management is the more traditional way of doing operation. The TAM engineer is traditionally focused on execution and not on prevention, which is also part of SRE.