Book Review: Site Reliability Engineering. How Google runs production systems

Intro
The following points come from a book by many Googleans and related colleagues such as Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy title "Site Reliability Engineering: How Google runs productions systems".

Disclaimer
Disclaimer: As always, in every book review I have posted, these reviews are just invitations to read the book. Not a replacement!

Rationale
"Traditionally system developers ended their task once we threw their creation into production". This brought only trouble both to the final customers and to the staff in charge of providing the service.

Objective
This book is basically Google's attempt to revamp the role of system administrator and operator in production. To place it at the same level system developers were and are.

How?
No magic solution, just common smart sense i.e. giving system admins in prod the possibility to improve the system themselves, to automate and to scale. The authors confirm that their proposal is a specific DevOps way.

Automation
From manual steps to externally maintained automation, both system specific and generic, then to internal automation and finally autonomy.
Reliability
How do they define reliability: "Probability that a system performs a function with no failure under stated conditions for a period of time". An outage for the SRE, when planned, is a change to improve the system, to innovate.

Service reliability hierarchy
Bottom-up: Monitoring, incident response, post-mortem/root cause analysis, testing and release procedures, capacity planning, development and product.

"Hope is not a valid strategy"
70% of outages come from changes in a live system.

Monitoring
Monitoring software should do the interpretation and humans be notified via alerts, tickets or logging (according to the criticality). No email alerts, use a dashboard with flashy colours. Nowadays monitoring is more a collection of time series (more powerful than only SNMP) i.e. a sequence of values and timestamps. The data source for automated evaluating rules.

Black box monitoring (how is the user experience?) and white box (monitoring system internals).

This way we reduce the MTTF (mean time to failure) and the MTTR (mean time to repair).

Latency vs throughput
System engineers need to understand what is best for their system, the smart mix between latency (how long) and throughput (how many). Think about cost vs projected increase in revenue. Key point: Aim for the right Service Level Objective. Do not overachieve. Over-achievement in terms of availability prevents you from innovating and improving the system.

Avoid toil
Manual, repetitive work needs to be automated. Monitoring data not being used is a candidate for renewal. Blending together too many results is complex. In a 10 to 12 SRE team, 1 or 2 people are devoted to monitoring.


Release engineering
Includes also config management at the beginning of the product lifecycle. Frequent releases result in fewer changes in between versions. Distinguish between inherent complexity and accidental complexity and avoid the latter.
In software, less is more (and more expensive). Versioning APIs is a good idea.

Incident management teams
Multi-sites teams incur in a communication overhead. How do you know the team is in the sweet spot? When handling an incident takes 6 hours, including root cause analysis and post-mortem. Prefer the rational, focused and cognitive (procedure-based) process rather than the intuitive, fast and automated. Provide clear escalation paths and follow a blameless postmortem culture. Use an incident management web based tool.

Avoid operational overhead. If there are too many alers, give the pager back to the initial developer. Prepare for outages, drill it, test the what if...?  Team members should be on-call at least once or twice per quarter.

Separation of duties in incident management: ops (rotating roles among teams and time zones), communication and planning.


Testing
Testing is continuous. Testing reduces uncertainty and reliability decreases in each change. Include configuration tests.


Team size
It should not scale directly with service growth.


Best practices
Fail safely. Make progressive rollouts. Define your error/bug budget. Follow the monitoring principles (hierarchy), make post-mortems and include capacity planning.


Latency
Look not only at mean latency but also at distribution of latencies. Prevent server overload by means of built-in graceful degradation.


Availability
Leader election requires a reformulation of the distributed asynchronous consensus problem. It cannot be solved using heartbeats (but rather replicated state machines). A byzantine failure is e.g. an incorrect message due to a bug or a malicious activity.


Production readiness review
An early involvement is desired. SRE can only work with frameworks to scale. Data integrity is the means, data availability is the goal. 

 
Rocky landscape
Happy reliable reading!
Interested in the mindmap of it? Here you are part 1.


And part 2.