Site Reliability Engineering — How Google Runs Production Systems

Travis Ladner
Travis Ladner
Published in
5 min readApr 17, 2016

--

“what happens when a software engineer is tasked with what used to be called operations.” — Ben Treynor

The following are excerpts and lessons pealed from Site Reliability Engineering & The Practice Of Cloud System Administration.

“…a traditional ops-focused group scales linearly with service size: if the products supported by the service succeed, the operational load will grow with traffic. That means hiring more people to do the same tasks over and over again.

To avoid this fate, the team tasked with managing a service needs to code or it will drown. Therefore, Google places a 50% cap on the aggregate “ops” work for all SREs — tickets, on-call, manual tasks, etc. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable. This cap is an upper bound; over time, left to their own devices, the SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself: we want systems that are automatic, not just automated. In practice, scale and new features keep SREs on their toes.”

“…spend more time making improvements in how work is done rather than doing the work itself.”

Ensuring a Durable Focus on Engineering —

Ensure on-call SREs do not experience an overwhelming number of events; they must have breathing room to produce a postmortem without incurring fatigue.

“Google operates under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix those faults…”

Maximum Change Velocity Without Violating SLAs —

“… an error budget stems from the observation that 100% is the wrong reliability target”

Availability time that is unnoticed by your users = error budget. Used to consistently procure innovative improvements for the service. Change should be an expected and regular part of the lifecycle. Complimented with resilient deployment pipeline practices, we avoid the perception of “change fear”.

“It turns out that past a certain point… increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer. Further, users typically don’t notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with. Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability! With this in mind, rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness — with features, service, and performance — is optimized.”

Monitoring —

“…monitoring is an absolutely essential component of doing the right thing in production. If you can’t monitor a service, you don’t know what’s happening, and if you’re blind to what’s happening, you can’t be reliable.”

“A classic and common approach to monitoring is to watch for a specific value or condition, and then to trigger an email alert when that value is exceeded or that condition occurs. However, this type of email alerting is not an effective solution: a system that requires a human to read an email and decide whether or not some type of action needs to be taken in response is fundamentally flawed. Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.”

“In a large system, the level of visibility must be actively created by designing systems that draw out information and make it visible. No person or team can manually keep tabs on all the parts.”

Service Level Objectives —

“We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy.”

Form objectives, which derive your indicators.

“Start by thinking about (or finding out!) what your users care about, not what you can measure.”

When collecting indicators, using end-user(client-side) perspectives are important, service-side metrics may not always communicate the full story.

“Most metrics are better thought of as distributions rather than averages. For example, for a latency SLI, some requests will be serviced quickly, while others will invariably take longer — sometimes much longer. A simple average can obscure these tail latencies, as well as changes in them.”

Eliminating Toil —

“We define toil as mundane, repetitive operational work providing no enduring value, which scales linearly with service growth.”

Google SRE’s unofficial motto is: “Hope is not a strategy.”

Organizing Operational Workflow —

Work types should be divided into three main areas:

  1. Emergency — High priority “fires” that require immediate response and resolution.
  2. Expected — Manual processes with an expected turn-around time.
  3. Projects — Requires long periods of uninterrupted space to develop long-term optimizations for reducing toil or preventing future emergencies.

What is special about this SRE’s is they tend to favor spending most of their time on projects, not expected, or emergency work. A rotation is put into effect whereas an engineer can spend a dedicated, but short amount of time (relative to the team size, say 1 week) solely handling emergency calls. For example: a team of 8 SRE’s, 1 engineer handles oncall work, 1 handles expected work, the rest of the team is on projects.

Since the rotation frees up most of the team, preventing them from getting distracted by fires.. most engineers are focused on improving the existing systems.

Extra Resources —

My Philosophy on Alerting by: Rob Ewaschuk(SRE @ Google)

Book review of Site Reliability Engineering by Mike Doherty

Notes on Google’s Site Reliability Engineering

Hacker News Discussion on SRE

--

--