Summary Site Reliability Engineering by Google

10 minute read

Since I work with a bunch of `%$&@^&!` who are too lazy to read a 500+ page book (really O'Reilly?), I decided to summarize the quite excellent book: > **Site Reliability Engineering**, How Google Runs Production Systems > Edited by Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy. (c)Copyright 2016, Google, published by O'Reily > ISBN: 978-1-491-92912-4 The book is also fully available online via: https://g.co/SREBook. Where helpful, I'm linking to that website for details and clarifications to the summary. Feel free to link to this summary, copy it, steal it, or put in pull requests for it (https://github.com/aldegoeij/aldegoeij.github.io). As proper summaries go, it follows the table of contents of the book: # Part 1: Introduction > https://landing.google.com/sre/book/chapters/part1.html # Part 2: Principles > https://landing.google.com/sre/book/chapters/part2.html ## Chapter 6: Monitoring Distributed Systems > https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html Guidelines what issues should interrupt a human via a page, and how to deal with issues that aren't serious enough to trigger a page. **Terms:** * _Monitoring_: collecting, processing, aggregating, displaying real-time quantative system data. _White-box_: based on specific metrics exposed by the internals of the system/application. _Black-box_: externally testing / measuring behaviour. * _Dashboard_: visual summary view of a service's core metrics. * _Alert_: intended to be read by a human, pushed to a ticketing system. * _Root cause_: symptom vs cause, you wouldn't be reading this if you didn't know about _root cause_. * _Node_ or _machine_: Physical / virtual server, container. That on which a (or more) _service_ runs. * _Push_: any change to a software / configuration / system [rightfully suggests all change should happen through code. ALdG]. **Why Monitor?** * Analyzing long-term trends * Comparing over time or experiment groups * Alerting: either immediate or imminent issues * Building dashboards: 'four golden signals' * Conducting ad-hoc retrospective analysis: e.g. logs * Analytics: business trends, security breaches, etc. When the system isn't able to fix itself, we want a human to investigate the alert. You should never trigger an alert simply because something looks strange. Paging a human is a quite expensive use of an employee's time. Effective alerting systems have good signal van very low noise. Monitoring a complex system is a significant engineering endeavour. SRE team [should] carefully avoid any situation that requires someone to 'stare at a screen'. Google has trended towards simpler and faster monitoring systems, [...] and avoids magic systems that try to learn tresholds or automatically detect causality. Google [has had] very limited success with complex dependency hierarchies [for alerts]. The elements of your monitoring system that direct to a pager need to be very simple and robust. Your monitoring system should address two questions: _what_ is broken and _why_. [Google] combines heavy use of white-box monitoring with modest but critical use of black-box monitoring. When debugging, white-box monitoring is essential, e.g. web server connecting to database, to know where the slow down happens, detailed application metrics are necessary, not just outside connection metrics. For paging, black-box monitoring has the key benefit of forcing discipline to only nag a human when a problem is both already on-going and contributing to real symptoms. On the other hand, for not-yet-occuring but imminent problems, black-box monitoring is fairly useless. ### The Four Golden Signals If you can only have four metrics, use these: **Latency:** time it takes to service a request: differentiate between successful and failed (error) requests. Definitely track both! **Traffic:** how much demand is on the system: e.g. requests per second. **Errors:** failure rate of requests: explicitly (`4xx` / `5xx`) or implicitly (`200` with wrong content). **Saturation:** measure of system fraction: resource constraints / utilization. Measure 99th percentile over small window as early warning indicator. This is a predicting measure, so choose alert targets wisely! If you measure all four golden signals correctly, and page a human when one or more signals are in alert, you will have decent monitoring coverage [without too much false positives]. **Worry about your tail!** The 99th percentile of one backend [service] can easily become the median response of your frontend! Collect metrics by buckets, suitable for rendering a histogram. This will show distribution of e.g. request latency better and will ameliorate a useless average of request latency showing an acceptable number while some requests are incredibly slow. Take care in how to structure the granularity of your measurements, as simply collecting everything at a high frequency is not only overly expensive, but also can drown out the real signals. The rules that catch real incidents most often should be as simple, predictable, and reliable as possible. It can be tempting to combine monitoring with other aspects of inspecting complex systems, however, don't. Blending together too many different types of metrics (metrics, profiling, debugging, load testing, logs) will result in overly complex and fragile systems. When creating rules for monitoring and alerting, the following questions can help you avoid false positives and pager burnout: * Is this alert urgent, or could it wait until tomorrow? * Does this rule detect an otherwise undetected condition? * Will I ever be able to ignore this alert? * Does this alert indicate users are being negatively affected / user-visible? * Can I take action in response to this alert? * Can this alert be safely automated? * Will my action be a short-term or long-term fix? * Are other people getting paged? (this would be redundant) Fundamental philosophy on pages and pagers: * I should react with a sense of urgency. I can only if I'm not fatigued. * I should be able to take immediate action * I should have to use intelligent decision making, otherwise I should automate! * It should be a novel issue or event, not a regular occurrence -> automate / ticket An alert that's currently exceptionally rare and hard to automate might become frequest, warranting automation of some sort (script). At this point someone should find and eleminate the root cause of the problem, if this is not possible, the response should be fully automated. There is often the case to take a short term hit to availability or performance in order to improve the long-term outlook for the system. Managers and technical leaders play a key role in implementing a true, long-term fix by supporting and prioritizing potentially time-consuming long-term fixes. Unwillingness on the part of your team to automate pages implies the team lacks confidence that they can clean up their technical debt. This is a major problem worth escalating. Taking controlled, short-term decrease in availability is often a painful, but strategic trade for the long-run stability of the system. Ensure that decision makers are kept up to date on the pager load and overall health of their teams. Do not use email alerts. Favour a dashboard that monitors all ongoing sub-critical problems. --- # Part 3: Practices > https://landing.google.com/sre/book/chapters/part3.html --- # Part 4: Conclusions > https://landing.google.com/sre/book/chapters/part4.html ## Chapter 29: Dealing with interrupts > https://landing.google.com/sre/book/chapters/dealing-with-interrupts.html Any complex system is as imperfect as its creators... **Operational Load:** * _Pages_: alerts, emergencies, have a response time measured in minutes * _Tickets_: 'customer' requests, response time measured in hours / days / weeks * _Ongoing Operational Responsibilities_: toil (Chapter 5), ad-hoc time-sensitive questions from 'customers', interruptions. Much of operational load is unplanned, interrupting at non-specific time, needs determining if the issue can wait. **Pages** are most commonly managed by a dedicated primary on-call engineer. In order to both minimize interruptions of a page to the team and avoid the bystander effect, on-call shifts are managed by a single engineer. Issues can be escalated if need be. Typically a secondary on-call engineer acts as a backup for the primary. **Tickets** can be managed in different ways, either by the primary while she is on-call, by the secondary, or by a dedicated ticket person who is not on-call. Randomly distributing sucks. **Ongoing Operational Responsibilities** can be either handled by the primary engineer, or ad-hoc by other team members (don't) Metrics to measure how to manage interrupts: * Interrupt SLO or expected response time * Number of interrupts backlogged * Severity of interrupts * Frequency of interrupts * Number of people available to handle types of interrupts ### Cognitive Flow State Being 'in the zone' can increase productivity, artistic and scientific creativity. Being interrupted can kick you right out of this state, if the interrupt is disruptive enough ([they usually are]). You want to maximize the amount of time spent in this state. Everyone's method of arriving at a cognitive flow state is different, but the result is the same. The person works intently on the problem, losing track of time and ignoring interupts as much as possible. They are going to produce creative results and do good work by volume. They'll be happier at the job they are doing. Unfortunately, many people in SRE-type roles spend much of their time either trying and failing to get into this mode, and getting frustrated when they cannot, or never even attempting to reach this mode, instead languishing in the interrupted state. For most stressed-out on-call engineers, stress is caused by either pager volume, or becuase they are treating on-call as the interrupt. These engineers exist in a constant state of interruption. This working environment is extremely stressful. When a person is concentrating full-time on interrupts, _interrupts stop being interrupts_. When you are doing interrupts, projects are a distraction. People are ultimately happier with a balance between these two types of work. Suggestions from Google SRE teams (for the benefit of management / influencers): * People are free to manage their own time * Direct the structure of how the team itself manages interrupts * _Distractability_: even though someone has the entire day free to work on projects, she remains extremely distractable. Some level of distraction is inevitable and by design. Distractions can be managed by turning off IM, etc. * Manage distractions by _polarizing time_: minimize context switches, assign cost to context switches. A 20-minute interruption entails two context switches, and a loss of a couple of hours of productive work. Polarizing time means that when a person comes into work each day they should know if they are doing project work or _just_ interrupts * If the volume of interrupts is too high, add another person * The primary on-call engineer should focus solely on on-call work * Tickets or interrupt based work that can be abandoned quickly should be part of on-call duties * When an engineer is on-call for a week, that week should be written off as far as project work is concerned * If a project is too important to be let slip by a week, that person should not be on-call * If the secondary is to back up the primary in case of a fallthrough, the secondary can do project work, otherwise the secondary should be doing e.g. tickets, while the primary does on-call. > You never run out of clean-up work. * There is always documentation that needs updating, configs need cleanups, which also helps your future on-call engineers (this is work for Primary during on-call when there are no alerts) * Tickets should be a full-time role. Don't spread this load across the team, you will just be causing context switches. * Define a push-manager who can juggle pushes for the duration of the on-call time or on interrupts. * Formalize the on-call hand-over process * Be on interrupts or don't be: don't work on tickets when you are not assigned to handle tickets, especially not as a way to look busy. This skews the numbers of how manageable a ticket load is. * Analyze root-causes of tickets to stop being annoyed by the same issues over and over again * Hand-over maintains a shared state between ticket handlers and responsibility * Conduct on-call hand-overs, page reviews, ticket reviews. Regularly review tickets and pages with the whole team. * If you think a root-cause is fixable in a reasonable amount of time, then _silence the interrupts_ until the root-causes is expected to be fixed. * Let the team set the level of service provided by your service * It's OK to push back some of the effort onto your customers * Think about the value of the time you spend on doing interrupts for a particular service, and if it is spent wisely, if you can't get the attention you need to fix a root-cause, it's probably not that important. * If there are activities that don't require your privileges to accomplish, consider using policy to push work back to the requestor. * If a 'customer' wants a certain task to be accomplished, they should be prepared to spend some effort getting what they want. * Requests should be meaningful, rational, provide to information and legwork you need in order to fulfill the request. In return, your response should be helpful and timely. --- # Appendices --- # Authors involved Since all chapters have been written by actual engineers and it has been a valuable knowledge sharing for me, I'm listing all authors here, thanks guys & gals: * Rob Ewaschuk * Dave O’Connor --- Copyleft.

Updated: