Site Reliability Engineering (SRE) Do’s
Ok, so I was going to do a full summary of:
Site Reliability Engineering How Google Runs Production Systems Edited by Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy. (c)Copyright 2016, Google, published by O’Reily ISBN: 978-1-491-92912-4
But of course that was never going to happen :-)
So, here I’m writing my condensed conclusions from the book, from the perspective of running a small team of SREs (5+) for a digital enterprise fintech startup in the Netherlands (30+ dev FTE).
Work In Progress: currently covering chapters: 5,6, 29.
What is Site Reliability Engineering about?
Engineering work intrinsically requires human judgement. Permanent improvement in your service, guided by strategy. Design-driven approach. Handle larger service(s) with the same level of staffing.
-
Software Engineering: Writing or modifying code. Design & document. automation scripts, tools, frameworks, adding scalability / reliability features to services, making infrastructure more robust.
-
Systems Engineering: Configuring (production) systems, modifying configurations, documenting. monitoring setup / updates, load balancing, server / OS tuning. Consulting on architecture, design, productization for developer teams.
-
Toil: repetitive manual work needed to keep a system running
-
Overhead: administrative work ties to running / being part of the SRE team. Reviews, assessments, training, hr, hygiene.
Every SRE needs to spend at least 50% of their time on engineering work. If this is not the case, teams should take a step back and figure out what’s wrong.
Toil is bad:
There will alway be toil, but we should strive to eliminate a little bit of toil every week, to steadily clean up our services and shift efforts to engineering for scale, architecting next generation of services and build cross-SRE toolchains. Too much toil results in: career stagnation, low morale, low productivity, etc.
Monitoring
When attempting to monitor often changing systems / services, focus on getting the four golden signals before anything else:
Latency: time it takes to service a request: differentiate between successful and failed (error) requests. Definitely track both!
Traffic: how much demand is on the system: e.g. requests per second.
Errors: failure rate of requests: explicitly (4xx / 5xx) or implicitly (200 with wrong content).
Saturation: measure of system fraction: resource constraints / utilization. Measure 99th percentile over small window as early warning indicator. This is a predicting measure, so choose alert targets wisely!
If you measure all four golden signals correctly, and page a human when one or more signals are in alert, you will have decent monitoring coverage (without too many false positives).
Collect metrics by buckets, suitable for rendering a histogram. This will show distribution of e.g. request latency better and will ameliorate a useless average of request latency showing an acceptable number while some requests are incredibly slow.
Policies
Create SLAs
The Team
Differentiate between three types of work:
- On-Call: Respond to pages. Perform root-cause analysis into pages. Design automation scenarios to prevent specific pages from happening again.
- Interrups: Respond to ‘customer’ requests / tickets. Perform root-cause analysis into tickets. Design solutions to prevent specific tickets from being raised again.
- Projects: Build new things. This can be either work part of a product / service backlog, or from the SRE team backlog.
Differentiate between three types of Engineer:
- Primary Engineer: The weekly on-call Engineer, works on pages. When page load is low, works on interrupts.
- Secondary Engineer: Backup of the Primary. When ticket load and/or page load is high, works on interrupts, otherwise works on projects.
- Engineer: Works on projects. Might be pulled into pages or tickets when her expertise is needed, this should be minimized by having a cross-trained team, with e.g. playbooks or documentation for common issues.
At any cost try to minimize the amount of interrupts to anyone but the primary or secondary engineer. Do not randomize / rotate ticket assignments. Context switches are the most costly and negative event in the day of an engineer.
When switching between on-call engineers, from one week to the other, have a formal, documented, hand-over moment in place. Not only does this bring the fresh engineer up to speed, it also allows for reflection on work done, it results in the fresh engineer being more productive and responsive, and the previous on-call engineer to have less interrupts after her shift.
Side note: the primary engineer, if she is not busy with pages or interrupts has the perfect position to update documentation, clean-up stale tickets, merge-requests, etc.
Another side note: in my team, projects entail both work on stories from product teams as well as work on Engineering stories.