Implementing Site Reliability Engineering

3 Day Classroom  •  3 Day Live Online
3 Day Training at your location.
Adjustable to meet your needs.
Group Rate:
GSA Discount:
When training eight or more people, onsite team training offers a more affordable and convenient option.
Register Now
Request Quote

The site reliability engineer role has been around for over 15 years now. And as the ubiquitousness of distributed systems increases, the demand for this role will continue to increase. However, many companies and technologists have not had exposure to the tenets of the SRE role, and there is often a lot of misunderstanding as to what this role is. Unlike traditional operations roles, the site reliability engineer puts additional focus on reducing human intervention by designing and implementing automation.

This position takes components from both operations and software engineering to automate, monitor, troubleshoot, and improve systems. More specifically, the site reliability engineer works on the following aspects of your applications and services: Availability, Latency, Performance, Efficiency, Change Management, Monitoring, Emergency Response, and Capacity Planning.

This three-day course will walk through the book Site Reliability Engineering: How Google Runs Production Systems, edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. During the course, you will learn about Google's approach to service management, gain an understanding of the basics of site reliability engineering, and get an introduction to advanced topics.

You'll look at real-world examples and code samples of how companies are using SRE to ensure that their services are exactly as reliable as they need to be. And finally, we'll cover the culture and human aspects of site reliability that drives successful implementation.

In this Site Reliability Engineering Training Course, You will:

  • Learn what an SRE is and isn't.
  • Find out how SRE compares to DevOps.
  • Understand the difference between service-level indicators (SLI), service-level objectives (SLO), and service-level agreements (SLA).
  • Learn what technical and professional skills an SRE needs.
  • Determine what makes up a good SRE team.
  • Practice common ceremonies like blameless postmortems and production readiness reviews.
  • Gain an understanding of error budgets and how to calculate reliability cost.
  • Learn how SREs can embed themselves within development teams to increase operational stability.
Upcoming Dates and Locations
Guaranteed To Run
Feb 24, 2020 – Feb 26, 2020    8:30am – 4:30pm Live Online
8:30am – 4:30pm
Feb 24, 2020 – Feb 26, 2020    8:30am – 4:30pm New York, New York

NYC Seminar and Conference Center
71 West 23rd
Suite 515-Lower Level
New York, NY 10010
United States

Mar 30, 2020 – Apr 1, 2020    8:30am – 4:30pm Live Online
8:30am – 4:30pm
Mar 30, 2020 – Apr 1, 2020    8:30am – 4:30pm Boston, Massachusetts

Attune, formerly Microtek Boston
25 Burlington Mall Road
2nd Floor
Burlington, MA 01803
United States

Apr 27, 2020 – Apr 29, 2020    8:30am – 4:30pm Seattle, Washington

Allied Business Systems - Computer Classrooms
10604 NE 38th Place, Suite 118
Yarrow Bay Office Park-1 North
Kirkland, WA 98033
United States

Apr 27, 2020 – Apr 29, 2020    11:30am – 7:30pm Live Online
11:30am – 7:30pm
May 26, 2020 – May 28, 2020    8:30am – 4:30pm Seattle, Washington

Allied Business Systems - Computer Classrooms
10604 NE 38th Place, Suite 118
Yarrow Bay Office Park-1 North
Kirkland, WA 98033
United States

May 26, 2020 – May 28, 2020    11:30am – 7:30pm Live Online
11:30am – 7:30pm
Jun 29, 2020 – Jul 1, 2020    8:30am – 4:30pm San Francisco, California

Learn IT
33 New Montgomery St.
Suite 300
San Francisco, CA 94105
United States

Jun 29, 2020 – Jul 1, 2020    11:30am – 7:30pm Live Online
11:30am – 7:30pm
Jul 27, 2020 – Jul 29, 2020    8:30am – 4:30pm Live Online
8:30am – 4:30pm
Jul 27, 2020 – Jul 29, 2020    8:30am – 4:30pm Philadelphia, Pennsylvania

Hyatt Place
440 American Avenue
King Of Prussia, PA 19406
United States

Aug 24, 2020 – Aug 26, 2020    8:30am – 4:30pm Nashville, Tennessee

One Century Place Conference Center
26 Century Blvd
Nashville, TN 37214
United States

Aug 24, 2020 – Aug 26, 2020    9:30am – 5:30pm Live Online
9:30am – 5:30pm
Sep 28, 2020 – Sep 30, 2020    8:30am – 4:30pm Portland, Oregon

Kinetic Technology Solutions
15495 SW Sequoia Parkway
Suite 100
Portland, OR 97224
United States

Sep 28, 2020 – Sep 30, 2020    11:30am – 7:30pm Live Online
11:30am – 7:30pm
Oct 26, 2020 – Oct 28, 2020    8:30am – 4:30pm Live Online
8:30am – 4:30pm
Oct 26, 2020 – Oct 28, 2020    8:30am – 4:30pm Washington, District of Columbia

Attune, Formerly Microtek-Washington, DC
1110 Vermont Avenue NW
Suite 700
Washington, DC 20005
United States

Nov 30, 2020 – Dec 2, 2020    8:30am – 4:30pm Live Online
8:30am – 4:30pm
Nov 30, 2020 – Dec 2, 2020    8:30am – 4:30pm Raleigh, North Carolina

ASPE, a Cprime Company
2000 Regency Parkway
Suite 335
Cary, NC 27518
United States

Course Outline

Part 1 - Introduction

  1. Introduction
  2. The Production Environment at Google, From the Viewpoint of an SRE
    • Exercise: Mapping Your Production Environment

Part 2 - Principles

  1. Embracing Risk
    • Managing Risk
    • Measuring Service Risk
    • Risk Tolerance of Services
    • Motivation for Error Budgets
  2. Service-Level Objectives
    • Service Level Terminology
    • Indicators in Practice
    • Objectives in Practice
    • Agreements in Practice
    • Exercise: Setting Service-Level Objectives
  3. Eliminating Toil
  • What Is Toil?
  • Why Less Toil is Better
  • What Qualifies as Engineering?
  • Is Toil Always Bad?
  1. Monitoring Distributed Systems
    • Definitions
    • Why Monitor?
    • Setting Reasonable Expectations
    • Symptoms Versus Causes
    • Black Box Versus White Box
    • The Four Golden Signals
    • Worrying About Your Tail
    • Choosing an Appropriate Resolution for Measurements
    • As Simple as Possible, No Simpler
    • Tying These Principles Together
    • Monitoring for the Long Term
  2. The Evolution of Automation at Google
    • The Value of Automation
    • The Value for Google SRE
    • Use Cases for Automation
    • Automate Yourself Out of a Job
    • Soothing the Pain: Applying Automation to Cluster Turnups
    • Borg: Birth of the Warehouse-Scale Computer
    • Reliability is the Fundamental Feature
  3. Release Engineering
    • The Role of a Release Engineer
    • Philosophy
    • Continuous Build and Deployment
    • Configuration Management
  4. Simplicity
    • System Stability Versus Agility
    • The Virtue of Boring
    • I Won't Give Up My Code!
    • The "Negative Lines of Code" Metric
    • Minimal APIs
    • Modularity
    • Release Simplicity

Part 3 - Practices

  1. Practical Alerting
  • Time-Series Monitoring Outside of Google
  • Instrumentation of Applications
  • Exporting Variables
  • Collection of Exported Data
  • Storage in the Time-Series Arena
  • Rule Evaluation
  • Alerting
  • Sharding the Monitoring Topology
  • Black-Box Monitoring
  • Maintaining the Configuration
  1. Being On-Call
  • The Life of an On-Call Engineer
  • Balanced On-Call
  • Feeling Safe
  • Avoiding Inappropriate Operational Load
  1. Effective Troubleshooting
  • Theory
  • In Practice
  • The Magic of Negative Results
  • Making Troubleshooting Easier
  • Exercise: Distributed System Troubleshooting
  1. Emergency Response
  • What to Do When Systems Break
  • Test-Induced Emergency
  • Challenge-Induced Emergency
  • Process-Induced Emergency
  • Don't Repeat the Past—Learn From It
  1. Managing Incidents
  • Unmanaged Incidents
  • Managed Incidents
  • When to Declare an Incident
  • Elements of Incident Management Process
  1. Postmortem Culture: Learning from Failure
  • Google's Postmortem Philosophy
  • Collaborate and Share Knowledge
  • Introducing a Postmortem Culture
  • Exercise: Blameless Postmortem
  1. Tracking Outages
  • Escalator
  • Outalator
  1. Testing for Reliability
  • Types of Software Testing
  • Creating a Test and Build Environment
  • Testing at Scale
  1. Software Engineering in SRE
  • Why is Software Engineering Within SRE Important?
  • Auxon Case Study
  • Intent-Based Capacity Planning
  • Fostering Software Engineering in SRE
  1. Load Balancing at the Front End
  • Load Balancing Using DNS
  • Load Balancing at the Virtual IP Address
  1. Load Balancing in the Datacenter
  • Identifying Bad Tasks: Flow Control and Lame Ducks
  • Limiting the Connections Pool with Subsetting
  • Load-Balancing Policies
  1. Handling Overload
  • The Pitfalls of "Queries Per Second"
  • Per-Customer Limits
  • Client-Side Throttling
  • Criticality
  • Utilization Signals
  • Handling Overload Errors
  • Load from Connections
  1. Addressing Cascading Failures
  • Causes of Cascading Failures and Designing to Avoid Them
  • Preventing Server Overload
  • Slow Startup and Cold Caching
  • Triggering Conditions for Cascading Failures
  • Testing for Cascading Failures
  • Immediate Steps to Address Cascading Failures
  1. Managing Critical State: Distributed Consensus for Reliability
  • Motivating the Use of Consensus: Distributed Systems Coordination Failure
  • How Distributed Consensus Works
  • System Architecture Patterns for Distributed Consensus
  • Distributed Consensus Performance
  • Deploying Distributed Consensus-Based Systems
  1. Distributed Periodic Scheduling with Cron
  • Cron Jobs and Idempotency
  • Cron at Large Scale
  • Building Cron at Google
  1. Data Processing Pipelines
  • Origin of the Pipeline Design Pattern
  • Initial Effect of Big Data on the Simple Pipeline Pattern
  • Challenges with the Periodic Pipeline Pattern
  • Trouble Caused by Uneven Work Distribution
  • Drawbacks of Periodic Pipelines in Distributed Environments
  • Introduction to Google Workflow
  • Stages of Execution in Workflow
  • Ensuring Business Continuity
  1. Data Integrity: What You Read Is What You Wrote
  • Data Integrity's Strict Requirements
  • Google SRE Objectives in Maintaining Data Integrity and Availability
  • How Google SRE Faces the Challenges of Data Integrity
  • 1T Versus 1E: Not "Just" a Bigger Backup
  • Knowing that Data Recovery Will Work
  • Case Studies
  • General Principles of SRE as Applied to Data Integrity
  1. Reliable Product Launches at Scale
  • Launch Coordination Engineering
  • Setting Up a Launch Process
  • Developing a Launch Checklist
  • Selected Techniques for Reliable Launches
  • Development of LCE
  • Exercise: Develop a Production Readiness Review

Part 4 - Management

  1. Accelerating SREs to On-Call and Beyond
  • You've Hired Your Next SRE, Now What?
  • Initial Learning Experiences: The Case for Structure Over Chaos
  • Creating Stellar Reverse Engineers and Improvisational Thinkers
  • Reverse Engineering a Production Service
  • Five Practices for Aspiring On-Callers
  • On-Call and Beyond: Rites of Passage and Practicing Continuing Education
  1. Dealing with Interrupts
  • Managing Operational Load
  • Factors in Determining How Interrupts Are Handled
  • Imperfect Machines
  1. Embedding an SRE to Recover from Operational Overload
  • Phase 1: Learn the Service and Get Context
  • Phase 2: Sharing Context
  • Phase 3: Driving Change
  1. Communication and Collaboration in SRE
  • Communications: Production Meetings
  • Collaboration Within SRE
  • Case Study: Viceroy
  • Collaboration Outside SRE
  • Case Study: Migrating DFP to F1
  1. The Evolving SRE Engagement Model
  • SRE Engagement: What, How, and Why
  • The PRR Model
  • The SRE Engagement Model
  • Production Readiness Reviews: Simple PRR Model
  • Evolving the Simple PRR Model: Early Engagement
  • Evolving Services Development: Frameworks and SRE Platform

Part 5 - Conclusions

  1. Lessons Learned From Other Industries
  2. Conclusion
Who should attend

This site reliability engineering training course is perfect for anyone in the IT/SDLC field looking to implement SRE teams and practices in their organization. Professionals with the titles directly below may find this course particularly beneficial.

  • Software Engineers
  • Systems Engineers
  • Network Engineers
  • Technical Program Managers
  • Anyone in an IT Leadership role
  • CIOs / CTOs
  • Anyone involved with IT infrastructure
  • IT Operations Staff

IT Operations experience is preferred.