Role Library
Big Tech & Cloud

Professional Role

Site Reliability Engineer (SRE)

Guardian of platform resilience. SREs apply software principles to operations to ensure global systems are perfectly performant, reliable, and automated.

The Professional Mission

To treat operations as an engineering problem—applying software principles to infrastructure to ensure that global platforms are not just 'online,' but perfectly performant and resilient.

The Daily Reality

You are the software engineer with a 'systems' focus. You spend your day coding automation to handle failures, managing the error budget, and ensuring that the platform can scale to handle the world's traffic. You bridge the gap between 'ship fast' and 'don't crash.'

Hard Challenges

  • Error Budgeting: Making the hard call on when to slow down feature delivery to prioritize system stability.
  • Observability Overload: Cutting through the noise to find the 'Gold Signals' that actually predict system health.
  • Solving the Unsolvable: Debugging the complex, transient issues that only happen when you hit millions of concurrent users.

What You Do Weekly

  • Monitor systems
  • Automate recovery
  • Manage incidents
  • Optimize performance
  • Conduct post-mortems

What Winning Looks Like

  • Maintaining the 'Service Level Objectives' (SLOs) that define a positive experience for billions of users.
  • Automating away 'Toil'—manually repetitive tasks—so the team can focus on improving core system reliability.
  • Leading effective post-mortems that turn failures into permanent architectural improvements.

Core Deliverables

  • SLO/SLA definitions
  • Post-mortem reports
  • Automation scripts
  • Monitoring dashboards

Ideal Person-Job Fit

The Resilient Architect. You are calm under extreme pressure, deeply curious about how systems fail, and believe that every human manual task is a bug that needs to be coded away.

The Concrete Proof Recruiters Trust

Incident report

Automation tool

Monitoring setup

Required Skills & Depth

Language
Python
Concept
Reliability Engineering
Observability
Technical
Performance Engineering
Go
Database
Elasticsearch
Infrastructure
CI/CD
Helm
Load Balancing
Quality
Load Testing
Performance Testing
Networking
DNS
gRPC
TCP/IP
Proxies
Service Mesh
Ecosystem & Tools
Kubernetes
Docker
Git
Terraform
Prometheus
Grafana
Jenkins
Nginx
OpenTelemetry
AWS
Google Cloud
Azure

Starter Sprints

15m

Service Level Objective (SLO) Design

Define SLIs and SLOs for a critical API. Calculate the error budget and define the alerting policy for when burn rate is high.

Start
12m

Incident Post-Mortem

Write a blameless post-mortem for a simulated outage. Identify the root cause, timeline, impact, and action items to prevent recurrence.

Start
18m

Chaos Engineering Experiment

Plan a chaos experiment to test system resilience. Define the steady state, the hypothesis (e.g., 'killing a pod won't cause 500s'), and the rollback plan.

Start