FUTURE OF WORK

The Future of On-Call: From Hero Culture to Collaborative Augmentation

How AI SRE is transforming on-call from burnout-inducing firefighting to sustainable, Human-AI collaboration

By Shafi KhanJune 20, 202513 min read

Last week, I talked to an SRE who hadn't slept more than 4 hours in 6 days. Not because of a major outage, but because of "normal" on-call. Ten pages, eight false positives, two "restart the pod" fixes, and one genuine issue that took 3 hours of log diving.

This is the reality of on-call in 2025: we've automated deployments, we've containerized everything, we've built observability stacks that cost $500k/year but we still wake humans up at 2am to restart a pod.

The future of on-call isn't about eliminating humans. It's about reimagining what "on-call" even means when AI can handle 80% of incidents autonomously, leaving humans for the genuinely complex, strategic work. This is the shift from hero culture to collaborative augmentation.

The Problem with "Hero Culture"

Hero culture is when SREs are celebrated for pulling all-nighters to save production. It sounds noble until you realize it's a symptom of broken systems.

The Hidden Costs of Hero Culture

For Individuals:

  • Burnout rate: 60%+ of SREs report exhaustion from on-call
  • Sleep deprivation: Average 2-3 pages per week disrupts sleep cycles
  • Decision fatigue: Tired SREs make mistakes (production mistakes)
  • Career stagnation: Firefighting leaves no time for learning or strategic work

For Organizations:

  • Attrition: 40% of SREs leave within 2 years due to on-call stress
  • Recruiting costs: $150k+ to replace a senior SRE (6-month ramp-up)
  • MTTR creep: Tired SREs = slower incident resolution
  • Innovation tax: Best engineers spend time on toil instead of building

"I joined as an SRE because I wanted to build reliable systems. Instead, I became a human cron job restarting pods, clearing caches, doing the same 5 fixes on repeat. The work wasn't hard. It was just soul-crushing."

— Anonymous SRE, Series B SaaS company

The Augmentation Model: Redefining "On-Call"

AI SRE doesn't replace on-call, it fundamentally changes what on-call is. Instead of humans being the first responders for everything, AI becomes the first line of defense, escalating only when it truly needs human judgment.

Old Model: Human First

  1. Alert fires → PagerDuty wakes human at 2am
  2. Human reads logs, checks dashboards, googles error
  3. Human tries fix (restart pod, clear cache, rollback)
  4. Human monitors for 30 min to confirm fix
  5. Human goes back to sleep (if lucky)

Problem: Human wasted 90 minutes on a 2-minute fix

New Model: AI First

  1. Alert fires → AI SRE immediately starts RCA
  2. AI correlates logs + metrics + topology
  3. AI executes verified fix (if confidence >90%)
  4. AI monitors for regression + posts Slack summary
  5. Human sleeps (only paged if AI can't handle it)

Result: Human only woken for genuinely complex incidents

The key difference: AI handles the "toil layer" repeatable, known good fixes that don't require creativity or judgment. Humans handle the "strategic layer" novel incidents, architectural decisions, and continuous improvement.

Real Teams, Real Transformations

Here's what the shift looks like in practice, based on three teams that adopted AI SRE in 2024-2025:

Case Study 1: E-Commerce (50-person SRE team)

Before AI SRE:

  • 120 pages/week (team-wide)
  • 80% were "restart pod" or "clear cache"
  • Avg MTTR: 45 minutes
  • Attrition: 35% annually

After AI SRE (6 months):

  • 20 pages/week (83% reduction)
  • AI auto-resolved 85% of incidents
  • Avg MTTR: 3 minutes (automated)
  • Attrition: 12% (on par with engineering avg)

What changed: SREs went from "human cron jobs" to strategic roles. The team used saved time to build a chaos engineering program, reducing repeat incidents by 40%.

Case Study 2: Fintech (8-person SRE team)

The Problem:

  • Team too small for 24/7 coverage
  • One-person on-call shifts = no backup
  • Pages during family time → burnout
  • Couldn't hire fast enough (competitive market)

The Solution:

  • AI SRE as "9th team member" (always on)
  • Business-hours on-call only (AI handles nights)
  • Pages dropped from 30/week → 6/week
  • Regained work-life balance → retained team

Key insight: For small teams, AI SRE isn't a productivity boost it's existential. It's the difference between sustainable on-call and burning out your entire team.

Case Study 3: SaaS Platform (200+ microservices)

The Complexity Problem:

  • Too many services for humans to understand
  • RCA took 2+ hours (dependency tracing)
  • Junior SREs couldn't handle on-call (too risky)
  • Senior SREs burned out (constant escalations)

The AI Advantage:

  • AI "knows" all 200+ services + dependencies
  • RCA in 30 seconds (topology-aware)
  • Junior SREs can now handle 70% of incidents (with AI assist)
  • Seniors focus on architecture, chaos eng, capacity planning

Unexpected benefit: AI leveled the playing field. Junior SREs became confident on-call participants (with AI safety net), and seniors finally had time to mentor instead of firefight.

What This Means for Your Career as an SRE

Some SREs fear AI will make them obsolete. The reality is the opposite: AI SRE elevates the role from tactical to strategic.

❌ Skills That Matter Less

  • Speed of log grepping
  • Memorizing kubectl commands
  • Manual runbook execution
  • Being awake at 3am

(AI is better at repetitive, speed-based tasks)

✅ Skills That Matter More

  • System design & architecture
  • Root cause thinking (not just fixing symptoms)
  • Building resilient systems (chaos engineering)
  • Teaching AI (runbook creation, approval workflows)

(Humans are better at creativity, judgment, and strategy)

The New SRE Career Path

Junior SRE (0-2 years):

Work with AI on incidents. Learn from AI explanations. Build confidence in on-call with AI safety net.

Mid-Level SRE (2-5 years):

Focus on teaching AI: create runbooks, tune confidence thresholds, review AI decisions. Lead chaos engineering experiments.

Senior SRE (5+ years):

Architect resilient systems. Optimize AI workflows. Mentor juniors. Drive strategic reliability improvements (not just firefighting).

Staff+ SRE:

Define reliability standards across org. Build AI-first SRE practices. Influence product design for operability. No on-call (AI handles it).

The Mental Health Impact: Reclaiming Your Life

Let's talk about the elephant in the room: on-call is terrible for mental health. And it's not just about sleep.

The Invisible Toll of Always-On

Anticipatory Anxiety:

"Even when I'm not on-call, I check my phone obsessively. I can't fully relax because I know the page is coming." This low-grade stress is chronic and exhausting.

Social Impact:

Missing family dinners, leaving dates early, canceling weekend plans. On-call creates a second-class life where you're physically present but mentally absent.

Cognitive Load:

SREs carry a mental map of 50+ services, dependencies, failure modes. Every alert requires reloading this context at 2am. It's like doing a puzzle while half-asleep every week.

How AI SRE Changes This

Immediate Impact:

  • 80% fewer pages = better sleep
  • No more "trivial" 2am wakeups
  • Predictable on-call (only complex incidents)
  • Can actually make weekend plans

Long-Term Impact:

  • Reduced anxiety (AI handles most issues)
  • Career longevity (sustainable pace)
  • Time for skill development, not just firefighting
  • Positive identity shift (strategist, not firefighter)

Practical Roadmap: Transitioning to Augmented On-Call

You can't flip a switch and go from hero culture to AI-first. Here's the realistic path:

1️⃣ Phase 1: Shadow Mode (Month 1)

AI watches incidents but doesn't act. Humans handle everything as usual. AI posts "What I would have done" to Slack.

Goal: Build trust. Let team see AI's RCA quality before automation.

2️⃣ Phase 2: Approval Mode (Month 2-3)

AI suggests fixes ("Restart pod? Approve/Reject"). Humans review + approve. Track approval rate + accuracy.

Goal: Learn which patterns are safe for automation. Tune confidence thresholds.

3️⃣ Phase 3: Auto-Pilot (Low Risk) (Month 4-5)

Enable auto-remediation for proven patterns: restart pod, clear cache, scale replicas. High-risk actions still require approval.

Goal: Reduce pages by 50%. Measure regression rate (should be <2%).

4️⃣ Phase 4: Full Augmentation (Month 6+)

AI handles 80%+ of incidents autonomously. Humans only paged for novel/high-risk incidents. SREs shift focus to proactive work (chaos eng, capacity planning).

Goal: Sustainable on-call. Team satisfaction >8/10. Attrition drops.

Pro tip: Don't rush Phase 3. The biggest mistake teams make is enabling auto-remediation too early, before trust is built. If AI makes a mistake in Phase 1-2, no big deal. If it makes a mistake in Phase 3, your team loses confidence and reverts to manual.

🎯 Key Takeaways

  • Hero culture is unsustainable, 60%+ burnout rate, 40% attrition, millions in recruiting costs
  • AI-first on-call means AI handles toil (80% of incidents), humans handle strategy
  • Real teams see 80%+ reduction in pages, MTTR drops from 45 min → 3 min (automated)
  • Your career benefits: Shift from tactical firefighting to strategic reliability engineering
  • Mental health impact: Better sleep, lower anxiety, sustainable career longevity
  • Transition gradually: Shadow mode → Approval mode → Auto-pilot (low risk) → Full augmentation (6 months)

Escape Hero Culture. Reclaim Your Life.

AutonomOps is built for this exact transition from hero culture to augmented on-call. Start in shadow mode, build trust, then let AI handle the toil while you focus on what matters.

SK

About Shafi Khan

Shafi Khan is the founder of AutonomOps AI. After years of experiencing on-call burnout firsthand, he built AutonomOps to fix the root problem: on-call shouldn't require waking humans for repetitive, known-good fixes.

Related Articles