PRACTICAL CHEATSHEET

25 High-Signal Prompts for AI SRE (Copy & Paste Ready)

Production-tested prompts for logs, metrics, topology, and deploy context organized by task

By Shafi KhanAugust 1, 202512 min read

The difference between a useful AI SRE and a frustrating one comes down to one thing: prompt quality.

After analyzing 10,000+ incident conversations with AutonomOps AI, we've distilled the 25 highest-signal prompts that consistently produce actionable results. These aren't generic "explain this error" prompts they're battle-tested, context-rich queries that SREs actually use in production.

Bookmark this page. Copy these prompts. Customize them for your stack. They'll save you hours during the next 2am incident.

💡 How to Use This Cheatsheet

  • 1.Fill in the [BRACKETS] with your specific context (service name, time range, etc.)
  • 2.Add system context if your AI SRE supports it (e.g., attach topology, recent deploys)
  • 3.Iterate on the responsefollow up with "dig deeper into X" or "show me the code"
  • 4.Save your favorites as runbook snippets for faster access

🔍 Root Cause Analysis (RCA) Prompts

Use these when you need to trace an issue back to its origin fast.

1️⃣ Cross-Signal RCA

Analyze the root cause of the [SERVICE_NAME] incident that started at [TIMESTAMP].
Context:
- Error: [ERROR_MESSAGE_OR_ALERT_NAME]
- Symptoms: [e.g., "p99 latency spiked from 200ms to 5s"]
- Affected users: [e.g., "10% of US-East traffic"]

Cross-reference:
1. Logs from [SERVICE_NAME] and its dependencies
2. Metrics (CPU, memory, network, custom) for the same time window
3. Recent deployments or config changes in the last 30 minutes

Output:
- Most likely root cause (with confidence %)
- Evidence from each signal type (logs, metrics, traces)
- Dependency chain (what broke what?)

Why it works: Forces the AI to correlate across multiple data sources instead of just reading logs.

2️⃣ Change-Triggered RCA

What changed in [SERVICE_NAME] in the 60 minutes before [INCIDENT_TIME]?
Check:
- Git commits to [REPO_NAME]
- Kubernetes deployments (new pods, config changes)
- Feature flag changes
- Database migrations
- Third-party API updates (if tracked)

For each change, assess:
- Blast radius (how many services affected?)
- Rollback feasibility (can we revert quickly?)
- Previous incidents related to this change

Use case: Incidents that start right after a deployment - this finds the smoking gun.

3️⃣ Silent Failure Detection

Analyze [SERVICE_NAME] for silent failures in the last [TIME_RANGE]:
Look for:
- Errors logged but not alerted (severity = ERROR but no PagerDuty)
- Metrics degrading slowly (p95 latency creeping up by 10% over 7 days)
- Increased retry rates (HTTP 429, 503 from dependencies)
- Circuit breakers opening without incidents filed

Prioritize by:
1. User-facing impact (did customers notice?)
2. Trend (getting worse or stable?)
3. Blast radius (how many services affected?)

Why it works: Catches issues before they become incidents. Great for proactive SRE.

4️⃣ Dependency Chain RCA

[SERVICE_NAME] is failing. Trace the dependency chain:
1. What services does [SERVICE_NAME] call?
2. For each dependency, check:
   - Response times (are they slow?)
   - Error rates (5xx, timeouts)
   - Health check status
3. Which dependency is the weakest link?
4. Is the failure cascading (are other services now also failing)?

Output:
- Root service causing the issue
- Propagation path (how it spread)
- Blast radius visualization (services affected)

Use case: Microservices nightmares find which service started the domino effect.

5️⃣ Historical Pattern Match

Compare the current [SERVICE_NAME] incident to similar historical incidents:
Current symptoms:
- [ERROR_SIGNATURE]
- [METRIC_ANOMALY]

Search past 90 days for:
- Same error signatures
- Similar metric patterns (e.g., "memory leak over 2 hours")
- Same time-of-day (if relevant)

For matches, show:
- What fixed it last time?
- How long did resolution take?
- Did the fix hold (or did it recur)?

Why it works: Repeatable incidents = repeatable solutions. This finds your "known fix."

Blast Radius & Impact Assessment

6️⃣ Predict Downstream Impact

If [SERVICE_NAME] goes down completely, what's the blast radius?
Use topology graph to identify:
1. Direct consumers (services that call [SERVICE_NAME])
2. Indirect consumers (services 2-3 hops away)
3. Critical paths (does this break checkout? login? core workflow?)

For each impacted service:
- Can it degrade gracefully? (fallback, cached data)
- What's the user-facing symptom? ("checkout fails" vs "slower search")
- Priority (P0 = revenue-critical, P1 = UX degraded, P2 = background jobs)

Use case: Before you restart a service, know what else will break.

7️⃣ User Impact Quantification

Quantify the user impact of the [SERVICE_NAME] incident at [TIMESTAMP]:
Metrics to calculate:
- % of requests failing (error rate)
- % of users affected (unique user IDs in error logs)
- Geography breakdown (which regions?)
- Revenue impact (if this is checkout/payment: failed_transactions * avg_order_value)

Compare to baseline:
- Normal error rate = [X]%
- Current error rate = [Y]%
- Severity multiplier = [Y/X]x worse than normal

Why it works: Turns vague alerts into business impact - helps prioritize incidents.

8️⃣ Rollback Safety Check

Before rolling back [SERVICE_NAME] from version [NEW_VERSION] to [OLD_VERSION]:
Safety checks:
1. Database migrations: Did [NEW_VERSION] add/modify schemas? (rollback may fail)
2. Feature flags: Are any flags tied to [NEW_VERSION] logic?
3. Downstream services: Do they expect new API contracts from [NEW_VERSION]?
4. Data consistency: Will old code handle new data formats?

Risk assessment:
- Low risk: Pure code change, no schema/API changes
- Medium risk: New API fields (but backward-compatible)
- High risk: Breaking changes (DO NOT auto-rollback)

Use case: Before you hit "rollback," make sure you won't make things worse.

Performance Optimization

9️⃣ Slow Query Hunter

Find the slowest database queries in [SERVICE_NAME] for [TIME_RANGE]:
Sort by:
1. P99 latency (worst-case user experience)
2. Frequency (how often it's called)
3. Total time spent (latency * frequency = biggest bottleneck)

For top 5 queries:
- Show EXPLAIN plan (if Postgres/MySQL)
- Missing indexes?
- N+1 query pattern?
- Can we cache the result?
- Can we paginate/batch?

Why it works: Focuses on queries that actually matter (not just "slow," but "slow AND frequent").

🔟 Memory Leak Detective

[SERVICE_NAME] memory is growing unbounded. Diagnose:
1. Check memory metrics over last 24h:
   - Steady climb (classic leak)
   - Sawtooth pattern (GC working hard but not enough)
   - Sudden spike (memory bomb)
2. Correlate with:
   - Request rate (does memory grow with traffic?)
   - Specific endpoints (is one handler leaking?)
   - Cache sizes (unbounded cache?)
3. Suggest fixes:
   - Heap dump analysis (for JVM/Node)
   - Profiling tools (pprof for Go, py-spy for Python)
   - Known leak patterns (event listeners, global arrays)

Use case: Stop restarting pods every 6 hours. Find the actual leak.

1️⃣1️⃣ Caching Opportunity Finder

Identify caching opportunities in [SERVICE_NAME]:
Look for:
1. Repeated DB queries with same params (cache candidates)
2. Expensive computations called frequently (memoization?)
3. External API calls with high latency (cache responses?)

For each candidate:
- Frequency: [X] calls/minute
- Avg latency: [Y]ms
- Cache hit ratio potential: [Z]% (based on param distribution)
- Estimated latency reduction: [Y]ms → [Y/10]ms
- TTL recommendation: [based on data freshness needs]

Why it works: Finds "free" performance wins cache what's repeated.

Capacity Planning & Forecasting

1️⃣2️⃣ Predict Resource Exhaustion

When will [SERVICE_NAME] run out of [RESOURCE: CPU/memory/disk]?
Data:
- Current usage: [X]%
- Growth rate: [analyze last 30 days for trend]
- Capacity limit: [MAX_RESOURCE]

Calculate:
- Days until exhaustion (if trend continues)
- Recommended scaling action:
  * Vertical: increase pod limits from [X] to [Y]
  * Horizontal: add [N] more replicas
  * Optimization: [if growth is inefficient, suggest fixes]

Alert me when usage hits [THRESHOLD]% (e.g., 70%)

Use case: Avoid surprise outages scale BEFORE you hit limits.

1️⃣3️⃣ Traffic Spike Preparedness

Prepare [SERVICE_NAME] for [EVENT: Black Friday / product launch / marketing campaign]:
Expected traffic: [X]x baseline (e.g., 10x normal)

Pre-flight checklist:
1. Load test at [X]x traffic—does it hold?
2. Database: Connection pool sized correctly? (current: [N], recommended: [M])
3. Rate limits: Will they trigger at [X]x load? (adjust thresholds)
4. Auto-scaling: Max replicas set high enough? (current: [MAX], needed: [ESTIMATED])
5. Failover: If [SERVICE_NAME] fails, what breaks? (blast radius check)

Output: GO/NO-GO decision + remediation steps for any red flags

Why it works: Turns vague "get ready for launch" into specific action items.

1️⃣4️⃣ Cost Optimization Analysis

Find cost optimization opportunities in [SERVICE_NAME]:
1. Over-provisioned resources:
   - Pods with CPU/memory usage <30% for last 7 days
   - Right-size recommendations (save [$X]/month)
2. Idle resources:
   - Load balancers with near-zero traffic
   - Databases with no connections (last 24h)
3. Storage waste:
   - Old logs/backups not accessed in 90 days
   - Unattached EBS volumes
4. Rate limit opportunities:
   - External API calls (can we reduce frequency?)

Estimated savings: [$TOTAL]/month

Use case: CFO asked "where can we cut cloud costs?" here's your answer.

Security & Compliance

1️⃣5️⃣ Credential Leak Detector

Scan [SERVICE_NAME] logs for potential credential leaks:
Search patterns:
- AWS keys (AKIA[0-9A-Z]{16})
- API tokens (Bearer [a-zA-Z0-9_-]+)
- Database passwords in connection strings
- Private keys (-----BEGIN PRIVATE KEY-----)

For each match:
- Log line (redacted)
- Timestamp
- Service/pod that logged it
- Severity (HIGH if prod, MEDIUM if staging)

Action items:
- Rotate compromised credentials immediately
- Add redaction rules to logger
- Scan git history for accidental commits

Why it works: Catches leaks before they hit GitHub/CloudWatch.

1️⃣6️⃣ Anomalous Access Patterns

Detect unusual access patterns in [SERVICE_NAME] for [TIME_RANGE]:
Baseline (normal behavior):
- [USER_ID] typically accesses [N] records/day
- From [GEO_LOCATION]
- Between [TIME_WINDOW]

Current anomalies:
- 10x more API calls than usual
- New geo (accessing from [UNEXPECTED_COUNTRY])
- Off-hours access (2am when user usually inactive)

Risk score:
- Low: Minor deviation (50% more traffic, but same geo/time)
- Medium: 2+ anomalies (more traffic + new geo)
- High: 3+ anomalies (more traffic + new geo + off-hours)

Action: Alert SOC team for HIGH risk users

Use case: Spot compromised accounts or data exfiltration attempts.

Deployment Safety

1️⃣7️⃣ Pre-Deploy Risk Assessment

Before deploying [SERVICE_NAME] version [NEW_VERSION]:
Risk factors:
1. Code churn: [X] lines changed (>1000 = HIGH RISK)
2. Critical paths modified: Does this touch checkout/auth/payment?
3. Database changes: Any migrations? (schema changes = MEDIUM RISK)
4. Dependency updates: New library versions? (check CVEs)
5. Test coverage: What % of new code is tested?

Recent incidents:
- Last deploy incident: [X] days ago (if <7 days = caution)
- Failed deployments: [N] in last 30 days

Recommendation:
- Low risk: Deploy to prod (with canary)
- Medium risk: Extended staging soak (24h)
- High risk: Deploy Friday? NO. Wait until Monday.

Why it works: Turns gut feeling into data-driven deploy decision.

1️⃣8️⃣ Canary Health Check

Monitor [SERVICE_NAME] canary deployment (version [NEW_VERSION]):
Compare canary vs baseline (stable version):
1. Error rate: canary = [X]%, baseline = [Y]%
2. Latency: canary p99 = [A]ms, baseline p99 = [B]ms
3. Resource usage: canary CPU = [C]%, baseline CPU = [D]%

Decision criteria:
- PASS: Error rate within 2% of baseline, latency <10% worse
- WARN: Error rate 2-5% higher → hold rollout, investigate
- FAIL: Error rate >5% higher → auto-rollback

Current verdict: [PASS/WARN/FAIL]
If FAIL: [Auto-rollback triggered] + [Link to error logs]

Use case: Automate canary analysis don't babysit Grafana dashboards.

🧠 Advanced / Strategic Prompts

1️⃣9️⃣ Generate Runbook from Incident

Convert the resolved incident [INCIDENT_ID] into a runbook:
Incident summary:
- Root cause: [X]
- Resolution: [Y]
- Time to resolution: [Z] minutes

Runbook structure:
1. Symptoms: "How do I know this is happening?"
   - Alert name: [ALERT_NAME]
   - Symptoms: [e.g., "p99 latency >2s, error logs show 'connection refused'"]
2. Diagnosis: "How do I confirm?"
   - Commands: [kubectl logs, curl health check, etc.]
3. Resolution: "How do I fix it?"
   - Step-by-step: [restart pods, clear cache, rollback, etc.]
   - Rollback plan: [if fix doesn't work]
4. Prevention: "How do I stop this recurring?"
   - [Add alert, increase timeout, fix code]

Save to: /runbooks/[SERVICE_NAME]-[ISSUE_TYPE].md

Why it works: Turns tribal knowledge into documented, repeatable fixes.

2️⃣0️⃣ SLO Burn Rate Analysis

Analyze SLO burn rate for [SERVICE_NAME]:
SLO: [99.9% availability = 43 minutes downtime/month allowed]

Current burn rate:
- Downtime so far this month: [X] minutes
- Days into month: [Y]
- Projected downtime: [X * (30/Y)] minutes
- Error budget remaining: [43 - X] minutes

Risk level:
- GREEN: <50% error budget used
- YELLOW: 50-80% used (slow down deploys)
- RED: >80% used (freeze all changes except hotfixes)

Current: [GREEN/YELLOW/RED]
If RED: [Generate incident postmortem + action items to improve reliability]

Use case: Track SLO health in real-time know when to hit the brakes.

2️⃣1️⃣ Chaos Engineering Recommendation

Suggest chaos experiments for [SERVICE_NAME]:
Based on:
- Recent incidents (what failed?)
- Dependency graph (what's critical?)
- SLO targets (where can't we afford failure?)

Recommended experiments:
1. [HIGH PRIORITY] Simulate [DEPENDENCY_NAME] timeout
   - Hypothesis: Service degrades gracefully (falls back to cache)
   - Blast radius: [ESTIMATED]
   - Rollback: [IMMEDIATE via kill switch]
2. [MEDIUM] Kill random pod (test K8s resilience)
3. [LOW] Inject network latency (test timeout handling)

Start with: Experiment #1 in [STAGING] environment
Success criteria: [Service stays within SLO during experiment]

Why it works: Data-driven chaos engineering test what actually matters.

2️⃣2️⃣ Multi-Cloud Failover Readiness

Test [SERVICE_NAME] multi-cloud failover readiness:
Primary: [AWS us-east-1]
Failover: [GCP us-central1]

Checklist:
1. DNS: Is failover target in Route53/CloudDNS? (TTL = [X]s)
2. Data sync: How stale is failover DB? (replication lag = [Y]s)
3. Config: Are env vars / secrets synced across clouds?
4. Load test: Has failover environment been tested at production load?
5. Runbook: Do we have a tested failover procedure?

Red flags:
- [ ] Replication lag >5 minutes
- [ ] Failover environment never load-tested
- [ ] No documented failover runbook

Action items: [List blockers before failover is viable]

Use case: AWS region outage? Know if your failover plan will actually work.

2️⃣3️⃣ Incident Pattern Analysis

Analyze all incidents for [SERVICE_NAME] in last 90 days:
Group by:
- Root cause category (code bug, infra, third-party, config)
- Time-to-detection (how long until we knew?)
- Time-to-resolution (how long to fix?)
- Recurrence (did same issue happen before?)

Top 3 incident patterns:
1. [e.g., "OOM kills after deploy" - 5 incidents]
   - Fix: [Increase memory limits + add pre-deploy load test]
2. [e.g., "Redis timeout during traffic spikes" - 3 incidents]
   - Fix: [Connection pooling + circuit breaker]
3. [e.g., "Stale cache causing bad data" - 2 incidents]
   - Fix: [Add cache versioning + invalidation hooks]

Estimated MTTR reduction: [X]% if we fix top 3 patterns

Why it works: Stop firefighting the same incidents fix root patterns.

2️⃣4️⃣ Observability Gap Finder

Find observability blind spots in [SERVICE_NAME]:
Check:
1. Missing metrics:
   - Are all HTTP endpoints instrumented? (any 404s in logs but no metrics?)
   - Database queries tracked? (slow query logs vs metrics)
   - External API calls tracked? (third-party timeouts visible?)
2. Missing logs:
   - Error paths logged? (catch blocks without logging)
   - Security events logged? (login attempts, permission denials)
3. Missing traces:
   - Cross-service calls traced? (can we see full request path?)

For each gap:
- Impact: [e.g., "Can't debug checkout failures" = HIGH]
- Effort: [e.g., "Add 1 line of code" = LOW]
- Priority: [HIGH/MEDIUM/LOW]

Quick wins: [List LOW effort + HIGH impact items]

Use case: "We don't know what we don't know" find the blind spots.

2️⃣5️⃣ On-Call Pain Point Analysis

Analyze on-call pain points for [TEAM_NAME]:
Data (last 30 days):
- Total pages: [N]
- Pages requiring action: [M] (the rest = noise)
- Avg time to acknowledge: [X] minutes
- Avg time to resolution: [Y] minutes
- Pages outside business hours: [Z]%

Top noise sources:
1. [Alert name] - [N] false positives
   - Fix: [Adjust threshold / add context / auto-resolve]
2. [Alert name] - [M] repeat pages for same incident
   - Fix: [Deduplicate alerts / correlation rules]

Toil opportunities:
- [X]% of incidents have known runbooks → can we auto-remediate?
- [Y]% of pages = "restart pod" → auto-healing?

Goal: Reduce pages by [Z]% through automation + better alerting

Why it works: Quantifies on-call pain: builds business case for AI SRE.

Pro Tips for Maximum Prompt Effectiveness

1. Be Specific with Time Ranges

❌ Bad: "Find errors in payment-service"

✅ Good: "Find errors in payment-service between 2024-06-20 14:00 and 14:15 UTC"

2. Include Baseline for Comparison

❌ Bad: "Is payment-service slow?"

✅ Good: "Payment-service p99 latency is 2s. Normal is 200ms. Why?"

3. Ask for Actionable Outputs

❌ Bad: "Analyze this error"

✅ Good: "Analyze this error + give me 3 fix options ranked by risk"

4. Chain Prompts for Deep Dives

Start broad → narrow down

"Find root cause" → "Dig deeper into Redis timeout" → "Show me query patterns"

5. Attach Topology Context

If your AI SRE supports it:

"Use topology graph to trace [service] dependencies during incident"

6. Save Winning Prompts as Templates

Once a prompt works well:

Save it to your runbook with [VARIABLE] placeholders for quick reuse

Key Takeaways

  • Specificity wins: Include service names, time ranges, and baselines in every prompt
  • Cross-signal prompts (logs + metrics + topology) produce higher-quality RCA than single-signal queries
  • Ask for action items, not just analysis "what should I do?" gets better results than "what's wrong?"
  • Chain prompts for deep investigations start broad, then narrow based on findings
  • Save your best prompts as runbook templates for fast reuse during incidents

Try These Prompts with AutonomOps AI

AutonomOps comes with topology context, deploy awareness, and memory making these prompts even more powerful. Start your free trial and see the difference context-aware AI makes.

SK

About Shafi Khan

Shafi Khan is the founder of AutonomOps AI. He's spent the last decade building SRE teams and automating toil. These 25 prompts are battle-tested from thousands of real incidents across fintech, SaaS, and e-commerce.

Related Articles