返回全部 Agent
🚨

场景:事故响应

Runbook: Incident Response

🧭 战略
6 个章节
6KB
在 GitHub 查看

个章节

ScenarioSeverity ClassificationResponse Teams by SeverityIncident Response SequenceCommunication TemplatesEscalation Matrix

🚨 Runbook: Incident Response

Mode: NEXUS-Micro | Duration: Minutes to hours | Agents: 3-8


Scenario

Something is broken in production. Users are affected. Speed of response matters, but so does doing it right. This runbook covers detection through post-mortem.

Severity Classification

LevelDefinitionExamplesResponse Time
P0 — CriticalService completely down, data loss, security breachDatabase corruption, DDoS attack, auth system failureImmediate (all hands)
P1 — HighMajor feature broken, significant performance degradationPayment processing down, 50%+ error rate, 10x latency< 1 hour
P2 — MediumMinor feature broken, workaround availableSearch not working, non-critical API errors< 4 hours
P3 — LowCosmetic issue, minor inconvenienceStyling bug, typo, minor UI glitchNext sprint

Response Teams by Severity

P0 — Critical Response Team

AgentRoleAction
Infrastructure MaintainerIncident commanderAssess scope, coordinate response
DevOps AutomatorDeployment/rollbackExecute rollback if needed
Backend ArchitectRoot cause investigationDiagnose system issues
Frontend DeveloperUI-side investigationDiagnose client-side issues
Support ResponderUser communicationStatus page updates, user notifications
Executive Summary GeneratorStakeholder communicationReal-time executive updates

P1 — High Response Team

AgentRole
Infrastructure MaintainerIncident commander
DevOps AutomatorDeployment support
Relevant Developer AgentFix implementation
Support ResponderUser communication

P2 — Medium Response

AgentRole
Relevant Developer AgentFix implementation
Evidence CollectorVerify fix

P3 — Low Response

AgentRole
Sprint PrioritizerAdd to backlog

Incident Response Sequence

Step 1: Detection & Triage (0-5 minutes)

TRIGGER: Alert from monitoring / User report / Agent detection

Infrastructure Maintainer:
1. Acknowledge alert
2. Assess scope and impact
   - How many users affected?
   - Which services are impacted?
   - Is data at risk?
3. Classify severity (P0/P1/P2/P3)
4. Activate appropriate response team
5. Create incident channel/thread

Output: Incident classification + response team activated

Step 2: Investigation (5-30 minutes)

PARALLEL INVESTIGATION:

Infrastructure Maintainer:
├── Check system metrics (CPU, memory, network, disk)
├── Review error logs
├── Check recent deployments
└── Verify external dependencies

Backend Architect (if P0/P1):
├── Check database health
├── Review API error rates
├── Check service communication
└── Identify failing component

DevOps Automator:
├── Review recent deployment history
├── Check CI/CD pipeline status
├── Prepare rollback if needed
└── Verify infrastructure state

Output: Root cause identified (or narrowed to component)

Step 3: Mitigation (15-60 minutes)

DECISION TREE:

IF caused by recent deployment:
  → DevOps Automator: Execute rollback
  → Infrastructure Maintainer: Verify recovery
  → Evidence Collector: Confirm fix

IF caused by infrastructure issue:
  → Infrastructure Maintainer: Scale/restart/failover
  → DevOps Automator: Support infrastructure changes
  → Verify recovery

IF caused by code bug:
  → Relevant Developer Agent: Implement hotfix
  → Evidence Collector: Verify fix
  → DevOps Automator: Deploy hotfix
  → Infrastructure Maintainer: Monitor recovery

IF caused by external dependency:
  → Infrastructure Maintainer: Activate fallback/cache
  → Support Responder: Communicate to users
  → Monitor for external recovery

THROUGHOUT:
  → Support Responder: Update status page every 15 minutes
  → Executive Summary Generator: Brief stakeholders (P0 only)

Step 4: Resolution Verification (Post-fix)

Evidence Collector:
1. Verify the fix resolves the issue
2. Screenshot evidence of working state
3. Confirm no new issues introduced

Infrastructure Maintainer:
1. Verify all metrics returning to normal
2. Confirm no cascading failures
3. Monitor for 30 minutes post-fix

API Tester (if API-related):
1. Run regression on affected endpoints
2. Verify response times normalized
3. Confirm error rates at baseline

Output: Incident resolved confirmation

Step 5: Post-Mortem (Within 48 hours)

Workflow Optimizer leads post-mortem:

1. Timeline reconstruction
   - When was the issue introduced?
   - When was it detected?
   - When was it resolved?
   - Total user impact duration

2. Root cause analysis
   - What failed?
   - Why did it fail?
   - Why wasn't it caught earlier?
   - 5 Whys analysis

3. Impact assessment
   - Users affected
   - Revenue impact
   - Reputation impact
   - Data impact

4. Prevention measures
   - What monitoring would have caught this sooner?
   - What testing would have prevented this?
   - What process changes are needed?
   - What infrastructure changes are needed?

5. Action items
   - [Action] → [Owner] → [Deadline]
   - [Action] → [Owner] → [Deadline]
   - [Action] → [Owner] → [Deadline]

Output: Post-Mortem Report → Sprint Prioritizer adds prevention tasks to backlog

Communication Templates

Status Page Update (Support Responder)

[TIMESTAMP] — [SERVICE NAME] Incident

Status: [Investigating / Identified / Monitoring / Resolved]
Impact: [Description of user impact]
Current action: [What we're doing about it]
Next update: [When to expect the next update]

Executive Update (Executive Summary Generator — P0 only)

INCIDENT BRIEF — [TIMESTAMP]

SITUATION: [Service] is [down/degraded] affecting [N users/% of traffic]
CAUSE: [Known/Under investigation] — [Brief description if known]
ACTION: [What's being done] — ETA [time estimate]
IMPACT: [Business impact — revenue, users, reputation]
NEXT UPDATE: [Timestamp]

Escalation Matrix

ConditionEscalate ToAction
P0 not resolved in 30 minStudio ProducerAdditional resources, vendor escalation
P1 not resolved in 2 hoursProject ShepherdResource reallocation
Data breach suspectedLegal Compliance CheckerRegulatory notification assessment
User data affectedLegal Compliance Checker + Executive Summary GeneratorGDPR/CCPA notification
Revenue impact > $XFinance Tracker + Studio ProducerBusiness impact assessment