Enterprise War Rooms: AI Skills for Live Production Incidents
It's 3am, your phone is screaming, prod is sideways, and you have eight minutes before your CTO joins the bridge with the kind of voice he only uses on calls that get retold at company retreats. You don't have time to think about prompts. You need ai skills production incidents war room playbooks pre-loaded so the moment you describe what you're seeing, the agent is already drafting a triage plan. That's what this post is about — the four skills I run during live incidents, what each one does well, and the things I will absolutely not let an AI agent do during a fire no matter how confident it sounds.
The "war room" mental model
An incident war room is the bridge call, the shared doc, the running timeline, and the three people frantically pasting log lines into a Slack channel. Before AI, the bottleneck was always the same: someone had to read 200 lines of log output, correlate it with deploy history, check three dashboards, and form a hypothesis. That part — the synthesis — is what eats your eight-minute window.
AI skills shrink the synthesis to roughly the time it takes to paste your evidence and read the answer. They don't fix the incident for you. They cut the "what is this even" phase from 25 minutes to 3. The remaining work — making the actual call to roll back, scale up, or flip a feature flag — stays with you. That distinction matters. Skills that try to act in production without human approval are how you turn a 30-minute incident into a 6-hour one.
The four skills below are pre-loaded into my war-room context the same way a runbook lives in Confluence. When the pager fires, I switch into a Claude Code session that has all four available, paste whatever I've got, and the right one fires. (For the routing piece, see how skill routing works.)
Scenario 1: Database at 100% CPU
Symptom: Postgres pegged at 100% CPU, p95 latency on the API up 8x, no recent deploys. Your first instinct is "slow query," but slow queries don't usually cause 100% CPU on a healthy box — they cause 100% I/O. So something else is going on.
Skill that fires: root-cause-debugger. Paste in: pg_stat_activity output, the last 50 entries from pg_stat_statements ordered by total_exec_time, your CPU graph for the last 6 hours, and a one-line description ("CPU pegged, no deploy"). The skill correlates: which queries are spiking, when they started, what their plan looks like. It distinguishes "a single bad query" from "background vacuum process gone rogue" from "connection pool exhaustion causing query thrash."
What it produced last time I ran it: identified that a customer with 40x normal data volume had triggered a sequential scan on a previously-indexed table because their volume crossed the planner's threshold. Suggested fix: a partial index on the column they were filtering by. Took 4 minutes from paste to diagnosis. I would have eventually figured it out, but probably 20 minutes in, after I'd already started chasing the wrong thread.
What the human still owns: applying the index in prod, deciding whether to do it during the incident or wait until traffic dies down, monitoring whether the fix actually moves the metric. Skills don't push fixes to production.
Scenario 2: Deploy regression
Symptom: error rate jumped from 0.1% to 4% exactly nine minutes ago. There WAS a deploy nine minutes ago. This one is supposed to be easy — roll back. But the rollback failed and now you're staring at a half-rolled-back system trying to figure out which version is actually running on which pod.
Skill that fires: enterprise-codebase-war-room. This is the skill purpose-built for incidents in a multi-service codebase. Paste: the deploy diff (or PR description), the failing error stack, current pod versions, recent feature flag toggles. The skill diffs the changeset against the error pattern, identifies which line of which file most likely caused the regression, and tells you whether the rollback you tried should have worked or whether the broken commit is also still in the rolled-back image.
What it produced last time: "Rollback didn't work because the migration in the new commit ran successfully and is irreversible without manual SQL. The new code can't run against the new schema, but the old code also can't, because the old code expected a column that the migration renamed. You need a forward-fix, not a rollback." That single sentence saved me 40 minutes of trying to revert a migration that wasn't reversible.
This pattern is also why AI code review skills are worth investing in — most "deploy regressions" are bugs that should have been caught at PR time. Skills at review-time and skills at incident-time work together.
What the human still owns: writing and applying the forward-fix. Approving the SQL. Communicating to stakeholders. The agent can draft the SQL but it doesn't run it.
Scenario 3: S3 bucket misconfigured
Symptom: customer report that "files I uploaded yesterday are now publicly listed." Your stomach drops. Twenty seconds later you confirm it: a bucket policy was changed three hours ago and a directory listing is now world-readable. This isn't a performance incident — it's a security incident, which is a different mode entirely.
Skill that fires: this is the case where I lean on mcp-server-safety-checklist hardest, because the temptation is to immediately ask the AI to "fix the bucket policy." Don't. Security incidents have a specific order of operations: contain, preserve evidence, then remediate. The skill enforces that ordering — it asks you what's been logged, whether you've snapshotted IAM state, whether you have CloudTrail confirmation of the change before you mutate anything. It feels slow during the panic. It saves you from destroying forensic data.
What it produced last time: the skill blocked me from running the obvious "make the bucket private again" command and forced a 90-second checklist first: who made the change (CloudTrail), are there other affected resources, has anything been downloaded in the window (S3 access logs), is there a known-bad IAM key. Three of those four turned up signal. The fourth was the actual fix.
What the human still owns: the entire decision tree. The skill is a checklist, not a remediator. It will not touch IAM, S3, or any AWS API on its own.
Scenario 4: OAuth token leak
Symptom: someone posted a screenshot of an internal dashboard on Reddit that includes a Bearer token in a network tab. You have maybe ten minutes before someone tries to use it. Possibly less.
Skill that fires: persistent-kb first, then enterprise-codebase-war-room. Persistent-kb retrieves your team's documented procedure for token revocation — which auth provider, which API, who's authorized to invoke the rotation, the post-rotation checklist. War-room then drafts the actual revocation requests, the audit log query to find what the token has been used for in the last hour, and the customer comms email if the token had any user-scoped permissions.
This is the workflow where having a knowledge base feels essential. The procedure exists somewhere — in Notion, in Confluence, in someone's brain. During a token leak, "I'll search Notion for it" is a 90-second tax you can't afford. Persistent-kb is the agent's local copy of that procedure, ready in two seconds.
What it produced last time (a smaller incident, not a real leak): a five-line revocation playbook, the exact API call to invalidate the token via the auth provider, a Splunk query for downstream usage, and a templated customer email. Total elapsed: 90 seconds from "we have a leak" to "I have everything I need to act." The actual revocation took 45 more seconds.
What the human still owns: deciding to actually rotate. Communicating to the customer. The decision to disclose publicly or not.
The four AI skills, in depth
1. enterprise-codebase-war-room — the multi-service incident skill
What it does: reasons over a multi-service repo plus deploy history plus error patterns to diagnose regressions, identify the broken commit, and propose forward-fixes. Designed for codebases big enough that no single human knows all of it.
Why incident teams need it: the synthesis step in a regression is "which of the 40 PRs that shipped this week broke the thing?" That's an exhausting search problem for a human at 3am. It's a 30-second problem for an agent with the diff and the error.
Install enterprise-codebase-war-room →
2. root-cause-debugger — the five-whys skill
What it does: takes a symptom plus evidence and traces it to a root cause, distinguishing trigger from underlying condition. Good for performance incidents, slow degradations, and bugs where the visible symptom is far from the actual cause.
Why incident teams need it: humans pattern-match to recent incidents. "DB CPU spike, must be that one customer again." Often correct, sometimes deeply wrong. The skill challenges your first hypothesis with structured questions and surfaces the cases where the obvious answer is incomplete.
3. persistent-kb — your team's procedure memory
What it does: a project-scoped knowledge base the agent reads at session start. Architectural decisions, runbooks, "the one weird thing you have to remember about our setup," post-mortem learnings. During an incident, it surfaces the specific procedure for the specific failure mode.
Why incident teams need it: 80% of incidents are variations of incidents you've seen before. The procedure exists. The hard part is finding it under pressure. Persistent-kb is your team's institutional memory, addressable in two seconds.
4. mcp-server-safety-checklist — the brake pedal
What it does: when an agent is about to take a privileged action (kubectl, AWS API, database write) via MCP, this skill enforces a safety review. Confirms the action is reversible, logs are preserved, and the authorized human approved.
Why incident teams need it: the most expensive incidents are caused by remediation steps. An over-eager rollback that drops a database. A "fix" that deletes the wrong S3 prefix. Skills give agents leverage, but leverage in a panic is exactly when you want a brake pedal.
Where AI agents are dangerous in incidents
The marketing copy for AI agents in DevOps usually skips this part. Don't.
- Never let an agent run kubectl, terraform apply, or any destructive AWS API without a human in the loop. An incident is the worst time to discover that the agent misread your kubeconfig context and applied to prod instead of staging. mcp-server-safety-checklist exists specifically because this category of mistake is so easy.
- Never trust an agent's confidence during a security incident. Agents will state with full conviction that "the bucket is now private" when in fact they applied the wrong policy to a similarly-named bucket. Verify every change in the actual cloud console, not in the agent's reply.
- Don't ask agents to write postmortems during the incident. They'll hallucinate causation. Postmortems happen 24+ hours later, with full evidence, written by humans. AI is fine for first-draft formatting after that.
- Don't paste raw customer PII into a public AI session. Use a self-hosted or contractually-private inference endpoint for incidents that involve customer data. Most enterprise teams already have this — use it during incidents, not just for normal dev.
The skill ecosystem assumes you're applying judgment. When an agent says "I recommend running <destructive command>," you read the command, you understand it, you choose to run it (or not). That's not optional. That's the deal.
FAQ
Is "ai skills production incidents war room" a real workflow or marketing?
It's real. The four skills above are in active use across multiple teams I've consulted with in 2026. The marketing version is "AI runs your incidents." That's not real and you shouldn't trust anyone selling it. The real version is "AI compresses synthesis from 25 minutes to 3, humans still own the decisions."
What's the difference between this and traditional incident management tools (PagerDuty, Datadog Watchdog, etc.)?
Different layer. Tools like PagerDuty handle alerting, scheduling, and bridge calls. Skills handle the cognitive work inside the bridge call — diagnosis, procedure retrieval, draft remediation. They compose. You still want PagerDuty for the alert; you want skills for what happens after the alert fires.
Can solo developers benefit from these too?
Yes, especially root-cause-debugger and persistent-kb. The other two are tuned for multi-service enterprise codebases — solo devs will find them overkill until their codebase grows. For the solo-dev-tuned starter set, see the solo developer guide.
How do I pre-load these for an incident? I can't install skills at 3am.
You install them now, while nothing is on fire. They live in ~/.claude/skills/ (or wherever your agent reads from). Once installed, they're just there — the agent picks the right one based on what you paste. There's no "incident mode" toggle. The pre-loading is the install.
Won't agents leak my incident data to the model provider?
Depends on your contract. Anthropic, OpenAI, and Google all offer enterprise tiers with no-training and zero-retention defaults. If you're handling regulated data, use those. If you're using a personal account on company incidents, stop and read your provider's data policy before pasting anything sensitive.
Are these skills replacing SREs?
No. They're replacing the parts of an SRE's job that nobody enjoyed — log archaeology, runbook hunting, draft-comm-writing — so SREs spend more time on the work that requires actual judgment. The good SREs I know are getting more strategic, not less employed.
The bottom line
If your team handles real production traffic, install enterprise-codebase-war-room and root-cause-debugger first. They're the highest-leverage of the four. Add persistent-kb once you've got a written runbook to put in it. Add mcp-server-safety-checklist before you connect any privileged MCP tools. The total install time is maybe an hour. The first incident where one of them earns its keep, you'll wonder how you ran on-call without it.
The deeper point is that AI skills don't replace incident response — they remove the grunt work that surrounded incident response. That's the bargain. Take it.
→ Browse all development skills · Request a skill that doesn't exist yet






