Claude Code Production Incidents: Detection, Rollback, RCA, and Prevention
A practical Claude Code incident playbook for leaks, deletes, DB failures, runaway cost, rollback, RCA, and prevention.
Claude Code can move faster than a normal editor because it can read files, edit code, and run shell commands. That speed is useful in production repositories, but it also changes the failure mode. A vague approval can leak a secret, delete local state, overwrite a branch, run an unsafe migration, or call an API thousands of times.
This article is not a claim about a specific company’s private outage. It is a practical incident-response playbook based on ClaudeCodeLab drills, content operations, and repository safety reviews. The amounts and timestamps are examples, but the patterns are realistic enough to rehearse before you need them.
In plain terms, an incident is an event that affects users, data, security, cost, or service reliability. Containment means stopping the damage from spreading. RCA means root cause analysis. Rollback means returning to the last known safe version.
Use the official Claude Code settings documentation and hooks guide for the exact configuration surface. This article turns those tools into a production incident flow: detect, contain, diagnose, rollback, communicate, postmortem, and prevent recurrence.
The Incident Flow
Do not start by asking Claude Code to “fix production.” First freeze the response order.
| Phase | Goal | What to ask Claude Code |
|---|---|---|
| Detect | Identify what changed and who is affected | Summarize alerts, logs, diffs, deploys, and recent commands |
| Contain | Stop more damage | Propose key revocation, job shutdown, feature flags, or endpoint disablement |
| Diagnose | Narrow the direct cause | Compare the last safe deploy with the failing change |
| Rollback | Return to a safe version | List rollback target, data risk, and verification commands |
| Communicate | Keep stakeholders aligned | Draft current status, impact, next update time, and owner |
| Postmortem | Convert the event into learning | Fill RCA, timeline, missed detection, and action items |
| Prevent | Make recurrence harder | Add permissions, hooks, CI checks, alerts, and review gates |
The most important rule is “contain first, investigate second” for secrets, billing, personal data, and database writes.
Seven Concrete Incident Patterns
| Pattern | What happens | First response | Common failure case |
|---|---|---|---|
| Secret leak | .env, logs, or screenshots expose an API key | Revoke the key, rotate secrets, inspect logs | Rewriting git history while CI logs still contain the key |
| Dangerous delete | rm -rf or a broad cleanup removes needed files | Stop work, inspect backups, list untracked files | git checkout . restores tracked files only |
| Force push | main is overwritten and teammates’ commits disappear | Stop pushing, inspect reflog, recover branch | Confusing --force-with-lease with --force |
| DB migration | A column drop, full-table update, or lock breaks production | Pause writes, snapshot state, restore safely | Running untested SQL directly against production |
| Runaway API calls | Retry logic loops and cost rises | Kill the process, pause the queue, check limits | ”Retry on error” becomes infinite retries |
| Broken dependency deploy | A package update passes locally but fails at startup | Reactivate previous deploy, inspect lockfile | npm update performs unexpected major upgrades |
| Missing auth | Admin or user data endpoint is public | Disable endpoint, inspect access logs, notify as needed | ”Admin endpoint” is not written as an auth requirement |
Case 1: Secret Leak
Detection often comes from GitHub secret scanning, a cloud usage alert, a billing screen, or CI logs. See GitHub’s official secret scanning documentation for how the platform detects supported token patterns.
Your first action is revocation, not investigation. Disable the leaked key, issue a new one, update production and CI secrets, and record where the key appeared. Check public repositories, pull requests, CI logs, chat, and error monitoring.
git status --short
git diff --cached --name-only
git log --all -- .env .env.local
git grep -n "sk-" -- ':!node_modules' ':!dist'
The failure case is rushing into history rewriting and creating a second incident with force pushes. If you must rewrite history, coordinate branch protection, tags, forks, CI caches, and teammate clones first. Ask Claude Code for leak scope, rotation steps, recovery options, and a stakeholder message before asking it to run commands.
Case 2: Failed Database Migration
For database incidents, stop writes before you explore. Use maintenance mode, disable the feature flag, pause workers, or move the application role toward read-only access.
psql "$DATABASE_URL" -c "select now();"
psql "$DATABASE_URL" -c "\d users"
pg_dump "$DATABASE_URL" --schema-only > schema_before_repair.sql
Separate code rollback from data recovery. Code may return to the previous deployment in minutes; deleted rows or columns require backups, WAL, audit logs, or external resync. Before Claude Code writes SQL, require table name, estimated affected rows, lock risk, restore source, and verification query.
Concrete failure cases include DELETE FROM users; without WHERE, a DROP COLUMN in the wrong migration direction, and synchronous index creation on a large production table.
Case 3: Runaway API Retries
LLM and external API incidents often hide inside “error handling.” When the remote service returns 503, bad retry logic can call the service every second for hours. Containment means killing the process, pausing the queue, setting usage limits, and checking alerts.
Save this as incident-budget-runner.mjs and wrap batch jobs with it.
#!/usr/bin/env node
import { spawn } from "node:child_process";
const command = process.argv.slice(2);
const maxAttempts = Number(process.env.MAX_ATTEMPTS || 3);
const maxCostCents = Number(process.env.MAX_COST_CENTS || 200);
const costPerAttempt = Number(process.env.COST_PER_ATTEMPT_CENTS || 0);
if (command.length === 0) {
console.error("usage: node incident-budget-runner.mjs <command> [...args]");
process.exit(2);
}
let estimatedCost = 0;
for (let attempt = 1; attempt <= maxAttempts; attempt += 1) {
const child = spawn(command[0], command.slice(1), {
stdio: "inherit",
shell: process.platform === "win32"
});
const exitCode = await new Promise((resolve) => {
child.on("exit", (code) => resolve(code ?? 1));
});
estimatedCost += costPerAttempt;
if (exitCode === 0) process.exit(0);
if (estimatedCost >= maxCostCents) {
console.error(`stopped: estimated cost ${estimatedCost} cents reached`);
process.exit(1);
}
const delayMs = Math.min(1000 * 2 ** (attempt - 1), 10_000);
await new Promise((resolve) => setTimeout(resolve, delayMs));
}
console.error(`failed after ${maxAttempts} attempts`);
process.exit(1);
Run it like this:
MAX_ATTEMPTS=3 MAX_COST_CENTS=200 COST_PER_ATTEMPT_CENTS=25 \
node incident-budget-runner.mjs node batch-process.js
Claude Code Guardrails
Start by moving risky actions into ask or deny. Also deny direct reads of secret files.
{
"$schema": "https://json.schemastore.org/claude-code-settings.json",
"permissions": {
"deny": [
"Read(./.env)",
"Read(./.env.*)",
"Read(./secrets/**)",
"Bash(git push --force *main*)",
"Bash(git push -f *main*)",
"Bash(rm -rf /*)",
"Bash(rm -rf ~*)"
],
"ask": [
"Bash(git push*)",
"Bash(rm*)",
"Bash(npm install*)",
"Bash(*migrate*)",
"Bash(*deploy*)"
]
},
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/protect-danger.sh"
}
]
}
]
}
}
The hook can block dangerous commands with exit code 2.
#!/usr/bin/env bash
set -euo pipefail
payload="$(cat)"
command="$(node -e 'const fs = require("fs"); const raw = fs.readFileSync(0, "utf8") || "{}"; const json = JSON.parse(raw); console.log(json.tool_input?.command || "");' <<< "$payload")"
blocked='(rm[[:space:]]+-rf[[:space:]]+(/|~)|git[[:space:]]+push[[:space:]].*(-f|--force)([[:space:]]|$)|DROP[[:space:]]+TABLE|TRUNCATE[[:space:]])'
if [[ "$command" =~ $blocked ]]; then
echo "Blocked dangerous command: $command" >&2
exit 2
fi
exit 0
Communication and Postmortem Templates
## Incident Update
- Status: investigating / contained / validating recovery
- Impact: feature, users, start time
- Current action: stopped job, reverted deploy, checking logs
- Next update: YYYY-MM-DD HH:mm
- Owner:
# Postmortem: [Incident Title]
## Summary
- Started:
- Detected:
- Resolved:
- Impact:
- Severity: P0/P1/P2/P3
## Timeline
| Time | Event |
| --- | --- |
| HH:mm | |
## Cause
- Direct cause:
- Root cause:
- Why detection was late:
## Prevention
| Action | Owner | Due date |
| --- | --- | --- |
| | | |
Google’s SRE chapter on postmortem culture is the best external reference for keeping this blameless and useful.
Related Reading and CTA
For adjacent ClaudeCodeLab material, read Claude Code security best practices, Claude Code permissions guide, Claude Code API cost guide, and the verification receipt workflow.
Solo builders can start with the free cheatsheet. If you want reusable CLAUDE.md, hook, and review templates, browse ClaudeCodeLab products. Teams that need permissions, rollout rules, review gates, and incident drills can use Claude Code training and consultation.
After trying these templates in ClaudeCodeLab article and repository drills, the most useful change was putting containment before diagnosis. Syntax-checking the JSON, Bash hook, and Node wrapper before publishing also made the guidance safer. A 20-minute rehearsal quickly reveals missing alerts, missing backups, and overly broad Claude Code permissions.
Free PDF: Claude Code Cheatsheet
Enter your email and download the one-page Claude Code cheatsheet for commands, review habits, and safe workflows.
We handle your data with care and never send spam.
Level up your Claude Code workflow
Start with the free PDF, use Gumroad guides when you need repeatable workflows, and book consultation when rollout or revenue paths need human judgment.
About the Author
Masa
Engineer focused on practical Claude Code workflows. Runs claudecode-lab.com, a 10-language technical media site.
Related Posts
Claude Code Permission Safety Ladder: Expand Access Without Losing Control
A beginner-friendly ladder for moving Claude Code from read-only to limited edits, proof commands, and deploy checks.
Claude Code Small PR Proof Pack: Make Tiny Changes Reviewable
A practical proof pack for Claude Code PRs: diff, checks, public URL, CTA path, and rollback note.
Claude Code Review Gate Before Commit: Diff, Tests, Public URL, and CTA Checks
A commit-time review gate for Claude Code work: diff scope, build, public URL, revenue CTA links, missing tests, and unrelated files.
Related Products
The Complete Claude Code Setup & Configuration Guide
From install to team-ready workflow.
A practical guide to installation, CLAUDE.md, hooks, MCP servers, permissions, IDE setup, and CI/CD workflows.
50 Battle-Tested Claude Code Prompt Templates
Copy, paste, ship. 50 production-ready prompts.
Use proven prompts for code review, refactoring, testing, documentation, debugging, architecture, and incident response.