Claude Code Production Incidents: Detection, Rollback, RCA, and Prevention

Claude Code can move faster than a normal editor because it can read files, edit code, and run shell commands. That speed is useful in production repositories, but it also changes the failure mode. A vague approval can leak a secret, delete local state, overwrite a branch, run an unsafe migration, or call an API thousands of times.

This article is not a claim about a specific company’s private outage. It is a practical incident-response playbook based on ClaudeCodeLab drills, content operations, and repository safety reviews. The amounts and timestamps are examples, but the patterns are realistic enough to rehearse before you need them.

In plain terms, an incident is an event that affects users, data, security, cost, or service reliability. Containment means stopping the damage from spreading. RCA means root cause analysis. Rollback means returning to the last known safe version.

Use the official Claude Code settings documentation and hooks guide for the exact configuration surface. This article turns those tools into a production incident flow: detect, contain, diagnose, rollback, communicate, postmortem, and prevent recurrence.

The Incident Flow

Do not start by asking Claude Code to “fix production.” First freeze the response order.

Phase	Goal	What to ask Claude Code
Detect	Identify what changed and who is affected	Summarize alerts, logs, diffs, deploys, and recent commands
Contain	Stop more damage	Propose key revocation, job shutdown, feature flags, or endpoint disablement
Diagnose	Narrow the direct cause	Compare the last safe deploy with the failing change
Rollback	Return to a safe version	List rollback target, data risk, and verification commands
Communicate	Keep stakeholders aligned	Draft current status, impact, next update time, and owner
Postmortem	Convert the event into learning	Fill RCA, timeline, missed detection, and action items
Prevent	Make recurrence harder	Add permissions, hooks, CI checks, alerts, and review gates

The most important rule is “contain first, investigate second” for secrets, billing, personal data, and database writes.

Seven Concrete Incident Patterns

Pattern	What happens	First response	Common failure case
Secret leak	`.env`, logs, or screenshots expose an API key	Revoke the key, rotate secrets, inspect logs	Rewriting git history while CI logs still contain the key
Dangerous delete	`rm -rf` or a broad cleanup removes needed files	Stop work, inspect backups, list untracked files	`git checkout .` restores tracked files only
Force push	`main` is overwritten and teammates’ commits disappear	Stop pushing, inspect reflog, recover branch	Confusing `--force-with-lease` with `--force`
DB migration	A column drop, full-table update, or lock breaks production	Pause writes, snapshot state, restore safely	Running untested SQL directly against production
Runaway API calls	Retry logic loops and cost rises	Kill the process, pause the queue, check limits	”Retry on error” becomes infinite retries
Broken dependency deploy	A package update passes locally but fails at startup	Reactivate previous deploy, inspect lockfile	`npm update` performs unexpected major upgrades
Missing auth	Admin or user data endpoint is public	Disable endpoint, inspect access logs, notify as needed	”Admin endpoint” is not written as an auth requirement

Case 1: Secret Leak

Detection often comes from GitHub secret scanning, a cloud usage alert, a billing screen, or CI logs. See GitHub’s official secret scanning documentation for how the platform detects supported token patterns.

Your first action is revocation, not investigation. Disable the leaked key, issue a new one, update production and CI secrets, and record where the key appeared. Check public repositories, pull requests, CI logs, chat, and error monitoring.

git status --short
git diff --cached --name-only
git log --all -- .env .env.local
git grep -n "sk-" -- ':!node_modules' ':!dist'

The failure case is rushing into history rewriting and creating a second incident with force pushes. If you must rewrite history, coordinate branch protection, tags, forks, CI caches, and teammate clones first. Ask Claude Code for leak scope, rotation steps, recovery options, and a stakeholder message before asking it to run commands.

Case 2: Failed Database Migration

For database incidents, stop writes before you explore. Use maintenance mode, disable the feature flag, pause workers, or move the application role toward read-only access.

psql "$DATABASE_URL" -c "select now();"
psql "$DATABASE_URL" -c "\d users"
pg_dump "$DATABASE_URL" --schema-only > schema_before_repair.sql

Separate code rollback from data recovery. Code may return to the previous deployment in minutes; deleted rows or columns require backups, WAL, audit logs, or external resync. Before Claude Code writes SQL, require table name, estimated affected rows, lock risk, restore source, and verification query.

Concrete failure cases include DELETE FROM users; without WHERE, a DROP COLUMN in the wrong migration direction, and synchronous index creation on a large production table.

Case 3: Runaway API Retries

LLM and external API incidents often hide inside “error handling.” When the remote service returns 503, bad retry logic can call the service every second for hours. Containment means killing the process, pausing the queue, setting usage limits, and checking alerts.

Save this as incident-budget-runner.mjs and wrap batch jobs with it.

#!/usr/bin/env node
import { spawn } from "node:child_process";

const command = process.argv.slice(2);
const maxAttempts = Number(process.env.MAX_ATTEMPTS || 3);
const maxCostCents = Number(process.env.MAX_COST_CENTS || 200);
const costPerAttempt = Number(process.env.COST_PER_ATTEMPT_CENTS || 0);

if (command.length === 0) {
  console.error("usage: node incident-budget-runner.mjs <command> [...args]");
  process.exit(2);
}

let estimatedCost = 0;

for (let attempt = 1; attempt <= maxAttempts; attempt += 1) {
  const child = spawn(command[0], command.slice(1), {
    stdio: "inherit",
    shell: process.platform === "win32"
  });

  const exitCode = await new Promise((resolve) => {
    child.on("exit", (code) => resolve(code ?? 1));
  });

  estimatedCost += costPerAttempt;

  if (exitCode === 0) process.exit(0);
  if (estimatedCost >= maxCostCents) {
    console.error(`stopped: estimated cost ${estimatedCost} cents reached`);
    process.exit(1);
  }

  const delayMs = Math.min(1000 * 2 ** (attempt - 1), 10_000);
  await new Promise((resolve) => setTimeout(resolve, delayMs));
}

console.error(`failed after ${maxAttempts} attempts`);
process.exit(1);

Run it like this:

MAX_ATTEMPTS=3 MAX_COST_CENTS=200 COST_PER_ATTEMPT_CENTS=25 \
node incident-budget-runner.mjs node batch-process.js

Claude Code Guardrails

Start by moving risky actions into ask or deny. Also deny direct reads of secret files.

{
  "$schema": "https://json.schemastore.org/claude-code-settings.json",
  "permissions": {
    "deny": [
      "Read(./.env)",
      "Read(./.env.*)",
      "Read(./secrets/**)",
      "Bash(git push --force *main*)",
      "Bash(git push -f *main*)",
      "Bash(rm -rf /*)",
      "Bash(rm -rf ~*)"
    ],
    "ask": [
      "Bash(git push*)",
      "Bash(rm*)",
      "Bash(npm install*)",
      "Bash(*migrate*)",
      "Bash(*deploy*)"
    ]
  },
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/protect-danger.sh"
          }
        ]
      }
    ]
  }
}

The hook can block dangerous commands with exit code 2.

#!/usr/bin/env bash
set -euo pipefail

payload="$(cat)"
command="$(node -e 'const fs = require("fs"); const raw = fs.readFileSync(0, "utf8") || "{}"; const json = JSON.parse(raw); console.log(json.tool_input?.command || "");' <<< "$payload")"
blocked='(rm[[:space:]]+-rf[[:space:]]+(/|~)|git[[:space:]]+push[[:space:]].*(-f|--force)([[:space:]]|$)|DROP[[:space:]]+TABLE|TRUNCATE[[:space:]])'

if [[ "$command" =~ $blocked ]]; then
  echo "Blocked dangerous command: $command" >&2
  exit 2
fi

exit 0

Communication and Postmortem Templates

## Incident Update
- Status: investigating / contained / validating recovery
- Impact: feature, users, start time
- Current action: stopped job, reverted deploy, checking logs
- Next update: YYYY-MM-DD HH:mm
- Owner:

# Postmortem: [Incident Title]

## Summary
- Started:
- Detected:
- Resolved:
- Impact:
- Severity: P0/P1/P2/P3

## Timeline
| Time | Event |
| --- | --- |
| HH:mm | |

## Cause
- Direct cause:
- Root cause:
- Why detection was late:

## Prevention
| Action | Owner | Due date |
| --- | --- | --- |
| | | |

Google’s SRE chapter on postmortem culture is the best external reference for keeping this blameless and useful.

For adjacent ClaudeCodeLab material, read Claude Code security best practices, Claude Code permissions guide, Claude Code API cost guide, and the verification receipt workflow.

Solo builders can start with the free cheatsheet. If you want reusable CLAUDE.md, hook, and review templates, browse ClaudeCodeLab products. Teams that need permissions, rollout rules, review gates, and incident drills can use Claude Code training and consultation.

After trying these templates in ClaudeCodeLab article and repository drills, the most useful change was putting containment before diagnosis. Syntax-checking the JSON, Bash hook, and Node wrapper before publishing also made the guidance safer. A 20-minute rehearsal quickly reveals missing alerts, missing backups, and overly broad Claude Code permissions.

Claude Code Production Incidents: Detection, Rollback, RCA, and Prevention

The Incident Flow

Seven Concrete Incident Patterns

Case 1: Secret Leak

Case 2: Failed Database Migration

Case 3: Runaway API Retries

Claude Code Guardrails

Communication and Postmortem Templates

Free PDF: Claude Code Cheatsheet

Level up your Claude Code workflow

Related Posts

Claude Code Harness Smoke Test: A 15-Minute Proof Loop Before You Trust an Agent

Claude Code Permission Safety Ladder: Expand Access Without Losing Control

Claude Code Small PR Proof Pack: Make Tiny Changes Reviewable

Related Products

The Complete Claude Code Setup & Configuration Guide

50 Battle-Tested Claude Code Prompt Templates

The Incident Flow

Seven Concrete Incident Patterns

Case 1: Secret Leak

Case 2: Failed Database Migration

Case 3: Runaway API Retries

Claude Code Guardrails

Communication and Postmortem Templates

Related Reading and CTA

Free PDF: Claude Code Cheatsheet

Level up your Claude Code workflow

Related Posts

Claude Code Harness Smoke Test: A 15-Minute Proof Loop Before You Trust an Agent

Claude Code Permission Safety Ladder: Expand Access Without Losing Control

Claude Code Small PR Proof Pack: Make Tiny Changes Reviewable

Related Products

The Complete Claude Code Setup & Configuration Guide

50 Battle-Tested Claude Code Prompt Templates