Tips & Tricks (Updated: 6/3/2026)

Claude Code Production Incidents: Detection, Rollback, RCA, and Prevention

A practical Claude Code incident playbook for leaks, deletes, DB failures, runaway cost, rollback, RCA, and prevention.

Claude Code Production Incidents: Detection, Rollback, RCA, and Prevention

Claude Code can move faster than a normal editor because it can read files, edit code, and run shell commands. That speed is useful in production repositories, but it also changes the failure mode. A vague approval can leak a secret, delete local state, overwrite a branch, run an unsafe migration, or call an API thousands of times.

This article is not a claim about a specific company’s private outage. It is a practical incident-response playbook based on ClaudeCodeLab drills, content operations, and repository safety reviews. The amounts and timestamps are examples, but the patterns are realistic enough to rehearse before you need them.

In plain terms, an incident is an event that affects users, data, security, cost, or service reliability. Containment means stopping the damage from spreading. RCA means root cause analysis. Rollback means returning to the last known safe version.

Use the official Claude Code settings documentation and hooks guide for the exact configuration surface. This article turns those tools into a production incident flow: detect, contain, diagnose, rollback, communicate, postmortem, and prevent recurrence.

The Incident Flow

Do not start by asking Claude Code to “fix production.” First freeze the response order.

PhaseGoalWhat to ask Claude Code
DetectIdentify what changed and who is affectedSummarize alerts, logs, diffs, deploys, and recent commands
ContainStop more damagePropose key revocation, job shutdown, feature flags, or endpoint disablement
DiagnoseNarrow the direct causeCompare the last safe deploy with the failing change
RollbackReturn to a safe versionList rollback target, data risk, and verification commands
CommunicateKeep stakeholders alignedDraft current status, impact, next update time, and owner
PostmortemConvert the event into learningFill RCA, timeline, missed detection, and action items
PreventMake recurrence harderAdd permissions, hooks, CI checks, alerts, and review gates

The most important rule is “contain first, investigate second” for secrets, billing, personal data, and database writes.

Seven Concrete Incident Patterns

PatternWhat happensFirst responseCommon failure case
Secret leak.env, logs, or screenshots expose an API keyRevoke the key, rotate secrets, inspect logsRewriting git history while CI logs still contain the key
Dangerous deleterm -rf or a broad cleanup removes needed filesStop work, inspect backups, list untracked filesgit checkout . restores tracked files only
Force pushmain is overwritten and teammates’ commits disappearStop pushing, inspect reflog, recover branchConfusing --force-with-lease with --force
DB migrationA column drop, full-table update, or lock breaks productionPause writes, snapshot state, restore safelyRunning untested SQL directly against production
Runaway API callsRetry logic loops and cost risesKill the process, pause the queue, check limits”Retry on error” becomes infinite retries
Broken dependency deployA package update passes locally but fails at startupReactivate previous deploy, inspect lockfilenpm update performs unexpected major upgrades
Missing authAdmin or user data endpoint is publicDisable endpoint, inspect access logs, notify as needed”Admin endpoint” is not written as an auth requirement

Case 1: Secret Leak

Detection often comes from GitHub secret scanning, a cloud usage alert, a billing screen, or CI logs. See GitHub’s official secret scanning documentation for how the platform detects supported token patterns.

Your first action is revocation, not investigation. Disable the leaked key, issue a new one, update production and CI secrets, and record where the key appeared. Check public repositories, pull requests, CI logs, chat, and error monitoring.

git status --short
git diff --cached --name-only
git log --all -- .env .env.local
git grep -n "sk-" -- ':!node_modules' ':!dist'

The failure case is rushing into history rewriting and creating a second incident with force pushes. If you must rewrite history, coordinate branch protection, tags, forks, CI caches, and teammate clones first. Ask Claude Code for leak scope, rotation steps, recovery options, and a stakeholder message before asking it to run commands.

Case 2: Failed Database Migration

For database incidents, stop writes before you explore. Use maintenance mode, disable the feature flag, pause workers, or move the application role toward read-only access.

psql "$DATABASE_URL" -c "select now();"
psql "$DATABASE_URL" -c "\d users"
pg_dump "$DATABASE_URL" --schema-only > schema_before_repair.sql

Separate code rollback from data recovery. Code may return to the previous deployment in minutes; deleted rows or columns require backups, WAL, audit logs, or external resync. Before Claude Code writes SQL, require table name, estimated affected rows, lock risk, restore source, and verification query.

Concrete failure cases include DELETE FROM users; without WHERE, a DROP COLUMN in the wrong migration direction, and synchronous index creation on a large production table.

Case 3: Runaway API Retries

LLM and external API incidents often hide inside “error handling.” When the remote service returns 503, bad retry logic can call the service every second for hours. Containment means killing the process, pausing the queue, setting usage limits, and checking alerts.

Save this as incident-budget-runner.mjs and wrap batch jobs with it.

#!/usr/bin/env node
import { spawn } from "node:child_process";

const command = process.argv.slice(2);
const maxAttempts = Number(process.env.MAX_ATTEMPTS || 3);
const maxCostCents = Number(process.env.MAX_COST_CENTS || 200);
const costPerAttempt = Number(process.env.COST_PER_ATTEMPT_CENTS || 0);

if (command.length === 0) {
  console.error("usage: node incident-budget-runner.mjs <command> [...args]");
  process.exit(2);
}

let estimatedCost = 0;

for (let attempt = 1; attempt <= maxAttempts; attempt += 1) {
  const child = spawn(command[0], command.slice(1), {
    stdio: "inherit",
    shell: process.platform === "win32"
  });

  const exitCode = await new Promise((resolve) => {
    child.on("exit", (code) => resolve(code ?? 1));
  });

  estimatedCost += costPerAttempt;

  if (exitCode === 0) process.exit(0);
  if (estimatedCost >= maxCostCents) {
    console.error(`stopped: estimated cost ${estimatedCost} cents reached`);
    process.exit(1);
  }

  const delayMs = Math.min(1000 * 2 ** (attempt - 1), 10_000);
  await new Promise((resolve) => setTimeout(resolve, delayMs));
}

console.error(`failed after ${maxAttempts} attempts`);
process.exit(1);

Run it like this:

MAX_ATTEMPTS=3 MAX_COST_CENTS=200 COST_PER_ATTEMPT_CENTS=25 \
node incident-budget-runner.mjs node batch-process.js

Claude Code Guardrails

Start by moving risky actions into ask or deny. Also deny direct reads of secret files.

{
  "$schema": "https://json.schemastore.org/claude-code-settings.json",
  "permissions": {
    "deny": [
      "Read(./.env)",
      "Read(./.env.*)",
      "Read(./secrets/**)",
      "Bash(git push --force *main*)",
      "Bash(git push -f *main*)",
      "Bash(rm -rf /*)",
      "Bash(rm -rf ~*)"
    ],
    "ask": [
      "Bash(git push*)",
      "Bash(rm*)",
      "Bash(npm install*)",
      "Bash(*migrate*)",
      "Bash(*deploy*)"
    ]
  },
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/protect-danger.sh"
          }
        ]
      }
    ]
  }
}

The hook can block dangerous commands with exit code 2.

#!/usr/bin/env bash
set -euo pipefail

payload="$(cat)"
command="$(node -e 'const fs = require("fs"); const raw = fs.readFileSync(0, "utf8") || "{}"; const json = JSON.parse(raw); console.log(json.tool_input?.command || "");' <<< "$payload")"
blocked='(rm[[:space:]]+-rf[[:space:]]+(/|~)|git[[:space:]]+push[[:space:]].*(-f|--force)([[:space:]]|$)|DROP[[:space:]]+TABLE|TRUNCATE[[:space:]])'

if [[ "$command" =~ $blocked ]]; then
  echo "Blocked dangerous command: $command" >&2
  exit 2
fi

exit 0

Communication and Postmortem Templates

## Incident Update
- Status: investigating / contained / validating recovery
- Impact: feature, users, start time
- Current action: stopped job, reverted deploy, checking logs
- Next update: YYYY-MM-DD HH:mm
- Owner:
# Postmortem: [Incident Title]

## Summary
- Started:
- Detected:
- Resolved:
- Impact:
- Severity: P0/P1/P2/P3

## Timeline
| Time | Event |
| --- | --- |
| HH:mm | |

## Cause
- Direct cause:
- Root cause:
- Why detection was late:

## Prevention
| Action | Owner | Due date |
| --- | --- | --- |
| | | |

Google’s SRE chapter on postmortem culture is the best external reference for keeping this blameless and useful.

For adjacent ClaudeCodeLab material, read Claude Code security best practices, Claude Code permissions guide, Claude Code API cost guide, and the verification receipt workflow.

Solo builders can start with the free cheatsheet. If you want reusable CLAUDE.md, hook, and review templates, browse ClaudeCodeLab products. Teams that need permissions, rollout rules, review gates, and incident drills can use Claude Code training and consultation.

After trying these templates in ClaudeCodeLab article and repository drills, the most useful change was putting containment before diagnosis. Syntax-checking the JSON, Bash hook, and Node wrapper before publishing also made the guidance safer. A 20-minute rehearsal quickly reveals missing alerts, missing backups, and overly broad Claude Code permissions.

#claude-code #incident #production #sre #security #postmortem
Free

Free PDF: Claude Code Cheatsheet

Enter your email and download the one-page Claude Code cheatsheet for commands, review habits, and safe workflows.

We handle your data with care and never send spam.

Level up your Claude Code workflow

Start with the free PDF, use Gumroad guides when you need repeatable workflows, and book consultation when rollout or revenue paths need human judgment.

Masa

About the Author

Masa

Engineer focused on practical Claude Code workflows. Runs claudecode-lab.com, a 10-language technical media site.