Tips & Tricks

7 Real Production Incidents with Claude Code: Full Recovery Procedures with RCA & Prevention

7 real production incidents involving Claude Code: API key leaks, DB wipes, billing explosions, and service outages — with root cause analysis and prevention strategies.

“Claude Code is convenient, but I’m scared to use it in production” — many engineers feel this way. And that instinct is correct.

Claude Code operates on your filesystem and shell with higher privileges than a typical IDE. At the same time, repeatedly clicking the approve button dulls your alertness — a known human psychological vulnerability. When these two factors combine, production incidents happen.

This article publicly documents 7 real production incidents involving Claude Code, complete with causes, impact scope, recovery procedures, RCA (root cause analysis), and prevention strategies. Read this before an accident happens at your organization.


Incident 1: API Key Leak → $3,000 in Unauthorized Charges

Timeline

09:12  Instructed Claude Code to "commit .env to pass environment variables to CI"
09:13  git add .env && git push executed without approval (allow list was too permissive)
09:14  GitHub secret scanning detected it, sent email notification
09:31  AWS crawler detected the OpenAI key, unauthorized usage began
11:00  Confirmed $3,000 charge on OpenAI dashboard

Recovery Procedure

# Step 1: Immediately revoke API key (top priority — within 5 minutes)
# → Revoke the key from OpenAI / each service's dashboard

# Step 2: Completely remove .env from git history
git filter-branch --force --index-filter \
  "git rm --cached --ignore-unmatch .env" \
  --prune-empty --tag-name-filter cat -- --all

# Step 3: Force push all branches
git push origin --force --all
git push origin --force --tags

# Step 4: Add to .gitignore to prevent recurrence
echo ".env" >> .gitignore
git add .gitignore && git commit -m "security: add .env to gitignore"

# Step 5: Issue a new API key and set it in .env

RCA

  • Direct cause: settings.json had Bash(git add*) in the allow list, allowing execution without confirmation
  • Root cause: Security configuration was deprioritized in favor of product code

Prevention

// .claude/settings.json
{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Bash(git add*)",
      "hooks": [{
        "type": "command",
        "command": "git diff --cached --name-only | grep -E '\\.env' && echo '🚨 .env detected! Aborting.' && exit 1 || exit 0"
      }]
    }]
  }
}

Incident 2: rm -rf Wiped the Entire Project

Timeline

14:33  Instructed "clean up node_modules and reinstall"
14:33  Claude Code executed rm -rf node_modules (normal so far)
14:34  Followed up with "delete old build files too", executed rm -rf dist/
14:34  Path misinterpretation caused rm -rf dist /src to run (space-separated)
14:35  src/ directory completely wiped. Config files outside git tracking also deleted
14:40  git checkout . restored tracked files, but .env and local configs were permanently lost

Recovery Procedure

# Git-tracked files can be restored
git checkout .
git clean -fd   # Remove any extra files

# Search for deleted files in git stash or reflog
git stash list
git reflog

# Non-git files (.env etc.) must be restored from backup
# → If no backup exists, reconfigure from scratch

RCA

  • Direct cause: Path containing a space was not properly quoted, resulting in rm -rf dist /src
  • Root cause: rm -rf was in allow instead of ask

Prevention

{
  "permissions": {
    "deny": ["Bash(rm -rf ~*)", "Bash(rm -rf /*)"],
    "ask": ["Bash(rm*)"]
  },
  "hooks": {
    "PreToolUse": [{
      "matcher": "Bash(rm*)",
      "hooks": [{
        "type": "command",
        "command": "echo '⚠️ Delete command detected. Target: $CLAUDE_TOOL_INPUT_COMMAND\nExecuting in 5 seconds. Press Ctrl+C to cancel.' && sleep 5"
      }]
    }]
  }
}

Incident 3: git push --force Wiped 3 Colleagues’ Commits

Timeline

16:00  Instructed "there's a conflict with remote. overwrite with local changes"
16:01  git push --force origin main was executed
16:01  ~200 lines of code committed by 3 colleagues that day were erased
16:10  Colleague A posted on Slack: "Hey, my commits are gone"
16:15  Root cause identified. Colleagues A/B had copies locally, but C had deleted theirs — permanently lost

Recovery Procedure

# Find lost commits in reflog (run on the machine that executed the push)
git reflog | head -30
# → Example: abc1234 HEAD@{3}: commit before the --force push

# Restore the lost commits
git checkout -b recovery abc1234
git push origin recovery

# Merge into main and clean up
git checkout main
git merge recovery --no-ff
git push origin main

RCA

  • Direct cause: The intent to resolve a conflict was interpreted as --force instead of --force-with-lease
  • Root cause: Force push to main branch was not in the deny list

Prevention

{
  "permissions": {
    "deny": [
      "Bash(git push --force *main*)",
      "Bash(git push --force *master*)",
      "Bash(git push -f *main*)"
    ]
  }
}
<!-- CLAUDE.md -->
## Git Rules
- `git push --force` is prohibited
- Use `git push --force-with-lease` for conflict resolution
- Always get user confirmation before pushing to main/master

Incident 4: Failed DB Migration Wiped 40,000 Production Records

Timeline

10:00  Instructed "run migration to add phone_number column to users table"
10:01  Claude Code generated and executed the migration script
10:01  A bug in the script caused the reverse migration (with DROP COLUMN) to run
10:02  users.email column (NOT NULL) was dropped from production
10:02  All APIs returned 500 errors, service completely down
10:05  Incident recognized, root cause investigation began
10:30  Restored from previous day's snapshot (one day of data lost)

Recovery Procedure

# 1. Immediately put the service into maintenance mode
# nginx: return 503;  or  Vercel: maintenance page

# 2. Check current DB state
psql $DATABASE_URL -c "\d users"

# 3. Restore from backup (for RDS)
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier mydb \
  --target-db-instance-identifier mydb-restored \
  --restore-time 2026-04-17T23:00:00Z

# 4. If the column still exists, add it manually
ALTER TABLE users ADD COLUMN email VARCHAR(255);
UPDATE users SET email = '(recovery required)' WHERE email IS NULL;

# 5. Restore service

RCA

  • Direct cause: The migration script generator confused up/down migrations
  • Root cause: The migration was run on production without testing in a staging environment

Prevention

<!-- CLAUDE.md -->
## DB Migration Required Rules
1. Always test in staging environment before applying to production
2. Always take a manual backup before migration:
   pg_dump $DATABASE_URL > backup_$(date +%Y%m%d_%H%M%S).sql
3. Scripts containing DROP COLUMN / TRUNCATE / DELETE (without WHERE) must
   always be confirmed by the user before execution
4. If using production DATABASE_URL, display 'Writing to PRODUCTION DB. Continue?' before execution

Incident 5: Infinite API Calls Generated $800 Overnight

Timeline

23:00  Instructed "automatically retry on errors" and started batch processing
23:01  External API started returning 503
23:01  Retry logic ran without a limit, hammering the API every second
07:00  Next morning, Anthropic notification: "Usage approaching limit"
07:05  Found 28,000 API calls and $800 in charges

Recovery Procedure

# 1. Immediately stop the process
pkill -f "node batch-process.js"

# 2. Review charges and contact Anthropic support
# → Sincere communication may result in partial refund

# 3. Set up usage alerts
# Anthropic console → Usage Limits → Set monthly budget alert

Prevention

// utils/retry.ts — always use this utility
export async function withRetry<T>(
  fn: () => Promise<T>,
  { maxAttempts = 3, baseDelayMs = 1000, maxDelayMs = 30000 } = {}
): Promise<T> {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (attempt === maxAttempts) throw err;
      const delay = Math.min(
        baseDelayMs * 2 ** (attempt - 1) + Math.random() * 500,
        maxDelayMs
      );
      console.warn(`Retry ${attempt}/${maxAttempts} in ${Math.round(delay)}ms`);
      await new Promise(r => setTimeout(r, delay));
    }
  }
  throw new Error("unreachable");
}
<!-- CLAUDE.md -->
## API Call Rules
- Maximum 3 retries with exponential backoff required
- while(true) + API calls are strictly prohibited
- Batch jobs must have explicit count limits
- Run --dry-run smoke test before production execution

Incident 6: Broken Dependencies After Deploy Took Down the Service

Timeline

15:00  Instructed "update packages to latest versions"
15:01  npm update executed, package-lock.json changed significantly
15:05  Local build passed
15:10  Deployed to production
15:12  Major version mismatch in production dependencies, startup failed
15:12  503 errors, service completely down
15:30  Rollback to previous version completed

Recovery Procedure

# For Vercel / Cloudflare Pages: immediately reactivate the previous deployment
# → One-click rollback from the dashboard

# Revert the code with git revert
git revert HEAD~1
git push

# Restore package-lock.json to previous version
git checkout HEAD~1 -- package-lock.json
npm ci  # Strictly uses package-lock.json

RCA

  • Direct cause: npm update upgraded a major version, causing a breaking change
  • Root cause: Skipped deployment verification in staging environment

Prevention

<!-- CLAUDE.md -->
## Package Management Rules
- npm update is prohibited (npm update --save-dev conditionally OK)
- Major version upgrades of packages require user confirmation
- Always verify behavior in staging before deploying
- Monitor error logs for 5 minutes after production deployment

Incident 7: Misconfigured Permissions Exposed All User Data

Timeline

11:00  Instructed "add user list to admin panel API endpoint"
11:05  /api/admin/users implemented without authentication check
11:10  Deployed to production
11:10  All users' personal information accessible to anyone
13:30  Security audit tool detected unauthenticated endpoint
13:35  Endpoint immediately disabled

Recovery Procedure

# 1. Immediately disable the affected endpoint
# nginx: location /api/admin { return 403; }

# 2. Review access logs to determine the scope of exposure
grep "/api/admin/users" /var/log/nginx/access.log | \
  awk '{print $1}' | sort | uniq -c | sort -rn

# 3. Notify affected users (check GDPR / data protection law requirements)

# 4. Add authentication middleware and redeploy

RCA

  • Direct cause: The instruction “add to admin panel” did not convey authentication requirements
  • Root cause: Security requirements were not documented in CLAUDE.md

Prevention

<!-- CLAUDE.md -->
## API Security Requirements
- /api/admin/* endpoints must always implement admin authentication
- /api/user/* endpoints must always implement login authentication
- Only /api/public/* endpoints are accessible without authentication
- When adding a new API, explicitly comment the required authentication level

Common Incident Response Flow (Postmortem Template)

# Postmortem: [Incident Title]

## Summary
- Occurred at: YYYY-MM-DD HH:MM
- Detected at: YYYY-MM-DD HH:MM
- Resolved at: YYYY-MM-DD HH:MM
- Impact: (number of users / features / duration)
- Severity: P0/P1/P2/P3

## Timeline
| Time  | Event |
|-------|-------|
| HH:MM | Incident occurred |
| HH:MM | Detected |
| HH:MM | Response began |
| HH:MM | Root cause identified |
| HH:MM | Recovery completed |

## Root Cause
- Direct cause:
- Root cause:
- Aggravating cause:

## Prevention Actions
| Action | Owner | Deadline |
|--------|-------|----------|
|        |       |          |

## Lessons Learned

Summary: Minimum Configuration to Prevent Incidents

// Copy this now and paste into .claude/settings.json
{
  "permissions": {
    "deny": [
      "Bash(rm -rf ~*)",
      "Bash(rm -rf /*)",
      "Bash(git push --force *main*)",
      "Bash(git push --force *master*)",
      "Bash(git push -f *main*)",
      "Bash(DROP TABLE*)",
      "Bash(TRUNCATE *)",
      "Bash(curl * | bash)",
      "Bash(wget * | sh)"
    ],
    "ask": [
      "Write(**)", "Edit(**)",
      "Bash(rm*)", "Bash(git commit*)",
      "Bash(git push*)", "Bash(*deploy*)",
      "Bash(npm install*)", "Bash(*migrate*)"
    ]
  },
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash(git add*)",
        "hooks": [{ "type": "command",
          "command": "git diff --cached --name-only | grep '\\.env' && echo '🚨 .env detected! Aborting.' && exit 1 || exit 0" }]
      },
      {
        "matcher": "Bash(rm*)",
        "hooks": [{ "type": "command",
          "command": "echo '⚠️ Delete command detected. Executing in 5 seconds. Press Ctrl+C to cancel.' && sleep 5" }]
      }
    ]
  }
}

Production incidents happen when you skip the 30 minutes it takes to configure settings. All 7 incidents in this article could have been prevented with a proper settings.json and CLAUDE.md.

References

#claude-code #incident #production #sre #security #postmortem

Level up your Claude Code workflow

50 battle-tested prompt templates you can copy-paste into Claude Code right now.

Free

Free PDF: Claude Code Cheatsheet in 5 Minutes

Just enter your email and we'll send you the single-page A4 cheatsheet right away.

We handle your data with care and never send spam.

Masa

About the Author

Masa

Engineer obsessed with Claude Code. Runs claudecode-lab.com, a 10-language tech media with 2,000+ pages.