7 Real Production Incidents with Claude Code: Full Recovery Procedures with RCA & Prevention
7 real production incidents involving Claude Code: API key leaks, DB wipes, billing explosions, and service outages — with root cause analysis and prevention strategies.
“Claude Code is convenient, but I’m scared to use it in production” — many engineers feel this way. And that instinct is correct.
Claude Code operates on your filesystem and shell with higher privileges than a typical IDE. At the same time, repeatedly clicking the approve button dulls your alertness — a known human psychological vulnerability. When these two factors combine, production incidents happen.
This article publicly documents 7 real production incidents involving Claude Code, complete with causes, impact scope, recovery procedures, RCA (root cause analysis), and prevention strategies. Read this before an accident happens at your organization.
Incident 1: API Key Leak → $3,000 in Unauthorized Charges
Timeline
09:12 Instructed Claude Code to "commit .env to pass environment variables to CI"
09:13 git add .env && git push executed without approval (allow list was too permissive)
09:14 GitHub secret scanning detected it, sent email notification
09:31 AWS crawler detected the OpenAI key, unauthorized usage began
11:00 Confirmed $3,000 charge on OpenAI dashboard
Recovery Procedure
# Step 1: Immediately revoke API key (top priority — within 5 minutes)
# → Revoke the key from OpenAI / each service's dashboard
# Step 2: Completely remove .env from git history
git filter-branch --force --index-filter \
"git rm --cached --ignore-unmatch .env" \
--prune-empty --tag-name-filter cat -- --all
# Step 3: Force push all branches
git push origin --force --all
git push origin --force --tags
# Step 4: Add to .gitignore to prevent recurrence
echo ".env" >> .gitignore
git add .gitignore && git commit -m "security: add .env to gitignore"
# Step 5: Issue a new API key and set it in .env
RCA
- Direct cause:
settings.jsonhadBash(git add*)in theallowlist, allowing execution without confirmation - Root cause: Security configuration was deprioritized in favor of product code
Prevention
// .claude/settings.json
{
"hooks": {
"PreToolUse": [{
"matcher": "Bash(git add*)",
"hooks": [{
"type": "command",
"command": "git diff --cached --name-only | grep -E '\\.env' && echo '🚨 .env detected! Aborting.' && exit 1 || exit 0"
}]
}]
}
}
Incident 2: rm -rf Wiped the Entire Project
Timeline
14:33 Instructed "clean up node_modules and reinstall"
14:33 Claude Code executed rm -rf node_modules (normal so far)
14:34 Followed up with "delete old build files too", executed rm -rf dist/
14:34 Path misinterpretation caused rm -rf dist /src to run (space-separated)
14:35 src/ directory completely wiped. Config files outside git tracking also deleted
14:40 git checkout . restored tracked files, but .env and local configs were permanently lost
Recovery Procedure
# Git-tracked files can be restored
git checkout .
git clean -fd # Remove any extra files
# Search for deleted files in git stash or reflog
git stash list
git reflog
# Non-git files (.env etc.) must be restored from backup
# → If no backup exists, reconfigure from scratch
RCA
- Direct cause: Path containing a space was not properly quoted, resulting in
rm -rf dist /src - Root cause:
rm -rfwas inallowinstead ofask
Prevention
{
"permissions": {
"deny": ["Bash(rm -rf ~*)", "Bash(rm -rf /*)"],
"ask": ["Bash(rm*)"]
},
"hooks": {
"PreToolUse": [{
"matcher": "Bash(rm*)",
"hooks": [{
"type": "command",
"command": "echo '⚠️ Delete command detected. Target: $CLAUDE_TOOL_INPUT_COMMAND\nExecuting in 5 seconds. Press Ctrl+C to cancel.' && sleep 5"
}]
}]
}
}
Incident 3: git push --force Wiped 3 Colleagues’ Commits
Timeline
16:00 Instructed "there's a conflict with remote. overwrite with local changes"
16:01 git push --force origin main was executed
16:01 ~200 lines of code committed by 3 colleagues that day were erased
16:10 Colleague A posted on Slack: "Hey, my commits are gone"
16:15 Root cause identified. Colleagues A/B had copies locally, but C had deleted theirs — permanently lost
Recovery Procedure
# Find lost commits in reflog (run on the machine that executed the push)
git reflog | head -30
# → Example: abc1234 HEAD@{3}: commit before the --force push
# Restore the lost commits
git checkout -b recovery abc1234
git push origin recovery
# Merge into main and clean up
git checkout main
git merge recovery --no-ff
git push origin main
RCA
- Direct cause: The intent to resolve a conflict was interpreted as
--forceinstead of--force-with-lease - Root cause: Force push to main branch was not in the
denylist
Prevention
{
"permissions": {
"deny": [
"Bash(git push --force *main*)",
"Bash(git push --force *master*)",
"Bash(git push -f *main*)"
]
}
}
<!-- CLAUDE.md -->
## Git Rules
- `git push --force` is prohibited
- Use `git push --force-with-lease` for conflict resolution
- Always get user confirmation before pushing to main/master
Incident 4: Failed DB Migration Wiped 40,000 Production Records
Timeline
10:00 Instructed "run migration to add phone_number column to users table"
10:01 Claude Code generated and executed the migration script
10:01 A bug in the script caused the reverse migration (with DROP COLUMN) to run
10:02 users.email column (NOT NULL) was dropped from production
10:02 All APIs returned 500 errors, service completely down
10:05 Incident recognized, root cause investigation began
10:30 Restored from previous day's snapshot (one day of data lost)
Recovery Procedure
# 1. Immediately put the service into maintenance mode
# nginx: return 503; or Vercel: maintenance page
# 2. Check current DB state
psql $DATABASE_URL -c "\d users"
# 3. Restore from backup (for RDS)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier mydb \
--target-db-instance-identifier mydb-restored \
--restore-time 2026-04-17T23:00:00Z
# 4. If the column still exists, add it manually
ALTER TABLE users ADD COLUMN email VARCHAR(255);
UPDATE users SET email = '(recovery required)' WHERE email IS NULL;
# 5. Restore service
RCA
- Direct cause: The migration script generator confused up/down migrations
- Root cause: The migration was run on production without testing in a staging environment
Prevention
<!-- CLAUDE.md -->
## DB Migration Required Rules
1. Always test in staging environment before applying to production
2. Always take a manual backup before migration:
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d_%H%M%S).sql
3. Scripts containing DROP COLUMN / TRUNCATE / DELETE (without WHERE) must
always be confirmed by the user before execution
4. If using production DATABASE_URL, display 'Writing to PRODUCTION DB. Continue?' before execution
Incident 5: Infinite API Calls Generated $800 Overnight
Timeline
23:00 Instructed "automatically retry on errors" and started batch processing
23:01 External API started returning 503
23:01 Retry logic ran without a limit, hammering the API every second
07:00 Next morning, Anthropic notification: "Usage approaching limit"
07:05 Found 28,000 API calls and $800 in charges
Recovery Procedure
# 1. Immediately stop the process
pkill -f "node batch-process.js"
# 2. Review charges and contact Anthropic support
# → Sincere communication may result in partial refund
# 3. Set up usage alerts
# Anthropic console → Usage Limits → Set monthly budget alert
Prevention
// utils/retry.ts — always use this utility
export async function withRetry<T>(
fn: () => Promise<T>,
{ maxAttempts = 3, baseDelayMs = 1000, maxDelayMs = 30000 } = {}
): Promise<T> {
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await fn();
} catch (err) {
if (attempt === maxAttempts) throw err;
const delay = Math.min(
baseDelayMs * 2 ** (attempt - 1) + Math.random() * 500,
maxDelayMs
);
console.warn(`Retry ${attempt}/${maxAttempts} in ${Math.round(delay)}ms`);
await new Promise(r => setTimeout(r, delay));
}
}
throw new Error("unreachable");
}
<!-- CLAUDE.md -->
## API Call Rules
- Maximum 3 retries with exponential backoff required
- while(true) + API calls are strictly prohibited
- Batch jobs must have explicit count limits
- Run --dry-run smoke test before production execution
Incident 6: Broken Dependencies After Deploy Took Down the Service
Timeline
15:00 Instructed "update packages to latest versions"
15:01 npm update executed, package-lock.json changed significantly
15:05 Local build passed
15:10 Deployed to production
15:12 Major version mismatch in production dependencies, startup failed
15:12 503 errors, service completely down
15:30 Rollback to previous version completed
Recovery Procedure
# For Vercel / Cloudflare Pages: immediately reactivate the previous deployment
# → One-click rollback from the dashboard
# Revert the code with git revert
git revert HEAD~1
git push
# Restore package-lock.json to previous version
git checkout HEAD~1 -- package-lock.json
npm ci # Strictly uses package-lock.json
RCA
- Direct cause:
npm updateupgraded a major version, causing a breaking change - Root cause: Skipped deployment verification in staging environment
Prevention
<!-- CLAUDE.md -->
## Package Management Rules
- npm update is prohibited (npm update --save-dev conditionally OK)
- Major version upgrades of packages require user confirmation
- Always verify behavior in staging before deploying
- Monitor error logs for 5 minutes after production deployment
Incident 7: Misconfigured Permissions Exposed All User Data
Timeline
11:00 Instructed "add user list to admin panel API endpoint"
11:05 /api/admin/users implemented without authentication check
11:10 Deployed to production
11:10 All users' personal information accessible to anyone
13:30 Security audit tool detected unauthenticated endpoint
13:35 Endpoint immediately disabled
Recovery Procedure
# 1. Immediately disable the affected endpoint
# nginx: location /api/admin { return 403; }
# 2. Review access logs to determine the scope of exposure
grep "/api/admin/users" /var/log/nginx/access.log | \
awk '{print $1}' | sort | uniq -c | sort -rn
# 3. Notify affected users (check GDPR / data protection law requirements)
# 4. Add authentication middleware and redeploy
RCA
- Direct cause: The instruction “add to admin panel” did not convey authentication requirements
- Root cause: Security requirements were not documented in CLAUDE.md
Prevention
<!-- CLAUDE.md -->
## API Security Requirements
- /api/admin/* endpoints must always implement admin authentication
- /api/user/* endpoints must always implement login authentication
- Only /api/public/* endpoints are accessible without authentication
- When adding a new API, explicitly comment the required authentication level
Common Incident Response Flow (Postmortem Template)
# Postmortem: [Incident Title]
## Summary
- Occurred at: YYYY-MM-DD HH:MM
- Detected at: YYYY-MM-DD HH:MM
- Resolved at: YYYY-MM-DD HH:MM
- Impact: (number of users / features / duration)
- Severity: P0/P1/P2/P3
## Timeline
| Time | Event |
|-------|-------|
| HH:MM | Incident occurred |
| HH:MM | Detected |
| HH:MM | Response began |
| HH:MM | Root cause identified |
| HH:MM | Recovery completed |
## Root Cause
- Direct cause:
- Root cause:
- Aggravating cause:
## Prevention Actions
| Action | Owner | Deadline |
|--------|-------|----------|
| | | |
## Lessons Learned
Summary: Minimum Configuration to Prevent Incidents
// Copy this now and paste into .claude/settings.json
{
"permissions": {
"deny": [
"Bash(rm -rf ~*)",
"Bash(rm -rf /*)",
"Bash(git push --force *main*)",
"Bash(git push --force *master*)",
"Bash(git push -f *main*)",
"Bash(DROP TABLE*)",
"Bash(TRUNCATE *)",
"Bash(curl * | bash)",
"Bash(wget * | sh)"
],
"ask": [
"Write(**)", "Edit(**)",
"Bash(rm*)", "Bash(git commit*)",
"Bash(git push*)", "Bash(*deploy*)",
"Bash(npm install*)", "Bash(*migrate*)"
]
},
"hooks": {
"PreToolUse": [
{
"matcher": "Bash(git add*)",
"hooks": [{ "type": "command",
"command": "git diff --cached --name-only | grep '\\.env' && echo '🚨 .env detected! Aborting.' && exit 1 || exit 0" }]
},
{
"matcher": "Bash(rm*)",
"hooks": [{ "type": "command",
"command": "echo '⚠️ Delete command detected. Executing in 5 seconds. Press Ctrl+C to cancel.' && sleep 5" }]
}
]
}
}
Production incidents happen when you skip the 30 minutes it takes to configure settings. All 7 incidents in this article could have been prevented with a proper settings.json and CLAUDE.md.
Related Articles
- Complete Guide to Claude Code Security Best Practices
- 7 Claude Code Security Failure Cases
- Complete Guide to Claude Code Permissions
References
Level up your Claude Code workflow
50 battle-tested prompt templates you can copy-paste into Claude Code right now.
Free PDF: Claude Code Cheatsheet in 5 Minutes
Just enter your email and we'll send you the single-page A4 cheatsheet right away.
We handle your data with care and never send spam.
About the Author
Masa
Engineer obsessed with Claude Code. Runs claudecode-lab.com, a 10-language tech media with 2,000+ pages.
Related Posts
Claude Code API Cost Mastery: 5 Proven Techniques to Cut Bills from $450 to $45/Month
Real numbers behind Claude Code API pricing. Learn how prompt caching, model optimization, and batching achieved a 90% cost reduction—from $450 to $45 per month.
10 Dangerous Prompt Patterns in Claude Code | What Not to Say and Safe Alternatives
Discover 10 dangerous prompt patterns you should never give Claude Code. Learn how vague instructions lead to code loss, DB destruction, billing explosions, and key leaks—with safe alternatives.
Claude Code Security Best Practices: API Keys, Permissions & Production Protection
A practical security guide for using Claude Code safely. From API key management to permission settings, Hooks-based automation, and production environment protection — with working code examples.