Claude Code × AWS CloudWatch Complete Guide | Log Analysis, Alarm Setup & Dashboard Automation
Boost AWS CloudWatch efficiency with Claude Code. Real-world code for log pattern analysis, automatic alarm configuration, metrics dashboards, and incident investigation.
“An error hit production! But there are too many logs — I don’t know where to look.” That’s a classic panic during incident response.
CloudWatch is AWS’s standard monitoring service, but logs can be so voluminous that critical information gets buried, and alarm configuration tends to be put off. I monitor an ECS + Lambda system at work, and having Claude Code read the logs to pinpoint the cause has cut our average incident-response time by 40%.
This article walks through practical steps for automating CloudWatch log analysis, alarm design, and dashboard creation with Claude Code.
CloudWatch Key Components
CloudWatch Logs : Store and search logs from apps and AWS services
CloudWatch Metrics : Numeric data such as CPU usage and request counts
CloudWatch Alarms : Detect threshold breaches on metrics and notify SNS, etc.
CloudWatch Dashboards: Custom views that visualize metrics and logs
Log Insights : SQL-like query engine for analyzing logs
Step 1: Delegate Log Pattern Analysis to Claude Code
During an incident, the first priority is understanding the pattern of error logs.
# Fetch error logs from the past hour
aws logs filter-log-events \
--log-group-name "/ecs/myapp" \
--start-time $(date -d "1 hour ago" +%s000) \
--filter-pattern "ERROR" \
--output json > error-logs.json
claude -p "
Analyze the following CloudWatch error logs and:
1. Classify errors by type (5xx, 4xx, DB connection errors, timeouts, etc.)
2. Identify the most frequent error
3. Pinpoint the time when errors spiked
4. Propose hypotheses for the root cause
5. Suggest next investigation actions
$(cat error-logs.json | head -500)
"
Auto-generate Log Insights Queries
claude -p "
Generate CloudWatch Log Insights queries for the following purposes:
1. Error rate by endpoint over the past hour (top 10)
2. Details of requests with latency above 500 ms
3. All operation logs for a specific user (user_id: 12345)
4. First-occurrence errors within 30 minutes of a deployment
Log format: JSON (timestamp, level, message, user_id, endpoint, duration_ms, status_code)
"
Example of generated Log Insights queries:
-- Error rate by endpoint
fields @timestamp, endpoint, status_code
| filter status_code >= 400
| stats count() as error_count by endpoint
| sort error_count desc
| limit 10
-- Requests with latency over 500 ms
fields @timestamp, endpoint, duration_ms, user_id
| filter duration_ms > 500
| sort duration_ms desc
| limit 50
-- Operation logs for a specific user
fields @timestamp, level, message, endpoint
| filter user_id = "12345"
| sort @timestamp desc
| limit 100
Step 2: Auto-generate Alarm Configuration
claude -p "
Design all required CloudWatch alarms for the system below.
Implement in CDK TypeScript.
[System Architecture]
- ECS Fargate (API server, 2–10 instances)
- RDS PostgreSQL
- ALB (Application Load Balancer)
- Lambda (batch processing)
[Alarm Requirements]
- Production: alarms that fire within 5 minutes
- Notification targets: SNS → Slack and PagerDuty
- Two-tier alarms (Warning / Critical)
- Outside business hours: Critical only
"
// lib/monitoring-stack.ts
import * as cdk from "aws-cdk-lib";
import * as cloudwatch from "aws-cdk-lib/aws-cloudwatch";
import * as actions from "aws-cdk-lib/aws-cloudwatch-actions";
import * as sns from "aws-cdk-lib/aws-sns";
export class MonitoringStack extends cdk.Stack {
constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const alertTopic = sns.Topic.fromTopicArn(
this, "AlertTopic",
`arn:aws:sns:${this.region}:${this.account}:prod-alerts`
);
const warnTopic = sns.Topic.fromTopicArn(
this, "WarnTopic",
`arn:aws:sns:${this.region}:${this.account}:prod-warnings`
);
// ALB 5xx error rate alarm
const alb5xxAlarm = new cloudwatch.Alarm(this, "Alb5xxAlarm", {
alarmName: "prod-alb-5xx-critical",
alarmDescription: "ALB 5xx error rate exceeded 5%",
metric: new cloudwatch.Metric({
namespace: "AWS/ApplicationELB",
metricName: "HTTPCode_Target_5XX_Count",
dimensionsMap: { LoadBalancer: "app/myapp/xxx" },
statistic: "Sum",
period: cdk.Duration.minutes(5),
}),
threshold: 10,
evaluationPeriods: 2,
comparisonOperator: cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
});
alb5xxAlarm.addAlarmAction(new actions.SnsAction(alertTopic));
// ECS CPU utilization alarm (Warning/Critical)
const ecsCpuWarning = new cloudwatch.Alarm(this, "EcsCpuWarning", {
alarmName: "prod-ecs-cpu-warning",
metric: new cloudwatch.Metric({
namespace: "AWS/ECS",
metricName: "CPUUtilization",
dimensionsMap: { ClusterName: "myapp-cluster", ServiceName: "myapp-service" },
statistic: "Average",
period: cdk.Duration.minutes(5),
}),
threshold: 70,
evaluationPeriods: 3,
});
ecsCpuWarning.addAlarmAction(new actions.SnsAction(warnTopic));
const ecsCpuCritical = new cloudwatch.Alarm(this, "EcsCpuCritical", {
alarmName: "prod-ecs-cpu-critical",
metric: new cloudwatch.Metric({
namespace: "AWS/ECS",
metricName: "CPUUtilization",
dimensionsMap: { ClusterName: "myapp-cluster", ServiceName: "myapp-service" },
statistic: "Average",
period: cdk.Duration.minutes(5),
}),
threshold: 90,
evaluationPeriods: 2,
});
ecsCpuCritical.addAlarmAction(new actions.SnsAction(alertTopic));
// RDS connection count alarm
const rdsConnectionAlarm = new cloudwatch.Alarm(this, "RdsConnectionAlarm", {
alarmName: "prod-rds-connections-critical",
metric: new cloudwatch.Metric({
namespace: "AWS/RDS",
metricName: "DatabaseConnections",
dimensionsMap: { DBInstanceIdentifier: "myapp-db" },
statistic: "Maximum",
period: cdk.Duration.minutes(5),
}),
threshold: 80, // 80% of db.t3.micro max connections
evaluationPeriods: 2,
});
rdsConnectionAlarm.addAlarmAction(new actions.SnsAction(alertTopic));
// Lambda error rate alarm
const lambdaErrorAlarm = new cloudwatch.Alarm(this, "LambdaErrorAlarm", {
alarmName: "prod-lambda-errors-critical",
metric: new cloudwatch.Metric({
namespace: "AWS/Lambda",
metricName: "Errors",
dimensionsMap: { FunctionName: "myapp-batch" },
statistic: "Sum",
period: cdk.Duration.minutes(15),
}),
threshold: 5,
evaluationPeriods: 1,
});
lambdaErrorAlarm.addAlarmAction(new actions.SnsAction(alertTopic));
}
}
Step 3: Auto-generate a Custom Dashboard
claude -p "
Generate a CloudWatch dashboard in CDK that displays the following information.
[Dashboard Layout]
Row 1: Overall system health (ALB request count, 5xx rate, latency P50/P95/P99)
Row 2: ECS service (CPU, memory, running task count)
Row 3: RDS (connections, latency, CPU utilization)
Row 4: Lambda (invocations, errors, duration)
Row 5: Business metrics (new registrations, payment success rate) ← custom metrics
"
// Dashboard definition (excerpt)
const dashboard = new cloudwatch.Dashboard(this, "AppDashboard", {
dashboardName: "myapp-production",
});
dashboard.addWidgets(
new cloudwatch.Row(
new cloudwatch.GraphWidget({
title: "ALB Request Count",
left: [new cloudwatch.Metric({
namespace: "AWS/ApplicationELB",
metricName: "RequestCount",
statistic: "Sum",
period: cdk.Duration.minutes(1),
})],
width: 8,
}),
new cloudwatch.GraphWidget({
title: "ALB 5xx Error Rate (%)",
left: [new cloudwatch.MathExpression({
expression: "5xx / (2xx + 3xx + 4xx + 5xx) * 100",
usingMetrics: {
"5xx": new cloudwatch.Metric({ metricName: "HTTPCode_Target_5XX_Count", namespace: "AWS/ApplicationELB", statistic: "Sum" }),
"2xx": new cloudwatch.Metric({ metricName: "HTTPCode_Target_2XX_Count", namespace: "AWS/ApplicationELB", statistic: "Sum" }),
"3xx": new cloudwatch.Metric({ metricName: "HTTPCode_Target_3XX_Count", namespace: "AWS/ApplicationELB", statistic: "Sum" }),
"4xx": new cloudwatch.Metric({ metricName: "HTTPCode_Target_4XX_Count", namespace: "AWS/ApplicationELB", statistic: "Sum" }),
},
period: cdk.Duration.minutes(1),
})],
width: 8,
}),
)
);
Step 4: Delegate Incident Investigation to Claude Code
claude -p "
I want to investigate a production incident. Run the following commands and analyze the results:
1. aws logs filter-log-events --log-group-name '/ecs/myapp' \
--start-time \$(date -d '2 hours ago' +%s000) \
--filter-pattern 'ERROR' --limit 100
2. aws cloudwatch get-metric-statistics \
--namespace AWS/ApplicationELB \
--metric-name HTTPCode_Target_5XX_Count \
--start-time \$(date -d '2 hours ago' -u +%Y-%m-%dT%H:%M:%SZ) \
--end-time \$(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 --statistics Sum
Based on the above results, summarize:
- Incident start time
- Estimated number of affected users
- Top 3 root cause hypotheses
- Immediate response actions
"
Step 5: Auto-design Custom Metrics
claude -p "
Generate Node.js (AWS SDK v3) code to measure the following e-commerce business KPIs
as CloudWatch custom metrics.
Metrics to measure:
- Payment success count and failure count (every 1 minute)
- Cart abandonment rate (every 5 minutes)
- New member registrations (every 1 hour)
Namespace: MyApp/Business
Tag each metric with an environment label (Production/Staging)
"
// src/monitoring/business-metrics.ts
import { CloudWatchClient, PutMetricDataCommand } from "@aws-sdk/client-cloudwatch";
const cw = new CloudWatchClient({ region: process.env.AWS_REGION });
const NAMESPACE = "MyApp/Business";
const ENV = process.env.NODE_ENV ?? "development";
export async function recordPaymentSuccess() {
await cw.send(new PutMetricDataCommand({
Namespace: NAMESPACE,
MetricData: [{
MetricName: "PaymentSuccess",
Value: 1,
Unit: "Count",
Dimensions: [{ Name: "Environment", Value: ENV }],
}],
}));
}
export async function recordPaymentFailure(reason: string) {
await cw.send(new PutMetricDataCommand({
Namespace: NAMESPACE,
MetricData: [{
MetricName: "PaymentFailure",
Value: 1,
Unit: "Count",
Dimensions: [
{ Name: "Environment", Value: ENV },
{ Name: "Reason", Value: reason },
],
}],
}));
}
4 Common Pitfalls
1. evaluationPeriods is too short
// ❌ Alarm fires on momentary spikes
evaluationPeriods: 1,
threshold: 10,
// ✅ Alarm only fires after 3 consecutive breaches (reduces false positives)
evaluationPeriods: 3,
threshold: 10,
datapointsToAlarm: 2, // Alarm when 2 out of 3 periods breach threshold
2. Ignoring Log Insights costs
Log Insights charges based on the amount of data scanned. Running queries without limiting the time range can lead to unexpected bills. Always specify --start-time and --end-time.
3. High-resolution custom metrics are expensive
Standard metrics (60 seconds) are free, but high-resolution metrics (1 second) cost roughly 10x more. 1-minute aggregation is usually sufficient for business metrics.
4. Not setting a log retention period for Lambda
The default is “Never expire,” causing storage costs to grow indefinitely. Always set a retention period on log groups.
new logs.LogGroup(this, "AppLogGroup", {
logGroupName: "/ecs/myapp",
retention: logs.RetentionDays.ONE_MONTH, // Auto-delete after 30 days
});
Summary
| Task | Claude Code Contribution |
|---|---|
| Log analysis | Reads error logs and proposes root-cause hypotheses with remediation steps |
| Log Insights queries | Generates queries just from a description of your analysis goal |
| Alarm configuration | Generates CDK code in bulk from a description of your system |
| Dashboard | Generates widget definitions from a description of what you want to display |
| Incident investigation | Runs AWS CLI commands and analyzes the results |
“We’ll set up monitoring later” — and then an incident hits and you have no visibility. With Claude Code, you can have production-grade alarms and dashboards ready in 30 minutes.
Related Articles
- Claude Code × AWS ECS/Fargate Complete Guide
- Claude Code × AWS CodePipeline/CodeBuild Complete Guide
- Claude Code × AWS IAM Complete Guide
References
Level up your Claude Code workflow
50 battle-tested prompt templates you can copy-paste into Claude Code right now.
Free PDF: Claude Code Cheatsheet in 5 Minutes
Just enter your email and we'll send you the single-page A4 cheatsheet right away.
We handle your data with care and never send spam.
About the Author
Masa
Engineer obsessed with Claude Code. Runs claudecode-lab.com, a 10-language tech media with 2,000+ pages.
Related Posts
Claude Code × Amazon Bedrock Complete Guide | Running Claude in Production on AWS
Complete guide to using Amazon Bedrock with Claude Code. From IAM authentication, streaming, Lambda integration, RAG implementation, to cost optimization — based on Masa's real production experience.
Claude Code × AWS CodePipeline/CodeBuild Complete Guide | Automate CI/CD Pipeline Build
Automatically build CI/CD with AWS CodePipeline & CodeBuild using Claude Code. Real code examples for pipeline design, buildspec.yml generation, test automation, and CDK infrastructure.
Claude Code × AWS ECS/Fargate Complete Guide | Automate Container Deployments
Automate AWS ECS/Fargate deployments with Claude Code. From task definitions and service configuration to Blue/Green deployments and CDK infrastructure — based on Masa's real-world experience.