Use Cases

Claude Code × AWS CloudWatch Complete Guide | Log Analysis, Alarm Setup & Dashboard Automation

Boost AWS CloudWatch efficiency with Claude Code. Real-world code for log pattern analysis, automatic alarm configuration, metrics dashboards, and incident investigation.

“An error hit production! But there are too many logs — I don’t know where to look.” That’s a classic panic during incident response.

CloudWatch is AWS’s standard monitoring service, but logs can be so voluminous that critical information gets buried, and alarm configuration tends to be put off. I monitor an ECS + Lambda system at work, and having Claude Code read the logs to pinpoint the cause has cut our average incident-response time by 40%.

This article walks through practical steps for automating CloudWatch log analysis, alarm design, and dashboard creation with Claude Code.


CloudWatch Key Components

CloudWatch Logs      : Store and search logs from apps and AWS services
CloudWatch Metrics   : Numeric data such as CPU usage and request counts
CloudWatch Alarms    : Detect threshold breaches on metrics and notify SNS, etc.
CloudWatch Dashboards: Custom views that visualize metrics and logs
Log Insights         : SQL-like query engine for analyzing logs

Step 1: Delegate Log Pattern Analysis to Claude Code

During an incident, the first priority is understanding the pattern of error logs.

# Fetch error logs from the past hour
aws logs filter-log-events \
  --log-group-name "/ecs/myapp" \
  --start-time $(date -d "1 hour ago" +%s000) \
  --filter-pattern "ERROR" \
  --output json > error-logs.json

claude -p "
Analyze the following CloudWatch error logs and:

1. Classify errors by type (5xx, 4xx, DB connection errors, timeouts, etc.)
2. Identify the most frequent error
3. Pinpoint the time when errors spiked
4. Propose hypotheses for the root cause
5. Suggest next investigation actions

$(cat error-logs.json | head -500)
"

Auto-generate Log Insights Queries

claude -p "
Generate CloudWatch Log Insights queries for the following purposes:

1. Error rate by endpoint over the past hour (top 10)
2. Details of requests with latency above 500 ms
3. All operation logs for a specific user (user_id: 12345)
4. First-occurrence errors within 30 minutes of a deployment

Log format: JSON (timestamp, level, message, user_id, endpoint, duration_ms, status_code)
"

Example of generated Log Insights queries:

-- Error rate by endpoint
fields @timestamp, endpoint, status_code
| filter status_code >= 400
| stats count() as error_count by endpoint
| sort error_count desc
| limit 10

-- Requests with latency over 500 ms
fields @timestamp, endpoint, duration_ms, user_id
| filter duration_ms > 500
| sort duration_ms desc
| limit 50

-- Operation logs for a specific user
fields @timestamp, level, message, endpoint
| filter user_id = "12345"
| sort @timestamp desc
| limit 100

Step 2: Auto-generate Alarm Configuration

claude -p "
Design all required CloudWatch alarms for the system below.
Implement in CDK TypeScript.

[System Architecture]
- ECS Fargate (API server, 2–10 instances)
- RDS PostgreSQL
- ALB (Application Load Balancer)
- Lambda (batch processing)

[Alarm Requirements]
- Production: alarms that fire within 5 minutes
- Notification targets: SNS → Slack and PagerDuty
- Two-tier alarms (Warning / Critical)
- Outside business hours: Critical only
"
// lib/monitoring-stack.ts
import * as cdk from "aws-cdk-lib";
import * as cloudwatch from "aws-cdk-lib/aws-cloudwatch";
import * as actions from "aws-cdk-lib/aws-cloudwatch-actions";
import * as sns from "aws-cdk-lib/aws-sns";

export class MonitoringStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const alertTopic = sns.Topic.fromTopicArn(
      this, "AlertTopic",
      `arn:aws:sns:${this.region}:${this.account}:prod-alerts`
    );
    const warnTopic = sns.Topic.fromTopicArn(
      this, "WarnTopic",
      `arn:aws:sns:${this.region}:${this.account}:prod-warnings`
    );

    // ALB 5xx error rate alarm
    const alb5xxAlarm = new cloudwatch.Alarm(this, "Alb5xxAlarm", {
      alarmName: "prod-alb-5xx-critical",
      alarmDescription: "ALB 5xx error rate exceeded 5%",
      metric: new cloudwatch.Metric({
        namespace: "AWS/ApplicationELB",
        metricName: "HTTPCode_Target_5XX_Count",
        dimensionsMap: { LoadBalancer: "app/myapp/xxx" },
        statistic: "Sum",
        period: cdk.Duration.minutes(5),
      }),
      threshold: 10,
      evaluationPeriods: 2,
      comparisonOperator: cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
      treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
    });
    alb5xxAlarm.addAlarmAction(new actions.SnsAction(alertTopic));

    // ECS CPU utilization alarm (Warning/Critical)
    const ecsCpuWarning = new cloudwatch.Alarm(this, "EcsCpuWarning", {
      alarmName: "prod-ecs-cpu-warning",
      metric: new cloudwatch.Metric({
        namespace: "AWS/ECS",
        metricName: "CPUUtilization",
        dimensionsMap: { ClusterName: "myapp-cluster", ServiceName: "myapp-service" },
        statistic: "Average",
        period: cdk.Duration.minutes(5),
      }),
      threshold: 70,
      evaluationPeriods: 3,
    });
    ecsCpuWarning.addAlarmAction(new actions.SnsAction(warnTopic));

    const ecsCpuCritical = new cloudwatch.Alarm(this, "EcsCpuCritical", {
      alarmName: "prod-ecs-cpu-critical",
      metric: new cloudwatch.Metric({
        namespace: "AWS/ECS",
        metricName: "CPUUtilization",
        dimensionsMap: { ClusterName: "myapp-cluster", ServiceName: "myapp-service" },
        statistic: "Average",
        period: cdk.Duration.minutes(5),
      }),
      threshold: 90,
      evaluationPeriods: 2,
    });
    ecsCpuCritical.addAlarmAction(new actions.SnsAction(alertTopic));

    // RDS connection count alarm
    const rdsConnectionAlarm = new cloudwatch.Alarm(this, "RdsConnectionAlarm", {
      alarmName: "prod-rds-connections-critical",
      metric: new cloudwatch.Metric({
        namespace: "AWS/RDS",
        metricName: "DatabaseConnections",
        dimensionsMap: { DBInstanceIdentifier: "myapp-db" },
        statistic: "Maximum",
        period: cdk.Duration.minutes(5),
      }),
      threshold: 80,  // 80% of db.t3.micro max connections
      evaluationPeriods: 2,
    });
    rdsConnectionAlarm.addAlarmAction(new actions.SnsAction(alertTopic));

    // Lambda error rate alarm
    const lambdaErrorAlarm = new cloudwatch.Alarm(this, "LambdaErrorAlarm", {
      alarmName: "prod-lambda-errors-critical",
      metric: new cloudwatch.Metric({
        namespace: "AWS/Lambda",
        metricName: "Errors",
        dimensionsMap: { FunctionName: "myapp-batch" },
        statistic: "Sum",
        period: cdk.Duration.minutes(15),
      }),
      threshold: 5,
      evaluationPeriods: 1,
    });
    lambdaErrorAlarm.addAlarmAction(new actions.SnsAction(alertTopic));
  }
}

Step 3: Auto-generate a Custom Dashboard

claude -p "
Generate a CloudWatch dashboard in CDK that displays the following information.

[Dashboard Layout]
Row 1: Overall system health (ALB request count, 5xx rate, latency P50/P95/P99)
Row 2: ECS service (CPU, memory, running task count)
Row 3: RDS (connections, latency, CPU utilization)
Row 4: Lambda (invocations, errors, duration)
Row 5: Business metrics (new registrations, payment success rate) ← custom metrics
"
// Dashboard definition (excerpt)
const dashboard = new cloudwatch.Dashboard(this, "AppDashboard", {
  dashboardName: "myapp-production",
});

dashboard.addWidgets(
  new cloudwatch.Row(
    new cloudwatch.GraphWidget({
      title: "ALB Request Count",
      left: [new cloudwatch.Metric({
        namespace: "AWS/ApplicationELB",
        metricName: "RequestCount",
        statistic: "Sum",
        period: cdk.Duration.minutes(1),
      })],
      width: 8,
    }),
    new cloudwatch.GraphWidget({
      title: "ALB 5xx Error Rate (%)",
      left: [new cloudwatch.MathExpression({
        expression: "5xx / (2xx + 3xx + 4xx + 5xx) * 100",
        usingMetrics: {
          "5xx": new cloudwatch.Metric({ metricName: "HTTPCode_Target_5XX_Count", namespace: "AWS/ApplicationELB", statistic: "Sum" }),
          "2xx": new cloudwatch.Metric({ metricName: "HTTPCode_Target_2XX_Count", namespace: "AWS/ApplicationELB", statistic: "Sum" }),
          "3xx": new cloudwatch.Metric({ metricName: "HTTPCode_Target_3XX_Count", namespace: "AWS/ApplicationELB", statistic: "Sum" }),
          "4xx": new cloudwatch.Metric({ metricName: "HTTPCode_Target_4XX_Count", namespace: "AWS/ApplicationELB", statistic: "Sum" }),
        },
        period: cdk.Duration.minutes(1),
      })],
      width: 8,
    }),
  )
);

Step 4: Delegate Incident Investigation to Claude Code

claude -p "
I want to investigate a production incident. Run the following commands and analyze the results:

1. aws logs filter-log-events --log-group-name '/ecs/myapp' \
   --start-time \$(date -d '2 hours ago' +%s000) \
   --filter-pattern 'ERROR' --limit 100

2. aws cloudwatch get-metric-statistics \
   --namespace AWS/ApplicationELB \
   --metric-name HTTPCode_Target_5XX_Count \
   --start-time \$(date -d '2 hours ago' -u +%Y-%m-%dT%H:%M:%SZ) \
   --end-time \$(date -u +%Y-%m-%dT%H:%M:%SZ) \
   --period 300 --statistics Sum

Based on the above results, summarize:
- Incident start time
- Estimated number of affected users
- Top 3 root cause hypotheses
- Immediate response actions
"

Step 5: Auto-design Custom Metrics

claude -p "
Generate Node.js (AWS SDK v3) code to measure the following e-commerce business KPIs
as CloudWatch custom metrics.

Metrics to measure:
- Payment success count and failure count (every 1 minute)
- Cart abandonment rate (every 5 minutes)
- New member registrations (every 1 hour)

Namespace: MyApp/Business
Tag each metric with an environment label (Production/Staging)
"
// src/monitoring/business-metrics.ts
import { CloudWatchClient, PutMetricDataCommand } from "@aws-sdk/client-cloudwatch";

const cw = new CloudWatchClient({ region: process.env.AWS_REGION });
const NAMESPACE = "MyApp/Business";
const ENV = process.env.NODE_ENV ?? "development";

export async function recordPaymentSuccess() {
  await cw.send(new PutMetricDataCommand({
    Namespace: NAMESPACE,
    MetricData: [{
      MetricName: "PaymentSuccess",
      Value: 1,
      Unit: "Count",
      Dimensions: [{ Name: "Environment", Value: ENV }],
    }],
  }));
}

export async function recordPaymentFailure(reason: string) {
  await cw.send(new PutMetricDataCommand({
    Namespace: NAMESPACE,
    MetricData: [{
      MetricName: "PaymentFailure",
      Value: 1,
      Unit: "Count",
      Dimensions: [
        { Name: "Environment", Value: ENV },
        { Name: "Reason", Value: reason },
      ],
    }],
  }));
}

4 Common Pitfalls

1. evaluationPeriods is too short

// ❌ Alarm fires on momentary spikes
evaluationPeriods: 1,
threshold: 10,

// ✅ Alarm only fires after 3 consecutive breaches (reduces false positives)
evaluationPeriods: 3,
threshold: 10,
datapointsToAlarm: 2,  // Alarm when 2 out of 3 periods breach threshold

2. Ignoring Log Insights costs

Log Insights charges based on the amount of data scanned. Running queries without limiting the time range can lead to unexpected bills. Always specify --start-time and --end-time.

3. High-resolution custom metrics are expensive

Standard metrics (60 seconds) are free, but high-resolution metrics (1 second) cost roughly 10x more. 1-minute aggregation is usually sufficient for business metrics.

4. Not setting a log retention period for Lambda

The default is “Never expire,” causing storage costs to grow indefinitely. Always set a retention period on log groups.

new logs.LogGroup(this, "AppLogGroup", {
  logGroupName: "/ecs/myapp",
  retention: logs.RetentionDays.ONE_MONTH,  // Auto-delete after 30 days
});

Summary

TaskClaude Code Contribution
Log analysisReads error logs and proposes root-cause hypotheses with remediation steps
Log Insights queriesGenerates queries just from a description of your analysis goal
Alarm configurationGenerates CDK code in bulk from a description of your system
DashboardGenerates widget definitions from a description of what you want to display
Incident investigationRuns AWS CLI commands and analyzes the results

“We’ll set up monitoring later” — and then an incident hits and you have no visibility. With Claude Code, you can have production-grade alarms and dashboards ready in 30 minutes.

References

#claude-code #aws #cloudwatch #monitoring #observability #devops

Level up your Claude Code workflow

50 battle-tested prompt templates you can copy-paste into Claude Code right now.

Free

Free PDF: Claude Code Cheatsheet in 5 Minutes

Just enter your email and we'll send you the single-page A4 cheatsheet right away.

We handle your data with care and never send spam.

Masa

About the Author

Masa

Engineer obsessed with Claude Code. Runs claudecode-lab.com, a 10-language tech media with 2,000+ pages.