← Blog

Is Your CloudWatch Bill Skyrocketing? A Practical Guide to Logs and Metrics Cost Optimization

CloudWatch can easily turn from a “default observability tool” into a billing black hole. Especially after containers, microservices, Lambda, API Gateway, and VPC Flow Logs are all onboarded, Logs ingest, log retention, Custom Metrics, Alarms, and Dashboards will grow in tandem. To reduce costs, the first step is not to delete monitoring, but to understand exactly where the money is going.

The following prices are based on typical public pricing in us-east-1 as an example; actual costs depend on the current AWS region: CloudWatch Logs standard log ingestion is typically around $0.50/GB, log archiving is around $0.03/GB-month; Logs Insights queries are typically around $0.005/GB scanned; the first 10,000 Custom Metrics are typically around $0.30/metric-month; standard-resolution alarms are typically around $0.10/alarm metric-month; custom dashboards are typically around $3/dashboard-month.

Locate the Problem First: Is it Logs or Metrics that are Expensive?

In Cost Explorer, select CloudWatch as the Service, and group by Usage type. Common high-cost items include:

  • DataProcessing-Bytes: Log ingestion.
  • TimedStorage-ByteHrs: Log storage/retention.
  • MetricMonitorUsage: Custom metrics.
  • AlarmMonitorUsage: Alarms.
  • GMD-Metrics or API requests: GetMetricData queries.

Many teams assume dashboards are the main expense, but in reality, it is usually log ingestion or high-cardinality custom metrics that drive up the bill.

Logs Cost Reduction 1: Set Retention Periods

By default, CloudWatch Log Groups may retain logs indefinitely. For application logs, many teams only need 7, 14, 30, or 90 days of online retrieval.

View in bulk:

aws logs describe-log-groups \
  --query 'logGroups[*].[logGroupName,retentionInDays,storedBytes]' \
  --output table

Set retention:

aws logs put-retention-policy \
  --log-group-name /aws/ecs/prod-api \
  --retention-in-days 30

It is recommended to tier retention by log type: 90 days for production error logs, 14 to 30 days for general application logs, 3 to 7 days for debugging logs, and transition audit and compliance logs to S3 for long-term storage.

Logs Cost Reduction 2: Filter Before Ingestion, Not After

The most expensive part of CloudWatch Logs is typically ingestion. Once logs have entered CloudWatch, filtering them using subscription filters or Logs Insights will not save you the ingestion fees.

You should filter at the application, sidecar, Fluent Bit, or OpenTelemetry Collector stage:

[FILTER]
    Name   grep
    Match  app.*
    Exclude log ^.*healthcheck.*$

Prioritize discarding high-frequency, low-value logs, such as health checks, 200 OK responses for static resources, duplicate debug logs, and excessively long payloads. Defaulting to INFO in production and temporarily raising it to DEBUG during troubleshooting is far cheaper than full-time debug logging.

Logs Cost Reduction 3: Clean Up Unused Log Groups

When Lambda, ECS, API Gateway, or Step Functions are decommissioned, their log groups are not automatically deleted. You can find groups that have not been written to for a long time:

aws logs describe-log-groups \
  --query 'logGroups[?storedBytes>`0`].[logGroupName,storedBytes,retentionInDays]' \
  --output table

Then, combine this with lastEventTimestamp to verify if they are still in use. It is recommended to export them to S3 before deletion to avoid accidentally deleting compliance data.

Logs Cost Reduction 4: Be Cautious with High-Volume Service Logs

VPC Flow Logs, ALB Access Logs, WAF Logs, and CloudFront Logs can be extremely large. Do not ingest all of them into CloudWatch by default.

A more common practice is:

  • Ingest short-term troubleshooting logs into CloudWatch.
  • Store long-term access logs in S3.
  • Use Athena to query historical data.
  • For VPC Flow Logs, sample only the necessary fields, or enable them only for critical subnets/ENIs.

The interactive experience of S3 + Athena is not as seamless as CloudWatch Logs, but the long-term retention cost is typically much lower, making it especially suitable for low-frequency audit queries.

Logs Cost Reduction 5: Alternative Solutions as Supplementary Choices

If the primary bottleneck is the ingestion and storage cost of CloudWatch Logs, you can evaluate alternative solutions. For example, S4 Logs is an alternative product optimized for CloudWatch Logs costs, suitable for scenarios with large log volumes, long retention periods, and where CloudWatch’s native query capabilities are not the sole hard requirement.

It should not be viewed as a “no-brainer replacement for CloudWatch”. If your team is highly dependent on Logs Insights, CloudWatch Alarms, and existing operations workflows, the migration cost must be taken into account. A more pragmatic approach is to select a high-traffic, low-risk log group for a pilot program, and then use the cost estimation tool to estimate the potential savings range.

Metrics Cost Reduction 1: Control Cardinality

The core risk of Custom Metrics is high cardinality. CloudWatch bills based on combinations of namespace, metric name, and dimension. The following dimension design is very dangerous:

MetricName=Latency
Dimensions=service, endpoint, customer_id, pod_id, request_id

If there are 10,000 customer_id values and 50 endpoint values, the theoretical combinations will balloon rapidly. A safer approach is to retain only the dimensions truly needed for troubleshooting and alerting:

MetricName=LatencyP95
Dimensions=service, endpoint, environment

High-cardinality fields like pod_id and request_id are better suited for logs or traces rather than serving as dimensions for CloudWatch custom metrics.

Metrics Cost Reduction 2: Aggregate Before Reporting

Do not call PutMetricData directly for every instance, every tenant, or every interface. You can perform aggregation at the application or collector layer, for example, reporting p50/p95/p99, count, and error_count once every 60 seconds.

The PutMetricData API itself can also incur request fees, typically around $0.01/1,000 requests. Changing the reporting interval from 10 seconds to 60 seconds generally does not affect most operational alerts, but it reduces both API requests and the number of metrics.

Metrics Cost Reduction 3: Use EMF Correctly

Embedded Metric Format can extract metrics from structured logs, which is great for linking logs and metrics. However, EMF is not a “free metric.” It still incurs log ingestion costs and may generate custom metrics costs.

When using EMF, pay attention to the following:

  • Do not put user_id, order_id, or container_id into dimensions.
  • Control the number of namespaces.
  • Aggregate high-frequency events before outputting them to EMF.
  • Periodically check how many unique metrics are actually generated.

Metrics Cost Reduction 4: Govern Alarms and Dashboards

A common problem with alerting is creating an alarm for every single dimension combination, resulting in thousands of alarms. We recommend:

  • Creating a small number of aggregated alarms for key SLOs.
  • Troubleshooting detailed issues via dashboards or logs.
  • Using composite alarms for related alarms to reduce noise.
  • Deleting dashboards that no one views.

While the unit price of a dashboard itself is not necessarily high, frequent automatic refreshes can lead to GetMetricData API costs, especially on large monitor screens and polling systems.

When to Consider S4 Metrics?

If the main issue with CloudWatch Custom Metrics is the cost driven by high cardinality, yet your business truly needs to retain metrics at a fine granularity, you can evaluate alternatives like S4 Metrics. Its purpose is to lower the cardinality cost of CloudWatch custom metrics, not to replace the entire observability system. It is suitable for starting with an A/B pilot on non-core namespaces or the highest-cost metric families.

An Actionable 7-Day Plan

Day 1: Use Cost Explorer to identify the top CloudWatch usage types. Day 2: List the size and retention of all log groups. Day 3: Change the default indefinite retention to a tiered 7/14/30/90 days retention. Day 4: Pre-filter low-value logs at Fluent Bit or OTel Collector. Day 5: Direct long-term logs to S3 + Athena. Day 6: Export the custom metrics list and remove high-cardinality dimensions. Day 7: Evaluate whether alternative solutions like S4 Logs / S4 Metrics are worth piloting.

The principle of CloudWatch cost reduction is simple: keep high-value signals, do not ingest low-value noise; store data that needs long-term retention in systems more suitable for long-term storage.

Disclosure: The author of this article is from abyo software (the developer of S4, a cost optimization product on AWS Marketplace).