Your Messages Are Expiring While CloudWatch Shows Green

Your CloudWatch alarm fires at queue depth 10. The messages are already 3 days old and expiring in 4 hours. You have no runbook link, no age context, and no idea this was a slow drain that started Tuesday.

The alarm worked exactly as configured. That’s the problem.

Why Engineers Default to CloudWatch for SQS

It makes sense. CloudWatch is already there. It’s the native AWS monitoring surface, it integrates directly with SQS metrics, and setting an alarm on ApproximateNumberOfMessagesVisible takes about two minutes. You configure a threshold, wire it to an SNS topic, and you’re paged when the queue backs up. That’s a reasonable starting point.

CloudWatch also has the right data. SQS publishes a solid set of metrics to the AWS/SQS namespace: message counts, age of the oldest message, number sent and deleted, in-flight counts. The raw signals exist. The problem isn’t missing data. It’s what CloudWatch alarms are designed to do with that data, and where the gaps open up when you’re monitoring DLQs specifically.

Teams that run into trouble with CloudWatch SQS alerts didn’t make obvious mistakes. They set up reasonable alarms based on reasonable assumptions. Then production showed them where those assumptions were wrong.

There are four gaps worth knowing about before you discover them the hard way.

Gap #1: Count-Only Monitoring Misses Slow Drains

A CloudWatch alarm set to threshold 10 on ApproximateNumberOfMessagesVisible will fire when depth hits 10. That seems clear. What it won’t do is fire if depth has been sitting at 8 for 72 hours.

That distinction matters more than it looks.

A queue that spikes to 50 messages and drains back to zero in 10 minutes is not a crisis. Processing hiccupped, recovered, and caught up. Whatever backed up briefly is already handled. The spike looks alarming on a graph but the outcome is fine.

A queue that sits at 8 messages for three days is a different kind of problem entirely. Those 8 messages aren’t being processed. They’re aging. If your retention period is 4 days, they have 24 hours left. The alarm never fired because 8 is below threshold. No one was paged. The slow drain ran silently for 72 hours before the first message expired.

CloudWatch alarms are designed around absolute thresholds and point-in-time values. They evaluate a metric against a number you set. They don’t evaluate stagnation. They don’t evaluate rate of change over time. There’s no native “this metric hasn’t moved in N hours” alarm type. You can use Anomaly Detection alarms to flag unusual behavior, but those require a training period and are calibrated around normal patterns, not intentional stagnation detection.

The slow drain failure mode is also subtle because it looks like things are working. Processing isn’t stopped. If you’re watching depth, you see a low number. Nothing obviously broken. But the messages already in the queue keep aging toward expiration while you’re watching the wrong signal.

Rate of change matters as much as absolute depth. A queue at depth 8 that’s been at depth 8 for three days needs a different response than a queue at depth 8 that was at depth 6 twenty minutes ago. CloudWatch gives you the depth. You have to build the rate-of-change logic yourself.

For DLQs, this matters more than for source queues. Messages on a DLQ are already failed. They’re not getting retried automatically. Every hour they sit there is an hour closer to permanent deletion. A slow drain on a DLQ isn’t a processing hiccup. It’s a silent clock running out.

Gap #2: ApproximateNumberOfMessagesSent Is the Wrong Metric

A common alarm pattern: watch ApproximateNumberOfMessagesSent on your DLQ. When messages are sent to the DLQ, the count goes up, the alarm fires, you investigate. Seems solid.

It breaks when you start using SQS’s auto-redrive feature.

Auto-redrive moves messages from a DLQ back to the source queue for reprocessing. When messages are redriven and then fail again, they return to the DLQ. But here’s the thing: the return trip doesn’t increment ApproximateNumberOfMessagesSent on the DLQ.

That metric counts messages that were sent to the queue via SendMessage or SendMessageBatch API calls. The redrive path doesn’t go through those APIs. It goes through the StartMessageMoveTask mechanism, which is a different code path with different metric attribution. When a redriven message fails and lands back in the DLQ, it doesn’t look like a new arrival to CloudWatch.

The practical result: you can have messages accumulating in your DLQ after failed redrives, and your ApproximateNumberOfMessagesSent alarm never fires. The alarm is silent. Engineers who built coverage on that metric have a gap they don’t know about.

This isn’t a bug. It’s how the metrics are defined. ApproximateNumberOfMessagesSent counts API-level sends. Redrives are an internal SQS operation, not an external API call, and they’re not counted.

The problem compounds when teams use redrive workflows as part of their incident response. Someone notices messages piling up, triggers an auto-redrive to reprocess them, and then sets up an alarm on sent-count to watch for future failures. The alarm works for the initial failure scenario but misses the “failed again after redrive” scenario completely.

To actually catch redrived messages returning to the DLQ, you need to watch ApproximateNumberOfMessagesVisible directly, not sent-count. Depth goes up when messages arrive via any mechanism, including redrives. Sent-count only goes up for external API sends.

The subtlety is that this requires you to know the distinction exists. It’s not documented prominently in the CloudWatch SQS alarm setup flow. Most engineers who hit this find out after a production incident where messages accumulated silently and the alarm they trusted stayed quiet.

Gap #3: No Native Age-Based Alerting

The metric that matters most for DLQ health is ApproximateAgeOfOldestMessage. CloudWatch publishes it. You can build an alarm on it. But doing that correctly requires work that CloudWatch doesn’t do for you.

The first issue is the threshold calculation. A useful age alert isn’t “message is older than 48 hours.” It’s “message is older than 50% of the retention period.” Those are different numbers for every queue. If your DLQ has 14-day retention, 50% is 7 days, which in seconds is 604800. If it has 4-day retention, 50% is 2 days, which is 172800. You need to look up each queue’s retention period, do the math, and set a per-queue alarm threshold.

That’s manageable for five queues. It falls apart at fifty.

The second issue is the retention mismatch problem. SQS preserves the original enqueue timestamp when a message moves to the DLQ. A message that spent 3 days on the source queue arrives in the DLQ with 3 days already burned off its clock. If your source queue has 4-day retention and your DLQ also has 4-day retention, that message arrives with one day left, not four.

A CloudWatch alarm set at “alert when age exceeds 50% of DLQ retention” (2 days for a 4-day DLQ) will never fire for that message. The message expires while the alarm is in OK state, because the message aged out on the source queue, not on the DLQ.

The right DLQ configuration is maximum retention on the DLQ (14 days) with the source queue set shorter. That gives messages arriving from the source queue the most remaining time. But plenty of teams have matched retention periods or defaulted to 4 days everywhere, which creates exactly this orphan scenario.

CloudWatch doesn’t check for retention mismatches. It doesn’t warn you that your DLQ and source queue have the same retention period. It doesn’t calculate time-to-expiry accounting for age carried over from the source queue. Those are calculations you need to build yourself, outside of CloudWatch.

Gap #4: Alerts Without Context

When a CloudWatch alarm fires, you get: the alarm name, the metric value, the threshold, and the time it fired. That’s the standard SNS notification payload.

What you don’t get: any of the things you actually need to respond.

You don’t get the age of the oldest message. You don’t get time-to-expiry. You don’t get the retention period of the queue. You don’t get a link to the runbook for this queue. You don’t get the upstream context, which consumer is writing to this queue, what’s likely wrong, or where to start debugging. You don’t get information about whether a redrive is possible or whether the messages have expired.

The result is an alert that tells you something is wrong and leaves every other question unanswered. At 2am, “orders-dlq-prod: ApproximateNumberOfMessagesVisible 12 greater than threshold 10” is almost useless. You know there are messages in the queue. You don’t know if they’re about to expire. You don’t know if this is the first time this happened or if it’s been growing for days. You don’t know where the runbook is.

You can add context to SNS notifications manually. You can build Lambda functions that enrich the alarm notification with additional queue metadata before routing it to Slack. You can write runbook URLs into alarm descriptions and parse them back out in your notification formatter. Each of these is a thing you have to build and maintain.

Most teams don’t. They live with the sparse alarm notifications because building the enrichment layer is a project, not an afternoon. So the on-call engineer gets paged with minimal information and has to reconstruct context from the AWS console before they can even start investigating.

What You’d Need to Build This Yourself

If you wanted to close all four gaps with native AWS tooling, here’s roughly what it takes.

For slow drain detection, you’d write a Lambda function that runs on a schedule (EventBridge Cron, every 15 minutes or so), reads the current depth for each DLQ, compares it to the value from the last run stored in DynamoDB or Parameter Store, and calculates rate of change. When depth is non-zero and not decreasing, it triggers an alert. This is maybe 100 lines of code per queue, plus the scheduler, plus the state storage.

For age-based alerting calibrated to retention, you’d extend that Lambda to call GetQueueAttributes on each queue, read the MessageRetentionPeriod, calculate 50% and 75% of that period in seconds, read the current ApproximateAgeOfOldestMessage, and compare. You could also publish a custom CloudWatch metric for “age as percentage of retention” and alarm on that, which moves the logic into CloudWatch but adds a custom metric and a PutMetricData call per queue per poll cycle.

For retention mismatch detection, you’d look up both the source queue and the DLQ, compare their retention periods, and flag mismatches. That requires maintaining a mapping of source queues to their corresponding DLQs, which isn’t something AWS provides out of the box. You’d build that mapping during setup and keep it updated as queues are added.

For context-rich alerts, you’d build a notification formatter that takes the alarm data, fetches additional queue attributes, looks up the runbook URL from a tag or config file, and formats a Slack message with the full picture. Then you’d wire that into your SNS routing.

The total is a few Lambda functions, a scheduler, some state storage, an SNS-to-Slack bridge, and a config layer to track the queue-to-DLQ mappings and runbook URLs. It’s probably two to four days of engineering work to do right, with ongoing maintenance as queues are added or retention periods change. And it needs tests, because a monitoring system that silently fails is worse than not having one.

That’s the actual cost of closing these gaps in CloudWatch. Not impossible. But real.

The Four Signals You Actually Need on a DLQ

After working through the gaps, the set of signals that actually covers DLQ health looks like this.

Depth plus rate of change. Absolute depth tells you if messages are present. Rate of change tells you if they’re being processed. A DLQ at depth 5 that’s been at depth 5 for 6 hours needs attention. A DLQ that just hit depth 5 during an active incident is normal. You need both data points to distinguish them.

Oldest message age as a percentage of the retention window. Not raw seconds. Percentage. A 48-hour-old message on a 4-day queue is at 50% of its retention window and getting close. A 48-hour-old message on a 14-day queue is at 14% of its window and has plenty of time. The same raw age number means completely different things depending on the queue’s retention period. Alerting on percentage gives you a calibrated signal that works across all your queues without per-queue threshold math.

Time-to-expiry. This is the number that actually matters when you’re deciding how urgently to respond. If the oldest message expires in 36 hours, you can handle it during business hours. If it expires in 4 hours, you page someone now. Time-to-expiry is what you need at 2am to make that call correctly, and it’s not a metric CloudWatch surfaces directly.

Retention mismatch flag. If your source queue and DLQ have the same retention period, messages will arrive in the DLQ already partially expired. That’s a configuration problem worth knowing about at setup time, before the first message fails. If your source queue has 4-day retention and your DLQ also has 4-day retention, you need to fix that. The mismatch flag surfaces it without waiting for a production incident to reveal it.

These four signals together cover what CloudWatch handles only partially. Depth for accumulation detection. Age percentage for expiry risk calibrated to the queue. Time-to-expiry for response prioritization. Retention mismatch for configuration health.

DeadQueue monitors all four signals across your SQS dead letter queues: depth trends, message age as a percentage of retention, time-to-expiry countdowns, and retention mismatch detection. Alerts route to Slack, email, or PagerDuty with queue name, runbook link, and full context. No per-queue alarm setup, no threshold math, no custom Lambda functions to maintain.

Free tier covers three queues. Connect your first queue at deadqueue.com.