How to Audit Your SQS DLQ Alert Configuration in 10 Minutes

Most teams have DLQ alerts set up. Few have checked whether they actually work.

It’s not about missing the setup step. It’s that the defaults are wrong, the metric names are confusing, and CloudWatch makes it easy to configure an alarm that looks correct but fires for the wrong reasons — or not at all.

This is a 10-minute audit. Three checks, real CLI commands, no fluff.

Step 1: Check your alarm metric

Pull your existing alarms and see what metric they’re using:

aws cloudwatch describe-alarms \
  --alarm-name-prefix "your-dlq" \
  --query 'MetricAlarms[*].[AlarmName,MetricName,Threshold]'

What wrong looks like: MetricName is NumberOfMessagesSent.

This is the most common mistake. NumberOfMessagesSent only increments when your application code explicitly calls SendMessage. When SQS automatically redrives a message to the DLQ after exhausting maxReceiveCount, that metric doesn’t move. Your alarm stays silent.

What right looks like: MetricName is ApproximateNumberOfMessagesVisible.

This reflects how many messages are sitting in the queue right now, regardless of how they got there. It catches both explicit application sends and automatic SQS redrives.

If your alarm is on NumberOfMessagesSent, change it.

Step 2: Check for age-based alerting

Queue depth alone misses the slow drain. If three messages land in your DLQ over 10 days, a depth > 5 alarm never triggers. Meanwhile, those messages are aging toward deletion.

Check your current depth first:

aws sqs get-queue-attributes \
  --queue-url <your-dlq-url> \
  --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesVisible

Then check whether you have an alarm on ApproximateAgeOfOldestMessage. Same describe-alarms command, look for it in the output. If you don’t have one, you have a gap.

ApproximateAgeOfOldestMessage tells you how long the oldest message has been sitting in the queue, in seconds. A message aging past 80% of your retention period without being processed is a signal worth alerting on. A depth alarm alone won’t catch it.

You need both: depth for volume spikes, age for slow accumulation.

Step 3: Check retention periods

Here’s the one most teams never think to check. Pull the retention period on your DLQ:

aws sqs get-queue-attributes \
  --queue-url <your-dlq-url> \
  --attribute-names MessageRetentionPeriod

Then pull the same for the source queue:

aws sqs get-queue-attributes \
  --queue-url <your-source-queue-url> \
  --attribute-names MessageRetentionPeriod

SQS preserves the original enqueue timestamp when it moves a message to the DLQ. A message that spent 3 days on the source queue arrives in the DLQ with 3 days already burned off its retention clock. If your DLQ retention is 4 days, that message has one day left.

If your DLQ retention is shorter than or equal to your source queue retention, you will lose messages. The DLQ needs to be longer, ideally at the 14-day maximum.

Summary

Check	Metric / Setting	What it catches	What it misses
Alarm metric	`ApproximateNumberOfMessagesVisible`	Depth spikes, auto-redrives	Slow accumulation
Age alarm	`ApproximateAgeOfOldestMessage`	Slow drain, aging messages	Volume spikes
Retention periods	DLQ > source queue	Messages expiring before inspection	Nothing, if set correctly

Three checks. Any one of them failing means you have a silent gap in your alerting.

If you’d rather automate this audit instead of running it manually every quarter, that’s what DeadQueue does. It checks your metric configuration, retention health, and message age across all your queues on setup.

Start monitoring your DLQs for free → — connect your first 3 queues in under 2 minutes. See pricing.