SQS DLQ Expiration: The Hidden Mechanic Deleting Your Messages

You set your SQS DLQ’s MessageRetentionPeriod to 14 days, so you have two weeks to debug failed messages, right?

You don’t. And by the time you find out, the messages are already gone.

The 14-Day Lie

The mental model most engineers carry goes something like this: message fails on the source queue, SQS moves it to the DLQ, the DLQ retention clock starts fresh. You set 14 days, you get 14 days.

That’s not how it works.

SQS sets a message’s SentTimestamp exactly once: the moment it first arrives on the source queue. That timestamp never changes. When SQS redrives the message to the DLQ after exhausting maxReceiveCount, it carries the original timestamp with it. The DLQ’s MessageRetentionPeriod is calculated against that original value, not the arrival time at the DLQ.

This means the “14 days” on your DLQ is not 14 days of fresh debugging time. It’s 14 days minus however long the message spent on the source queue.

A message that spent 3 days retrying before hitting the DLQ has 11 days left, not 14. A message that spent 13 days retrying has one day left. The DLQ retention is a ceiling, not a guarantee.

Most documentation mentions this in passing. Most engineers don’t read it until they’ve already lost something important.

How SQS Message Expiration Actually Works

Every SQS message has a SentTimestamp attribute. It’s set at enqueue time and is immutable after that. AWS uses it as the reference point for the entire retention calculation.

You can verify this yourself. Pull a message from your DLQ with:

aws sqs receive-message \
  --queue-url <your-dlq-url> \
  --attribute-names SentTimestamp ApproximateFirstReceiveTimestamp \
  --visibility-timeout 0 \
  --max-number-of-messages 1

The SentTimestamp will reflect when the message first hit the source queue, not when it arrived in the DLQ. There’s no “DLQ arrival time” attribute exposed by default because, from SQS’s perspective, the message’s age hasn’t reset.

The retention calculation is simple: SentTimestamp + MessageRetentionPeriod = expiration. SQS checks each message against that number and deletes it when the wall clock passes the expiration point. It doesn’t matter which queue the message is sitting in at that moment.

MessageRetentionPeriod is configured per-queue, and it applies to messages currently in that queue. But the calculation is always original SentTimestamp + that queue's retention period. So if a message arrives in your DLQ with 2 days of age already on the clock, and your DLQ has a 14-day MessageRetentionPeriod, the message lives for 14 - 2 = 12 more days. Not 14.

This is why setting DLQ retention to 14 days (the max) and keeping source queue retention well below that matters. It maximizes the remaining window, but it doesn’t reset the clock.

One more wrinkle: retention periods are per-queue settings, not per-message settings. You can’t set a longer retention on individual messages. The queue-level MessageRetentionPeriod applies uniformly. If you change the retention on a queue after messages are already sitting in it, the new period applies to those existing messages too, which can cause immediate mass expiration if you accidentally shorten the retention.

The Danger Zone: A Real-World Failure Scenario

Here’s a scenario that’s happened to real teams.

You have a payment webhook processor. The source queue has a 4-day MessageRetentionPeriod and a maxReceiveCount of 10. Your DLQ has a 14-day MessageRetentionPeriod. On paper, the setup looks correct.

On a Tuesday afternoon, your downstream payment provider starts returning intermittent 503s. Your processor retries each message on a backoff schedule. The errors keep coming. SQS keeps redriving. Each failed receive attempt increments the receive count.

By Friday evening, most messages have exhausted their 10 retries and landed in the DLQ. They spent roughly 3 days and 22 hours failing. They arrive in the DLQ with a little over 2 hours left before expiration.

The on-call engineer gets a Slack alert that the DLQ has messages. It’s late Friday. They look at the DLQ depth: 47 messages. They see the 14-day retention configured on the queue. They make a note to investigate Monday morning. They assume they have time.

Monday morning, there are 0 messages in the DLQ. SQS deleted all 47 of them within 2 hours of them arriving, because that’s all the time they had left.

The payment provider’s logs show the 503s. Your application logs show the failed processing attempts. But the actual message bodies, with the original webhook payloads, are gone. If you need to replay those events or audit exactly what was requested, you’re doing it from incomplete logs at best.

This scenario is not theoretical. The math is just: source queue retention minus retry duration equals DLQ window. If you set maxReceiveCount high enough, or if retries run long enough, that window can collapse to nearly zero. And nothing in the AWS console warns you it’s happening.

Why Your CloudWatch Alerts Won’t Save You

The default instinct after learning about this problem is to add more CloudWatch alarms. That helps, but there are gaps in what CloudWatch can see.

The most common DLQ alarm is on ApproximateNumberOfMessagesVisible. It tells you how many messages are in the queue right now. It doesn’t tell you how old they are. A DLQ sitting at 47 messages looks identical whether those messages have 12 days left or 12 minutes left. The depth metric is the same either way.

NumberOfMessagesSent has its own problem: it only increments when your application code explicitly calls SendMessage. SQS auto-redrives from source queue to DLQ don’t trigger it. So an alarm on this metric won’t fire when messages arrive in the DLQ through the normal failure path.

ApproximateAgeOfOldestMessage is better. It tells you how long the oldest visible message has been in the current queue. But this is age-in-queue, not time-until-expiration. A message that arrives in the DLQ with 2 hours left and sits there for 30 minutes will show an age of 30 minutes. It doesn’t surface the remaining TTL, and it doesn’t account for the time already burned on the source queue.

There’s also no built-in CloudWatch metric for “time until expiration” or “messages expiring in the next N hours.” SQS doesn’t expose that as a metric. You’d have to derive it yourself.

So you can have a DLQ with a “stable” depth, no NumberOfMessagesSent spikes, and an ApproximateAgeOfOldestMessage that looks reasonable, right up until SQS mass-deletes everything in a short window. The alarms stay green. The messages disappear.

The Solution

There are two paths: build it yourself, or use something that already does it.

DIY with Lambda: You can poll SentTimestamp on individual messages, compute remaining TTL against the source queue’s retention period, push a custom CloudWatch metric, and set an alarm on it. It works. The catch: you need one Lambda per DLQ, you need to hardcode or dynamically discover each source queue’s retention period to get the math right, and you’re sampling. You can’t read all messages without consuming them. Across a handful of queues with different configurations, the maintenance surface adds up. It’s a reasonable path if you have one or two queues and time to spare.

With DeadQueue: DeadQueue reads the original SentTimestamp on messages in your DLQs and tracks remaining TTL continuously. In the Friday-night scenario above, you’d have received a warning when those 47 payment messages arrived in the DLQ with 2 hours left: enough time to replay them before the window closed. Replay matters here because the message body is the webhook payload itself. Once SQS deletes it, you’re reconciling against provider logs that may not be complete. Having the original message means reprocessing correctly instead of reconstructing from incomplete data.

No Lambda to maintain, no per-queue configuration to manage, no sampling logic to debug.

If you have SQS DLQs in production, some of your messages are probably closer to expiration than you think. DeadQueue tracks the window and alerts you before it closes, not after.

Start monitoring your DLQs for free → — connect your first 3 queues in under 2 minutes. See pricing.