The DLQ Graveyard: Why Dead Letter Queues Fill Up and Nobody Notices

Every SQS queue comes with an optional dead letter queue. You configure it during setup, point the main queue at it, set a max receive count, and move on. The DLQ is a safety net. Good engineering. The problem is safety nets don’t maintain themselves, and nobody actually watches them after they’re strung up.

Weeks later, the DLQ has 4,000 messages in it. Some of those messages failed because of a transient timeout six weeks ago. Others failed because a schema change broke deserialization and nobody caught it. The context for most of them is gone. The engineer who set up the queue is on a different team now, or left the company, or just doesn’t remember. You’re staring at a queue full of corpses with no tags, no runbook, and no idea where to start.

This is almost never a tooling problem. It’s an ownership problem. The tools exist: CloudWatch has the metrics, SQS has the retention settings, Lambda can replay. The gap is that nobody assigned the DLQ to a human being who is responsible for it. Fix the ownership, and the rest gets much easier.

Why Ownership Gets Dropped

DLQs are created as an afterthought. You’re configuring a queue, you see the dead letter section, you fill it in because you know you should, and you ship. The DLQ config is correct. The problem is that “configure the DLQ” and “own the DLQ” are treated as the same step, and they’re not.

Nobody’s pager fires specifically for DLQ ownership. Alerts, if they exist at all, usually go to a shared channel. Shared channels are where accountability goes to die. When something fails, the DLQ message lands in a queue that everyone is technically responsible for and nobody feels personally on the hook for. It’s everyone’s problem and therefore nobody’s.

The person who created the queue might be gone six months later. Even if they’re still around, they may have moved to a different service and no longer feel ownership. Teams rotate. Priorities shift. The DLQ keeps filling. There’s no forcing function that makes anyone stop and deal with it until the backlog is embarrassingly large or something downstream breaks.

Make DLQ Depth a First-Class Metric

If DLQ depth isn’t on a dashboard someone looks at every day, it will drift. That’s just what happens. Out of sight means out of mind, and SQS metrics are buried in CloudWatch by default. You have to go looking for them.

The floor here is a CloudWatch alarm on ApproximateNumberOfMessagesVisible greater than zero. Any message in a DLQ is a failed message that needs attention. Zero should be the steady state. When it’s not zero, someone should know immediately, not on the next quarterly review.

Here’s a minimal CloudFormation example:

Resources:
  OrderProcessingDLQDepthAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: order-processing-dlq-depth
      AlarmDescription: "Messages are accumulating in the order processing DLQ"
      MetricName: ApproximateNumberOfMessagesVisible
      Namespace: AWS/SQS
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 1
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: QueueName
          Value: !GetAtt OrderProcessingDLQ.QueueName
      AlarmActions:
        - !Ref DLQAlertTopic
      TreatMissingData: notBreaching

The key insight is that the metric needs to be visible by default, not buried three clicks into CloudWatch. Pin it to a dashboard. Put it next to the main queue depth metric. Treat a non-zero DLQ depth the same way you treat a non-zero error rate in your application. It’s a symptom, and symptoms need owners.

Set a Decay Policy

SQS has a maximum retention of 14 days. After that, messages expire automatically. Most teams set this and consider the problem solved. It’s not. The real issue is the accumulation that happens before day 14. You need a decay policy that forces decisions before messages become unrecoverable archaeology.

The metric you want here is ApproximateAgeOfOldestMessage. Set a separate alarm for this, independent from the depth alarm. Something like: if the oldest message in the DLQ is more than 24 hours old, alert. If it’s more than 72 hours old, escalate. The age alarm catches a different failure mode than the depth alarm. Depth tells you the queue is filling. Age tells you the queue is being ignored.

Aging out messages is actually fine, as long as you accept it explicitly. If your policy is “messages older than 7 days get archived to S3 and then discarded from the queue,” that’s a real policy. You know what you’re losing. Contrast that with the default: messages silently expire on day 14, nobody knows they were there, and if any of them were important, you’ll find out the hard way later. An explicit decay policy doesn’t prevent data loss. It makes data loss a deliberate choice instead of an accident.

Classify by Failure Type on Ingestion

Not all DLQ messages are the same. A message that failed because your downstream service was rate-limiting at 2 AM is a transient failure. It’s probably safe to replay after a delay. A message that failed because the payload is malformed JSON is a permanent failure. Replaying it will fail again. Sending the same alert for both types wastes attention and makes teams ignore the alerts.

Classification happens at ingestion, either via message attributes set by the producer or a classification Lambda that reads the DLQ as messages land and applies a failure type label. The classification logic doesn’t need to be perfect. Even a rough split between transient (throttles, timeouts, network errors) and permanent (schema violations, business logic rejections) is enough to apply different treatment.

Once classified, you can apply different TTLs and alert thresholds. Transient errors: alert after 1 hour of sitting in the DLQ, because they might be worth auto-replay once the downstream recovers. Permanent errors: alert immediately, because no amount of waiting will fix a malformed message, and someone needs to look at it now. This separation lets you build smarter automation: a replay function that only touches transient messages, a triage queue for permanent ones that routes to the right team.

The Ownership Assignment Pattern

Every DLQ needs two tags before it goes to production: Owner and RunbookURL. These are not optional. A DLQ without an owner tag is a ticking clock, because when it fills up at 3 AM, the on-call engineer has to figure out whose problem this is before they can even start diagnosing the actual issue.

aws sqs tag-queue \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/order-processing-dlq \
  --tags '{
    "Owner": "payments-team",
    "OwnerEmail": "payments-oncall@yourcompany.com",
    "RunbookURL": "https://wiki.yourcompany.com/runbooks/order-processing-dlq",
    "Service": "order-processing",
    "Environment": "production"
  }'

The runbook URL is just as important as the owner. At 3 AM, the engineer who gets paged might not be the person who built the queue. The runbook should answer: what does this DLQ catch, what are the common failure modes, and what’s the procedure for each. “Check the logs and replay if safe” is a runbook. It doesn’t need to be elaborate. It needs to exist and be reachable from the alert.

Make these tags part of your queue creation checklist. Better yet, make them required in your IaC templates with a validation step. If a queue is missing Owner or RunbookURL, the deploy fails. That’s aggressive, but it’s the only way to make ownership non-optional. Soft rules get skipped under deadline pressure. Hard rules don’t.

You Can Build This, or You Can Use Something Built for It

Building this yourself means writing Lambda functions to poll SQS, wiring up CloudWatch alarms for depth and age, building a dashboard that someone actually looks at, implementing the decay logic, and maintaining all of it as queues get added and removed. It’s doable. It’s also a non-trivial amount of undifferentiated infrastructure work that doesn’t move your product forward.

DeadQueue handles the depth and age monitoring, sends alerts with the context you actually need at 3 AM (which queue, how many messages, how old, who owns it, where’s the runbook), and doesn’t require you to wire up the plumbing yourself. The graveyard stays empty when someone is watching the gate. That’s the whole idea.