A downstream API was down for twenty minutes. Four thousand messages exhausted their retries and landed on the dead-letter queue. The API is back now, the messages are still good, and the obvious move is to put them back where they came from. So you reach for the broker console and start moving messages from orders.dlq to orders.
Stop there. A dead-letter queue is a holding area, not a graveyard — but moving messages out of it is a loaded operation, and “just move them back” is how you turn a recovered outage into a second incident.
Two ways a naive replay makes it worse
It re-poisons the queue. Your DLQ rarely holds only the outage’s victims. Mixed in are messages that died for a real reason — a malformed payload, a bug, an unroutable URN — and those will fail again the instant you replay them. If your replay just shovels everything back, the genuinely-broken messages fail, hit the DLQ again, and you replay them again. A loop that costs CPU, fills logs, and buries the messages that would actually succeed.
It re-fires side effects. A message that dead-lettered after three attempts may have partially run each time. The charge went through but the confirmation email threw; the message failed, retried, dead-lettered. Replay it naively and the handler runs start to finish again — a second charge, on top of the email it now finally sends. The queue moved a message; your customer got billed twice.
Both traps come from treating replay as “move bytes back.” It isn’t. It’s “reset a message and reprocess it, safely” — and each of those words is a safety catch.
Reset, but keep the identity
A dead-lettered message carries an extra block describing how it died — the reason, the original queue, the attempt count. To replay it you have to reset it: drop that block, and set the attempt counter back to zero so it gets a fresh retry budget instead of immediately re-exhausting it.
But reset is not the same as new. Preserve everything that identifies the message — its id, its payload, and above all its correlation id (trace_id). Two reasons:
- A replayed message you can’t trace is an operational blind spot. Keep the
trace_idand the replay shows up in the same trace as the original failure. - The preserved message id is what makes replay safe to retry. If your consumers are idempotent — deduping on the message id — then replaying a message that already half-succeeded is caught by the dedupe, not re-run. Idempotency and replay are partners: idempotency is what lets you replay a queue without holding your breath.
Don’t aim it at production first
Even reset and idempotent, a replay is a write to a live system. Earn confidence in stages:
- Dry-run. Read the DLQ and report what would be replayed — counts, targets, reasons — and put every message back untouched. You learn the blast radius without firing a shot.
- Sandbox. Redirect the replay to a non-production queue whose consumers have their external side effects stubbed. The messages flow through real handler logic; nobody gets charged. This is where you find out the “good” messages are actually good.
- Then production, once the first two are boring.
The order matters more than any single step. Replaying straight to production because the messages “look fine” is the same confidence that moved them on the broker console in paragraph one.
Select; don’t shotgun
The outage’s victims share a signature — a window of time, a failure reason, a URN. Replay that set, not the whole queue. Selecting by reason is the difference between “replay the 4,000 that failed on a timeout” and “replay the 4,000 timeouts plus the 30 genuinely-poison messages that will just come back.” The poison messages aren’t your problem to replay; they’re your problem to fix, and they should stay on the DLQ where you can see them.
The side effect you can’t reset away
Here is the honest hard part. Reset clears the envelope; idempotency dedupes exact replays; sandboxing protects production while you test. But a deliberate replay into production — the real one, the one you actually want — runs the handler, and the handler does what it does: it charges, it emails, it calls the third party. Idempotency stops a duplicate; it does not stop the intended reprocess from doing its job, side effects included.
The clean fix is to let a handler know it is running a replay so it can skip the external effects that already happened — re-run the database write, but don’t re-send the email. The catch is where to put that flag: a frozen message envelope has nowhere to add it. The answer is the same one distributed tracing uses for the same constraint — carry it out of band, as a transport header riding alongside the message, that the runtime surfaces to the handler. A replay-bypass marker the handler can check, set only on replayed messages, costing nothing on the normal path. It’s the most involved piece, and the right one to build last — after reset, dry-run, sandbox and select have made replay safe enough to be routine.
A dead-letter queue is the most useful thing in your system on the worst day of the quarter. Treat replaying it like what it is — reprocessing live traffic — and it stays useful. Treat it like moving bytes, and the DLQ gets its sequel.
See also: the BabelQueue redrive-and-replay spec standardizes this reset-and-reprocess shape, and the dlq-redrive example is a runnable version of it.
Comments
Sign in with your GitHub account to join the discussion. Comments are stored in GitHub Discussions.