Transactional Outbox: Dual-Write, At-Least-Once, and Idempotent Consumption

The symptom was small: one or two invoices a day sat in the database but the consumer never saw them. The user said “I issued it,” and there was no trace on the other side. Always at night, always during a deploy or a network blip. Once I laid the picture out, the cause was clear — the INSERT into the database and the publish to the queue were two separate systems, with no guarantee between them.

This post is the systems side of the transactional outbox pattern that closed that gap, the at-least-once repeats it opened in return, and finally encrypting the personal data that flows through the event at the field level. I wrote the decision-chain side as a log on muhammetsafak.com.tr; here I collect the decisions you have to make for the pattern to be durable in production.

Dual-write: you can’t stretch a transaction across two systems

The crux in one sentence: your local database transaction doesn’t cover RabbitMQ. The classic code is dangerous because it works most of the time:

DB::transaction(function () use ($invoice) {
    $invoice->save();                 // 1) write to MySQL
    $this->publishToRabbit($invoice); // 2) publish to the queue
});

There are two distinct failure modes:

save() succeeds, publish drops on a network error → the invoice is in the DB, the event isn’t. Lost event.
publish succeeds, then the transaction rollbacks for another reason → the event went out, no counterpart in the DB. Phantom event.

DB::transaction doesn’t save you, because commit/rollback only wraps the MySQL side; the broker isn’t part of that transaction. The practical way to hold two systems in one atomic step is to reduce the write to a single system.

Outbox: write to one system first, let a separate process publish

The idea is plain: instead of pushing the message straight to the queue, write it as a row into an outbox table inside the same transaction. The invoice and the event either both exist or both don’t, in the same commit — dual-write collapses into a single write.

CREATE TABLE outbox (
    id            BINARY(16)   NOT NULL,        -- event id = idempotency key
    aggregate_id  BIGINT       NOT NULL,        -- ordering and partition key
    topic         VARCHAR(120) NOT NULL,
    payload       JSON         NOT NULL,
    status        ENUM('pending','published') NOT NULL DEFAULT 'pending',
    created_at    DATETIME(6)  NOT NULL,
    published_at  DATETIME(6)  NULL,
    PRIMARY KEY (id),
    KEY idx_dispatch (status, created_at)        -- the relay scan goes through this index
);

The write is no longer the business code’s concern:

DB::transaction(function () use ($invoice) {
    $invoice->save();
    Outbox::write('invoice.issued', $invoice->id, $this->payload($invoice));
});

A separate relay does the queue push: it reads the pending rows, publishes them to RabbitMQ, and stamps them published on success. The critical fact here: the relay can fall into the “I published but crashed before stamping it published” state. So while the outbox solves dual-write, it gives you no free guarantee — it gives you at-least-once. The message isn’t lost, but it can repeat. Everything else arranges itself around this fact.

How do you feed the relay: polling or CDC?

The relay can see pending rows two ways.

Polling — the relay scans the table periodically:

SELECT id, topic, payload
FROM outbox
WHERE status = 'pending'
ORDER BY created_at
LIMIT 100;

Simple, works on every database, low operational overhead. The cost is two things: latency up to the scan interval (a 1 s poll = ~1 s of queue latency) and an idling query. Without the idx_dispatch index this scan gets expensive as the table grows; even with it, you trade poll frequency against DB load. At small-to-medium volume, polling is the right answer — tune it frequent enough to keep latency acceptable, sparse enough not to hammer the DB (in practice 200ms–1s).

CDC (change data capture) — the relay listens to the database’s WAL/binlog rather than the table (Debezium is the typical tool). Every INSERT into outbox turns into an event almost instantly; polling latency and idle scanning disappear. The cost is operational: binlog access, a connector process, one more moving part. I turn CDC on when latency genuinely matters (sub-second) or when volume strains polling; otherwise I count polling’s simplicity as an advantage.

Rule: your latency budget and your operational budget conflict. CDC buys you latency and charges you an infrastructure piece in return. Start with polling; switch to CDC when there’s a measured reason.

Multiple relays: collision-free dispatch with `SKIP LOCKED`

A single relay is a bottleneck and a single point of failure. Running multiple relays over the same table creates a new risk: two of them grab the same row and publish the same message twice. The naive fix is to lock the table — which kills the parallelism.

The right tool is SELECT ... FOR UPDATE SKIP LOCKED. Each relay claims a batch of rows by locking them; instead of queuing behind rows another relay has locked, it skips them:

BEGIN;
  SELECT id, topic, payload
  FROM outbox
  WHERE status = 'pending'
  ORDER BY created_at
  LIMIT 100
  FOR UPDATE SKIP LOCKED;     -- skip rows another relay holds, don't wait

  -- publish this batch, then:
  UPDATE outbox SET status = 'published', published_at = NOW(6)
  WHERE id IN (...);
COMMIT;

Without SKIP LOCKED, the relays queue for the same rows and the parallelism effectively runs serially. With it, each relay pulls a disjoint set and all three move at once. PostgreSQL and MySQL 8+ support it.

A subtlety: if the relay crashes between publish and UPDATE ... published, the row stays pending and gets republished. That’s acceptable — we’re already at at-least-once; the fix is on the consumer side, below. The dangerous order is the reverse: stamping published first and then publishing. A crash there produces a lost event — exactly what we ran from. So the order is fixed: publish first, stamp second.

Ordering: what you can guarantee, what you can’t

“Let events arrive in the order they were written” is an intuitive expectation, but global ordering is expensive and usually unnecessary. What you actually need is per-aggregate order: the same invoice’s created event should arrive before its updated; the order of two different invoices relative to each other is nobody’s concern.

You get this by partitioning on aggregate_id: events with the same aggregate_id go to the same queue partition (or, in RabbitMQ, the same queue via a consistent-hash exchange), and a single consumer processes that partition in order. Different aggregates flow in parallel.

outbox (in created_at order)
   │  hash(aggregate_id) % N
   ├── partition 0 ──► consumer-0   (events for aggregates A,D — ordered)
   ├── partition 1 ──► consumer-1   (events for aggregate B    — ordered)
   └── partition 2 ──► consumer-2   (events for aggregate C    — ordered)

Two things still break ordering, and the design has to be ready for them: multiple relays under SKIP LOCKED can publish rows in a different order, and at-least-once repeats can cut in. So rather than leaning on a global order, you write the consumer to be resilient to out-of-order and duplicated events. The practical shield: put a monotonic version/sequence on the event and drop the stale one (lower version) at the consumer. If you genuinely need strict global ordering you drop to a single partition + single consumer — which then ties throughput to that one consumer; most systems don’t want to pay that.

Once you have at-least-once: the consumer must be idempotent

Because the relay can republish, you have to assume every event arrives at least once, sometimes more. The fix isn’t to make the event unique — it’s to set up the consumer so that processing the same event twice has the same effect as once. Write the id of every processed event into a dedupe table, and do the work inside the same transaction:

DB::transaction(function () use ($event) {
    $inserted = DB::table('processed_messages')->insertOrIgnore([
        'message_id'   => $event['id'],
        'processed_at' => now(),
    ]);

    if ($inserted === 0) {
        return; // it's a repeat; produce no side effects
    }

    $this->applyBusinessEffect($event); // the real work — exactly once
});

The unique constraint on processed_messages.message_id is the heart of it: a second arrival of the same id is silently dropped by insertOrIgnore. Putting the business effect in the same transaction is mandatory — otherwise the “I processed it but crashed before stamping” gap opens. Where the dedupe key comes from, and designing a dedupe that survives a crash, is a topic of its own; I covered it in a separate note.

If the side effect is an external system (an e-invoice integrator, a payment API), dedupe alone isn’t enough — because the side effect is outside the transaction. There you have to carry the key to the external service: pass the same id as an idempotency key to the integrator, or ask “does this already exist?” before sending. I worked through that boundary in detail in the JIT reservation post, where sealing the number before the external call turns recovery into a lookup.

`processed_messages` can’t grow forever

The dedupe table accumulates a row per event; if you don’t prune it, it becomes a performance problem of its own. The key observation: you don’t need to keep an event id forever — only as long as the window in which a repeat could arrive. Relay retries and broker redeliveries are on the order of hours, not days.

In practice I pick a fixed retention window (say 7 days — comfortably above the longest possible redelivery) and prune older rows:

DELETE FROM processed_messages
WHERE processed_at < NOW() - INTERVAL 7 DAY
LIMIT 10000;   -- small batches at a time, keeping replica lag and locks in check

If message_id is a UUID the table and its index grow; pruning keeps that bounded. When choosing the window, frame the rule backwards: how late could the latest redelivery arrive? Make a safe upper bound of that your retention. Set the window too short and a late repeat misses the dedupe, so the side effect runs a second time.

Why exactly-once is usually an illusion

In a distributed system “exactly-once delivery” is tempting but can’t be guaranteed end to end. The reason is the same dual-write, at the broker boundary: if the consumer did the work but crashed before getting its ack to the broker, the broker redelivers the message — because it never saw the ack. “I processed it” and “I reported that I processed it” are two separate steps, and a crash can fall between them. This is the practical face of the Two Generals problem.

So the goal isn’t exactly-once delivery, it’s exactly-once effect. And you get that by combining two cheap guarantees:

at-least-once delivery  +  idempotent consumer  =  effectively-once

That is, you stop trying to deliver exactly once and instead make the repeat harmless. Under systems that market “exactly-once,” this is usually exactly what’s there: at-least-once plus a dedupe layer. Look for the guarantee in the effect, not in the delivery.

Personal data flowing through the event: field-level encryption

Once the outbox settles in, a new surface appears: the events’ payload carries personal data — customer name, tax ID, address — and that payload now sits in a durable table (outbox) and passes through the broker. If I leave the data as-is, I’m copying GDPR/KVKK-scoped fields, unencrypted, into more than one place.

I didn’t want to encrypt the whole payload — I needed fields like topic, aggregate_id, and invoice_id in the clear for relay routing and observability. The need is field-level encryption: only the sensitive fields encrypted, the rest in the clear. I marked the sensitive fields in the schema with x-gdpr-sensitive and encrypted only those at serialization time:

$schema = [
    'invoice_id'    => ['x-gdpr-sensitive' => false],
    'customer_name' => ['x-gdpr-sensitive' => true],
    'tax_id'        => ['x-gdpr-sensitive' => true],
    'total'         => ['x-gdpr-sensitive' => false],
];

$payload = collect($raw)->map(fn ($value, $field) =>
    ($schema[$field]['x-gdpr-sensitive'] ?? false)
        ? Crypt::encryptString((string) $value)  // only the sensitive field is encrypted
        : $value
)->all();

That the schema declares the sensitive field is no accident — combine it with schema validation at the edge of the queue and both questions, “which field is personal data, and is it the right type?”, get answered at the edge. The real engineering decision isn’t the encryption itself, it’s the three questions around it:

Key management and rotation. Laravel’s Crypt uses AES-256 keyed by APP_KEY. Bind yourself to a single key and rotation becomes a nightmare: the new key can’t open outbox rows written with the old one. The fix is to store a key version (key id) alongside the encrypted value and keep a keyring that recognizes several keys at once — new writes use the current key, old values decrypt with their own version. In Laravel, key + previous_keys in config/app.php is exactly this; decryption tries the older keys in turn. That’s why you can rotate without migrating data.

Searching an encrypted field. Crypt::encryptString produces different ciphertext on every call (a random IV), which is correct for security but makes WHERE tax_id = ? impossible. If you need equality search you add a blind index: keep a deterministic HMAC of the searchable field in a separate column and query that. The encrypted column carries confidentiality, the blind index carries searchability; they’re separate columns because they do two separate jobs.

Losing the key is losing the data. Field encryption turns APP_KEY into an availability dependency: lose the key and the encrypted fields are permanent garbage. Key backup and access control are as much a part of the design as the encryption itself.

When do you not build this pattern?

Outbox + idempotent consumption + field encryption isn’t cheap: it brings a table, a relay process (or a CDC connector), a dedupe table and its pruning, plus key management. If you don’t know what you’re buying in return, don’t pay for it. I don’t build it when:

Event loss is tolerable. If what you publish is a best-effort notification (cache invalidation, a “new content available” signal) and losing one or two doesn’t matter, a direct publish is enough. You build the outbox because loss matters.
The write and the publish are the same system. If your target is the same database (there’s no separate broker), there’s no dual-write; you don’t need an outbox.
Volume fits a single consumer and ordering is critical. At low volume a single-consumer queue gives you both order and uniqueness simply; building the multi-relay-and-partition machinery is solving a problem you don’t have.

The need for the pattern starts when three conditions arise together: event loss is unacceptable, production/consumption is parallel, and the data goes to a separate system (a broker). If all three aren’t present, there’s a plainer solution.

The three pieces come down to one discipline: write the data to a single system, suppress repeats at the consumer, seal the sensitive field before it leaves. The outbox makes trust in the queue unnecessary, idempotent consumption makes trust in delivery unnecessary, and field encryption makes trust in the payload unnecessary — each moves a guarantee out of the code and into the design.