Race Conditions and Gaps in Sequential Numbering: JIT Reservation

The error coming back from the official integrator was short: Duplicate / Already Exists. Two workers had produced the same invoice number in the same second, both sent it, and one got rejected. Invoice issuing stalled on that series, and every job behind it waited.

I’ve built this pattern the same way in several systems that number legal documents. Each time, the same two constraints collide and every obvious solution breaks. Below is both what breaks and the solution I use today — JIT reservation.

Two constraints at once

What makes legal document numbering (e-invoices, receipts, dispatch notes) hard is that two constraints must hold together:

Unique — the same number can’t be used twice.
Gap-free — no skipped numbers in the series. If 1454 is missing in an audit, the question “where did it go?” has to be answerable.

Each constraint alone is easy. AUTO_INCREMENT handles uniqueness; a counter handles gap-freeness. But the moment the number is sent to an external system that can fail (an integrator, the tax authority’s clearance API — GİB, in our case), holding both at once turns from a locking problem into a timing problem.

The shape of the system — and the migration that exposed it

This problem actually surfaced in the tail of an architectural migration. The flow used to be synchronous: the user clicks “Issue invoice” in the panel, and the invoice is issued inside that request while the user waits for the result on screen. In that model the numbering “worked” for years — because each request was effectively serialized by the user’s blocking wait; real concurrency never happened.

Then I decoupled the flow. Today different projects and frameworks publish their invoices to a shared RabbitMQ queue; a separate consumer/worker microservice continuously listens, consumes the messages, converts them to the e-invoice format, and forwards them to the integrator.

Project A (framework X) ──┐
Project B (framework Y) ──┤
Scheduler (cron)        ──┼──►  [ shared RabbitMQ ]  ──►  Consumer microservice (N workers)  ──►  Integrator API ──► Tax authority
User action ("issue")   ──┘

I no longer make the button a synchronous call that issues the invoice inline; the boundary between HTTP and a queue runs exactly here: the external API call can take seconds, and I didn’t want to hold it inside a request-response cycle. The result is an event-driven shape — scalable, resilient, and independent of the producers.

But that decoupling exposed something that had been hidden for years. The synchronous model was serializing the numbering by accident; the moment I ran the consumer concurrently, that accident was gone and a race condition that had always been latent surfaced. The bug was small and easy to miss — precisely because it only appears once concurrency is real. From here on, the whole design knots around a single question: where, and when, does the number come from?

A migration doesn’t create a concurrency bug; it reveals the one a serial execution model was hiding.

Naive solution 1 — `MAX()` at send time (race condition)

When I first built it, I generated the number at send time:

SELECT MAX(invoice_no) + 1 FROM outgoing_invoices WHERE series = 'A';

With a single worker it worked flawlessly. The day I turned on a second worker, it broke:

Worker-1: SELECT MAX(...) → 1453
Worker-2: SELECT MAX(...) → 1453   (Worker-1 hasn't INSERTed yet)
Worker-1: sends 1454 to the API  ✓
Worker-2: sends 1454 to the API  ✗ Duplicate

A classic read-modify-write race condition. The window between the MAX() read and the write is wide enough for two workers to see the same value. As volume grows, the window widens — so the problem gets worse under load, at the worst possible time.

Naive solution 2 — generate the number upfront (gap)

My second reflex was the same as most developers’: generate and seal the number as early as possible, when the message goes onto the queue. The duplication stops, but a gap appears:

Invoice 1454 is processed while waiting in the queue.
It hits a permanent error at the integrator (invalid tax ID, timeout, a business-rule rejection), or the user cancels while it’s still queued.
That invoice drops; the one behind it carries on with 1455.
A 1454 hole is left in the series — a gap you can’t account for in an audit.

That was the dilemma: generate late and you get the race, generate early and you get the gap. The point wasn’t to swap one method for the other; it was to hold both.

Approach	Unique	Gap-free	Failure mode
`MAX()+1` at send time	✗	✓	Collision under parallel workers
Generate at enqueue	✓	✗	Failure/cancel → gap
JIT reservation	✓	✓	— (cost below)

Why a built-in sequence isn’t enough

I said “a counter handles gap-freeness” — but the database’s own sequence (SEQUENCE, AUTO_INCREMENT) doesn’t; it’s designed to produce gaps. A sequence advances independently of the transaction for performance: even if the transaction that called nextval rolls back, the consumed value doesn’t come back. In PostgreSQL a failed INSERT’s id is permanently skipped; MySQL’s AUTO_INCREMENT behaves the same way. In most systems this is the desired behavior — nobody cares about holes in a primary key. In ours it’s exactly what we’re avoiding. So I manage gap-freeness in the application layer, with an explicit counter table.

The second objection: “won’t a single-row counter table become a bottleneck on its own?” It won’t — because that row’s lock is held only for the duration of UPDATE ... SET next_no = next_no + 1, on the order of microseconds, not for the external call. And if load is genuinely high, splitting the series (A, B, C… as separate counters) reduces contention linearly; each series gets its own gap-free run and the series don’t wait on each other.

JIT reservation: sealing state one ms before the external call

The solution I landed on: generate and persist the number neither too early nor too late, but exactly one moment before the external API call. I added two parts:

An atomic sequence counter (invoice_sequences) — advances the series from a single source.
A reservation column (reserved_no) on the draft invoice — where the generated number is preserved even if the external call fails.

I hold the lock only for the reservation; the external call is outside it:

│── BEGIN TX ──────────────────│
│  lockForUpdate the invoice   │
│  if reserved_no is empty:    │   ← row lock held only here (single-digit ms)
│    sequence++ → reserved_no  │
│    status = SENDING          │
│── COMMIT ────────────────────│
                                └──► Integrator API call (200ms–2s, NO lock)
                                      ├─ success → INSERT into outgoing_invoices
                                      └─ failure → reserved_no STAYS, status = ERROR

I advance the counter atomically with a row lock:

BEGIN;
  SELECT next_no FROM invoice_sequences WHERE series = 'A' FOR UPDATE;
  UPDATE invoice_sequences SET next_no = next_no + 1 WHERE series = 'A';
COMMIT;  -- the lock releases within ms; the next worker takes over without waiting

On the worker side there’s a single check that makes the reservation idempotent:

DB::transaction(function () use ($invoiceId) {
    $invoice = Invoice::whereKey($invoiceId)->lockForUpdate()->first();

    // On retry the number is PRESERVED — not regenerated.
    if ($invoice->reserved_no === null) {
        $invoice->reserved_no = $this->sequence->next('A'); // counter behind FOR UPDATE
        $invoice->status      = 'SENDING';
        $invoice->save();
    }
}); // TX commits here; the lock releases

$this->integrator->send($invoice); // can take seconds — we hold no lock

The critical line is if ($invoice->reserved_no === null). When an invoice errors and gets retried, the column is already populated; I don’t generate a new number, I keep the existing one. No matter how many times the same invoice is retried, it consumes exactly one number. This is the half that closes the gap: a failed invoice doesn’t lose its number, it keeps it pinned and retries with the same one.

Why early commit is non-negotiable

The heart of the pattern is the early commit in step three. Get it wrong and the whole pattern turns into a bottleneck.

The naive reflex is to wrap everything in a single transaction “to be atomic”: take the lock, generate the number, call the external API, then commit. This is a disaster. The integrator call takes 200ms to 2 seconds; if the row lock is held that whole time, every worker waiting on the same series queues up behind it. Throughput collapses to the latency of a single external call — a system that should issue tens of invoices a second drops to one.

So I hold the lock only for the reservation and commit immediately. The lock releases in single-digit milliseconds; parallel workers go on reserving 1454, 1455, 1456.... The slowness of the external call no longer blocks anyone.

Rule: a lock is held for DB state, not for external I/O. If there’s a network call inside a transaction, your lock duration is at the mercy of that network.

Keeping multiple producers safe

Because multiple producers (the scheduler, the user button, other projects) feed the same series, the number must be reserved from a single point. They all pass through the same reserved_no check at the consumer stage. Whoever the producer is, the reservation source is one; number generation depends not on the producer but on the single piece of consumer code. Generating the number at the producer (at enqueue time) would drop us straight back into the “generate early → gap” trap above; that’s why the reservation lives in exactly one place, the consumer.

The second concern is queue starvation. The thousands of messages the scheduler pushes in bulk must not bury a live user’s ad-hoc request — otherwise the user waits behind the tail of a queue with 4,000 batch jobs in front. With a priority queue in RabbitMQ I slot user requests in at high priority; while batch work flows in the background, button-triggered work drops the loading screen from seconds to milliseconds. Priority doesn’t change correctness — the number still comes from a single source — it only adds fairness. (Why and how the queue itself stalls is a separate topic: why Laravel queues slow down in production.)

The dual-write problem and orphan recovery

The nastiest edge case is left. The API call succeeds, but while writing its response to the local DB the worker crashes or the network drops. At the integrator the invoice is legally issued; on your side it’s still in SENDING, with its number pinned. This is the classic dual-write problem: you can’t hold two separate systems in a single atomic operation. A worker shutting down mid-flight during a deploy or scale-down leaves the same trace — graceful shutdown reduces it but doesn’t eliminate it; the design still has to be able to clean it up.

The fix is an orphan recovery service that finds what’s left hanging. It scans for invoices stuck in SENDING for more than 5 minutes and doesn’t fire a blind retry — it first runs an idempotency check at the integrator:

Did an invoice with this number actually reach the integrator?
If yes: it repairs the local DB (writes to outgoing_invoices, closes the status). It does not generate a new number.
If no: it safely retries with the same reserved_no.

What makes this possible is that the number was sealed before the call. Because the number is in hand, the query is deterministic: “is this number over there?” Had the number been generated during the call, you wouldn’t even know what to ask. JIT reservation turns recovery from guesswork into a lookup.

One note: if the integrator is failing permanently (not transiently), the orphan service must not retry forever. To stop hammering the external service I put a circuit breaker in front of it; the number is preserved, the invoice waits in ERROR, but hundreds of pointless requests a second don’t go to the integrator.

When do you drop this pattern?

This machinery isn’t cheap — it brings a reservation column’s lifecycle, an atomic counter, an orphan service, and a mandatory idempotency check at the integrator. If you don’t know what you’re buying in return, don’t pay for it. I don’t build it when:

Gaps are allowed. If your number isn’t legally required to be gap-free (an internal reference, an order number, a log id), a plain DB sequence or AUTO_INCREMENT is enough. Building this mechanism to chase gaps is solving a problem you don’t have.
There’s no external call. If number generation and persistence can stay in the same transaction (no failure-prone I/O in between), you don’t need the reservation column; a single FOR UPDATE counter ends the problem.
Single worker, low volume. No parallelism, no race condition. A global advisory lock or a single-consumer queue is far simpler than the JIT machinery, and sufficient.

The need for the pattern starts when three conditions arise together: gap-freeness is a legal requirement, production is parallel, and the last step is an external call that can fail. If all three aren’t present, there’s a plainer solution — pick it.

Generating sequential, gap-free numbers is less a locking problem than a timing one. You seal the number one moment before it leaves for the outside world: late enough to prevent the race, early enough to prevent the gap. Everything else — early commit, idempotent retry, orphan recovery — arranges itself around that single seal.