Realtime Recovery, Part 3 — Replay Without Stopping Live Traffic

Part 3 of a four-part series: Realtime Is Easy. Recovery Is The System.

Part 1 argued that replay is the core mechanism, not the emergency procedure. Part 2 gave you finalized data you can trust to reproduce. This part is the hard operational question that follows: how do you replay history into a system that is still serving live traffic — without a maintenance window, without double-counting, and without the two flows corrupting each other?

The naive answer is "stop the world, replay, restart." That's a maintenance window, and maintenance windows are exactly the operational debt this series is trying to avoid. The goal is replay as a background concurrent activity against a live system. That requires getting four things right.

1. Idempotent writes are the foundation

If you take one thing from this part: a replayed event must produce the same final state as the original event, no matter how many times it's applied.

Replay means the same event will be processed more than once — that's not an edge case, it's the definition. A reconnect replays the gap, which overlaps with events you already had. A backfill reprocesses windows that live traffic already touched. If applying an event twice counts it twice, replay is destructive: the act of recovering corrupts the thing you're recovering.

Idempotency is what makes replay safe to run casually, concurrently, and as often as you like. It turns "did this already get processed?" from a question you must answer correctly (and will eventually get wrong) into a question that doesn't matter, because applying it again is a no-op.

Two broad ways to get there, and they apply at different layers:

Idempotent inserts for raw/event-level data — keyed on a natural or derived identity so a re-insert is absorbed rather than duplicated.
Idempotent aggregation for derived data — either recompute the bucket from its inputs (naturally idempotent: same inputs, same bucket) or accumulate in a way that can detect and skip already-seen contributions.

2. `DO NOTHING` vs `DO UPDATE` — choosing your conflict semantics

When a write collides with an existing row, you have two fundamentally different intentions, and conflating them is a common, expensive bug. In SQL terms it's ON CONFLICT DO NOTHING versus ON CONFLICT DO UPDATE, but the distinction is conceptual and applies in any store.

DO NOTHING — "first write wins, replay is a no-op." Use this for immutable facts. A raw event that happened, happened. If replay presents it again, the right behavior is to silently keep what's there. This is the workhorse for replaying raw data: overlap is expected and harmless, because the second write does nothing.

INSERT INTO events (source, event_id, ts, payload)
VALUES ($1, $2, $3, $4)
ON CONFLICT (source, event_id) DO NOTHING;

Replaying a gap that overlaps existing data costs you some wasted inserts and zero corruption.

DO UPDATE — "incorporate a correction." Use this when the new write is genuinely better than the old one — a correction from upstream, a recomputation that supersedes a provisional value. Here you want the new data to win, but only if it should. A blind DO UPDATE is dangerous during replay: an older replayed event can clobber a newer correction, walking your state backward. The fix is to make the update conditional on a version or timestamp:

INSERT INTO aggregates (bucket, version, value)
VALUES ($1, $2, $3)
ON CONFLICT (bucket) DO UPDATE
  SET value = EXCLUDED.value,
      version = EXCLUDED.version
  WHERE EXCLUDED.version > aggregates.version;   -- only move forward

The rule of thumb: DO NOTHING for facts, conditional DO UPDATE for corrections, never an unconditional DO UPDATE on anything replay can touch.

3. Versioned aggregates: separating "newer" from "later-arriving"

The WHERE EXCLUDED.version > ... clause above is doing something subtle and important. It's the line between immutable and mutable aggregates, and between "this data is newer" and "this write arrived later."

During replay those two come apart. A write that arrives later may carry data that is older. Wall-clock arrival order is meaningless; you need an explicit notion of version to know which value should win.

A version can be:

a monotonic sequence number from the source,
the event-time of the latest contribution folded into the aggregate, or
a generation counter you bump each time you recompute a window.

With versioning in place, the merge rule is simple and total: the highest version wins, regardless of arrival order. Replay an old window and it loses to the correction that already superseded it. Replay a correction and it wins over the stale value. The system converges — it always lands on the same final state for a given set of inputs — instead of depending on the order writes happened to land. That convergence is the runtime expression of the determinism Part 2 built into finalization.

This is also why immutable aggregates are a gift where you can afford them. An aggregate that's recomputed wholesale from its inputs doesn't need version arbitration at all — recomputing it is inherently idempotent. Reserve the mutable, versioned path for the cases where full recomputation is too expensive and you must merge incrementally.

4. Replay in subwindows, merging into live state

Now the concurrency. You don't replay "history" as one monolithic job that holds the system hostage. You replay in bounded subwindows — chunks of the time range — and let each chunk merge into the same aggregates live traffic is writing to.

This works because of the first three properties, not in spite of them:

Writes are idempotent, so a replay chunk overlapping live data doesn't double-count.
Conflict semantics are explicit, so replayed facts don't clobber live corrections and vice versa.
Aggregates are versioned, so the merge converges no matter the interleaving.

Given those, replay and live ingest are just two writers against one set of aggregates, and correctness no longer depends on coordinating them.

A few practical notes on doing it without melting the system:

Avoid lock contention by partitioning the work, not by locking the data. Idempotent, version-arbitrated writes don't need a global lock — that's the point. If you find yourself reaching for a big lock to "protect" aggregates during replay, it usually means one of the first three properties is missing. Fix that instead.
Bound the subwindow size so any single chunk is cheap to retry and can't monopolize connections or memory. Small chunks also make progress observable and resumable — if replay dies halfway, you restart from the last completed chunk, and idempotency makes the partial chunk safe to redo.
Let live traffic keep priority. Replay is background work; it should yield to live ingest under pressure, not starve it. (How that pressure propagates, and what happens when replay is too aggressive, is the subject of Part 4.)

The shape of a correct replay

Put together, replaying a window into a live system looks like this:

Read the historical events for a bounded subwindow.
Insert raw facts with DO NOTHING — overlaps with live data are absorbed.
Recompute or incrementally update the affected aggregates with versioned conflict resolution — corrections win, stale replays lose.
Move to the next subwindow. Live traffic never stopped; it was writing the whole time, and the merge converged because every write was idempotent and version-arbitrated.

No maintenance window. No double-counting. No divergence between the live and historical paths — because, as Part 1 insisted, there is only one path, and replay is just feeding it from a different source.

What's left is making sure all of this survives the day someone replays two years of history into a system already running hot. That's where the theory meets the bill.

Previous: Part 2 — Watermarks and Intentional Lag Next: Part 4 — The Operational Reality

1. Idempotent writes are the foundation

2. DO NOTHING vs DO UPDATE — choosing your conflict semantics

3. Versioned aggregates: separating "newer" from "later-arriving"

4. Replay in subwindows, merging into live state

The shape of a correct replay

2. `DO NOTHING` vs `DO UPDATE` — choosing your conflict semantics