Realtime Recovery, Part 1 — Recovery Is The System

Part 1 of a four-part series: Realtime Is Easy. Recovery Is The System.

The demo that lies to you

Every streaming system demos the same way. Events arrive on a socket, a consumer folds them into some running aggregate, and a dashboard ticks upward in front of the room. It looks like the hard problem has been solved. It hasn't. You've solved the easy 20% — the part that works as long as nothing goes wrong.

Live processing is easy because, in the demo, the world cooperates. The socket stays open. Events arrive in order. Nothing needs to be corrected after the fact. No one asks what the number was an hour ago, or whether it's the same number you'd get if you computed it again from scratch.

Production does not cooperate. The interesting engineering — the part that actually takes the years — is everything that happens when the stream stops behaving like a stream.

What "realtime" architectures quietly assume

Pull apart most realtime designs and you'll find a set of load-bearing assumptions that nobody wrote down:

Uninterrupted streams. The connection to the upstream source never drops, so the consumer never has to reason about a gap.
Infinite ordering guarantees. Events arrive in the order they happened, so "latest event wins" is a safe rule.
No backfill pressure. You never need to load six months of history into the same pipeline that's serving live traffic.
No corrections. The data you received is the data that was true, forever.

Every one of these is false in production, and each one is false in a way that quietly corrupts your aggregates rather than crashing loudly. That's what makes them dangerous. A crash you'll fix. A subtly wrong rolling average that drifts a fraction of a percent every time a socket reconnects, you'll discover six months later when someone reconciles against a source of truth and the numbers don't match.

The reality

Here's what actually happens, on a long enough timeline:

Upstream sources disconnect. Maintenance windows, rate limits, network partitions, your own deploys. The stream is not a stream; it's a sequence of sessions with gaps between them.
Events arrive late, and out of order. A partition heals and dumps a backlog. A slow producer flushes. The event timestamped 10:00:00 shows up after the one timestamped 10:00:05.
History gets corrected. The upstream source revises a value, or you discover a bug in your own parsing and need to recompute a window that already "finished."
You need to backfill. A new metric, a new instrument, a new customer who wants two years of history — all of it has to flow through the same logic that serves live data.

None of these are exotic. They are the normal operating conditions of any system that runs longer than a demo. The question isn't whether they'll happen — it's whether your design treats them as the main case or the exception.

The hidden requirement: deterministic replay

Once you accept that the stream breaks, corrects, and backfills, one requirement falls out of all of it: you must be able to reprocess the same inputs and get the same outputs. Deterministic replay.

This sounds obvious until you notice how many systems can't do it. They can't, usually for one of these reasons:

The aggregation depends on wall-clock arrival time, not event time, so replaying yesterday's data today produces different buckets.
Live and historical paths run different code, so "recomputing" a window doesn't actually reproduce what the live path did.
Writes aren't idempotent, so replaying an event double-counts it.
State is mutable and unversioned, so there's no way to tell a corrected value from the original.

Why replay correctness matters

Replay is not a disaster-recovery feature you hope never to use. It is the normal mechanism by which your system stays correct:

After a disconnect, you replay the gap.
After a bug fix, you replay the affected window.
After a backfill, you replay history into the same aggregates.
After a correction upstream, you replay to absorb it.

If replay isn't deterministic, every one of those operations introduces drift. The system doesn't converge on truth — it accumulates error. Your "realtime" numbers and your "recomputed" numbers disagree, and now you have two versions of reality and no principled way to choose between them.

Why the historical and live paths must converge

The single most expensive mistake in this space is maintaining two codepaths: a "fast" live path and a "correct" batch path. It feels reasonable — streaming code optimizes for latency, batch code optimizes for throughput, surely they should be written differently.

The problem is that they will diverge, and the divergence is invisible until it isn't. The live path rounds differently. The batch path handles a null case the live path never hit. Six months in, the two systems produce subtly different aggregates for the same input, and you cannot tell which is right because both are running the logic their authors intended — the logic just isn't the same logic.

The discipline that prevents this: one aggregation implementation, fed by two sources. The variance accumulator, the bucketing rule, the dedup logic — written once. Live ingest and historical replay are just two ways to push events at the same function. When they share the code, "recompute" provably reproduces "live," because they are the same computation. When they don't, you're hoping.

                  ┌─────────────────┐
  live ingest ───▶                 
                  │ one aggregation ──▶ aggregates
  replay ────────▶      core       
                  └─────────────────┘
        (two sources, one implementation, identical output)

Why "close enough" becomes operational debt

There's enormous temptation to wave off small inconsistencies. The rolling average is off by 0.1%. A reconnect dropped three events out of millions. Who cares?

The answer is that "close enough" is not a state — it's a rate. Every reconnect, every late batch, every backfill adds a little more error, and nothing ever removes it, because you have no deterministic mechanism to recompute truth. The error is monotonic. It only grows.

And it grows in the most expensive possible currency: trust. The first time someone catches your numbers disagreeing with a source of truth, every number you've ever produced is now suspect. You'll spend more time defending the pipeline than improving it. "Close enough" aggregates don't fail loudly; they quietly convert into a permanent tax on everyone who depends on the data.

The reframe

So here's the thesis for the rest of this series. Stop thinking of recovery as the error-handling around your realtime system. *Recovery is the system.* The live path is just the special case of replay where the data happens to be arriving right now.

Once you adopt that framing, the rest of the design follows:

You finalize results behind the present, not at it, so late data has somewhere to land. That's watermarks and intentional lag.
You make writes idempotent and aggregates versioned, so replay merges into live state instead of fighting it. That's replay without stopping live traffic.
You design for the cost and contention that replay creates under load, because a recovery mechanism that takes the system down during recovery isn't one. That's the operational reality.

The demo will always be easy. Build for the part that comes after the demo.

Next: Part 2 — Watermarks and Intentional Lag