← all posts

Realtime Recovery, Part 4 — The Operational Reality

Replay storms, cost amplification, stream growth, and backpressure — the things that make a recovery design either survive contact with production or not.

Part 4 of a four-part series: Realtime Is Easy. Recovery Is The System.

The first three parts built the design: recovery as the system (Part 1), watermarks for deterministic finalization (Part 2), and idempotent, versioned replay that merges into live traffic (Part 3). On paper it's clean. This part is about the day it meets production, because a recovery mechanism that takes the system down during recovery is not a recovery mechanism. It's a second outage with extra steps.

The thread running through everything below: replay is amplification. It takes a quantity of work that was once spread across days and compresses it into minutes. Every resource that was comfortable at live volume can be the bottleneck at replay volume.

Replay storms

The defining failure mode is the replay storm: a replay that generates more load than the live system was provisioned for, and that load cascades.

It usually starts innocently. A source was down for six hours, comes back, and the system dutifully replays the gap — at full speed, because nobody told it not to. Now you're ingesting six hours of events as fast as the pipe allows, on top of live traffic that never stopped. Or worse: several sources reconnect at once (they often do — the thing that knocked one offline knocked them all offline), and their replays superimpose.

The storm is dangerous because it's self-reinforcing. Replay load slows live processing; slow live processing widens the gap that needs replaying; the wider gap means more replay. Left unchecked, a transient outage turns into a system that can never catch up.

The defenses are all forms of deliberately not going as fast as possible:

  • Rate-limit replay explicitly. Replay should run at a configured fraction of capacity, leaving headroom for live traffic. "As fast as possible" is never the right speed for background recovery.
  • Bound concurrency. Cap how many subwindows replay in parallel, and how many sources replay at once. Staggering reconnect-driven replays prevents the superimposed-storm case.
  • Prioritize live over replay. When the two contend, live wins. Replay is catching up on the past; the past can wait a few more minutes. The present can't.

Cost amplification

Replay compresses time, and most cloud bills are priced by the unit of work, not by wall-clock. So replay doesn't just cost — it costs suddenly and visibly.

  • Compute spikes because you're doing days of aggregation in a burst.
  • I/O and request costs spike because every replayed event is reads and writes — and if your store charges per request or per I/O operation, a backfill is a lot of operations in a short window.
  • Egress and inter-service traffic spike if replay pulls history across a network boundary.

The trap is provisioning for live load and getting surprised by replay load. The fix is to treat replay as a first-class capacity scenario: know what a full backfill costs before you run one, and make the rate limit a cost lever, not just a stability lever. Slower replay is cheaper replay. Often the right answer to "this backfill is too expensive" is "run it slower," which costs the same total but flattens the spike into something your provisioned capacity absorbs.

CPU cache pressure and the cost of touching cold data

This one is subtle and bites systems that are otherwise well-tuned. Live processing has good locality: it touches recent buckets, recent keys, hot state that's already in cache. The working set is small and warm.

Replay destroys that locality. It sweeps across a huge range of historical buckets and keys, none of which are in cache, evicting the hot live state to make room. The result is that both paths slow down — replay is slow because it's all cache misses, and live is slow because replay just evicted its working set.

Mitigations are about keeping the two working sets from fighting:

  • Keep replay subwindows small and sequential so the cold working set stays bounded rather than thrashing the whole cache at once.
  • Isolate replay where the architecture allows it — separate workers, or even separate replicas — so cold historical sweeps don't evict the hot live state serving production.

Stream and storage growth

If you use a durable log or stream (Kafka, a Redis/Valkey stream, etc.) as the backbone between ingest and processing, replay interacts badly with retention.

  • Stream growth. Replaying into the same stream that feeds live consumers inflates it fast. A Redis/Valkey stream that's comfortably trimmed under live load can balloon when replay floods it, and an untrimmed stream is a memory problem that becomes an availability problem.
  • Disk amplification. Idempotent inserts still attempt the write. DO NOTHING is cheap on conflict, but it's not free — there's still WAL, still index probes, still vacuum/compaction pressure from the churn. Replaying a large overlapping range generates real write amplification even when most of it is no-ops.

Plan retention and trimming for replay volume, not live volume. Decide explicitly whether replay flows through the same stream as live traffic or a separate one — sharing is simpler but couples their growth; separating costs complexity but isolates the blast radius.

Backpressure propagation

Everything above converges on one mechanism: backpressure, the system's ability to make a fast producer slow down when a downstream stage can't keep up.

Replay is the ultimate fast producer. Without backpressure, it will happily read history faster than aggregation can absorb, faster than the store can persist, faster than the stream can drain — and the overflow has to go somewhere: unbounded queues (memory exhaustion), dropped data (the corruption Part 1 warned about), or thrashing (the cache pressure above).

Backpressure has to propagate all the way back to the replay reader:

  • The store signals it's saturated →
  • aggregation slows its writes →
  • the stream/queue fills →
  • the replay reader stops reading.

That last link is the one people forget. If replay reads from an infinite historical source (a database, an object store) it has no natural rate limit — it will read as fast as it can unless something explicitly tells it to wait. Backpressure is what connects "the store is struggling" to "the reader should pause," and without it the rate limit from the replay-storm section is the only thing standing between you and an overflow. Build both: the rate limit as the deliberate ceiling, backpressure as the reactive safety valve.

The operational checklist

If you're building or auditing one of these systems, the questions that separate "works in the demo" from "survives production":

  • Can replay be rate-limited and concurrency-bounded, and does live traffic take priority under contention?
  • Do you know the cost of a full backfill before running one, and is rate a cost lever?
  • Are replay's subwindows bounded so cold sweeps don't thrash the cache or monopolize connections?
  • Is retention/trimming sized for replay volume, and do you know whether replay shares the live stream or has its own?
  • Does backpressure propagate all the way to the replay reader, so a saturated store actually slows replay instead of overflowing?

If any answer is no, you don't have a recovery system. You have a recovery feature that works until the day you need it most.

Closing thesis

Across four parts the argument has been one idea seen from different angles. The live path is the easy part — it's the part that works when the world cooperates. The system you're actually building is the one that stays correct when the world doesn't: when sockets drop, when data arrives late, when history needs correcting, when two years of backfill have to flow through the same logic that's serving this second's traffic.

Watermarks make finalization deterministic. Idempotency and versioning make replay safe. Rate limits and backpressure make it survivable. But all of it is in service of a single property, and it makes a good place to end:

A realtime pipeline is only as good as its ability to reconstruct truth after failure.

Realtime is easy. Recovery is the system.


Previous: Part 3 — Replay Without Stopping Live Traffic Back to the start: Part 1 — Recovery Is The System