Realtime Recovery, Part 2 — Watermarks and Intentional Lag

Part 2 of a four-part series: Realtime Is Easy. Recovery Is The System.

In Part 1 the claim was that recovery is the system and deterministic replay is the requirement everything hangs off of. This part is about the single design decision that does the most to make replay deterministic — and that most people get wrong on instinct.

The instinct is: when a time bucket's window ends, close it. The minute ends, finalize the minute. It feels obviously correct. It is the source of most of the instability in naive streaming systems.

The problem with finalizing at the present

Picture a system that aggregates events into one-minute buckets and finalizes each bucket the instant its window elapses. At 10:01:00.000 it stamps the 10:00–10:01 bucket "done" and publishes it.

Now reality intrudes:

An event timestamped 10:00:58 arrives at 10:01:02 — normal network jitter. The bucket is already closed. Either you drop the event (data loss) or you reopen a "finalized" bucket (so it wasn't finalized).
A producer was buffering and flushes a half-second of events at 10:01:01. Same problem, bigger.
A socket reconnects after a 4-second blip and replays the gap. Now you're reopening several buckets you already published.

Each reopen is a correction to something a downstream consumer already saw. Every dashboard, every cache, every alert that read the "final" value now has to be told it changed. The closer you finalize to the present, the more often this happens, because the present is exactly where late and out-of-order data is most likely. You've built a system whose published results churn constantly.

This is the core tension: finalize early and you're frequently wrong; finalize never and you can't publish anything. The resolution is to finalize late — but deliberately, by a known amount.

Decouple collection from finalization

The key move is to stop treating "collect" and "finalize" as the same clock.

Collection runs at the present. Events stream in and update open buckets in realtime. You're not adding latency to ingestion. The freshest data is visible immediately as provisional state.
Finalization runs behind the present, at a lagged watermark. A bucket is only sealed once the watermark passes its end.

The watermark is a moving timestamp that means: "I do not expect any more events older than this." It trails real time by a delay you choose — the intentional lag. Everything before the watermark is finalized and immutable. Everything after it is open and provisional.

   real time ──────────────────────────────────────────▶ now
                                                      │
   live ingest ──▶ open / provisional buckets   ◀────┘
                          ▲
                          │  events keep landing here
            ┌─────────────┘
            │
      watermark = now − lag
            │
            ▼
   ◀───────────  finalized / immutable buckets
        (no more writes expected here)

The lag is the width of the window in which you're willing to accept late data. Set it to cover your realistic worst case — the longest reconnect, the slowest producer flush, the typical out-of-order spread — plus a margin. Past the watermark, you assert completeness and seal.

Why this buys you determinism

This is the part that connects back to Part 1. A lagged watermark isn't just an ingestion nicety — it's what makes finalized data reproducible.

A finalized bucket is, by construction, one that received all the events it was ever going to receive before it was sealed. That means recomputing it from the same inputs yields the same output, every time, regardless of when you recompute. The arrival-time nondeterminism — "did that late event make it in before we closed?" — is gone, because finalization waited until the answer was unambiguously yes.

Contrast with finalize-at-present, where the output of a bucket depends on the race between event arrival and the close. Replay that day later and the race resolves differently; you get a different number. Finalize-behind-watermark removes the race from everything you've sealed. The watermark is the boundary between "still racing" and "settled," and only settled data is allowed to be called final.

Provisional vs final is a feature, not a leak

Exposing two tiers of data — provisional (after the watermark, still moving) and final (before it, immutable) — feels like leaking implementation detail.

It's the opposite. It's an honest contract. Your consumers genuinely live in two regimes:

A trading screen wants the freshest possible number and accepts that it might tick. Give it provisional data and label it provisional.
A billing reconciliation or a settlement report cannot tolerate a number that changes after it's read. Give it only finalized data, and tell it the latest finalized timestamp so it knows how current it is.

The mistake is pretending there's one number that's both maximally fresh and permanently stable. There isn't. The watermark makes the tradeoff explicit and lets each consumer choose its side. Hiding it just means you picked for them, usually wrong for half of them.

Choosing the lag

The lag is a real tuning knob with a real tradeoff:

Too short and you re-introduce the churn — late data lands after finalization and you're back to correcting published results.
Too long and finalized data is needlessly stale; consumers that need stability wait longer than they have to.

A few principles that help:

Measure your actual lateness distribution. Track the gap between event time and arrival time. Set the lag to cover the high percentile you care about (p99, p999), not the average. The tail is the whole point.
Different sources can warrant different lags. A flaky upstream with frequent reconnects needs more slack than a reliable one. The watermark can be per-source and the system-wide watermark the minimum across them.
The lag is a floor, not a guarantee. Data later than the lag still happens. That's not a watermark failure — it's exactly the case that Part 3 handles with replay and idempotent corrections. The watermark handles the common lateness cheaply; replay handles the rare lateness correctly.

What you've actually built

With collection at the present and finalization behind a watermark, you get a system with three useful properties:

Freshness without lying. Live data is visible immediately, clearly marked provisional.
Stability where it's promised. Finalized data doesn't move, so downstream consumers can trust and cache it.
Determinism. Sealed buckets reproduce exactly on replay, because they were only sealed once they were complete.

That third property is what makes the next part possible. Once you can trust that finalized data is reproducible, you can replay history into a running system without fear — because replay of a settled window will agree with what's already there, and replay of an unsettled one is just more provisional data doing its job.

Previous: Part 1 — Recovery Is The System Next: Part 3 — Replay Without Stopping Live Traffic