Every distributed system is a collection of carefully managed disappointments.
That sounds cynical. It isn't. It's the most useful frame I know for the work, because it points at the part that's actually hard and quietly skips the part everyone obsesses over instead.
The apologies are not edge cases. They're the contract.
The moment you put a network between two pieces of your system, you've agreed to a set of things that will happen to you. Not might. Will, given enough time and traffic:
- The network will fail, usually at the worst moment.
- Messages will arrive twice.
- Messages will arrive out of order.
- Machines will disappear without warning.
- Clocks will disagree about what time it is.
- Data will be stale by the time you read it.
None of these are bugs you can fix. They're properties of the medium. You can't write code that makes the network reliable any more than you can write code that makes the speed of light faster. So the question stops being "how do I prevent this" and becomes "what does my system do when it happens, because it will."
You are choosing which disappointment to absorb, whether you admit it or not
Here's the part that trips people up. You don't get to avoid all of these at once. Picking a defense against one of them usually signs you up for another.
- Retry on failure, and you've chosen duplicate delivery. Now you owe the system idempotency.
- Demand strict ordering, and you've chosen latency and head-of-line blocking.
- Insist on always-fresh reads, and you've chosen reduced availability when a replica is unreachable.
- Accept stale reads for availability, and you've chosen to explain to someone why their dashboard was wrong for ninety seconds.
Distributed systems engineering is mostly deciding which disappointments you're willing to tolerate, and then designing the apology so it doesn't hurt anyone.
The teams that get into trouble aren't the ones that chose wrong. They're the ones that never realized they were choosing. They retried without making writes idempotent. They assumed ordering they never enforced. The tradeoff got made by default, in the dark, and surfaced later as a corruption bug nobody could reproduce.
The happy path is the cheap part
This is why "it works in the demo" tells you almost nothing about a distributed system. The happy path, where every message arrives once, in order, while every machine is up and every clock agrees, is the easy 80%. Generated code and quick prototypes clear it without breaking a sweat.
The actual engineering lives in the other 20%:
- What happens when the same request is processed twice?
- What happens when an acknowledgment is lost but the work succeeded?
- What happens when a node comes back from the dead holding stale state and starts confidently acting on it?
- How does the system get back to a known-good state after any of that?
That last question is the whole game. I've argued before that recovery is the system, not a feature bolted onto it. Distributed systems are where that stops being a slogan and starts being the difference between a blip and a multi-day incident.
Designing the apology
If you accept that the disappointments are coming, the work gets concrete and almost calming. You stop trying to build something that never fails and start building something that fails on purpose, in ways you've already rehearsed.
- Idempotency so a duplicate message is a no-op, not a double charge.
- Tolerance for reordering so "out of order" is a sorting problem, not a corruption event.
- Staleness budgets so "how old is too old" is a number you chose, not a surprise you discover. This is the same instinct behind watermarks and intentional lag: naming the staleness you'll accept instead of pretending there is none.
- Timeouts and bounded retries so a slow dependency degrades you instead of taking you down with it.
- A defined recovery path so the answer to "now what" is documented, not improvised at 3am.
Every one of these is an apology written in advance. You're deciding, while calm, how the system will say sorry when reality refuses to cooperate, so that nobody has to invent the apology under pressure.
The one-line version
A distributed system never gets to avoid failure. It only gets to choose which failures it tolerates and how gracefully it apologizes for them. The happy path was never the work. The apology always was.