Observability Is Runtime Architecture

Most teams treat observability as a read-only layer over their systems. It is not.

The Incident

We recently hit a production issue where memory usage exploded in production while local environments stayed flat. Same code. Same workload. Same deployment model.

Locally, the service sat around 20MB of memory usage. In production, it crossed 2GB within minutes.

The root cause was simple once we saw it: local environments had Prometheus scraping metrics, and production did not.

The /metrics scrape was triggering a reset path that bounded growth elsewhere in the process. Remove the scrape, and the bound disappeared.

The bug itself is not the interesting part. The reminder is.

The Reminder

Observability is part of your runtime architecture whether you intend it to be or not.

Monitoring changes timing. Scraping changes execution paths. Tracing changes allocation patterns. Health checks and operational traffic change system behavior.

A lot of "works in staging" failures are really parity failures in disguise.

Parity Includes Observability

Teams usually think parity means:

infrastructure
configuration
datasets
deployment topology

In practice, it also means:

monitoring
scrape intervals
tracing overhead
health checks
operational traffic patterns

The Check

We now run soak tests with the observability stack disabled as a standard pre-production check.

The systems watching your application can quietly become part of the reason it behaves correctly.

Production has a way of teaching that lesson eventually.