On the difference between possible and durable
A demo proves something can work. A production system proves it keeps working — under load, under failure, under change, at three in the morning when nobody is watching. The gap between those two is where most engineering effort actually lives, and most engineering pain.
A demo validates possibility. A production system must guarantee reliability under constraints. Confusing the two is a category error.
The original version of this essay listed thirteen ideas, flat. Reading it back, I noticed two things. First, a long checklist flattens priority — every item reads as equally weighted. Second, the rigor a system needs scales to its blast radius: a research notebook and an ASIL-rated motor controller live on different planets, and pretending otherwise either over-engineers prototypes or under-engineers products. So I have grouped the ideas into four parts, condensed to ten, and made each one a thing you can do, not a thing you can nod at.
This monograph is meant to live next to your IDE. Open it when you start a feature; open it again before you ship one. The interactive figures are not decoration — they show, in miniature, the failure modes that catch teams off guard.
A result that cannot be re-run is not a result. It is a memory of one.
Notebooks let you mutate state without realizing it: cells run out of order, variables linger, a helpful colleague pip installs a newer library and the entire conclusion shifts. Production cannot tolerate this. Every output must be traceable to a specific commit, a specific dataset version, a specific dependency tree, and a specific sequence of operations.
Three habits move you most of the way there. Pin everything you import. Make every script idempotent and order-independent. Capture not just the code but the environment: lockfiles, container digests, hardware where it matters.
# bad: implicit, drifty, ambient import numpy import torch df = pandas.read_csv("latest.csv") # which "latest"? # good: explicit, versioned, reproducible # numpy==1.26.4, torch==2.3.1+cu121 (locked in pyproject.toml / requirements.lock) # dataset: s3://data/raw/2026-04-12/orders.parquet (immutable URI) # commit: 3f9a2c1 (recorded in MLflow run)
:latest), seed pinned for any stochastic step, and an experiment_id recorded with every artifact. If you can rebuild the same number tomorrow from a clean clone, you have reproducibility.
Most production incidents I have seen begin with the sentence: "someone changed the upstream table."
Data is not static. Columns get renamed. A field that was always non-null suddenly isn't. Units silently change from grams to kilograms because a procurement system updated. Without an enforced contract at every boundary, your code will absorb the change and produce a plausible-looking wrong answer. Plausible-looking wrong answers are the most expensive kind.
A data contract is just a schema you treat as a promise: type, nullability, allowed range, semantic meaning. Validate it on every read; fail loudly when it breaks; version it like an API.
pydantic, pandera, Great Expectations, Avro/Protobuf for messages) at every system boundary. Treat schema changes as breaking API changes: version, deprecate, migrate. Never trust an upstream "we'll let you know."
Two pipelines that look the same but compute differently are a bug — they just haven't surfaced yet.
This is the ML version of the classic "works on my machine" problem. The training pipeline normalizes timestamps in UTC; the serving pipeline normalizes in local time. Training drops nulls; serving fills them with zero. Training one-hot encodes from a frozen vocabulary; serving builds the vocabulary on the fly. Each one of these alone shifts model accuracy by a few points. Together they can make a model in production behave like a different model entirely.
The cure is structural, not procedural: one feature transformation library, called from both training and serving. If a transform exists in two places, the question is not "are they the same" but "when will they diverge."
"It runs in 200 ms on my laptop" tells you almost nothing about how it will feel to a user.
The number that matters is not the mean. It is the tail — typically P95 or P99. If a request fans out to ten services, even a 1% chance of a slow response somewhere means the user sees a slow page about 10% of the time. Garbage collection pauses, cold caches, model warmup, network jitter, contention — these are not edge cases. They are the everyday texture of a real system.
A solution that is correct but economically unsustainable is not viable. Efficiency is a first-class metric, alongside accuracy.
"It scales" is doing a lot of work in most engineering meetings. What scales — the algorithm, the cost, the team's ability to operate it? An O(n²) step that is invisible at 1,000 records is your weekend at 1,000,000. A model that costs $0.002 per call is fine until you do 50 million calls a day.
In a demo, the network is perfect. In production, the network is the main thing happening.
Every external call — to a database, an API, a model server — is an opportunity for partial outage, latency spikes, or wrong-but-plausible responses. Resilience is the practice of assuming all of these will happen and deciding, ahead of time, what the system should do.
The vocabulary is small and worth memorizing: timeouts bound the wait, retries with backoff and jitter recover from transient failures, circuit breakers stop you from hammering a sick downstream, fallbacks keep some answer flowing when the ideal one isn't available.
If a system degrades and you cannot tell, it has already degraded for everyone using it.
Observability is the property of a system that lets you ask new questions of it after deployment, without shipping new code. It rests on three complementary pillars:
None of the three is enough alone. Logs without metrics flood you. Metrics without traces tell you something is slow but not where. Traces without logs lose the texture of what each step actually did.
db_query), and logs explain why (timeout). Any pillar alone leaves you guessing.
request_id, and define a small set of SLIs (service-level indicators) you actually look at.
A model deployed today is being slowly outvoted by tomorrow's data.
The world changes. User behavior shifts. Sensor calibrations age. Marketing campaigns introduce new product categories the model has never seen. Whatever metric was true at deployment will quietly decay — sometimes gently, sometimes abruptly. The same is true of any non-ML system whose behavior depends on input distributions: rules, heuristics, even SQL queries against changing data.
Monitoring drift means watching two things: the inputs (has the data distribution shifted?) and the outputs (has performance degraded?). The first warns you early; the second is the truth.
Deploys should be the most boring part of your week.
Two truths about new code: it has bugs you haven't found, and the cost of finding them grows with the number of users exposed. Safe deployment is the practice of shrinking that exposure until the bugs surface — then either rolling forward confidently or rolling back without drama.
The toolkit is well established: continuous integration catches the obvious bugs in seconds; canary deploys route a small % of traffic to the new version while the rest stays on the known-good one; feature flags let you turn things off without a redeploy; blue-green keeps the previous version warm and routable. Together they make rollback a one-button operation instead of a one-night incident.
A system without an owner is a system in slow decline. A system without input validation is a system you do not control.
Security in production is not glamorous; it is mostly hygiene. Secrets in a vault, never in a repo. Inputs validated at every trust boundary. Dependencies updated on a schedule. Least-privilege access, rotated credentials, audit logs. None of these are optional, and none of them are exciting — which is precisely why they get skipped.
Ownership is the same idea applied to people. Who is paged when this breaks? If the answer is "nobody specific" or "it depends," the answer is nobody, and the system is slowly rotting. Every production component needs a named owner, a documented runbook, and an on-call rotation. Even small teams. Especially small teams.
"Hardcoded credentials and unchecked inputs are acceptable in demos." Reconsider: muscle memory built in demos travels to production. Build the right habits in the playground.
CODEOWNERS file, a runbook with the top three failure modes and what to do, and an on-call schedule that someone outside the team can read.
Before merging a feature for production — proportional to its blast radius — walk this list. Not all items apply to every project. The ones that do should be defensible.
request_id SLIs defined and visible on a dashboardCODEOWNERS; runbook written; on-call set if it breaks at 3 AM, someone knows what to doA feature is not complete when it works in a demo. It is complete when it survives production.