Production readiness is not a checklist you complete once.
It is a set of properties that should stay true while the system changes, scales, fails, and recovers. A service is not production-ready because it deployed successfully. It is production-ready when its important invariants survive ordinary operation.
Working is not ready
Many systems look healthy in the first deploy because nothing has stressed them yet.
The difference shows up later:
- A pod starts, but traffic reaches it before it is ready.
- A process listens on a port, but cannot talk to its database.
- A service scales up, but its dependencies collapse under retries.
- A backup job exists, but no one has restored from it.
- A deployment rolls forward, but there is no known rollback path.
- A container runs, but with broader privileges than the workload needs.
“It works” describes a moment. Readiness describes behavior across time.
Invariants beat vibes
The best production checks are concrete enough to reject unsafe changes.
For a service, useful invariants might be:
- Every workload declares resource requests and limits.
- Readiness checks prove the service can handle real work.
- Liveness checks restart processes that cannot recover themselves.
- At least one instance remains available during planned disruption.
- Logs and metrics expose enough signal to debug without shell access.
- Network paths are deny-by-default, then opened deliberately.
- Secrets are not stored in application code or plain repository history.
- Backups are restored on a schedule, not only created on a schedule.
These are not Kubernetes-only ideas. Kubernetes just makes the defaults visible. Any production platform needs answers for capacity, health, isolation, observability, secrets, and recovery.
Automate the boring gates
Production standards that rely on memory will drift.
If every service owner has to remember the same safety rules, the organization will eventually ship something without them. That does not mean every team is careless. It means manual consistency does not scale.
The better pattern is to turn readiness into boring gates:
- Templates for the default safe path.
- Policy checks in CI.
- Runtime admission checks where needed.
- Dashboards that expose missing signals.
- Runbooks that name the operational hazards.
This is where GitOps Is a Recovery Pattern helps. If Git is the route into production, readiness checks can happen before state is applied, and drift can be noticed after state is applied.
Recovery is part of readiness
A system is not ready until recovery has been tested.
Backups that have never been restored are assumptions. Failover that has never been practiced is a hope. Emergency access that depends on one person’s laptop is not an access model.
Readiness should include the uncomfortable questions:
- What happens if the primary database is unavailable?
- What happens if the deploy pipeline is unavailable?
- What happens if the secret store is unavailable?
- What happens if the operator account is locked out?
- What happens if the latest release must be rolled back quickly?
Those questions belong in design, not only in postmortems.
Readiness should make change safer
The point of production readiness is not to slow teams down with ritual.
The point is to make routine change less frightening. When health checks, resource limits, observability, access paths, backups, and rollback behavior are built into the system, engineers can ship with less hidden risk.
That is the practical value of Failure-Mode Thinking as a Design Discipline. It moves the team from “will this work?” to “what must remain true when it does not?”