Failure-Mode Thinking as a Design Discipline

There are two ways to write code that touches production state.

The first writes for the happy path. Inputs arrive in the expected shape, the network stays up, the process runs to completion, and nothing else is competing for the same resource. Error handling exists, but it is added near the end, often as a wrapper around the optimistic version of the function.

The second writes from the opposite end. Every operation is designed by asking what happens if it crashes mid-flight, runs twice, or runs concurrently with itself. Idempotency is a property of the design, not a patch.

The difference is rarely about coding ability. It is about the frame the author started from.

Common signals of happy-path thinking

Errors silently discarded because “if it fails, it’s fine” — without specifying which errors are actually expected.
Locks released by anyone, because the lock value is shared instead of unique to the holder (PID, UUID, or process identity).
IDs allocated from a random number generator with no atomic guarantee. It works in tests because the keyspace is empty.
State machines whose steps are too coarse, so a crash mid-step forces resume from the beginning of the step rather than the actual point of failure.
Idempotency enforced outside the state machine — usually by a file existence check — instead of being a property of the state itself.

These are not unusual patterns. They are what reasonable code looks like when the author is thinking about success.

What failure-mode thinking looks like

Silent error suppression is reserved for cases that are explicitly expected, with the reason written next to the suppression.
Lock values include holder identity so they cannot be released by accident.
Identifiers are monotonic, database-backed, or content-addressed — not probabilistic.
State machine steps are sized to match the smallest unit of work that can be safely retried.
Idempotency is checked against the state machine itself, not an external artifact that can drift.
Permanent and transient errors are distinguished, because retry semantics depend on which one occurred.

None of this is exotic. It is mostly a consistent habit of asking the same questions before writing a function.

Why it shows up in production and not in review

Happy-path code passes review because the path being read is the happy path. Tests pass because the test fixtures describe the expected world. The gaps appear later, in the conditions tests rarely cover: a node disappearing mid-operation, two workers picking up the same task, a queue redelivering after a process restart, a disk filling at the wrong moment.

These conditions are normal in any system that runs on more than one machine for more than a few days.

Where the discipline pays off

Failure-mode thinking is the difference between code that works once and code that keeps working. It is not a clever architectural pattern. It is a question discipline applied at the smallest scale: every function, every state transition, every external call.

The same framing transfers to operating systems already in production. Instead of asking only “what is the fix?”, a more useful first question is “which boundary is failing?” Application logic, runtime configuration, routing, lifecycle, storage durability, and platform behavior each imply different next checks. The habit of separating boundaries before guessing is the same habit that produced the code that did not need rescuing in the first place.

Umar's Garden

Explorer

Failure-Mode Thinking as a Design Discipline

Common signals of happy-path thinking

What failure-mode thinking looks like

Why it shows up in production and not in review

Where the discipline pays off

Graph View

Table of Contents

Backlinks