A runbook is an interface between a human and a system under pressure.

That framing changes how I write them. A runbook is not just a memory dump. It should expose the smallest reliable set of actions a person needs to diagnose, operate, or recover something without guessing.

Good runbooks define inputs and outputs

For each procedure, I want to know:

  • When should this be used?
  • What access is required?
  • What information do I need before starting?
  • What command or action should I run?
  • What output should I expect?
  • What means “stop and escalate”?

This is similar to designing an API. The caller is a future human, often tired, often interrupted, and sometimes not the original author.

Runbooks should name hazards

The most valuable parts of a runbook are often the warnings:

  • This command restarts active workers.
  • This step is safe to repeat.
  • This step is not safe to repeat.
  • This query is read-only.
  • This migration changes data shape.
  • This rollback only works before cleanup.

Hazards should be close to the action. A warning buried in a paragraph at the top is easy to miss.

Runbooks become better through use

Every incident or maintenance window should leave the runbook slightly better than before.

The useful edits are usually small:

  • Add the missing precondition.
  • Replace a vague check with an exact command.
  • Add the expected output.
  • Remove a step that no longer exists.
  • Link to the dashboard that actually helped.

The point is not to make documentation perfect. The point is to make the next operation cheaper and less dependent on memory.