Post

The AI Runbook Pattern

The AI Runbook Pattern

Runbooks exist because not every operational problem can be fully automated. Some situations require reading system state, applying judgment, and recommending a course of action — often under time pressure, often at 2am, often by someone who didn’t build the system they’re looking at.

The traditional answer to this is documentation: write down the steps, keep them up to date, hope whoever is on call can find them and interpret them correctly. It works until it doesn’t. Steps go stale. Context is missing. The person on call is one team removed from the service.

The AI Runbook Pattern is a different answer. Instead of a document a human follows, you have an agent that reads the environment, reasons over what it finds, and produces a structured report. The human still makes the call. But the legwork — the log inspection, the metric correlation, the candidate hypotheses — is done by the agent.

What makes it different from automation

Full automation and the AI Runbook Pattern are often conflated, but they solve different problems.

Full automation is appropriate when the remediation is known, deterministic, and safe to apply without review. If a pod is crashlooping because of an OOM condition and your response is always to increase the memory limit and redeploy, you can automate that. The decision tree is short and the failure mode of getting it wrong is understood.

The AI Runbook Pattern is for everything else. Situations where the cause isn’t known. Where multiple things might be wrong simultaneously. Where the remediation depends on context that isn’t captured in any single metric. Where you want a second opinion before touching production.

In these situations, an agent that can read broadly and reason over what it finds is more useful than a script that checks a fixed set of conditions. The output isn’t an action — it’s a diagnosis and a recommendation, written in a format a human can read and act on.

A concrete example

A Kubernetes deployment fails. The pipeline step that triggered it marks the release as failed and stops. On-call engineer gets paged.

Without the AI Runbook Pattern, the engineer opens a terminal, starts running kubectl commands, checks logs, checks events, checks the deployment history. Ten minutes in, they’ve identified that the new image has a misconfigured liveness probe that’s causing the pod to restart before it’s ready.

With the AI Runbook Pattern, the Octopus runbook that fires on deployment failure includes an AI step. The agent receives a prompt with the deployment name, namespace, and Octopus API credentials. It runs kubectl describe, checks pod events, reads the last 100 lines of container logs, and cross-references the deployment spec. It returns a markdown report: the liveness probe is firing before the application port is open, here are the relevant log lines, here’s the section of the deployment spec to review.

The engineer still makes the fix. But they walk into the situation with a diagnosis instead of starting from scratch.

flowchart TD
    Trigger["Deployment failure\n(Octopus release)"]
    Runbook["AI Runbook step\n(Claude Code / Copilot CLI)"]
    Context["Environment context\n(kubectl, Octopus API,\npod logs, events)"]
    Report["Markdown report\n(diagnosis + recommendations)\nattached as Octopus artifact"]
    Human["On-call engineer\nreviews and acts"]

    Trigger --> Runbook
    Runbook --> Context
    Context --> Runbook
    Runbook --> Report
    Report --> Human

The human-in-the-loop is the point

The pattern deliberately stops short of autonomous remediation. This isn’t a limitation — it’s a design choice.

Autonomous remediation requires high confidence that the agent’s diagnosis is correct and that the proposed fix is safe to apply. In the general case, neither is guaranteed. An agent that can read logs and correlate events is useful. An agent that can also restart services, roll back deployments, or modify configurations — without a human reviewing the diagnosis first — is a liability.

The sweet spot is an agent that does the time-consuming observational work and presents its findings clearly. The human review step is fast because the groundwork is done. The decision to act remains with a person who can weigh context the agent might not have.

As confidence in specific agent behaviours builds over time — as you learn which diagnoses it gets right reliably — the human checkpoint can move. But that’s a progression earned through operational experience, not a starting assumption.

The AI Runbook Pattern gives you a place to start that’s useful immediately and extensible deliberately.

This post is licensed under CC BY 4.0 by the author.