What Is Harness Engineering? A Tested Guide for Codex and Claude Code

You ask an AI agent to tidy a repository, and it edits unrelated configuration. You ask it to run the tests, and it reports success without leaving evidence that the command ran. These failures are common when a team starts delegating work to coding agents.

The prompt is only part of the problem. The agent’s visible context, available tools, stopping conditions, and definition of success are still underspecified. Harness engineering is the work of designing that surrounding system: the footing an agent needs to do useful work without making every decision itself.

This guide first grounds the term in OpenAI’s published Codex case study. It then builds a small provider-facing harness that limits file access, rejects overwrites, and has executable tests. The final section states exactly what was and was not verified.

Key takeaways

A harness is not one wrapper script. It combines repository knowledge, tools, permissions, tests, logs, recovery paths, and human approval.
OpenAI’s Codex case study made repository structure, application behavior, and quality constraints legible and enforceable for agents.
Agents can research, draft, and repeat; humans should retain judgment over deletion, production changes, external communication, and spending.
A lexical path-prefix check is not a complete sandbox. Symlinks, overwrites, process privileges, and operating-system isolation still matter.
“Tested” should name the command, result, and scope. A model’s success message is not evidence.

What harness engineering actually means

A prompt says what you want on this run. A harness defines the environment in which the instruction is attempted.

Layer	Question it answers	Minimal example
Context	What can the agent learn?	`AGENTS.md`, a focused directory, a versioned spec
Tools	What can it do?	Read, test, and create a draft
Permissions	Where must it stop?	Human approval for deletion or sending
Verification	What counts as done?	`npm test` exits with status 0
Observability	How can a failure be traced?	Command, diff, and relevant error output
Recovery	How is a bad run reversed?	Small commits, dry runs, and rollback steps

The model is only one component in this loop.

Human goal and approval policy
             ↓
Repository rules → AI agent → allowed tools
        ↑                        ↓
Specifications             tests, logs, diffs
        └── repair on failure; escalate on judgment ──┘

Changing the model cannot reveal a private design decision that exists only in someone’s memory. Conversely, a model becomes more dependable when the relevant knowledge is discoverable and the acceptance criteria are executable.

Why the term took off in 2026: OpenAI’s Codex case study

One major source of current interest is OpenAI’s February 11, 2026 article, Harness engineering: leveraging Codex in an agent-first world.

OpenAI reported that three engineers used Codex to produce roughly 1,500 pull requests over about five months. The useful lesson is not the throughput number on its own. The team redesigned the environment around the assumption that agents, rather than humans typing code, would perform the implementation work.

Their published approach included:

keeping plans and design decisions in versioned repository artifacts;
making the UI, logs, metrics, and traces directly inspectable by agents;
enforcing dependency direction and other invariants with structural tests and custom linters;
treating a failed run as evidence of a missing tool, rule, or abstraction rather than asking the model to “try harder”; and
using recurring cleanup to find stale documentation and accumulated drift.

This is not a recipe for an enormous system prompt. Knowledge that lives only in chat, an external document, or a person’s head is invisible during the run. The goal is progressive disclosure: a small stable entry point that links to the next relevant source, backed by checks that prevent important rules from becoming optional.

Claude Code supports the same general design. The official Claude Agent SDK hooks documentation shows how a hook can inspect a tool request, deny it, modify its input, or record it for audit. Different products expose different controls, but the harness owns the boundary and feedback loop in each case.

What the agent should do, and what a human should decide

Do not begin with full autonomy. Automate reversible work first and require approval when an action affects customers, money, or production.

Good first automation	Delegate with conditions	Keep a human decision
Search files	Edit existing files	Delete production data
Run tests	Add a dependency	Send a customer email
Summarize a diff	Deploy to staging	Change billing or a contract
Create a draft	Push a branch	Process sensitive personal data

Ask two questions: can the action be reversed cheaply, and can it affect someone outside the team? Start with read access and temporary output. Promote an operation to automatic only after its successful and failed cases are observable.

Build a minimal harness

The example gives the model only two capabilities:

read text inside sandbox; and
create a new text file inside sandbox.

There is no delete, overwrite, shell, or network tool. The sample was checked with Node.js 22, and the SDK is pinned to the version used for this verification.

mkdir harness-demo
cd harness-demo
npm init -y
npm install @anthropic-ai/[email protected]
mkdir sandbox
echo "# meeting notes" > sandbox/note.md

Create policy.json:

{
  "workspace": "./sandbox",
  "maxSteps": 6,
  "maxToolResultChars": 4000
}

1. Enforce the file boundary in code

Create safe-files.mjs. A check such as candidate.startsWith(root) is insufficient on its own: a similarly named directory can match, and a symlink inside the workspace can resolve outside it. The read path below checks the resolved target, while writes are limited to new files.

import { open, readFile, realpath } from "node:fs/promises";
import path from "node:path";

function assertInside(root, candidate) {
  if (candidate !== root && !candidate.startsWith(root + path.sep)) {
    throw new Error(`outside workspace: ${candidate}`);
  }
}

export async function createFileGate(workspace) {
  const root = await realpath(path.resolve(workspace));

  async function readText(relativePath) {
    const requested = path.resolve(root, relativePath);
    assertInside(root, requested);
    const actual = await realpath(requested);
    assertInside(root, actual);
    return readFile(actual, "utf8");
  }

  async function createText(relativePath, content) {
    const requested = path.resolve(root, relativePath);
    assertInside(root, requested);
    const actualParent = await realpath(path.dirname(requested));
    assertInside(root, actualParent);

    let handle;
    try {
      handle = await open(requested, "wx", 0o600);
      await handle.writeFile(content, "utf8");
    } catch (error) {
      if (error.code === "EEXIST") {
        throw new Error(`refusing to overwrite: ${relativePath}`);
      }
      throw error;
    } finally {
      await handle?.close();
    }
    return "created";
  }

  return { readText, createText };
}

This is an application-level guard, not a complete security boundary. Use a container, virtual machine, operating-system permissions, or the product’s sandbox when stronger isolation is required. Application checks do not neutralize administrator-level process access.

2. Expose only two tools to the model

Create agent.mjs. The model name is deliberately supplied through ANTHROPIC_MODEL rather than frozen in the article, because account access and model availability change.

import Anthropic from "@anthropic-ai/sdk";
import { readFile } from "node:fs/promises";
import { createFileGate } from "./safe-files.mjs";

const model = process.env.ANTHROPIC_MODEL;
if (!model) throw new Error("Set ANTHROPIC_MODEL to a model available to your account.");

const policy = JSON.parse(await readFile("./policy.json", "utf8"));
const gate = await createFileGate(policy.workspace);
const client = new Anthropic();

const tools = [
  {
    name: "read_file",
    description: "Read a UTF-8 text file inside the workspace",
    input_schema: {
      type: "object",
      properties: { path: { type: "string" } },
      required: ["path"],
      additionalProperties: false
    }
  },
  {
    name: "create_file",
    description: "Create a new UTF-8 file; existing files cannot be overwritten",
    input_schema: {
      type: "object",
      properties: {
        path: { type: "string" },
        content: { type: "string" }
      },
      required: ["path", "content"],
      additionalProperties: false
    }
  }
];

async function runTool(name, input) {
  if (name === "read_file") return gate.readText(input.path);
  if (name === "create_file") return gate.createText(input.path, input.content);
  throw new Error(`unknown tool: ${name}`);
}

const prompt = process.argv.slice(2).join(" ") ||
  "Read note.md and create summary.md with a three-line summary.";
const messages = [{ role: "user", content: prompt }];

for (let step = 0; step < policy.maxSteps; step += 1) {
  const response = await client.messages.create({
    model,
    max_tokens: 1200,
    system: "Use only the supplied tools. Never claim a file was created unless the tool succeeded.",
    tools,
    messages
  });
  messages.push({ role: "assistant", content: response.content });

  const calls = response.content.filter((block) => block.type === "tool_use");
  if (calls.length === 0) {
    console.log(response.content.find((block) => block.type === "text")?.text ?? "done");
    process.exit(0);
  }

  const results = [];
  for (const call of calls) {
    try {
      const value = await runTool(call.name, call.input);
      results.push({
        type: "tool_result",
        tool_use_id: call.id,
        content: String(value).slice(0, policy.maxToolResultChars)
      });
    } catch (error) {
      results.push({
        type: "tool_result",
        tool_use_id: call.id,
        is_error: true,
        content: error.message
      });
    }
  }
  messages.push({ role: "user", content: results });
}

throw new Error(`step limit exceeded: ${policy.maxSteps}`);

3. Test the gate before calling a model

The critical boundary can be tested locally without spending API credits. Create safe-files.test.mjs:

import assert from "node:assert/strict";
import test from "node:test";
import { mkdtemp, mkdir, rm, symlink, writeFile } from "node:fs/promises";
import { tmpdir } from "node:os";
import path from "node:path";
import { createFileGate } from "./safe-files.mjs";

test("file gate blocks traversal, overwrite, and outside symlinks", async () => {
  const base = await mkdtemp(path.join(tmpdir(), "harness-test-"));
  const root = path.join(base, "sandbox");
  const outside = path.join(base, "outside.txt");

  try {
    await mkdir(root);
    await writeFile(path.join(root, "note.md"), "hello", "utf8");
    await writeFile(outside, "secret", "utf8");
    const gate = await createFileGate(root);

    assert.equal(await gate.readText("note.md"), "hello");
    await assert.rejects(() => gate.readText("../outside.txt"), /outside workspace/);
    await assert.rejects(() => gate.createText("note.md", "replace"), /refusing to overwrite/);

    try {
      await symlink(outside, path.join(root, "outside-link.txt"), "file");
      await assert.rejects(() => gate.readText("outside-link.txt"), /outside workspace/);
    } catch (error) {
      if (error.code !== "EPERM") throw error;
    }

    assert.equal(await gate.createText("summary.md", "safe"), "created");
  } finally {
    await rm(base, { recursive: true, force: true });
  }
});

Run the offline checks:

node --test safe-files.test.mjs
node --check agent.mjs

Only then set ANTHROPIC_API_KEY and ANTHROPIC_MODEL and run node agent.mjs. Do not place credentials in source control or in policy.json.

Three practical use cases

1. Software teams: implement and verify a pull request

Give the agent a focused issue, the relevant directories, and the test commands. “Code was written” is not the acceptance condition. Require a failing reproduction, a passing post-fix test, and a readable diff. Keep production deployment and migrations behind human approval.

2. Media operations: quality-gate an article

Separate article generation from checks for duplicate topics, depth, code syntax, links, and mobile rendering. A failed check should stop publication and return a specific remediation message. This site uses article-specific CI checks so that a content change cannot deploy merely because the prose says it was tested.

3. Customer operations: classify an inquiry and draft a reply

The agent can classify a message and produce a reply draft with reasons. A human approves customer-record changes and the actual send. Pass only the personal information needed for classification, and avoid copying full message bodies into long-lived logs.

A simple ROI calculation

Measure saved review time and rework, not generated tokens. Suppose a team spends 20 minutes reviewing each of 15 weekly tasks: five hours per week. If the initial harness takes six hours to build and ongoing maintenance falls to one hour per week, the setup time is recovered in roughly a week and a half.

That is an illustrative calculation, not a promised result. Measure these for two weeks before and after introduction:

human minutes per task;
rework rate;
defects caught before production; and
number of human-approval escalations.

Too many approval requests suggest that a proven low-risk operation can be narrowed and automated. More defects or rework means the harness needs another check or clearer context, not broader autonomy.

Pitfalls and fixes

Treating a folder-name check as a sandbox

A path can appear to be inside the workspace while a symlink resolves elsewhere. Resolve targets, reject overwrites, and enforce operating-system permissions as a second boundary.

Writing “never do anything dangerous” in the prompt

Text is guidance, not enforcement. Do not expose a dangerous tool, or deny it in a pre-tool hook. See the Claude Code permissions guide for a concrete configuration.

Accepting the model’s “tests passed” message

Record the command, exit status, and verification scope. UI work also needs direct interaction or screenshot evidence. The verification receipt workflow shows how to preserve that evidence.

Sending every document on every run

Long context can bury the one constraint that matters. Give the agent a small entry point with links to focused, versioned sources. Track freshness and verification status so obsolete documentation is discoverable.

Summary

Harness engineering is not the art of making a prompt longer. It makes relevant knowledge discoverable, limits tools, blocks sensitive actions, evaluates results with commands, and feeds failures back into the repository as better rules and tests.

For a first step, pick one workflow and write four lines: input, allowed actions, acceptance command, and human-approval actions. Teams that need to add permissions, verification, and review gates to a real repository can use the Claude Code training and implementation consultation to map those boundaries to their current workflow.

What was actually tested

On July 21, 2026, the safe-files.mjs and safe-files.test.mjs blocks in this article were extracted to a temporary directory and run with Node.js. The fixture checks normal reads, new-file creation, ../ traversal rejection, and overwrite rejection. It also checks rejection of an outside symlink on systems where the test process can create one. agent.mjs received a syntax check.

A live Anthropic API call is not part of this verification scope because model access and cost vary by account. That distinction is intentional. “Published,” “syntax checked,” “tested offline,” and “called a paid external API” are four different claims, and a trustworthy harness records which one is true.