The Complete Guide to Harness Engineering: Building AI Agents the Claude Code Way
Learn harness engineering with runnable Claude Code-style examples for tools, context, permissions, verification, and agent loops.
The “just throw a prompt at ChatGPT” era is over. Since 2025, the center of gravity in AI engineering has shifted rapidly toward harness engineering. It is one of the most frequently repeated keywords in Anthropic’s internal blog posts and in OpenAI’s agent research.
Yet ask someone “what is a harness?” and few can answer crisply. In this article we unpack harness engineering using runnable code and Claude Code’s own architecture as the case study. By the end, you will have everything you need to build your own agent from scratch.
A harness is the “scaffolding” that surrounds an AI
“Harness” originally referred to tack for a horse or a safety belt for a climber. In software, think of a “test harness”: the outer scaffolding that makes something actually run.
In the AI world, a harness is the wrapper layer surrounding an LLM. Concretely, it bundles everything the model needs in order to operate on real-world tasks:
- Tools: read files, execute commands, call APIs, and so on
- Context management: what to remember, what to forget, what to compress
- Control loop: when to call, when to stop, when to retry
- Permissions and safeguards: prevent destructive operations from running unattended
- Memory: knowledge that persists across sessions
A prompt is just one input into this harness. With a weak harness, even the cleverest prompt hits a performance ceiling. This is why we increasingly hear “prompt engineering alone is not enough.”
Why the harness matters: think in OODA loops
An LLM on its own can only “generate the next token.” To solve real-world tasks, you must spin an OODA loop (Observe → Orient → Decide → Act) borrowed from military strategy.
| Phase | Description | Owner |
|---|---|---|
| Observe | Read the environment (files, DB queries) | Harness |
| Orient | Shape the information and hand it to the LLM | Harness |
| Decide | Pick the next move | LLM |
| Act | Execute (run commands, call APIs) | Harness |
As you can see, three of the four phases belong to the harness. The LLM is only strong at Decide. The quality of the scaffolding around it determines the quality of the whole agent.
Three harness levels, by example
Let’s solve the same “generate a blog post” task at three escalating harness levels.
Level 1: Raw API call (almost no harness)
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const res = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 4096,
messages: [{ role: "user", content: "Write a blog post" }],
});
console.log(res.content[0].text);
Outcome: generic, hollow prose. Every run produces a different topic and structure.
Level 2: Add tools (medium harness)
const tools = [
{
name: "read_existing_posts",
description: "Return the list of existing blog posts with their titles",
input_schema: { type: "object", properties: {} },
},
{
name: "write_post",
description: "Write out an MDX file",
input_schema: {
type: "object",
properties: {
slug: { type: "string" },
frontmatter: { type: "object" },
body: { type: "string" },
},
required: ["slug", "frontmatter", "body"],
},
},
];
async function runAgent(userGoal: string) {
let messages = [{ role: "user", content: userGoal }];
while (true) {
const res = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 4096,
tools,
messages,
});
if (res.stop_reason === "end_turn") break;
// The harness executes the tool call
const toolUse = res.content.find((c) => c.type === "tool_use");
const result = await executeTool(toolUse.name, toolUse.input);
messages.push({ role: "assistant", content: res.content });
messages.push({
role: "user",
content: [{ type: "tool_result", tool_use_id: toolUse.id, content: result }],
});
}
}
Outcome: a non-duplicate topic is picked and a correctly structured MDX file is produced. Just adding tools dramatically changes quality.
Level 3: A full Claude Code-grade harness
- Autonomous loop (user approval, error retries)
- Context compression (summarize long conversations to save tokens)
- Subagent delegation (translate in an isolated context)
- Prompt caching (do not resend the static prefix)
- Hooks (auto-lint before commit)
Wiring all of that by hand is a serious project. That is exactly why studying Claude Code as a reference implementation pays off.
Dissecting Claude Code’s harness
Claude Code is the most polished agent harness inside Anthropic. It decomposes into the following five layers.
Layer 1: Tool design
Tools like Read, Edit, Write, Bash, Glob, Grep, and Agent ship out of the box. Pay attention to their granularity:
Grepis not plaingrepbut a ripgrep wrapper — accurate and fastEditis not a whole-file rewrite but a targeted string replacement — minimal diffAgentspawns a subagent and isolates its context
Tool quality maps directly to agent quality. “Anything that works” is not good enough. Design tools with idempotency, clear error messages, and a single responsibility in mind.
Layer 2: Layered context
~/.claude/CLAUDE.md ← global rules
./CLAUDE.md ← project rules (auto-loaded)
~/.claude/memory/ ← long-term memory (across sessions)
├── user_profile.md
├── feedback_xxx.md
└── project_xxx.md
conversation history ← recent turns
tasks/plan ← progress in the current session
Each layer has a different lifetime and purpose. Writing to the wrong place loses information quickly, or worse, keeps stale data alive. Use tasks for “this session only” and memory for “reusable across sessions.”
Layer 3: Subagent delegation
With the Agent tool you can spawn another agent in its own context.
# Main gives directives; the heavy lifting goes to the subagent
Agent(
subagent_type: "general-purpose",
prompt: "Translate blog/harness.mdx into English plus 8 more languages,
save each under blog-{lang}/, then report back"
)
This keeps the main context from being polluted by noisy logs. Long build logs, intermediate translations, search dumps — any work where “you only want the deliverable” — can be offloaded wholesale.
Layer 4: Hooks (deterministic processing)
.claude/settings.json lets you hook shell commands before and after tool calls.
{
"hooks": {
"PostToolUse": [
{
"matcher": "Edit|Write",
"hooks": [
{ "type": "command", "command": "npx tsc --noEmit" }
]
}
]
}
}
Type-checking now runs automatically after every edit. Anything that “should be handled deterministically rather than asked of the LLM every time” belongs in a hook.
Layer 5: Permission modes
{
"permissions": {
"allow": ["Read", "Grep", "Glob"],
"deny": ["Bash(rm -rf*)", "Bash(git push --force*)"],
"ask": ["Write", "Edit", "Bash"]
}
}
Explicitly deny destructive commands, and require approval for writes. Accidents happen when something runs unattended, so this layer determines your operational safety.
Five pitfalls to watch for
1. Too many tools Give the model 30 tools and it will flounder over which to pick, degrading accuracy. The rule of thumb is 5-15. Push overflow capability into subagents.
2. Not exploiting prompt caching
Skip the Claude API’s cache_control and you resend your long system prompt in full every turn — and pay for it. Mind the 5-minute TTL and cache the static parts.
messages: [{
role: "system",
content: [
{ type: "text", text: longStaticInstructions,
cache_control: { type: "ephemeral" } }, // ← this line
{ type: "text", text: dynamicContext },
],
}]
3. Error messages the LLM cannot read
A tool that returns only Error: undefined cannot be self-repaired by the model. Say what is wrong and how to fix it.
throw new Error(
`File '${path}' does not exist. ` +
`Files currently in scripts/: ${list.join(", ")}`
);
4. Skipping human approval Auto-approving destructive actions (delete, force push, DB updates) guarantees that disaster arrives eventually. Default to “ask for writes, deny for deletes.”
5. Never tidying memory
Stale information keeps pulling the agent toward wrong assumptions. Memory needs regular pruning too (in Claude Code, use /compact or edit files manually).
Run your own mini harness
Finally, here is a minimal harness you can run locally with Node.js + TypeScript.
// mini-harness.ts
import Anthropic from "@anthropic-ai/sdk";
import { readFileSync, writeFileSync } from "fs";
const client = new Anthropic();
const tools = [
{ name: "read_file",
description: "Read a text file",
input_schema: { type: "object", properties: { path: { type: "string" } }, required: ["path"] } },
{ name: "write_file",
description: "Write out a text file",
input_schema: { type: "object", properties: { path: { type: "string" }, content: { type: "string" } }, required: ["path", "content"] } },
];
const executors = {
read_file: ({ path }) => readFileSync(path, "utf-8"),
write_file: ({ path, content }) => { writeFileSync(path, content); return `written ${path}`; },
};
async function loop(goal: string, maxSteps = 10) {
const messages: any[] = [{ role: "user", content: goal }];
for (let i = 0; i < maxSteps; i++) {
const res = await client.messages.create({
model: "claude-opus-4-6", max_tokens: 4096, tools, messages,
});
messages.push({ role: "assistant", content: res.content });
if (res.stop_reason === "end_turn") return res.content;
const toolUse = res.content.find((c: any) => c.type === "tool_use") as any;
if (!toolUse) return res.content;
const result = executors[toolUse.name](toolUse.input);
messages.push({
role: "user",
content: [{ type: "tool_result", tool_use_id: toolUse.id, content: String(result) }],
});
}
}
await loop("Read README.md and save a 3-line summary as TL;DR.md");
That alone gives you a mini-agent that can read an existing file and write out a new one. Add a Grep tool, a Bash tool, and an Agent tool on top, and you have a miniature Claude Code.
How to apply this in real work
The most common harness mistake is trying to build a universal agent first. In production work, the opposite pattern wins: start with a narrow loop that is easy to observe, easy to verify, and easy to roll back.
Before I let Claude Code touch a recurring workflow, I define four boundaries.
| Boundary | Decide first | Failure if you skip it |
|---|---|---|
| Input | Which files, logs, URLs, or tickets the agent may read | The agent reads too much and loses the real task |
| Output | Whether the deliverable is an MDX file, PR, report, or patch | The result becomes a polished explanation instead of a usable artifact |
| Verification | Which command, screenshot, public URL, or diff proves success | Broken output reaches production because generation looked convincing |
| Permission | Which operations are automatic, ask-first, or forbidden | Deploys, deletes, billing changes, or secrets leak into the wrong place |
For content operations, the harness is not just “write an article.” It is: read existing posts, choose a non-duplicate topic, write MDX, check code fences, run the build, deploy, and inspect the live URL. For engineering teams, add git diff, tests, review criteria, and rollback notes. That is when Claude Code stops being a chat box and becomes an operational layer around the work.
Recommended next step
If you are experimenting alone, copy the mini harness above and run it on a disposable folder first. Then read the Claude Code Permissions Guide to draw the line between safe automation and ask-first actions. If your real bottleneck is knowledge management rather than permissions, the Claude Code x Obsidian guide is the better next article.
For a lightweight reference, keep the free Claude Code Quick Reference Cheatsheet open beside your terminal. If you are designing a team workflow, content engine, or safer deployment loop, compare the self-serve material on the products page first. Book a consultation only when the hard part is the operating model: permissions, reviews, proof, ownership, and revenue path.
What I verified while updating this article
This site now uses a small harness for its own publishing flow: article generation, localized content checks, code-fence checks, build, deploy, and live URL inspection. Earlier versions relied too heavily on prompting and occasionally missed broken code blocks or failed deploys. Adding checks did not make the writing more magical; it made the failure modes visible. That is the practical value of harness engineering.
Summary: from prompt author to harness architect
| Old mindset | New mindset |
|---|---|
| A great prompt yields a great output | A great harness yields a great output |
| Pick a model | Design model + tools + context + permissions |
| One-shot questions | Continuous loop operation |
Claude Code is the best teaching material for absorbing this shift in perspective. Don’t just use it — break it apart and fold the ideas into your own agent. That is the posture required of AI engineers from 2026 onward.
Start by copy-pasting the mini harness above and running it. Ten minutes from now, you will have taken the first step toward your very own agent.
Related articles
- 10 Subagent Patterns in Claude Code
- CLAUDE.md Best Practices
- Token Optimization Techniques for Claude Code
- Claude Code Permissions Guide
References
Free PDF: Claude Code Cheatsheet
Enter your email and download the one-page Claude Code cheatsheet for commands, review habits, and safe workflows.
We handle your data with care and never send spam.
Level up your Claude Code workflow
Start with the free PDF, use Gumroad guides when you need repeatable workflows, and book consultation when rollout or revenue paths need human judgment.
About the Author
Masa
Engineer focused on practical Claude Code workflows. Runs claudecode-lab.com, a 10-language technical media site.
Related Posts
Claude Code Verification Receipt: Prove AI Changes With Build, Public URL, CTA, and Screenshots
A practical Claude Code verification receipt for diff, build, public URL, CTA, screenshot, and revenue-path checks.
Claude Code Permission Budget Loop: Ship Safely Without Approving Every Command
Design a permission budget for Claude Code so safe work moves fast while secrets, deploys, billing, and data stay protected.
Claude Code Prompt Library Maintenance: Turn One-Off Prompts Into Assets
Name, test, and reuse Claude Code prompts so they become a reliable path from free PDF learning to the paid prompt pack.
Related Products
50 Battle-Tested Claude Code Prompt Templates
Copy, paste, ship. 50 production-ready prompts.
Use proven prompts for code review, refactoring, testing, documentation, debugging, architecture, and incident response.
The Complete Claude Code Setup & Configuration Guide
From install to team-ready workflow.
A practical guide to installation, CLAUDE.md, hooks, MCP servers, permissions, IDE setup, and CI/CD workflows.