The Complete Guide to Harness Engineering: Building AI Agents the Claude Code Way
Prompts alone can't tame an LLM. Learn how to weave tools, context, and control loops into a harness, with runnable code and Claude Code's own design as a teacher.
The “just throw a prompt at ChatGPT” era is over. Since 2025, the center of gravity in AI engineering has shifted rapidly toward harness engineering. It is one of the most frequently repeated keywords in Anthropic’s internal blog posts and in OpenAI’s agent research.
Yet ask someone “what is a harness?” and few can answer crisply. In this article we unpack harness engineering using runnable code and Claude Code’s own architecture as the case study. By the end, you will have everything you need to build your own agent from scratch.
A harness is the “scaffolding” that surrounds an AI
“Harness” originally referred to tack for a horse or a safety belt for a climber. In software, think of a “test harness”: the outer scaffolding that makes something actually run.
In the AI world, a harness is the wrapper layer surrounding an LLM. Concretely, it bundles everything the model needs in order to operate on real-world tasks:
- Tools: read files, execute commands, call APIs, and so on
- Context management: what to remember, what to forget, what to compress
- Control loop: when to call, when to stop, when to retry
- Permissions and safeguards: prevent destructive operations from running unattended
- Memory: knowledge that persists across sessions
A prompt is just one input into this harness. With a weak harness, even the cleverest prompt hits a performance ceiling. This is why we increasingly hear “prompt engineering alone is not enough.”
Why the harness matters: think in OODA loops
An LLM on its own can only “generate the next token.” To solve real-world tasks, you must spin an OODA loop (Observe → Orient → Decide → Act) borrowed from military strategy.
| Phase | Description | Owner |
|---|---|---|
| Observe | Read the environment (files, DB queries) | Harness |
| Orient | Shape the information and hand it to the LLM | Harness |
| Decide | Pick the next move | LLM |
| Act | Execute (run commands, call APIs) | Harness |
As you can see, three of the four phases belong to the harness. The LLM is only strong at Decide. The quality of the scaffolding around it determines the quality of the whole agent.
Three harness levels, by example
Let’s solve the same “generate a blog post” task at three escalating harness levels.
Level 1: Raw API call (almost no harness)
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const res = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 4096,
messages: [{ role: "user", content: "Write a blog post" }],
});
console.log(res.content[0].text);
Outcome: generic, hollow prose. Every run produces a different topic and structure.
Level 2: Add tools (medium harness)
const tools = [
{
name: "read_existing_posts",
description: "Return the list of existing blog posts with their titles",
input_schema: { type: "object", properties: {} },
},
{
name: "write_post",
description: "Write out an MDX file",
input_schema: {
type: "object",
properties: {
slug: { type: "string" },
frontmatter: { type: "object" },
body: { type: "string" },
},
required: ["slug", "frontmatter", "body"],
},
},
];
async function runAgent(userGoal: string) {
let messages = [{ role: "user", content: userGoal }];
while (true) {
const res = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 4096,
tools,
messages,
});
if (res.stop_reason === "end_turn") break;
// The harness executes the tool call
const toolUse = res.content.find((c) => c.type === "tool_use");
const result = await executeTool(toolUse.name, toolUse.input);
messages.push({ role: "assistant", content: res.content });
messages.push({
role: "user",
content: [{ type: "tool_result", tool_use_id: toolUse.id, content: result }],
});
}
}
Outcome: a non-duplicate topic is picked and a correctly structured MDX file is produced. Just adding tools dramatically changes quality.
Level 3: A full Claude Code-grade harness
- Autonomous loop (user approval, error retries)
- Context compression (summarize long conversations to save tokens)
- Subagent delegation (translate in an isolated context)
- Prompt caching (do not resend the static prefix)
- Hooks (auto-lint before commit)
Wiring all of that by hand is a serious project. That is exactly why studying Claude Code as a reference implementation pays off.
Dissecting Claude Code’s harness
Claude Code is the most polished agent harness inside Anthropic. It decomposes into the following five layers.
Layer 1: Tool design
Tools like Read, Edit, Write, Bash, Glob, Grep, and Agent ship out of the box. Pay attention to their granularity:
Grepis not plaingrepbut a ripgrep wrapper — accurate and fastEditis not a whole-file rewrite but a targeted string replacement — minimal diffAgentspawns a subagent and isolates its context
Tool quality maps directly to agent quality. “Anything that works” is not good enough. Design tools with idempotency, clear error messages, and a single responsibility in mind.
Layer 2: Layered context
~/.claude/CLAUDE.md ← global rules
./CLAUDE.md ← project rules (auto-loaded)
~/.claude/memory/ ← long-term memory (across sessions)
├── user_profile.md
├── feedback_xxx.md
└── project_xxx.md
conversation history ← recent turns
tasks/plan ← progress in the current session
Each layer has a different lifetime and purpose. Writing to the wrong place loses information quickly, or worse, keeps stale data alive. Use tasks for “this session only” and memory for “reusable across sessions.”
Layer 3: Subagent delegation
With the Agent tool you can spawn another agent in its own context.
# Main gives directives; the heavy lifting goes to the subagent
Agent(
subagent_type: "general-purpose",
prompt: "Translate blog/harness.mdx into English plus 8 more languages,
save each under blog-{lang}/, then report back"
)
This keeps the main context from being polluted by noisy logs. Long build logs, intermediate translations, search dumps — any work where “you only want the deliverable” — can be offloaded wholesale.
Layer 4: Hooks (deterministic processing)
.claude/settings.json lets you hook shell commands before and after tool calls.
{
"hooks": {
"PostToolUse": [
{
"matcher": "Edit|Write",
"hooks": [
{ "type": "command", "command": "npx tsc --noEmit" }
]
}
]
}
}
Type-checking now runs automatically after every edit. Anything that “should be handled deterministically rather than asked of the LLM every time” belongs in a hook.
Layer 5: Permission modes
{
"permissions": {
"allow": ["Read", "Grep", "Glob"],
"deny": ["Bash(rm -rf*)", "Bash(git push --force*)"],
"ask": ["Write", "Edit", "Bash"]
}
}
Explicitly deny destructive commands, and require approval for writes. Accidents happen when something runs unattended, so this layer determines your operational safety.
Five pitfalls to watch for
1. Too many tools Give the model 30 tools and it will flounder over which to pick, degrading accuracy. The rule of thumb is 5-15. Push overflow capability into subagents.
2. Not exploiting prompt caching
Skip the Claude API’s cache_control and you resend your long system prompt in full every turn — and pay for it. Mind the 5-minute TTL and cache the static parts.
messages: [{
role: "system",
content: [
{ type: "text", text: longStaticInstructions,
cache_control: { type: "ephemeral" } }, // ← this line
{ type: "text", text: dynamicContext },
],
}]
3. Error messages the LLM cannot read
A tool that returns only Error: undefined cannot be self-repaired by the model. Say what is wrong and how to fix it.
throw new Error(
`File '${path}' does not exist. ` +
`Files currently in scripts/: ${list.join(", ")}`
);
4. Skipping human approval Auto-approving destructive actions (delete, force push, DB updates) guarantees that disaster arrives eventually. Default to “ask for writes, deny for deletes.”
5. Never tidying memory
Stale information keeps pulling the agent toward wrong assumptions. Memory needs regular pruning too (in Claude Code, use /compact or edit files manually).
Run your own mini harness
Finally, here is a minimal harness you can run locally with Node.js + TypeScript.
// mini-harness.ts
import Anthropic from "@anthropic-ai/sdk";
import { readFileSync, writeFileSync } from "fs";
const client = new Anthropic();
const tools = [
{ name: "read_file",
description: "Read a text file",
input_schema: { type: "object", properties: { path: { type: "string" } }, required: ["path"] } },
{ name: "write_file",
description: "Write out a text file",
input_schema: { type: "object", properties: { path: { type: "string" }, content: { type: "string" } }, required: ["path", "content"] } },
];
const executors = {
read_file: ({ path }) => readFileSync(path, "utf-8"),
write_file: ({ path, content }) => { writeFileSync(path, content); return `written ${path}`; },
};
async function loop(goal: string, maxSteps = 10) {
const messages: any[] = [{ role: "user", content: goal }];
for (let i = 0; i < maxSteps; i++) {
const res = await client.messages.create({
model: "claude-opus-4-6", max_tokens: 4096, tools, messages,
});
messages.push({ role: "assistant", content: res.content });
if (res.stop_reason === "end_turn") return res.content;
const toolUse = res.content.find((c: any) => c.type === "tool_use") as any;
if (!toolUse) return res.content;
const result = executors[toolUse.name](toolUse.input);
messages.push({
role: "user",
content: [{ type: "tool_result", tool_use_id: toolUse.id, content: String(result) }],
});
}
}
await loop("Read README.md and save a 3-line summary as TL;DR.md");
That alone gives you a mini-agent that can read an existing file and write out a new one. Add a Grep tool, a Bash tool, and an Agent tool on top, and you have a miniature Claude Code.
Summary: from prompt author to harness architect
| Old mindset | New mindset |
|---|---|
| A great prompt yields a great output | A great harness yields a great output |
| Pick a model | Design model + tools + context + permissions |
| One-shot questions | Continuous loop operation |
Claude Code is the best teaching material for absorbing this shift in perspective. Don’t just use it — break it apart and fold the ideas into your own agent. That is the posture required of AI engineers from 2026 onward.
Start by copy-pasting the mini harness above and running it. Ten minutes from now, you will have taken the first step toward your very own agent.
Related articles
- 10 Subagent Patterns in Claude Code
- CLAUDE.md Best Practices
- Token Optimization Techniques for Claude Code
- Claude Code Permissions Guide
References
Level up your Claude Code workflow
50 battle-tested prompt templates you can copy-paste into Claude Code right now.
Free PDF: Claude Code Cheatsheet in 5 Minutes
Just enter your email and we'll send you the single-page A4 cheatsheet right away.
We handle your data with care and never send spam.
About the Author
Masa
Engineer obsessed with Claude Code. Runs claudecode-lab.com, a 10-language tech media with 2,000+ pages.
Related Posts
Claude Code Security Best Practices: API Keys, Permissions & Production Protection
A practical security guide for using Claude Code safely. From API key management to permission settings, Hooks-based automation, and production environment protection — with working code examples.
7 Claude Code Security Failure Cases | Real Incidents and Prevention
Seven real-world security incidents with Claude Code: .env leaks, production DB drops, billing explosions, and more — each with root cause analysis and prevention code.
Complete Guide to Claude Code Permissions | settings.json, Hooks & Allowlist Explained
A complete guide to Claude Code permissions. Learn allow/deny/ask, automation with Hooks, environment-specific settings.json, and practical patterns—all with working code.