Meta-Harness: what happens when AI agents optimize their own tooling

There’s a dirty secret in the AI agent world that anyone who’s shipped one already knows: the model is rarely the bottleneck.

The bottleneck is everything around the model. The system prompt. The tool definitions. The retry logic. The way you format context and manage conversation history. This wrapper code, called the “harness,” determines whether the same Claude or GPT instance solves 27% of problems or 76%.

That’s not a typo. That’s the actual gap.

A team from Stanford and Anthropic just published Meta-Harness, a system that automates the optimization of this harness layer. Instead of engineers manually tweaking prompts and tool interfaces for weeks, a coding agent reads its own execution logs, diagnoses why it failed, and proposes concrete improvements. On coding, math, and classification benchmarks, the system beat hand-tuned baselines built by teams who’d been optimizing for months.

The paper is dense. But if agents can improve their own infrastructure, the economics of deploying them change in ways worth paying attention to.

The harness problem is real and measurable

First, some context on why this matters so much.

Earlier this year, Can Boluk ran an experiment that should have been boring. He tested 16 LLMs on the same coding tasks using three different edit tool formats: patch (diff-style), string replacement (exact match), and his new hashline format (content hashes per line). Same models, same tasks, different harness.

Grok Code Fast went from 6.7% to 68.3%. A tenfold improvement by changing one interface. The model’s actual coding ability had been almost completely hidden behind mechanical failures in how it applied edits.

Across all 16 models, hashline matched or beat string replacement while cutting output tokens by roughly 20%. The conclusion: a bad harness doesn’t just slow a model down. It masks what the model can actually do.

LangChain confirmed this at scale. Their coding agent jumped from 52.8% to 66.5% on Terminal-Bench 2.0, going from Top 30 to Top 5, purely through harness engineering. No model change. No fine-tuning. Just better scaffolding.

Five independent teams, OpenAI, Anthropic, Huntley, Horthy, and Vasilopoulos, converged on the same finding: coding agents become reliable only when you build the right infrastructure around them.

The catch is that optimizing harnesses takes serious engineering time. You run tasks, read logs, form hypotheses, tweak the prompt or tool definition, and try again. It’s iterative, slow, and requires exactly the kind of long-horizon reasoning that humans are good at.

Or that agents could be good at, if you gave them the right setup.

What Meta-Harness actually does

The Meta-Harness paper, by Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn, proposes a surprisingly simple loop:

A coding agent (Claude Code) reads accumulated source code, execution traces, and scores from a filesystem
It proposes an improved harness
The new harness gets evaluated on held-out tasks
All results get stored back to the filesystem
Repeat

That’s it. The cleverness is in step 1, what the agent has access to.

Previous optimization methods like OpenEvolve (evolutionary search), TTT-Discover (test-time RL), and Self-Refine (self-critique) all compress their feedback. They pass the agent a scalar score, a brief summary, or the last few attempts. That’s roughly 26K tokens of context per iteration, max.

Meta-Harness gives the proposer the entire filesystem. Every prior candidate’s source code. Every execution trace. Every score. Around 10 million tokens of diagnostic context per step. The agent navigates this using standard tools like grep and cat, the same way a human engineer would debug a complex system.

The paper calls this “counterfactual diagnosis across execution traces.” In practice, it means the agent can ask questions like: “Harness version 12 scored 45% and version 14 scored 38%. What specific change between them caused the regression? Let me look at the traces for tasks 23 and 57 where scores diverged.” That kind of reasoning requires raw data, not summaries.

The median proposer session reads 82 files. That’s not a typo either. The agent is doing the kind of deep forensic analysis that a senior engineer would do when debugging a subtle performance regression. It reads dozens of log files, cross-references them, and forms a targeted hypothesis.

The results across three domains

Meta-Harness was tested on three different tasks.

Text classification

The task: classify texts into large label spaces (up to 215 categories for legal documents, 180 for chemical reactions). The harness controls how labels are presented to the model, how queries are formatted, and how responses are parsed.

Meta-Harness discovered a technique the authors call “Label-Primed Query,” a harness that achieved 48.6% accuracy versus the previous best (ACE) at 40.9%. That’s a 7.7-point improvement while using 4x fewer context tokens. The largest gains came on LawBench (+16 points), where the confusable label space was largest.

What’s notable: this harness design wasn’t in any prior literature. The agent invented a genuinely new approach by diagnosing failure patterns in the logs.

It also outperformed competing optimizers on efficiency. Meta-Harness matched OpenEvolve and TTT-Discover’s final accuracy in just 4 iterations with 10x fewer evaluations. When you factor in the cost of all those evaluation runs, the 10M-token-per-step approach is actually cheaper overall.

Math reasoning

On 200 IMO-level math problems using retrieval-augmented generation: a 4.7-point accuracy gain (34.1% to 38.8%) across five models that were never seen during the harness search. The optimized harness implemented filtering and branching logic using corpus metadata, strategies the agent discovered by analyzing which retrieval approaches worked for which problem types.

Agentic coding (TerminalBench-2)

This is where it gets interesting for anyone building agent platforms.

Terminal-Bench 2.0 is the benchmark frontier labs watch most closely for practical agent capability. 89 Dockerized tasks covering code translation, ML setup, systems programming, bioinformatics, and cryptanalysis. Each task is validated by both humans and AI.

Meta-Harness + Claude Opus 4.6 achieved a 76.4% pass rate, placing #2 among all agents globally. Broken down: 100% on easy tasks, 81.1% on medium, 64.7% on hard.

But the more telling result is with a smaller model. Meta-Harness + Claude Haiku 4.5 hit 37.6%, ranking #1 among all Haiku-class agents, beating Goose (35.5%), Terminus-KIRA (33.7%), and stock Claude Code (27.5%).

That 10-point jump from 27.5% to 37.6% came entirely from harness optimization. Same small model, much better performance. The harness evolved system prompts, tool definitions, completion-checking logic, and context management by reading per-task execution traces.

One specific innovation the open-source artifact reveals: “environment bootstrapping.” Instead of letting the agent waste turns running ls, which python3, and other discovery commands, the optimized harness captures a sandbox snapshot before execution (working directory, file listings, available tools, installed packages, memory status) and injects it into the initial prompt. This eliminates 2-5 exploration turns per task.

That’s the kind of optimization a human engineer might eventually discover after watching hundreds of agent runs. The Meta-Harness proposer found it by reading the logs.

Why full history access matters

The paper includes a finding that challenges a common assumption in optimization research: you can’t compress away the history.

Previous text optimization methods deliberately compress feedback to save tokens. Self-Refine uses the model’s own critique (~1K tokens). OPRO tracks recent score pairs (~2K). TextGrad computes “textual gradients” (~15K). Even TTT-Discover, the most context-rich prior method, uses ~26K tokens of solution fragments.

Meta-Harness uses ~10M tokens per step. That’s roughly 400x more context.

The authors argue this isn’t wasteful. It’s necessary. Harness optimization involves long-horizon credit assignment. A change to the system prompt in iteration 3 might not show its effects until iteration 12 when it interacts with a tool definition change from iteration 8. Tracking these dependencies requires access to the raw data.

When they restricted the proposer’s access to only recent attempts or summary statistics, performance degraded significantly. The agent needs to grep through old logs, compare specific traces, and reason about multi-step causal chains. Summaries destroy exactly the information needed for this kind of diagnosis.

If you’re building self-improving agent systems, take note. Compression feels efficient but it throws away the long-range signal that matters most.

The meta-learning connection

Chelsea Finn, the paper’s senior author, is one of the originators of MAML (Model-Agnostic Meta-Learning), the work that put “learning to learn” on the map. Meta-Harness is explicitly positioned as a new instantiation of classic meta-learning ideas, but with even less imposed structure.

Where MAML learns initialization parameters across tasks, and where traditional hyperparameter search optimizes a fixed set of knobs, Meta-Harness just… lets an agent look at everything and propose code changes. The search space isn’t parameterized. The feedback isn’t compressed. The optimization isn’t gradient-based.

The authors frame it this way: “Meta-Harness is itself a harness: one whose purpose is to optimize other harnesses.” A meta-harness. The recursion is intentional.

Other groups are pushing in the same direction. Meta AI released Hyperagents, a framework where meta-agents modify task-solving agents’ code. The Darwin Godel Machine grows a tree of agent codebases where new versions must empirically beat their ancestors. Agent0 creates a co-evolutionary loop between curriculum agents and executors.

The pattern across all of these: the most capable agents are the ones that can modify their own scaffolding. Meta-Harness is the most controlled demonstration of this so far, with the clearest before-and-after numbers.

What this means for agent builders

If you’re deploying AI agents, whether through platforms like Augmi or building your own, this paper has practical consequences.

The most obvious one: harness engineering has a better return than model selection. The same model with a better harness can outperform a larger model with a default harness. If you’re spending time choosing between GPT-5 and Claude Opus, you might get more return from optimizing how either model reads files and applies edits.

This kind of automated optimization also just became practical. Meta-Harness relies on coding agent capabilities that only became reliable in early 2026. The approach wouldn’t have worked a year ago. The proposer needs to be capable enough to read 82 files, form hypotheses, and write correct code changes. Now it can.

If you take one thing from this paper: log everything. The single most important prerequisite for Meta-Harness-style optimization is having full execution traces. If your agent platform discards logs or compresses them to summaries, you’re throwing away the exact information needed for agents to improve themselves. Store the raw traces.

Smaller models benefit the most from this. The 10-point jump on Haiku (27.5% to 37.6%) is proportionally much larger than the improvement on Opus. Smaller models are more constrained by their harness because they have less capacity to compensate for bad scaffolding. If you’re running cost-sensitive deployments, harness optimization is where you want to spend your time.

And the recursive improvement loop is real, if bounded. An agent that can optimize its own harness creates a positive feedback cycle: better harness, better performance, better harness optimization. Humans still define the evaluation metrics and choose when to deploy improved harnesses. But the gap between human-in-the-loop and fully autonomous got smaller with this paper.

The uncomfortable question

There’s something unsettling about a system that reads its own failure logs and rewrites its own scaffolding. Meta-Harness still operates within defined boundaries. The evaluation metrics are fixed, the search is bounded, a human decides when to deploy changes.

But consider what it already does. The “Label-Primed Query” harness Meta-Harness invented for text classification didn’t exist in any prior paper. The agent found it by reading raw logs and reasoning about failure modes. That’s not optimization. That’s invention.

We’re not at agents that fully maintain themselves. But the harness layer, that unglamorous wrapper code between the model and the world, turns out to be where the most important optimization happens. And it turns out agents are pretty good at optimizing it.

Meta-Harness was developed by Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn at Stanford University, with compute resources from KRAFTON AI. The paper, project page, and code are publicly available.