Harness Engineering
This proposal introduces harness engineering as a first-class discipline within the AI Workflow Conduction framework. A harness is everything around an LLM that is not the model itself: the tools it can call, the loop that decides when to stop, the memory it keeps between steps, the guardrails that block dangerous actions, and the sensors that verify its output. The working identity is:
Agent = Model + Harness
Core Insight: As frontier models converge in raw capability, the harness around them becomes the differentiator. A capable model with a poor harness stalls, picks the wrong tool, or runs past its permission boundary. A well-designed harness gives the same model dramatically higher reliability on the same task.
Related Proposals: Claude Skills Adoption, Agent-Friendly Knowledge Base, AI-First Context Infrastructure, Continuous Context Cleanup
Problem Statement
The Ceiling of "More Context"
Context §1.2 already establishes that specification chaos, review velocity gaps, and role confusion limit AI effectiveness. The response across AI-First Context Infrastructure and Agent-Friendly Knowledge Base has been to give agents better context.
Better context raises the floor. It does not fix every failure mode.
A capable model with a full, well-curated context still fails in ways that context alone cannot address:
| Failure mode | Symptom | Why context alone will not fix it |
|---|---|---|
| Agent stall | Loop never terminates; agent repeats the same tool call | The harness, not the context, controls loop termination |
| Wrong tool | Agent picks a generic tool when a specific one exists | Tool granularity is a harness design choice |
| Permission breach | Agent takes a destructive action without confirmation | Permission boundaries live in the harness, not the prompt |
| Silent drift | Output looks plausible; nobody catches the regression | Sensors (tests, reviewers) are a harness concern |
| Over-long session | Context window exhausted mid-task | Memory management and summarization are harness features |
Three Paradigms, Not One
Prompt engineering, context engineering, and harness engineering address three different failure surfaces. They compound; they do not replace each other.
Knowledge compounds across layers. Yesterday's prompt techniques live inside today's SDKs. Today's context expertise informs tomorrow's harness design. A team jumping straight to "adopt harness engineering" without a working context layer will hit the same problems the context-engineering proposals were written to solve.
Definition
Agent Equals Model Plus Harness
Martin Fowler frames the identity directly in Harness engineering for coding agent users:
"Harness engineering refers to everything in an AI coding agent except the model itself."
Parallel Web Systems extends the definition beyond coding:
"The harness is what connects an AI model to the outside world, enabling it to use tools, remember information between steps, and interact with complex environments."
The harness has five observable components:
- Tools — the callable surface the model reaches through. Coarse-grained or fine-grained, domain-general or domain-specific.
- Loop control — stop conditions, escalation triggers, budget limits, multi-agent coordination.
- Memory and state — working context, session log, long-term memory, summarization and retrieval.
- Guardrails — permission boundaries, schema validation, safety filters.
- Sensors — tests, linters, type checkers, review agents, runtime monitors that observe output after the agent acts.
Guides and Sensors
Fowler splits a harness into two control types based on when they act:
Guides shape output before the agent acts. They raise the probability of a good first attempt.
Sensors observe output after the agent acts. They catch bad attempts and feed the signal back into the loop.
Computational and Inferential Controls
A second axis cuts across Guides and Sensors: how the control itself runs.
| Type | Latency | Cost | Determinism | Examples |
|---|---|---|---|---|
| Computational | Milliseconds to seconds | Near zero | Deterministic | Linters, type checkers, unit tests, schema validators |
| Inferential | Seconds to minutes | LLM call | Non-deterministic | Review agents, LLM-as-judge, semantic-drift detectors |
Computational controls catch mechanical problems reliably. Inferential controls add semantic judgment where mechanical rules cannot express intent. A mature harness uses both.
Operating System Analogy
A common framing borrowed from the Chinese-language analysis by KodeLAB:
- Model is the CPU — raw computation.
- Context window is RAM — working memory, bounded.
- Harness is the operating system — schedules work, manages resources, mediates access to tools and memory.
A naked LLM is a CPU without an OS. It can compute. It cannot do useful work on its own.
Proposed Solution
Adopt a shared organizational harness layered over the model. Three layers:
Layer 1: Base Harness
A standard agent runtime configuration shared across all teams.
| Element | Recommended Default |
|---|---|
| Agent runtime | Claude Code (or an equivalent with skills, hooks, MCP support) |
| Permission defaults | Minimal — no write or execute without explicit allow-list |
| Hooks | Pre-tool-use hook enforcing the project AGENTS.md boundary |
| Settings | Committed .claude/settings.json per project |
| Model | Latest generally available frontier model, unless a project pins otherwise |
The point of Layer 1 is not to pick the right runtime once and stop. It is to make the choice explicit and the configuration shared, so that every team inherits the same defaults and diverges only with justification.
Layer 2: Organizational Guides and Sensors
The existing proposals in this chapter are the organization's guides and sensors. Harness engineering is the framing that ties them together.
| Role | Component | Existing Proposal |
|---|---|---|
| Guide | AI-accessible context surface | AI-First Context Infrastructure |
| Guide | Markdown knowledge base | Agent-Friendly Knowledge Base |
| Guide | Shared Claude Skills | Claude Skills Adoption |
| Guide | Specification retrieval | Internal Spec Platform |
| Guide | Terminology constraints | Ubiquitous Language |
| Guide | Existing-fact specs | Spec Extraction |
| Guide | Requirement source of truth | Global Requirement Store |
| Guide | Component inventory | Design System |
| Guide | Component ownership model | shadcn/ui Foundation |
| Guide | Spec hierarchy | Multi-Product Spec Management |
| Guide | Document graph | Frontmatter Spec Coordination |
| Guide | Project AI guidance file | CLAUDE.md Standards (planned) |
| Sensor | Linters and type checkers | Tooling baseline |
| Sensor | Continuous cleanup review | Continuous Context Cleanup |
| Sensor | Tech stack alignment | Tech Radar and Roadmaps |
| Loop Control | AI-first decision points | AI-First Decision Making |
| Loop Control | Elaboration sessions | AI-DLC Mob Elaboration |
Layer 3: The Steering Loop
When the same failure mode recurs, iterate the harness, not the prompt.
Diagnostic rubric:
| Symptom | Likely layer |
|---|---|
| Wrong output shape, formatting drift | Prompt |
| Missing fact, outdated reference | Context |
| Agent stall, wrong tool, permission breach, silent regression | Harness |
The steering loop has a named owner. Harness changes go through a lightweight review, the same as any other infrastructure change.
Implementation Roadmap
Four phases, staged to avoid the "shipped the harness, nobody uses it" failure.
Phase 1: Baseline the Harness
Deliverables:
- Standard
.claude/settings.jsoncommitted in a reference repository. - AGENTS.md template published.
- Permission hook blocking unauthorized write or execute actions.
- One-page "What runs on your machine" doc for every engineer.
Exit criteria:
- Every active project has an AGENTS.md file.
- Default permission boundary is enforced by a pre-tool-use hook.
Phase 2: Seed the Guides
Deliverables:
- Shared skill library (see Claude Skills Adoption).
- Knowledge base migrated to Git-backed Markdown (see Agent-Friendly Knowledge Base).
- Ubiquitous language glossary published (see Ubiquitous Language).
Exit criteria:
- Two or more teams are consuming shared skills.
- Agents can retrieve domain knowledge without manual paste.
Phase 3: Seed the Sensors
Deliverables:
- CI-integrated linters and type checkers on every repository.
- LLM-as-judge review pipeline for PRs above a size threshold.
- Continuous context cleanup process running (see Continuous Context Cleanup).
Exit criteria:
- Sensor signal is written back into the agent loop (not just human dashboards).
- At least one class of regression has been caught by sensors pre-merge.
Phase 4: Institutionalize the Steering Loop
Deliverables:
- Named owner for the organizational harness.
- Monthly harness retro reviewing recurring failure modes.
- Documented cycle time target: issue observed to harness updated.
Exit criteria:
- Harness changes are tracked in the same backlog as product work.
- Recurrence rate of named failure modes is trending down month over month.
Success Metrics
| Metric | Target | How to Measure |
|---|---|---|
| Recurrence rate of named failure modes | Declining month over month | Track tagged issues per failure class |
| Skill reuse rate across teams | > 50% of active skills used by more than one team | Skill invocation logs |
| Agent sessions requiring human rescue | < 10% of sessions per week | Session telemetry or self-report |
| Issue-to-harness-update cycle time | < 1 sprint (median) | Timestamp from issue open to harness change merged |
| Shared AGENTS.md adoption | 100% of active projects | File presence check in CI |
Anti-Patterns
These are failure patterns observed in teams adopting agent frameworks without a harness discipline.
| Anti-pattern | What it looks like | Why it backfires |
|---|---|---|
| Prompt stuffing | Every new failure responds with more text in the system prompt | Prompts grow unreadable; context budget shrinks; root cause is usually a missing tool or sensor |
| Context bloat | Every new failure responds with more documents piped into context | Signal-to-noise drops; model output quality declines |
| Harness sprawl | Multiple competing skills, hooks, or MCP servers that overlap | Agents pick the wrong one; maintenance burden compounds |
| Orphan harness | Harness exists, no named owner, nobody updates it | Drift accumulates silently; teams quietly stop using it |
| Single-layer thinking | Treating harness as a replacement for context engineering | Missing knowledge still produces wrong code; the three layers compound, they do not substitute |
CLAUDE.md Integration
A project's CLAUDE.md (or AGENTS.md) is the primary Guide at the project level. At minimum it declares:
- Which skills are in scope for this project.
- Which tools the agent is permitted to invoke without confirmation.
- Where to find the project's specification, design system, and knowledge base.
- What the steering-loop owner expects to be notified about.
See CLAUDE.md Standards (planned) for the full schema.
Related Proposals
| Role in Harness | Proposal |
|---|---|
| Guide | AI-First Context Infrastructure |
| Guide | Agent-Friendly Knowledge Base |
| Guide | Claude Skills Adoption |
| Guide | Internal Spec Platform |
| Guide | Ubiquitous Language |
| Guide | Spec Extraction |
| Guide | Global Requirement Store |
| Guide | Design System |
| Guide | shadcn/ui Foundation |
| Guide | Multi-Product Spec Management |
| Guide | Frontmatter Spec Coordination |
| Sensor | Continuous Context Cleanup |
| Sensor | Tech Radar and Roadmaps |
| Loop Control | AI-First Decision Making |
| Loop Control | AI-DLC Mob Elaboration |
References
- Martin Fowler, Harness engineering for coding agent users — https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html
- Parallel Web Systems, What is an agent harness in the context of large-language models? — https://parallel.ai/articles/what-is-an-agent-harness
- KodeLAB, Harness Engineering: AI Agent 從提示詞工程、上下文工程演進的新顯學 — https://klab.tw/2026/04/from-prompt-to-harness-engineering/
- ABMedia, Harness Engineering 是什麼? AI 的下一個戰場不是模型,而是模型外面的那層架構 — https://abmedia.io/harness-engineering-ai-agent-framework-explained
- awesome-harness-engineering — https://github.com/ai-boost/awesome-harness-engineering
- YouTube talk, Harness Engineering: 有時候語言模型不是不夠聰明,只是沒有人類好好引導 — https://www.youtube.com/watch?v=R6fZR_9kmIw
- Anthropic, Effective Context Engineering for AI Agents — https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents