Skip to content

Harness Engineering

This proposal introduces harness engineering as a first-class discipline within the AI Workflow Conduction framework. A harness is everything around an LLM that is not the model itself: the tools it can call, the loop that decides when to stop, the memory it keeps between steps, the guardrails that block dangerous actions, and the sensors that verify its output. The working identity is:

Agent = Model + Harness

Core Insight: As frontier models converge in raw capability, the harness around them becomes the differentiator. A capable model with a poor harness stalls, picks the wrong tool, or runs past its permission boundary. A well-designed harness gives the same model dramatically higher reliability on the same task.

Related Proposals: Claude Skills Adoption, Agent-Friendly Knowledge Base, AI-First Context Infrastructure, Continuous Context Cleanup

Problem Statement

The Ceiling of "More Context"

Context §1.2 already establishes that specification chaos, review velocity gaps, and role confusion limit AI effectiveness. The response across AI-First Context Infrastructure and Agent-Friendly Knowledge Base has been to give agents better context.

Better context raises the floor. It does not fix every failure mode.

A capable model with a full, well-curated context still fails in ways that context alone cannot address:

Failure modeSymptomWhy context alone will not fix it
Agent stallLoop never terminates; agent repeats the same tool callThe harness, not the context, controls loop termination
Wrong toolAgent picks a generic tool when a specific one existsTool granularity is a harness design choice
Permission breachAgent takes a destructive action without confirmationPermission boundaries live in the harness, not the prompt
Silent driftOutput looks plausible; nobody catches the regressionSensors (tests, reviewers) are a harness concern
Over-long sessionContext window exhausted mid-taskMemory management and summarization are harness features

Three Paradigms, Not One

Prompt engineering, context engineering, and harness engineering address three different failure surfaces. They compound; they do not replace each other.

Knowledge compounds across layers. Yesterday's prompt techniques live inside today's SDKs. Today's context expertise informs tomorrow's harness design. A team jumping straight to "adopt harness engineering" without a working context layer will hit the same problems the context-engineering proposals were written to solve.

Definition

Agent Equals Model Plus Harness

Martin Fowler frames the identity directly in Harness engineering for coding agent users:

"Harness engineering refers to everything in an AI coding agent except the model itself."

Parallel Web Systems extends the definition beyond coding:

"The harness is what connects an AI model to the outside world, enabling it to use tools, remember information between steps, and interact with complex environments."

The harness has five observable components:

  1. Tools — the callable surface the model reaches through. Coarse-grained or fine-grained, domain-general or domain-specific.
  2. Loop control — stop conditions, escalation triggers, budget limits, multi-agent coordination.
  3. Memory and state — working context, session log, long-term memory, summarization and retrieval.
  4. Guardrails — permission boundaries, schema validation, safety filters.
  5. Sensors — tests, linters, type checkers, review agents, runtime monitors that observe output after the agent acts.

Guides and Sensors

Fowler splits a harness into two control types based on when they act:

Guides shape output before the agent acts. They raise the probability of a good first attempt.

Sensors observe output after the agent acts. They catch bad attempts and feed the signal back into the loop.

Computational and Inferential Controls

A second axis cuts across Guides and Sensors: how the control itself runs.

TypeLatencyCostDeterminismExamples
ComputationalMilliseconds to secondsNear zeroDeterministicLinters, type checkers, unit tests, schema validators
InferentialSeconds to minutesLLM callNon-deterministicReview agents, LLM-as-judge, semantic-drift detectors

Computational controls catch mechanical problems reliably. Inferential controls add semantic judgment where mechanical rules cannot express intent. A mature harness uses both.

Operating System Analogy

A common framing borrowed from the Chinese-language analysis by KodeLAB:

  • Model is the CPU — raw computation.
  • Context window is RAM — working memory, bounded.
  • Harness is the operating system — schedules work, manages resources, mediates access to tools and memory.

A naked LLM is a CPU without an OS. It can compute. It cannot do useful work on its own.

Proposed Solution

Adopt a shared organizational harness layered over the model. Three layers:

Layer 1: Base Harness

A standard agent runtime configuration shared across all teams.

ElementRecommended Default
Agent runtimeClaude Code (or an equivalent with skills, hooks, MCP support)
Permission defaultsMinimal — no write or execute without explicit allow-list
HooksPre-tool-use hook enforcing the project AGENTS.md boundary
SettingsCommitted .claude/settings.json per project
ModelLatest generally available frontier model, unless a project pins otherwise

The point of Layer 1 is not to pick the right runtime once and stop. It is to make the choice explicit and the configuration shared, so that every team inherits the same defaults and diverges only with justification.

Layer 2: Organizational Guides and Sensors

The existing proposals in this chapter are the organization's guides and sensors. Harness engineering is the framing that ties them together.

RoleComponentExisting Proposal
GuideAI-accessible context surfaceAI-First Context Infrastructure
GuideMarkdown knowledge baseAgent-Friendly Knowledge Base
GuideShared Claude SkillsClaude Skills Adoption
GuideSpecification retrievalInternal Spec Platform
GuideTerminology constraintsUbiquitous Language
GuideExisting-fact specsSpec Extraction
GuideRequirement source of truthGlobal Requirement Store
GuideComponent inventoryDesign System
GuideComponent ownership modelshadcn/ui Foundation
GuideSpec hierarchyMulti-Product Spec Management
GuideDocument graphFrontmatter Spec Coordination
GuideProject AI guidance fileCLAUDE.md Standards (planned)
SensorLinters and type checkersTooling baseline
SensorContinuous cleanup reviewContinuous Context Cleanup
SensorTech stack alignmentTech Radar and Roadmaps
Loop ControlAI-first decision pointsAI-First Decision Making
Loop ControlElaboration sessionsAI-DLC Mob Elaboration

Layer 3: The Steering Loop

When the same failure mode recurs, iterate the harness, not the prompt.

Diagnostic rubric:

SymptomLikely layer
Wrong output shape, formatting driftPrompt
Missing fact, outdated referenceContext
Agent stall, wrong tool, permission breach, silent regressionHarness

The steering loop has a named owner. Harness changes go through a lightweight review, the same as any other infrastructure change.

Implementation Roadmap

Four phases, staged to avoid the "shipped the harness, nobody uses it" failure.

Phase 1: Baseline the Harness

Deliverables:

  • Standard .claude/settings.json committed in a reference repository.
  • AGENTS.md template published.
  • Permission hook blocking unauthorized write or execute actions.
  • One-page "What runs on your machine" doc for every engineer.

Exit criteria:

  • Every active project has an AGENTS.md file.
  • Default permission boundary is enforced by a pre-tool-use hook.

Phase 2: Seed the Guides

Deliverables:

Exit criteria:

  • Two or more teams are consuming shared skills.
  • Agents can retrieve domain knowledge without manual paste.

Phase 3: Seed the Sensors

Deliverables:

  • CI-integrated linters and type checkers on every repository.
  • LLM-as-judge review pipeline for PRs above a size threshold.
  • Continuous context cleanup process running (see Continuous Context Cleanup).

Exit criteria:

  • Sensor signal is written back into the agent loop (not just human dashboards).
  • At least one class of regression has been caught by sensors pre-merge.

Phase 4: Institutionalize the Steering Loop

Deliverables:

  • Named owner for the organizational harness.
  • Monthly harness retro reviewing recurring failure modes.
  • Documented cycle time target: issue observed to harness updated.

Exit criteria:

  • Harness changes are tracked in the same backlog as product work.
  • Recurrence rate of named failure modes is trending down month over month.

Success Metrics

MetricTargetHow to Measure
Recurrence rate of named failure modesDeclining month over monthTrack tagged issues per failure class
Skill reuse rate across teams> 50% of active skills used by more than one teamSkill invocation logs
Agent sessions requiring human rescue< 10% of sessions per weekSession telemetry or self-report
Issue-to-harness-update cycle time< 1 sprint (median)Timestamp from issue open to harness change merged
Shared AGENTS.md adoption100% of active projectsFile presence check in CI

Anti-Patterns

These are failure patterns observed in teams adopting agent frameworks without a harness discipline.

Anti-patternWhat it looks likeWhy it backfires
Prompt stuffingEvery new failure responds with more text in the system promptPrompts grow unreadable; context budget shrinks; root cause is usually a missing tool or sensor
Context bloatEvery new failure responds with more documents piped into contextSignal-to-noise drops; model output quality declines
Harness sprawlMultiple competing skills, hooks, or MCP servers that overlapAgents pick the wrong one; maintenance burden compounds
Orphan harnessHarness exists, no named owner, nobody updates itDrift accumulates silently; teams quietly stop using it
Single-layer thinkingTreating harness as a replacement for context engineeringMissing knowledge still produces wrong code; the three layers compound, they do not substitute

CLAUDE.md Integration

A project's CLAUDE.md (or AGENTS.md) is the primary Guide at the project level. At minimum it declares:

  • Which skills are in scope for this project.
  • Which tools the agent is permitted to invoke without confirmation.
  • Where to find the project's specification, design system, and knowledge base.
  • What the steering-loop owner expects to be notified about.

See CLAUDE.md Standards (planned) for the full schema.

Role in HarnessProposal
GuideAI-First Context Infrastructure
GuideAgent-Friendly Knowledge Base
GuideClaude Skills Adoption
GuideInternal Spec Platform
GuideUbiquitous Language
GuideSpec Extraction
GuideGlobal Requirement Store
GuideDesign System
Guideshadcn/ui Foundation
GuideMulti-Product Spec Management
GuideFrontmatter Spec Coordination
SensorContinuous Context Cleanup
SensorTech Radar and Roadmaps
Loop ControlAI-First Decision Making
Loop ControlAI-DLC Mob Elaboration

References