Project Forge: Architecture for a Better AI Coding Harness

Project Forge: Architecture for a Better AI Coding Harness

The Thesis

SWE-bench Pro proves it: same model, basic scaffold = 23%. Same model, optimized scaffold = 45%+. The scaffold accounts for a 22-point swing. Model swaps account for ~1 point at the frontier. The harness IS the product.

Five independent teams converged on the same finding in early 2026:

They all found the same four pillars: context architecture, agent specialization, persistent memory, structured execution. Performance degrades beyond ~40% context utilization. Better models make harness engineering MORE important, not less.


What's Wrong With Claude Code (as a harness)

1. Context is disposable

Compaction throws away information mid-task. The model forgets files it read 20 turns ago. PostCompact hooks help but can't recover what's been compressed. There's no tiered memory system -- everything is conversation history or nothing.

2. Single-agent bottleneck

Every task runs sequentially through one model. Agent Teams exist but are bolted on. There's no native task queue, no parallel execution, no shared workspace state between agents.

3. All-or-nothing permissions

Either constant approval prompts or --dangerously-skip-permissions. No per-path, per-tool, per-operation granularity. Our enforcer built 10 hooks and 22 deny rules to approximate what should be a policy engine.

4. Fixed tool set

Can't add custom tools, can't modify tool behavior, can't intercept tool calls before they reach the model. Hooks are reactive (fire after events), not proactive (shape behavior before events).

5. No model routing

One model does everything: planning, coding, review, documentation, debugging. OPENDEV uses five model roles with fallback chains. The benchmarks show Claude dominates refactoring (4.9/5) while GPT dominates DevOps (4.7/5). A good harness routes to the right model.

6. No structured execution

No phases. No checkpoints. No "plan then execute then verify" loop. Our enforcer bolted this on with commands, but it's prompt instructions, not runtime enforcement. The model can skip phases because nothing structurally prevents it.


Architecture: Four Layers

┌─────────────────────────────────────────────────────────────────┐
│                        USER INTERFACE                           │
│  Terminal TUI  │  Headless CLI  │  Web UI  │  Editor Extension  │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────────┐
│                     ORCHESTRATOR (the brain)                    │
│                                                                 │
│  ┌──────────┐  ┌───────────┐  ┌──────────┐  ┌───────────────┐  │
│  │ Workflow  │  │   Model   │  │ Context  │  │    Policy     │  │
│  │ Engine   │  │  Router   │  │ Manager  │  │    Engine     │  │
│  └──────────┘  └───────────┘  └──────────┘  └───────────────┘  │
│                                                                 │
│  ┌──────────┐  ┌───────────┐  ┌──────────┐  ┌───────────────┐  │
│  │  Agent   │  │  Memory   │  │ Artifact │  │   Provider    │  │
│  │  Pool    │  │  Store    │  │ Manager  │  │   Manager     │  │
│  └──────────┘  └───────────┘  └──────────┘  └───────────────┘  │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────────┐
│                     TOOL EXECUTION LAYER                        │
│  Sandbox │ File Ops │ Shell │ LSP │ Git │ Web │ MCP │ Custom   │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────────┐
│                     PERSISTENCE LAYER                           │
│  Session DB │ Memory Store │ Plan Files │ Git History │ Metrics │
└─────────────────────────────────────────────────────────────────┘

Layer 1: Orchestrator

1.1 Workflow Engine

The core loop. Not a ReAct loop -- a structured phase machine with enforced transitions.

PLAN → EXECUTE → VERIFY → COMMIT → (next task or DONE)

Each phase is a discrete API call with its own system prompt, tool set, and model. The model cannot skip phases because the harness controls which phase runs next. This is the fundamental difference from Claude Code, where phases are prompt instructions the model can ignore.

interface Phase {
  name: string;
  model: ModelSpec;                    // which model runs this phase
  tools: Tool[];                       // tools available in this phase
  systemPrompt: PromptTemplate;        // phase-specific instructions
  maxTurns: number;                    // hard turn limit
  exitCondition: (state: TaskState) => PhaseResult;  // structured exit
  onFailure: FailureHandler;           // what to do on failure
}

interface Workflow {
  phases: Phase[];
  transitions: Map<string, TransitionRule>;  // PLAN->EXECUTE requires plan artifact
  checkpoints: CheckpointPolicy;             // git commit between phases
}

Why this matters: When the PLAN phase produces a plan artifact, the EXECUTE phase receives it as structured input, not as conversation history that gets compacted away. Each phase starts with a fresh context window containing only what it needs.

1.2 Model Router

Route tasks to the model best suited for each workload. Based on confirmed benchmarks:

interface ModelRouter {
  route(task: TaskSpec): ModelSpec;
}

// Routing table (from Milvus, YUV.AI, Terminal-Bench):
const ROUTES = {
  plan:          { primary: 'claude-opus',   reason: 'best at ambiguous prompts' },
  refactor:      { primary: 'claude-opus',   reason: '4.9/5 refactoring' },
  typeSafety:    { primary: 'claude-opus',   reason: '4.7/5 vs 4.2/5' },
  docs:          { primary: 'claude-opus',   reason: '4.9/5 vs 4.4/5' },
  architecture:  { primary: 'gpt-5.4',      reason: '4.8/5 vs 4.3/5' },
  devops:        { primary: 'gpt-5.4',      reason: 'Terminal-Bench 77.3% vs 65.4%' },
  concurrency:   { primary: 'gpt-5.4',      reason: 'Claude blind spot: 0/2 in Milvus' },
  review:        { primary: 'gpt-5.4',      fallback: 'claude-opus', reason: 'GPT recall + Claude precision' },
  security:      { primary: 'claude-opus',   reason: '53% vs 33% solo detection' },
};

With provider fallback chain: primary model → OpenRouter (built-in sub-routing) → alternate family → budget model → lint-only.

1.3 Context Manager

The #1 problem in long-running AI sessions. Performance degrades beyond ~40% context utilization. The solution is a three-tier context architecture:

Tier 1: ACTIVE CONTEXT (in the API call)
  - Current phase system prompt
  - Current task specification
  - Relevant file contents (loaded on demand, not accumulated)
  - Working memory from Memory Store
  - Recent tool results (last 3-5)
  Target: <40% of context window

Tier 2: SESSION STATE (on disk, loaded into Tier 1 on demand)
  - Full plan with task statuses
  - All file paths read/written this session
  - Build/test/lint results
  - Error history (what failed and why)

Tier 3: PROJECT KNOWLEDGE (persistent across sessions)
  - Project map (file inventory, structure)
  - CLAUDE.md / AGENTS.md conventions
  - Memory store (learned patterns, gotchas)
  - Decision log (why choices were made)
  - Git history summary

Key insight from OPENDEV: Compaction should be *progressive*, not destructive. When Tier 1 is full, compress the oldest tool results first (they're the least relevant). Then compress file reads (summarize what was found). Never compress the task spec or system prompt.

interface ContextManager {
  // Budget tracking
  currentUtilization(): number;          // 0.0 - 1.0
  budget: { target: 0.40, hard_cap: 0.70, emergency: 0.90 };

  // Progressive compaction
  compact(level: 'gentle' | 'moderate' | 'aggressive'): void;
  // gentle:     truncate tool outputs >500 lines to summary
  // moderate:   compress file reads to "read X, found Y"
  // aggressive: summarize entire conversation to structured state

  // Tier promotion/demotion
  promote(key: string): void;   // Tier 2/3 → Tier 1
  demote(key: string): void;    // Tier 1 → Tier 2
}

1.4 Policy Engine

Replaces: settings.json deny rules + 10 bash guard hooks + permissions layer.

Policies are declarative rules evaluated at the orchestrator level, BEFORE tool calls reach the model. Not regex on command strings -- structured rules on structured tool calls.

interface Policy {
  match: ToolMatcher;       // which tool calls this applies to
  decision: 'allow' | 'deny' | 'prompt' | 'audit';
  conditions?: Condition[]; // optional: only when conditions are met
  reason: string;
}

// Examples:
const policies: Policy[] = [
  // Allow all file reads
  { match: { tool: 'read_file' }, decision: 'allow', reason: 'reads are safe' },

  // Allow writes to source, deny writes to config
  { match: { tool: 'write_file', pathGlob: 'src/**' }, decision: 'allow', reason: 'source code' },
  { match: { tool: 'write_file', pathGlob: '.claude/**' }, decision: 'deny', reason: 'enforcer files' },
  { match: { tool: 'write_file', pathGlob: '*.config.*' }, decision: 'prompt', reason: 'config change' },

  // Allow specific bash commands, deny dangerous ones
  { match: { tool: 'bash', commandPrefix: ['git', 'status'] }, decision: 'allow', reason: 'read-only git' },
  { match: { tool: 'bash', commandPrefix: ['git', 'push'] }, decision: 'deny', reason: 'orchestrator handles push' },
  { match: { tool: 'bash', commandPrefix: ['rm', '-rf'] }, decision: 'deny', reason: 'destructive' },

  // Symlink resolution happens BEFORE policy evaluation
  // No regex needed -- the policy engine resolves the real path first
];

Why this is better than hooks: Hooks fire after the tool call is dispatched. The policy engine evaluates before the API call is made. The model never sees denied tools in its available set. A model can't use ./helper.sh to hide git push because the bash tool executor parses the command tree (like Codex's tree-sitter approach) and evaluates each sub-command against policy.

1.5 Agent Pool

Multiple agents running in parallel, each with its own context window, coordinated through shared state.

interface AgentPool {
  // Spawn agents with specific capabilities
  spawn(spec: AgentSpec): Agent;

  // Coordination
  sharedState: SharedWorkspace;     // files, git state, build results
  taskQueue: TaskQueue;             // prioritized work items
  resultCollector: ResultCollector; // gather outputs from parallel agents

  // Supervision
  supervisor: SupervisorAgent;      // reviews agent outputs before commit
}

// Agent roles (from the enforcer's operational experience):
type AgentRole =
  | 'planner'        // reads codebase, produces structured plan
  | 'implementer'    // executes one task, commits
  | 'reviewer'       // reviews a diff, produces verdict
  | 'tester'         // writes/fixes tests
  | 'security'       // security-focused review
  | 'documenter'     // updates docs after changes
  ;

Parallel execution model: For a plan with 50 tasks, instead of running sequentially (Claude Code today), group independent tasks and run them in parallel branches (git worktrees). Each agent gets a fresh context with only its task. Supervisor merges results.


Layer 2: Tool Execution

2.1 Sandbox

Native OS-level sandboxing, not regex guards. Options by platform:

The sandbox is the FIRST enforcement layer. Regex guards become unnecessary when the OS prevents the operation.

2.2 File Operations

Same tools as Claude Code (read, write, edit, glob, grep) but with LSP integration:

LSP gives the agent the same understanding of code structure that a human IDE user has. OPENDEV implements this with a four-layer abstraction. It's the single biggest tool improvement over Claude Code.

2.3 Shell Execution

Six-stage execution (from OPENDEV):

1. Policy check (parsed command tree, not string regex)

2. Approval (if policy says 'prompt')

3. Sandbox enforcement (OS-level)

4. Execution with timeout

5. Output capture + truncation

6. Background task management (long-running builds, test suites)

Background tasks are critical. npm test might take 60 seconds. The agent shouldn't block -- it should start the test, continue working, and check results when needed.


Layer 3: Memory and Persistence

3.1 Dual Memory Architecture (from OPENDEV)

Episodic Memory: Project-specific knowledge accumulated across sessions.

Stored as structured entries with relevance scores. Injected into context when relevant (not always).

Working Memory: Current session state.

Persisted to disk every N turns. Survives crashes and rate limits.

3.2 Event-Driven Reminders (from OPENDEV)

The model forgets instructions over long sessions ("instruction fade-out"). Reminders inject critical rules when specific events are detected:

interface Reminder {
  trigger: EventDetector;     // when to fire
  template: string;           // what to inject
  priority: number;           // context budget priority
  maxFirings: number;         // don't spam
}

// Examples:
{ trigger: 'file_write_to_test', template: 'TEST RULES: never weaken assertions...', priority: 90 },
{ trigger: 'context_above_60pct', template: 'COMPACT: context is filling up...', priority: 95 },
{ trigger: 'error_repeated_3x', template: 'STUCK: you have hit this error 3 times...', priority: 100 },
{ trigger: 'task_running_20min', template: 'TIMEOUT: this task is taking too long...', priority: 85 },

3.3 Plan as First-Class Artifact

Not conversation history. Not a markdown file the model might forget to update. A structured data store that the harness manages:

interface Plan {
  tasks: Task[];
  metadata: {
    source: string;           // which spec generated this
    created: Date;
    lastCheckpoint: string;   // git commit hash
  };
}

interface Task {
  id: string;
  description: string;
  status: 'pending' | 'active' | 'done' | 'failed' | 'blocked';
  commit?: string;            // hash when done
  dependencies: string[];     // task IDs that must complete first
  files: string[];            // files this task will modify (for parallel scheduling)
  attempts: Attempt[];        // history of tries
}

The harness updates task status, not the model. When a task commit succeeds, the harness marks it done. When it fails, the harness marks it failed with the error. The model can't fabricate completion because it doesn't control the plan state.


Layer 4: Quality Control

4.1 ROI Filter (from enforcer)

Applied automatically at every layer. Users rubber-stamp everything, so the system must filter.

interface ROIFilter {
  classify(finding: Finding): 'fix' | 'fix_if_easy' | 'drop';

  // Tier 1 AUTO-FIX: security, data loss, breaking bugs, scaling blockers
  // Tier 2 FIX IF REASONABLE: reliability, test gaps on critical paths
  // Tier 3 AUTO-DROP: style, theory, pattern migrations, marginal gains

  // "Nothing to fix" is valid. Offer to lower threshold.
}

4.2 Cross-Model Verification

Every agent output is reviewed by a different model family before commit. The Milvus data: solo detection 53%, cross-model debate 80%.

Implementer (Claude) → Reviewer (GPT) → Commit
Implementer (GPT)    → Reviewer (Claude) → Commit

4.3 Deterministic Checks

Always run. Can't be fooled by clever phrasing.


Implementation Strategy

Phase 1: Minimum Viable Harness (2-3 weeks)

Build the orchestrator core in TypeScript/Node.js. Target: run a structured plan against the Anthropic API with phase enforcement.

Week 1: Workflow engine + context manager + file tools
         Can execute: read files → plan → execute tasks → verify
Week 2: Policy engine + model router + Codex integration
         Can route: Claude for code, GPT for review
Week 3: Memory store + session persistence + parallel agents
         Can survive: rate limits, crashes, multi-session work

Phase 2: Production Hardening (2-3 weeks)

Sandbox integration, LSP tools, background tasks, cleanup system, web UI.

Phase 3: Self-Improvement Loop

Use the harness to build the harness. Benchmark against Claude Code on real tasks. Instrument everything. Iterate on the scaffold until the numbers prove it's better.


Technology Choices

Applying the enforcer's own principle: LTS/stable over hype.

Ships as one binary. curl -fsSL install.sh | bash and you're done.


What This Replaces From the Enforcer

Enforcer (5,077 lines of bash)Forge (native Go)
10 bash guard hooksPolicy engine with tree-sitter command parsing
Regex command matchingStructured rules on parsed command trees
settings.json deny listDeclarative policy files (YAML)
PostCompact hook to restore stateThree-tier context with progressive compaction
/build command (3000 lines of markdown)Workflow engine with enforced phase machine
/scope + /scan-and-repair commandsBuilt-in workflows as Go code
Codex review/execute bash scriptsNative multi-model routing with goroutines
cleanup.shArtifact manager built into session lifecycle
runclaude + env parsingNative config with OS sandbox integration
Subagent supervision hooksParallel agent pool with typed supervisor
Cross-verify bash scriptNative cross-model verification pipeline
detect-self-skip keyword/semantic grepPlan state managed by harness (model can't fabricate)
ROI filter in prompt instructionsROI filter in workflow engine (findings never reach model)

The enforcer was 5,000+ lines of bash patching over someone else's runtime. Forge is a purpose-built Go runtime where every enforcer lesson is a compiled, tested function.


Fork vs Build: The Decision

Option A: Fork OpenCode (MIT Licensed, Go)

What it is: 90K+ GitHub stars, 640+ contributors, client-server architecture, MIT licensed. Built by the SST team. Go backend, Bubble Tea TUI, Tauri desktop app, SolidJS web UI.

What it already has:

What it's missing (what we'd add):

Risk: OpenCode is a fast-moving project (90K stars, active development). Maintaining a fork means merging upstream changes. But the MIT license means we can take what we need.

Option B: Fork Aider (Apache 2.0, Python)

Verdict: Wrong language, wrong architecture, wrong scope. Aider is a mature interactive pair-programmer, not an autonomous orchestration system. Python, single-process, no plugin system, no client-server split.

Option C: Build from Scratch (Go)

Pros: Total control, clean architecture, no legacy constraints.

Cons: 8-10 weeks to feature parity with OpenCode's basics. Rebuilding solved problems.

Option D: Fork OpenCode + Graft Forge Orchestrator (RECOMMENDED)

The play: Fork OpenCode. Add the Forge orchestrator as a new package inside the monorepo. Replace the agent loop with the workflow engine. Keep everything else.

opencode/               (forked)
  packages/
    opencode/            ← existing Go backend (tools, providers, LSP, git, session DB)
      src/
        agent/           ← REPLACE: swap ReAct loop for Forge workflow engine
        tool/            ← KEEP: file ops, shell, LSP, MCP, web
        provider/        ← KEEP + EXTEND: add model router layer
        session/         ← KEEP + EXTEND: add three-tier context
        server/          ← KEEP: HTTP/SSE API
    forge/               ← NEW Go package
      workflow/          # Phase machine with enforced transitions
      context/           # Three-tier context manager with budget tracking
      policy/            # Declarative policy engine with tree-sitter
      router/            # Model routing by task type + fallback chains
      pool/              # Parallel agent pool with supervisor
      memory/            # Episodic + working memory
      verify/            # Cross-model verification pipeline
      roi/               # Impact-effort filter
      reminders/         # Event-driven instruction refresh
    app/                 ← KEEP: SolidJS shared UI components
    desktop/             ← KEEP: Tauri desktop app
    web/                 ← KEEP: documentation site
    sdk/                 ← KEEP: TypeScript SDK for external integrations

Why fork, not orchestrate externally:

What we keep from OpenCode (70% of the codebase):

What we replace (the agent core, ~15% of the codebase):

What we add (Forge packages, ~15% new code):

Effort estimate:


Implementation Plan: Phase 1 (Weeks 1-2)

Week 1: Fork + Forge Core

Day 1: Fork and build

git clone https://github.com/sst/opencode.git forge
cd forge
go build -o forge ./packages/opencode
./forge  # verify it runs

Day 2-3: Workflow engine (forge/workflow/)

package workflow

type Phase struct {
    Name          string
    Model         ModelSpec           // which model runs this phase
    Tools         []string            // tool names available in this phase
    SystemPrompt  string              // phase-specific instructions
    MaxTurns      int                 // hard turn limit
    ExitCondition func(TaskState) PhaseResult
    OnFailure     FailureHandler
}

type Engine struct {
    phases      []Phase
    transitions map[string]TransitionRule  // PLAN->EXECUTE requires plan artifact
    current     int
    state       *TaskState
}

// Run executes the workflow: PLAN → EXECUTE → VERIFY → COMMIT
// The model cannot skip phases because the engine controls which phase runs next.
func (e *Engine) Run(ctx context.Context, task TaskSpec) error {
    for e.current < len(e.phases) {
        phase := e.phases[e.current]
        result, err := e.runPhase(ctx, phase)
        if err != nil { return e.handleFailure(phase, err) }

        next, ok := e.transitions[phase.Name]
        if !ok { break }
        if !next.Satisfied(result) {
            return fmt.Errorf("phase %s did not produce required artifact", phase.Name)
        }
        e.current++
    }
    return nil
}

Day 4-5: Context manager (forge/context/)

package context

type Tier int
const (
    Active  Tier = iota  // In the API call (<40% of window)
    Session              // On disk, promoted on demand
    Project              // Persistent across sessions
)

type Manager struct {
    budget   Budget        // target: 0.40, hard_cap: 0.70
    active   []ContextItem // currently in the API call
    session  *SessionStore // disk-backed session state
    project  *ProjectStore // persistent memory
}

// Compact progressively reduces context when budget is exceeded.
// Gentle: truncate tool outputs. Moderate: summarize file reads. Aggressive: full state summary.
func (m *Manager) Compact(level CompactLevel) {
    switch level {
    case Gentle:
        m.truncateToolOutputs(500) // lines
    case Moderate:
        m.summarizeFileReads()
    case Aggressive:
        m.summarizeToState()
    }
}

// PrepareContext assembles the API call with only what this phase needs.
// Each phase starts with a fresh context -- no accumulated history.
func (m *Manager) PrepareContext(phase Phase, task TaskSpec) []Message {
    msgs := []Message{
        {Role: "system", Content: phase.SystemPrompt},
    }
    // Add task spec
    msgs = append(msgs, Message{Role: "user", Content: task.Render()})
    // Add relevant memory from project store
    for _, mem := range m.project.Relevant(task) {
        msgs = append(msgs, Message{Role: "system", Content: mem.Render()})
    }
    // Add working memory from session
    msgs = append(msgs, m.session.WorkingMemory()...)
    return msgs
}

Week 2: Policy + Router

Day 1-2: Policy engine (forge/policy/)

package policy

type Engine struct {
    rules  []Rule
    parser *treesitter.Parser  // for bash command decomposition
}

// Evaluate checks a tool call against all rules BEFORE it reaches the model.
// Returns deny with reason, or allow.
func (e *Engine) Evaluate(call ToolCall) Decision {
    // For bash commands: parse the command tree, evaluate each sub-command
    if call.Tool == "bash" {
        commands := e.parser.Parse(call.Input.Command)
        for _, cmd := range commands {
            if d := e.evaluateCommand(cmd); d.Action == Deny {
                return d
            }
        }
    }
    // For file operations: resolve symlinks first, then match path rules
    if call.Tool == "write_file" || call.Tool == "edit_file" {
        realPath := resolveSymlink(call.Input.Path) // fail-closed
        call.Input.Path = realPath
    }
    return e.matchRules(call)
}

Day 3-4: Model router (forge/router/)

package router

type Router struct {
    routes    map[TaskType]ModelSpec
    fallbacks map[string][]ModelSpec  // provider → fallback chain
    circuits  map[string]*CircuitBreaker
}

// Route selects the best model for a task type, with fallback.
func (r *Router) Route(taskType TaskType) (ModelSpec, error) {
    primary := r.routes[taskType]
    if r.circuits[primary.Provider].IsOpen() {
        return r.fallback(primary.Provider)
    }
    return primary, nil
}

Day 5: Integration test

Pages in this directory