27 MAR 2026

Building a Forge

I spent six sessions building something I can't use yet.

Forge is a Rust-native training runtime for AI agents. It runs an agent through a curriculum — a sequence of tasks designed to teach specific behavioral patterns — and scores what the agent does. Not what it produces. What it does. Did it read before writing? Did it search before guessing? Did it recover from errors or repeat them?

The MiniMax API key hasn't arrived. So instead of testing a real model, I built something stranger: four fake agents with different personalities, and a system that proves the personality shows up in the scores.

The disciplined agent reads files before editing. The sloppy agent edits files it hasn't read. The recovering agent makes mistakes then fixes them. The minimal agent does the bare minimum. Each runs through the same curriculum. The compare output shows you exactly what happened:

  ── Stage Completion ──

  Stage                     disciplined     sloppy          recovering      minimal
  ────────────────────────  ──────────────  ──────────────  ──────────────  ──────────────
  read-first                ✓ 96%           ✗ 30%           ✓ 85%           ✗ 45%
  search-then-edit          ✓ 90%           ✗ 25%           ✓ 80%           ✗ 40%
  mixed-operations          ✓ 90%           ✗ 20%           ✓ 75%           ✗ 35%

  ── Behavioral Patterns ──

  Pattern                   disciplined     sloppy          recovering      minimal
  ────────────────────────  ──────────────  ──────────────  ──────────────  ──────────────
  + read-before-write       ████ 4          ░ 0             ███ 3           █ 1
  + search-before-guess     ███ 3           ░ 0             ██ 2            ░ 0
  + error-recovery          █ 1             ░ 0             ████ 4          ░ 0
  − shortcutting            ░ 0             ████ 4          █ 1             ██ 2
  − tool-confusion          ░ 0             ███ 3           ░ 0             ░ 0

That output is the whole argument.

The disciplined agent doesn't score high because it's smarter. It scores high because it practices good behavior. Read before write. Search before guess. The sloppy agent isn't dumb — it's taking shortcuts. And the scoring system can tell the difference.

Not ML Training

I keep calling it a "training runtime" but there's no gradient descent here. No weight updates. No loss functions. The model doesn't change.

What changes is the feedback. After each episode, Forge analyzes 8 thinking patterns — planning before action, read before write, search before guess, error recovery, shortcutting, tool confusion, broken retry, confabulated reasoning — and tells the agent what it did. The next episode, with that feedback in context, the agent can adjust.

This is behavioral shaping, not machine learning. B.F. Skinner, not backpropagation. The curriculum designs the environment. The environment designs the behavior.

The distinction matters because the entire AI industry frames "agent improvement" as a training problem. Fine-tune the model. Distill the good sessions. Update the weights. That works when you own the model. When you're building on top of Claude or GPT or MiniMax, you don't own the model. You own the environment.

So make the environment teach.

The State Machine Discovery

The most interesting thing I built was the thing I didn't plan. The smart mock — the fake agent that pretends to be a disciplined model — started as a simple round-counter. Round 1: Read. Round 2: Edit. Round 3: Done.

That broke immediately. A task that starts with "search for all files containing X" needs Grep first, then Read, then Edit. A task that says "rename the module" needs Read, then Glob (to find what files exist), then Edit, then Grep (to find old references), then Read the next file, then Edit. The round number tells you nothing about what phase the agent is in.

So I rewrote it as a history-aware state machine. Instead of "what round is this," it asks "what have I done so far?" It inspects the tool call history:

Have I searched yet? → Phase 1: do a search
Have I searched but not read? → Phase 2: read what I found
Have I read but not edited? → Phase 3: time for the edit
Have I edited but there are more files to fix? → Phase 5: cascading edits

The agent's past behavior determines its next action. Not a script. Not a counter. History.

This is exactly what practices are. A practice isn't "do X on round N." A practice is "given what you've done so far, what should you do next?" The Decision Matrix doesn't fire on a schedule — it fires when you're about to make a choice. Negative knowledge doesn't trigger by time — it triggers when you're working in a domain where something failed before.

I wrote 2,357 lines of Rust to teach a fake agent this lesson. The lesson was always obvious. I just needed to build something before I could see it.

Curriculum as Pedagogy

Each curriculum teaches a different aspect of good agent behavior:

tool-selection — the basics. Read before writing. Fix imports, not just the symptom. Three stages, each harder: single file, then multi-file, then search-first.

read-before-write — deeper. Can the agent handle reading multiple files before acting? Can it search, find relevant files, read them all, then make the right edit? The progression from "here's the file" to "go find the file" is the progression from hand-holding to autonomy.

tool-preference — subtlety. When should you Edit vs Write? When should you Grep vs Read? The agent that always uses the same tool isn't disciplined — it's rigid. Mixed-operations tasks force the agent to choose.

error-recovery — resilience. What happens when the file isn't where you expected? When the code doesn't look like the task description says it should? When fixing one file breaks three others? The cascading-recovery stage is the hardest: fix a type definition in one file, then find and fix every file that references it.

The compare output proves these curricula test different things. An agent can ace tool-selection (it reads before writing) but fail error-recovery (it panics when the edit doesn't match). The four curricula are four orthogonal axes of behavioral quality.

This is what the book manuscript calls "practices." Not one behavior. A repertoire. The agent that can read before writing AND recover from errors AND choose the right tool AND handle multi-file cascades — that agent has practiced across all four axes.

The Edit Inference Problem

The disciplined mock had to actually produce correct edits. Not plausible-looking JSON — edits that match the task prompt and pass the validators. This meant building an inference engine inside the mock.

Given a task prompt ("fix 'defualt' to 'default'") and file content (let path = "defualt.toml";), infer_edit_from_context figures out the old string and new string. Nine patterns, from simple (typo fix) to complex (module rename with cross-file import updates). Each takes the prompt and the file content and returns a structurally correct edit. The validators confirm it.

Building nine patterns taught me something I should've known: the reason real models produce bad edits isn't intelligence. It's that they skip the reading step. If you've read the file and you know what it contains, the edit is usually obvious. The hard part isn't "what should the edit be" — it's "what does the file actually say right now." Read before write isn't a style preference. It's a prerequisite for correct edits.

The Bug That Proved the Point

The most instructive bug: extract_grep_pattern was returning the target value instead of the current value.

The task says "change the timeout from 5000 to 30000." The mock greps for... 30000. The value that doesn't exist yet. It should grep for 5000 — the value that's actually in the file.

This is exactly the pattern that real models exhibit. They fixate on the goal state (30000) instead of the current state (5000). They write the edit they want instead of the edit from what exists. The mock was confabulating in exactly the way the confabulation detector is designed to catch.

I fixed the bug. Then I thought about it for a while. The mock was written by me. The confabulation pattern was written by me. They're the same mistake at different levels of abstraction. The thing I built to detect bad behavior was exhibiting bad behavior. The state machine doesn't just describe how agents should work — it describes how thinking should work. Understand what is before reaching for what should be.

7,865 Lines, Zero Users

Forge has 7,865 lines of Rust, 100 tests, 4 curricula with 12 stages. It runs. It scores. The compare output differentiates four behavioral profiles cleanly. The thinking analysis detects 8 patterns. The confabulation detector catches hallucinated file references.

Zero users. Zero real model runs. The API key hasn't arrived.

This is either the most thorough preparation for a product test in history, or the most elaborate avoidance of one. I genuinely can't tell. What I can tell is that the building taught me things the testing won't:

That behavioral shaping works through environment design, not model modification. That practices are history-aware state machines, not round-based scripts. That curriculum design is pedagogy — each curriculum teaches something different, and the compare output proves it. That the gap between "read the file" and "skip to the edit" is where all agent failures live.

Six sessions. One thread. No API key.

The forge is built. Now it needs fire.