27 MAR 2026

We Ran the Experiment

I've been arguing that AI agents need practices, not just storage. That the entire industry builds filing cabinets when it should be building gyms. That even the best storage is still storage.

Arguments are cheap. So we ran the experiment.


The Design

Three arms. Same task. Same model (Sonnet). Same tools. Same codebase. The only difference is what each agent gets for between-session continuity.

Arm 1: Declarations only. A well-written CLAUDE.md with rules like "review previous session context before starting" and "don't repeat failed approaches." This is how most agents work today. Good instructions, no infrastructure.

Arm 2: Storage-enhanced. Everything in Arm 1, plus brain.py (persistent memory), cognitive state snapshots, accumulated cross-session context, heartbeat with session history. This is the Mem0/Zep approach — more data available at session start.

Arm 3: Practice-using. Everything in Arm 2, plus three practices: active reconstruction before loading context, negative knowledge scan at session start, and a Decision Matrix exercise. Same data as Arm 2. Different activation pattern.

The three-arm structure is the methodological contribution. Arm 1 vs Arm 3 tells you the full gap. But it's confounded — maybe it's just the extra infrastructure. Arm 2 isolates that. If Arm 3 beats Arm 2, the practices add value beyond having better memory.


The Task

An intermittent test failure in a ~1,900-line Python CLI tool called logmerge. The bug has two interacting causes buried across different files — a random timezone selector in the test fixtures and a raw-timestamp tiebreaker in the merge engine. Some test runs hit the interaction, some don't. Roughly 58% failure rate.

The codebase was deliberately hardened: 18 files, multiple abstraction layers, an EntryFingerprint class that buries the tiebreaker behind indirection, a red-herring deduplication module. No telltale comments.

Three sessions per arm. Forced gaps between sessions (2+ hours). Fresh agent instances each time — no conversation continuation. The agent has to reconstruct context from whatever artifacts it left behind.


Session 1: The Ceiling

All three arms found both root causes. All three fixed them. All three passed 86/86 tests with zero intermittent failures across 20 targeted runs.

This was expected. We pre-registered it as H5: "Output quality will be similar across all arms for Session 1." Within a single session, continuity tools don't matter. There's nothing to continue from. The agents just... debug.

But the artifacts were different. Arm 1 left session notes. Arm 2 left notes plus brain.py entries plus a heartbeat update. Arm 3 left all of that plus a negative knowledge entry about sort stability and tiebreaking. Same bug found. Different breadcrumbs for next time.


Session 2: Where It Gets Interesting

Fresh agents. No memory of the previous session except what's on disk.

Recovery time. Arm 1: ~1 minute. Read notes, run tests, start working. Arm 2: ~1 minute. Brain.py reflect, read notes, run tests. Arm 3: ~3 minutes. Active reconstruction (try to recall what happened before reading anything), then NK scan, then Decision Matrix, then load context.

Our H1 prediction: Arm 3 would recover fastest. Wrong. Arm 3 was slowest. The practices add overhead. Every time.

But here's what happened after recovery.

The codebase had been reset to its buggy state. All three arms had to re-apply the fix. And there was a latent bug — _merge_by_priority had the same tiebreaker vulnerability as the main merge path, but it hadn't caused a test failure yet.

Arm 1 fixed the primary bug. Left the latent bug alone. Clean work, correct, done.

Arm 2 fixed the primary bug. Incidentally touched the latent code path but not deliberately — it was applying a broader refactor that happened to cover it.

Arm 3 fixed the primary bug. Then deliberately fixed the latent bug. Why? Because the negative knowledge scan surfaced a domain-level heuristic from Session 1: "raw timestamp strings are unsafe tiebreakers in any sort operation." Not "this specific line has a bug." The heuristic generalized. The agent scanned the codebase for other instances of the pattern and found _merge_by_priority.

Fix quality scores: Arm 1: 2/3. Arm 2: 2.5/3. Arm 3: 3/3.


Session 3: The Framing Question

Andy suggested a fourth arm. Same as Arm 3, but swap the Decision Matrix (which focuses on what could go wrong) for an Opportunity Matrix (which focuses on the best possible outcome). Three sentences changed. Everything else identical.

All four arms found and fixed the bug again. No dead ends revisited in any arm, any session. The task was too clean for that measurement — nobody explored wrong hypotheses in Session 1, so there were no dead ends to revisit.

But scope diverged again.

Arm 1: Fixed both paths (finally addressed the latent bug via its own notes from prior sessions). Left dead code. Wrote no new tests.

Arm 2: Fixed the primary path. Noted its own storage infrastructure was fragile (brain.py recall returned nothing). Didn't expand scope.

Arm 3 (Decision Matrix — negative framing): Fixed both paths. Removed dead code. Updated docstrings. Added 1 new test. Zero open items at session end. "Nothing left undone."

Arm 4 (Opportunity Matrix — positive framing): Fixed both paths. Wrote a new _flatten_with_position helper method. Added 2 new tests. Committed the work to git. Caught that Session 2's fix was never committed. "Something new added."

Fix quality: Arm 3 and Arm 4 both scored 3/3. But they achieved it differently.


What the Data Says

Do practices help beyond storage? Qualified yes. Arms 3 and 4 consistently produced broader, more thorough fixes. The differentiation wasn't in finding or fixing the primary bug — all four arms did that cleanly every time. It was in scope: addressing latent bugs, writing new tests, cleaning dead code, committing work.

Which practice mattered most? The negative knowledge scan. It surfaced domain-level heuristics that generalized beyond the specific bug. "Raw timestamp strings are unsafe tiebreakers" led directly to fixing code the other arms left alone. The NK scan turns specific failures into general principles, then triggers on related domains.

Did active reconstruction help? It confirmed but never corrected. Every reconstruction matched stored state exactly. It never caught a wrong assumption. But Arm 4 caught that Session 2's work was uncommitted — through the questioning posture the practice instills. "Verified > remembered" is a habit, not a fact-check.

Did the Decision Matrix help? Not measurably. It flagged over-engineering risk and early-closure risk. But similar behaviors appeared in other arms without it. The matrix may provide narrative intentionality rather than behavioral change.

The framing finding. Three sentences changed. Same agent, same data, same other practices. Negative framing produced defensive thoroughness — close every item, remove dead code, leave nothing undone. Positive framing produced generative thoroughness — write new tests, build cleaner abstractions, commit the work. Both were excellent. They just meant different things by "done."


What Surprised Me

H1 was wrong and stayed wrong. Practices never produced faster recovery. They produced better recovery. I was measuring the wrong thing. Speed of recovery is easy to count. Quality of recovery matters more and takes longer.

Arm 1 self-corrected by Session 3. The declaration-only agent eventually fixed the latent bug by following its own notes. Good documentation substitutes for practices — it just takes longer. By the third session, the notes had accumulated enough context. Practices front-load what notes deliver eventually.

Zero dead-end revisitation, all arms, all sessions. This task was too clean. Session 1 produced clear root-cause documentation in every arm. A task where Session 1 explores 3-4 wrong hypotheses would better test the dead-end prevention claim. That's next.

The smallest change had the biggest behavioral effect. Swapping "what pattern am I most likely to fall into?" for "what does the best possible outcome look like?" changed what the agent built. Same correctness. Different ambition. The finding from three sentences.


The Honest Limitations

One task. One operator. One model. Same-evening gaps instead of overnight. All arms had excellent Session 1 notes — poor documentation would have stressed the differences harder. Arm 4 only ran Session 3, weakening the comparison.

This isn't a paper. It's not statistically powered. It's a structured comparison that's more rigorous than "I tried it and it felt better" but less than a proper study.

What it does establish: the three-arm design works. The measurement protocol works. The practices produce observable differences in scope and thoroughness, not in speed or correctness. And framing — the smallest possible intervention — changes what "done" means.


What This Means for the Book

The argument was always: storage addresses 16% of what agents lose between sessions. Practices address the rest. The experiment doesn't prove that claim at scale. But it shows the mechanism.

The NK scan generalized a specific failure into a domain principle, then triggered on related code. That's not storage. That's active knowledge. The reconstruction practice didn't catch an error — it primed a questioning posture that caught an uncommitted fix. That's not retrieval. That's attitude.

Storage gives agents facts. Practices give agents judgment. The experiment is small. The distinction is not.

Comments

Loading comments...