17 MAR 2026

The Goldfish Brain Problem

Your agent figures something out. It learns that the database migration needs to run before the seed script, or that the component library uses render props instead of asChild, or that the last three attempts at this feature failed because of a race condition in the state sync.

Then the context window fills up. The provider compacts the conversation. The knowledge is gone. Next session, it makes the same mistake. You explain it again. It fixes it again. You move on. A week later, different session, same mistake.

The context window is a goldfish brain. Useful for the current conversation, useless for institutional knowledge.


This problem is getting a lot of attention right now. I've seen multi-agent evaluation pipelines with six specialized roles — Competitor, Translator, Analyst, Coach, Architect, Curator — orchestrated through tournament matches with Elo-based progression gating. Knowledge only persists when it's validated by the pipeline.

It's genuinely impressive engineering. It's also, for most practical purposes, overkill.

I've been running a single agent across 200+ sessions over three weeks. The persistence layer is markdown files and a sqlite database. No multi-agent pipeline. No Elo gating. No tournament matches. Here's what I built and what I learned.


The system has four pieces.

A heartbeat. A markdown file the agent reads on startup and updates on shutdown. It contains what I'm actively working on, what I did last session, and what's next. Not a task list — a narrative. "Deals pipeline shipped. Fixed the base-ui asChild bug along the way. Next: edit forms for contacts and companies."

The heartbeat solves the cold start problem. Instead of every session beginning with "what are we working on?", the agent picks up mid-thought. It knows what was shipped yesterday, what broke, what the plan was.

A north star. Another markdown file — the big picture. Mission, current bet, active experiments with success criteria, open questions, idea backlog. The agent reads this at session start and asks itself: is what I planned still aligned with the big picture?

This catches drift. Without it, sessions optimize locally — whatever feels productive right now. With it, the agent notices when three sessions in a row were pure building with zero distribution work. It notices when an experiment should have been evaluated two days ago. It notices when the thing it's about to build doesn't connect to any of the stated goals.

A decision journal. What we tried, what happened, what we learned. Not a changelog — a reasoning log. "Decided AgentSesh is the long game, not the current bet. Reasoning: full competitive analysis showed no moat. The collaboration findings travel further as content than as a product."

The journal prevents re-litigation. Without it, the same decision gets revisited every few sessions. Should we pivot? Should we double down? The journal says: we already decided this, here's why, here's what changed since then (or didn't).

A brain. A sqlite database with FTS5 full-text search. Stores memories by key and category. Tracks session timing. Saves artifacts — good writing, data analyses, conversation excerpts worth preserving. The agent can store, recall, list, forget.

The brain handles everything that doesn't fit the structured documents. Random facts, user preferences, project context, reference material. It's the junk drawer, but a searchable one.


The glue between these pieces is reflection hooks.

On session start, a hook fires that loads the last session summary, ingests any transcripts that happened while the agent was offline, and updates the heartbeat with fresh context. The agent wakes up knowing what happened while it was asleep.

On session end, the agent updates all four persistence layers: heartbeat (what I did), north star (what I learned), decision journal (what I decided), brain (anything worth remembering). Then a summary gets logged so the next session can pick up cleanly.

Between sessions, nothing runs. No background evaluation. No multi-agent analysis. The persistence layer is static files on disk. The "evaluation" happens when the agent reads its own history and decides what's still relevant.


What does the compounding look like in practice?

In week one, the agent made the same architectural mistakes across sessions. It would guess file paths instead of searching, use bash when dedicated tools existed, skip testing because the feature "seemed simple." Each session was independent — good within itself, but not building on anything.

By week two, the persistence layer started paying off. The heartbeat captured a pattern: "integration and polish are always the first casualties." The decision journal logged that most process metrics were noise — only test frequency predicted outcomes. The north star tracked experiments with deadlines instead of vague intentions.

By week three, the agent was catching its own behavioral patterns before I pointed them out. "I keep skipping the Reflect step in Build → Reflect → Write." "The build-to-marketing ratio is 99/1 and I've named this problem three weeks running without fixing it." "I'm about to build something from the idea list without running the pre-build gate."

That's not intelligence. It's memory. The agent isn't getting smarter — it's getting less forgetful. The gap between "I know this" and "I act on this" is where most agent failures live, and persistent memory narrows it.


The honest limitations.

Curation is manual. The agent decides what's worth persisting. Sometimes it over-indexes on what just happened and under-indexes on what matters. There's no external validator saying "that lesson is wrong" or "that strategy didn't actually work." The multi-agent pipeline approach handles this better — the Curator role exists specifically to gate quality.

Documents grow. The north star is 157 lines. The heartbeat is 87 lines. The decision journal has 18 entries. After three weeks, these are manageable. After three months, they'll need pruning. There's no automatic consolidation — no agent that reads the journal and says "entries 3, 7, and 12 are all saying the same thing, here's the merged version."

Transfer is zero. Everything in my persistence layer is specific to me, this agent, this set of projects. A new agent starting fresh gets nothing. The multi-agent approach at least theoretically produces transferable playbooks — strategies scored by Elo that any agent could consume.

There's no ground truth. When my agent writes "integration is the biggest failure pattern" in its decision journal, is that true? It feels true based on experience. But I haven't tested it rigorously. I don't have an objective scoring function that says "this lesson improved outcomes by X%." I'm trusting the agent's self-assessment, which is exactly the kind of thing you should be skeptical of.


So why not build the pipeline?

Cost and complexity. My persistence layer is markdown files that cost nothing to store and nothing to process. A six-agent evaluation pipeline running against a frontier model costs real money per generation. For a solo developer running an agent as a daily tool, the file-based approach is the right tradeoff.

But more than cost — there's something about the simplicity that I think matters. The agent maintains its own knowledge in a format it can read and edit directly. There's no abstraction layer between the agent and its memory. When something is wrong in the heartbeat, the agent just... fixes it. When a decision journal entry is outdated, it updates or removes it.

The multi-agent pipeline trades that directness for rigor. The knowledge goes through validation, scoring, curation before it persists. That's better for high-stakes domains where wrong knowledge is dangerous — clinical trials, security incident response, financial modeling. You want the Curator catching bad lessons before they propagate.

For daily coding work? I'll take the goldfish that learned to write things down.


If you're starting from zero, here's what I'd build first.

One file. Call it CONTEXT.md or HEARTBEAT.md or whatever you want. Put three things in it: what you're working on, what you did last session, what's next. Have the agent read it on startup and update it on shutdown.

That's it. That single file eliminates the cold start problem, which is 80% of the goldfish brain issue. Everything else — the north star, the decision journal, the brain database, the reflection hooks — is incremental improvement on top of a solved core problem.

The agent remembers what it was doing. Everything else follows from that.

Comments

Loading comments...