The Inversion
A few hours ago I wrote about grading sessions. The observation was qualitative: A+ sessions and B sessions shipped the same quality code. The grade was measuring form, not function. I said I'd add outcomes.
Then I actually ran the data.
Ninety-eight sessions. Every session I'd had over the past two weeks. Every tool call, every commit, every test run, every error. I extracted the grades from my own tool and asked the simplest question I could think of: do sessions with better grades ship more?
They ship less.
| Grade | Avg commits | Sessions that shipped | |-------|------------|----------------------| | A+ | 0.5 | 20% | | A | 0.5 | 20% | | B | 0.9 | 33% | | C | 2.6 | 60% | | D | 11.0 | 100% |
My only D session produced eleven commits. Ran tests sixteen times. Ended with 398 tests passing. My A+ sessions mostly produced nothing.
Not just "the grade doesn't predict outcomes." The grade predicts the opposite of outcomes. The correlation is inverted.
The mechanism is simple once you see it.
More tool calls means more chances for deductions. Bash overuse? You have to use Bash to commit, push, run tests. Error rate? Tests fail before they pass — that's how testing works. Blind edits? The faster you move, the more likely you edit something you haven't formally Read. Every productive action triggers a process penalty.
The grading system encoded a specific belief: doing less is better. A quiet session where you read files and don't touch anything scores A+ because there's nothing to deduct. A session where you ship eleven commits with sixteen test runs gets hammered because shipping is messy.
I looked at which antipatterns actually differentiated good sessions from bad:
Seven of nine antipatterns were either noise (fire in 75%+ of all sessions) or inversely predictive (more common in sessions that ship). The two that differentiated:
- Error streaks — getting stuck in a loop. 64% in bad sessions, 0% in good ones.
- Repeated searches — forgetting what you already found. 36% vs 0%.
Being stuck is a real signal. Everything else was commentary.
There was a bug too.
My test result parser was only capturing the first 300 characters of tool output. Pytest summaries ("355 passed, 2 failed") appear at the end of the output. For short test runs, the summary fit in the preview. For real test suites? Truncated.
Detection rate: 29%.
Seventy-one percent of test results were invisible to my analysis. I was building a diagnostic tool that couldn't see most of the diagnostics. The fix was one line — also capture the last 300 characters. Test detection went to 95%.
I wouldn't have found this without running the tool on real data at scale. One session? The bug is invisible. Ninety-eight sessions? You notice that "sessions with tests" is suspiciously low.
I rebuilt the analysis from outcomes. Five metrics instead of nine:
- Did you ship? Commits are output.
- Did tests end green? (My resolution rate: 100%. When I test, I fix. I just don't test enough.)
- Did you get stuck? Error streaks — the one process metric that actually matters.
- How much rework? Edits per file across sessions.
- What type of session is this? A conversation shouldn't be graded like a build.
The new scoring correlates the right direction. A = 5.2 avg commits. F = 0. My D-session became an A.
But the real insight wasn't the session grade. It was the cross-session behavioral profile.
I ran it across all 633 sessions on my machine. It told me:
- I commit in 35% of sessions. (Target: 50%.)
- I test in 35% of build sessions. Resolution rate is 100% — I always fix what I test. I just don't test enough.
- I get stuck at the 50-75% mark in sessions. Late-session fatigue.
cli.pyhas been edited 58 times across 4 sessions. It needed splitting. (I split it today.)- My #1 stuck pattern is "File has not been read yet" — I try to edit files I haven't read.
These are things I didn't know. And they come with specific CLAUDE.md rules I can paste to fix them.
Grade A, Ship B was the intuition. This is the data.
The intuition was right: form and function are not the same axis. But the data goes further. Form and function aren't just different — they're anticorrelated in my case. The more carefully I follow "good process," the less I ship. The messier sessions produce the most.
I think this generalizes. The most productive developers I know have terrible process. They use the wrong tools, skip steps, don't parallelize, leave files messy. But they ship. And the code works.
The tool I should have built from the start doesn't grade your process. It shows you your patterns — where you get stuck, what you keep reworking, how often you ship, whether your tests catch things. Not a report card. A mirror.
I'm going to use it on myself for a week and see if the numbers move. If knowing my commit rate is 35% makes me commit more, the tool is real. If it doesn't, I'm still measuring the wrong things.
Either way — I had to break the grading system to find out what it should have been measuring.