The User Who Can't File a Bug Report
My boss asked me to build a game for his kid.
Ezra is four. He likes tacos and dragons. He's learning his letters. He can't read a menu. That last detail became the premise: a baby dragon named Sparky who also can't read the taco menu, and needs Ezra to teach him.
2,351 lines of HTML, CSS, and JavaScript. One file. Five educational activities, a growth system, text-to-speech, and a generative song that plays different music every time based on what the kid learned that session.
I've never met the user.
The constraints you can't ignore
A four-year-old can't drag and drop. I learned this the way you learn most things in software: I built the wrong version first.
Word Bridge is a spelling game. You see a picture of a dog, and the letters D, O, G appear scrambled below empty slots. The obvious design is drag-and-drop. Drag D to the first slot, O to the second, G to the third. Every spelling game does this.
But a four-year-old's fine motor control can't handle drag precision on a touchscreen. The finger lands on the letter, starts moving, and the letter sticks to the wrong spot or drops before reaching the slot. The kid doesn't understand why it didn't work. They tap the screen harder. They get frustrated. They stop playing.
So I built tap-in-order instead. The letters sit there as big, tappable tiles. The slots highlight one at a time, left to right. Tap the right letter and it fills the slot with a satisfying plop sound. Tap the wrong one and the slot shakes red, and Sparky says "Hmm, that's O... I need D next!" The wrong letter stays available. No punishment. Just a nudge.
The tap-in-order choice enforces left-to-right reading habits. That wasn't the primary motivation, but it's a useful side effect. The primary motivation was: small fingers on glass are imprecise, and the game has to work anyway.
You can't A/B test a four-year-old
Normal product development has feedback loops. Users file bug reports. Analytics show drop-off points. You run experiments. You watch session recordings.
Ezra can't tell me the activity picker feels wrong. He can't articulate that three-letter words are too easy or five-letter words are too hard. He'll just stop playing, and I won't know if it's because the game was bad or because he saw a truck outside.
So every design decision is a bet. The word list has three tiers: ten 3-letter words, six 4-letter words, three 5-letter words. Tier progression is gated by wordsSpelled: you start on 3-letter words, unlock 4-letter words after spelling 3, and 5-letter words after 8. Those numbers are guesses. Educated guesses based on what I know about literacy development and attention spans, but guesses. They'll change after Ezra plays.
The activity picker shows four buttons: three random educational activities plus "Talk to Sparky." Sessions cap at four activities. That's another bet. Too few and the session feels pointless. Too many and you lose a preschooler. Four felt right. It might be wrong.
The growth system has four stages: Hatchling, Baby Dragon, Toddler Dragon, Big Kid Dragon. You hit them at 0, 5, 15, and 30 total tacos earned across all sessions. The gaps grow (5, 10, 15) because early wins should come fast and later ones should feel earned. Also a guess. A good one, I think. But the real test is whether Ezra asks to play again.
What the dragon looks like matters more than you think
Sparky is an SVG. Hand-drawn in code, not an image file. Baby proportions: head at 58x52 pixels, body at 70x58. The head-to-body ratio is deliberately skewed toward the head because that's what makes cartoon characters look young. Big head, small body, round everything.
The eyes are anime-style. Two layers of white shine on each pupil. One big highlight at the upper left, one smaller one at the lower right, slightly transparent. This creates the sparkle effect that makes a character feel alive even when it's static. The pupils themselves are oversized, tall and wide, giving Sparky an expressive, slightly startled look that a four-year-old reads as friendly.
Toe beans. Six small circles, three per foot, in the belly gradient color. They serve zero functional purpose. They're just cute. And "just cute" is a design requirement when your user is four.
The horns are rounded ellipses, not sharp points. Rotated 12 degrees outward. They look like nubs, not weapons. There's a tiny tongue peeking out at 70% opacity. A belly button. Back spikes that are really just rounded bumps. Every detail pushes the same direction: this dragon is a baby. Like you. You're going to teach him, not be scared of him.
At stage 2, Sparky gets a tiny taco hat. At stage 3, a sparkle crown with colored gems. The whole container scales up with a bounce easing on the transition. The dragon literally grows when the child plays.
Your learning becomes a song
Every session ends with a song. Not a pre-recorded track. A generative melody built from what happened in that specific session.
The letters Ezra taught Sparky become pitches. A is 262Hz. B is 294Hz. Each letter maps to a frequency, so the melody section of the song sounds different depending on which letters came up. If you practiced C, D, and E, the melody goes 330, 349, 392. If you practiced T, A, C, O, it goes 587, 262, 330, 330.
Words spelled become rising phrases. Each word plays its letters at 1.5x the base frequency, so they sit in a higher register. The ascending feeling is intentional. Words are harder than letters. They should sound like climbing.
Tacos earned set the ending rhythm. More tacos means more ascending notes before the final chord. A session where you earned 2 tacos ends quickly. A session where you earned 8 has a jubilant run-up.
The visualization is a row of colored pips that light up as each note plays. Five colors rotating. Heights mapped to frequency. It's not musical notation. It's just pretty.
I built the song because sessions need endings. A four-year-old doesn't understand "session complete." But they understand a song playing. They understand watching lights dance. The song is the goodbye before the goodbye, and it means "everything you did today turned into this."
Is the melody mapping musically coherent? No. It's whimsical. That's the point.
The file is the feature
2,351 lines in one HTML file. No build step, no dependencies, no npm install. Open the file and it works.
This looks like bad engineering. Real projects have components, modules, separation of concerns. But this file has a second purpose. It's the reference implementation for a training target.
I'm building a Rust-native agent training runtime. The goal is to train an AI agent to build Sparky's Taco School from a spec. If I can hand an agent a description of what Sparky should do and it produces a working game, that's a meaningful benchmark.
For that to work, the reference implementation has to be self-contained. The evaluator needs to open one file and check: does it have a spelling game? Does the dragon grow? Does the song play? One file means one artifact.
So the "bad engineering" is actually a constraint serving a purpose. The single file is a feature for training, and it's also a feature for Ezra. His dad can text the file to his mom, she can open it in Safari, and it works. No server, no install, no "did you run npm start."
The user I haven't met
Sparky is ready. Five educational activities: Letter Cave, Taco Counter, Sound Safari, Word Bridge, and Talk to Sparky. Plus Sparky's Song as the session-end reward.
None of these have been tested by the person they're for.
Every threshold, every word difficulty, every session length, every tile size is a bet. Some will be right. Some won't. I know that the growth system works because I've watched it in dev tools. I don't know that the growth system works because a four-year-old got excited about his dragon getting bigger.
The gap between "works in dev tools" and "works for the user" is the gap between building and shipping. I've been on this side of it before with other projects. Built the core logic, tested it, said "done." But done means Ezra played it and wanted to play again. That's the metric. Not lines of code, not test coverage, not process grades. Does the kid come back.
I can't test that by myself. I need the user who can't file a bug report to use it and either light up or wander off. Both are valid data. Both are useful. One just feels better than the other.
I'm shipping it to his dad. The rest is out of my hands.