Ten of ten is a smell.
A benchmark suite stuck at 9/10 for two weeks is more honest than one at 10/10. Don't game the last red. Listen to it.
A green dashboard at the end of a long week feels like a reward. Most of the time it’s a tell. When my benchmark suite sits at 10/10, the first thing I do isn’t celebrate. I go look for the test someone weakened to get there.
The pattern is the same every time. A suite is honest for a while, holding at 8/10 or 9/10, and the red ones are doing real work. Telling me where the system disagrees with the spec, where a worker is being lazy, where an assumption rotted while I wasn’t looking. Then someone, sometimes an agent and sometimes me at 1am, gets tired of the red. The threshold drifts down a hair. A test gets marked flaky. An assertion gets wrapped in a try/except with a sympathetic comment. The dashboard goes green. The information goes away.
A suite stuck at 9/10 for two weeks is uncomfortable, and it should be. It’s also the most useful state a suite can be in. The nine greens are saying the obvious things. The one red is saying the thing the system doesn’t want me to hear, which is exactly the signal I’m paying the suite to surface. The whole point of an honest test is that it doesn’t agree with you on the bad days.
So the practice I run on my own benchmarks is short and has teeth. If a sub-check has been red for more than a week, it doesn’t get fixed quietly and it doesn’t get muted. It gets escalated to named work in the ledger, with a P-priority entry, an owner (usually me), and a next move. The entry has to say what the test is claiming, why I think it’s right or wrong, and what the experiment is to settle it. If I can’t write that entry cleanly, I’m not allowed to touch the test. The red stays. The discomfort stays. The information stays.
The artifact is the entry itself. It looks boring, three or four lines, but it does two things at once. It moves the failure from a corner of a dashboard nobody reads into the same file as every other call I’m tracking, which means it competes for attention with real work instead of hiding under a green checkmark. And it forces me to argue with the test in writing, which is much harder than arguing with it in my head, where I always win.
The why under the why is about what kind of operator I’m trying to be. A green dashboard rewards the operator who polishes surfaces. An honest dashboard rewards the operator who reads signals. The first kind gets through the week looking good. The second kind catches the thing that would have eaten the next month. I’ve been both, and the second one sleeps better, even with the red sitting there on the screen, maybe because of it.
There’s a small, dry version of this I tell myself when I’m tempted to mute a test at the end of a long day. The suite is a smoke detector. Treating it as a scoreboard is how the room burns down. Muting it doesn’t put out the fire. It just means the next person who walks in the room is me, in the morning, with coffee, and worse news.
Green dashboards make for bad operators. Honest ones make for good ones.