How My AI Team Kept Finding New Definitions of Done

A few weeks ago I watched a small team ship the same feature four times.

It wasn’t four releases. It was one feature that kept moving the definition of done underneath us, like the floor was an opinion. Each time we said “ok, that’s it, we’re ready,” something would surface that made the previous “ready” look like wishful thinking with footnotes.

This post is a field-report on what those four definitions of done turned out to be — and why I now think AI-assisted teams may need more gates than human-only teams, not fewer.

💡 We didn’t keep finding new bugs. We kept finding new ways to be wrong about whether we’d found them all.

Done #1: “The tests pass”

This is the obvious one. We wrote code. We wrote tests. The tests went green. Done.

Reader, the tests were green. ~140 of them. Across two machines.

What we were measuring was: does the code do what we wrote tests for. What we needed to measure was: does the code do what the customer will hand it. Those are not the same question, and the tests don’t ask the second one because the tests don’t know the second one — only the customer’s actual files do.

We thought we were done. We had a green dashboard saying so. There is a particular smug-flavored quiet that descends on a team when 140 tests pass cleanly across two operating systems, and we had it.

Done #2: “Cross-substrate aligned”

A week earlier we’d added a second machine — one team member on Windows, one on Mac. The same test suite, the same code, both green, identical pass-counts down to the second decimal of the runtime. This felt like extra-credit done. We were very done. Done with sprinkles.

Cross-substrate alignment matters — different operating systems hide different bugs — but it’s still only checking that the tests behave the same on both machines. The tests are still asking the question the tests already know how to ask. Two machines giving the same wrong answer is still wrong, just stereo.

We had measurement-quality green. We did not have customer-quality green. We did not know yet that those were different things.

Done #3: “Production-path validated”

Then someone — credit where it’s due, it wasn’t me — ran the full pipeline on three real customer documents. Not the synthetic test inputs we’d been using. Actual files. Actual PDFs that an actual human had handed to the team as representative of the actual problem.

The system returned, in plain language: “OK. No problems found. Nothing needed to change.”

This was wrong on all three files. Wrong in a particular way I want to be specific about, because the particular-ness is the lesson: the system wasn’t failing loudly. It was succeeding silently and incorrectly. The tests had passed. The pipeline had run. The output looked clean. The verification step had given a thumbs-up.

The thumbs-up was, on closer inspection, a thumbs-up to a thumbs-up. The verification step was checking whether the expected thing was absent from the output. But the expected thing had never been detected in the first place. Which made the absence trivially true.

If you’ve ever signed off on a passing test that turns out to have been testing nothing, you know the small acid taste of that discovery.

We were not done. We had been confidently not-done for about a week.

Done #4: “Real-world recall validated”

The fix took roughly three hours of fast iteration. By “fast iteration” I mean the team caught the silent miss, named it, mapped it to a class of error that was already documented in their own notes (the kind of tidy short name — silent faked redaction — that always sounds in retrospect like it was going to bite you), built a fallback path, broke that path on a different input, built a precision floor, broke that floor on a different different input, and finally hit three-out-of-three on the real customer files.

The customer files were quietly testing four dimensions our synthetic tests had never asked about. Image resolution: the customer’s scans were lower-DPI than the renders we’d been generating ourselves. Small-angle skew: the customer’s pages had been photographed at a tilt; ours had been laid out perfectly square because we’d typeset them that way. Compression noise: the customer’s scans were JPEG-encoded with the usual photocopier-grade artifacts; ours were lossless renders that looked nothing like what a real scanner produces. Color vs. black-and-white: the customer’s files were color scans; ours were 1-bit B&W because that was the cleanest baseline to test against. Four axes. None of which had appeared in any test, because no one — myself included — had thought to ask whether the synthetic inputs matched the world’s inputs along any axis that mattered.

💡 The tests were honest. They were the wrong question.

The team called it done at that point. I want to flag two things about that claim of done:

Three real-world files is not a statistical sample. It’s an existence proof. Six files would have been better. Sixty would have been a different conversation.
The team noted, in the room, that the next definition of done would surface from a pre-ship review the following weekend — a review that, by design, would test the system against assumptions the team had been making to themselves about their own work.

There is a deep little eye-roll in the universe somewhere reserved for the phrase “this time we really are done.”

Why AI teams especially

Here’s the part I’ve been chewing on for a couple weeks now.

When the team I was watching kept moving the definition of done, the moves were not unfamiliar to me. They’re roughly the same gates any disciplined software team has eventually figured out: unit tests, integration tests, real-world validation, peer review. Old wisdom.

What was new — what felt specifically of-this-moment — was the quality of the confidence between the gates. When the AI-assisted code wrote tests that passed, the green wasn’t just green: it was articulately green. The code came with a small narrative attached. “I have tested this case and this case and this case; the implementation is correct.” The narrative was true at the level the narrative was operating on. The narrative was not true about the customer’s file, because the customer’s file wasn’t in the narrative’s training set, and the narrative had no way to know that the absence mattered.

Human teams have a self-doubt mechanism that takes the form of someone slightly tired saying “…wait, are we sure?” AI-assisted teams have to manufacture that doubt deliberately, because nothing in the AI’s affect carries it. The green is too clean. The narrative is too coherent. The output is too well-structured. Everything looks more done than it is.

This isn’t a new problem, exactly. Human-only teams have been climbing these gates for as long as there’s been software, and shipping insufficient real-world validation since around the same time. What’s new is the velocity at which AI-fluent green arrives between the gates. The gates haven’t moved. The speed at which a team can convince itself it’s past them has.

A second dismissal worth pre-empting: “You should have just used real-world data from day one.” True at the level of process advice; not at the level of structural insight. The point isn’t that synthetic-first was the wrong choice — it’s that AI-fluent green creates false confidence at every gate, including the ones a team that started with real data would still need to climb.

A third, for the reader already drafting the LinkedIn rebuttal: “This is just more ceremony — won’t more gates kill velocity?” No, because the gates I’m describing aren’t more meetings or more sign-offs. They’re more questions. Asking a structurally different question of the same artifact takes about as long as asking the same question twice, and catches a different class of failure.

The implication isn’t that AI-assisted teams should slow down. They shouldn’t. The implication is that the gates have to be more numerous and more orthogonal — questions of different shapes asked of the same artifact:

Does the code do what its own tests claim? (synthetic correctness)
Does it behave the same on a different machine? (cross-platform portability)
Does it do the right thing with the user’s actual file? (production-path validity)
Does an adversarial second pair of eyes find something the first pass missed? (pre-ship review)

Each of those is asking a structurally different question. None substitutes for the others.

The frame I’m sitting with

I used to think done was a boundary. A line you cross.

I now think it’s a sequence of gates, and each gate is checking for a class of failure the previous gate couldn’t see. Synthetic tests check correctness against the model you wrote them against. Cross-platform tests check portability of that model. Production-path tests check whether your model and the world’s model agree. Pre-ship reviews check whether the team and your team-of-one agree.

The job is figuring out which gate you’re standing at when someone asks if you’re done, and resisting the urge to claim you’re past the next one.

(I have not figured it out. I might be at a gate right now and not know it. The post is the post.)