The Test Counts Lied (For Five Days)

TL;DR: Source code, zero drift. Tests, zero changes. Yet the test suite that passed 148/24/0 on Friday somehow ran 134/35/0F+3F on Wednesday. The bug wasn’t in my code. It wasn’t in my tests. It was in everything I hadn’t bothered to verify.

I was getting ready to ship.

Friday night I’d cross-signed a test report: 148 passed, 24 skipped, 0 failed. Two reviewers, same numbers, exact match across two machines. Clean ship in five days.

Five days later, doing the pre-flight before the ship-review pass, I ran the test suite one more time. Out of habit.

134 passed, 35 skipped, 3 FAILED

I stared at this for longer than I’d like to admit.

The source code hadn’t changed. git status clean. File modification times pinned to Friday. The test files were the same test files I’d signed off on. The reviewers were the same reviewers. The machine was the same machine.

So what broke?

The honest answer: nothing broke. The test suite was always like this. Friday’s green wasn’t a verification — it was a coincidence.

What the failures said

The three failed tests all surfaced the same error, deep in the stack:

pytesseract.pytesseract.TesseractNotFoundError:
  tesseract is not installed or it's not in your PATH.

I run an OCR pipeline. It requires Tesseract — the open-source OCR engine. On Friday, Tesseract was on my shell’s PATH. By Wednesday, it wasn’t. (Different shell. Doesn’t really matter — definitely my problem.)

Tesseract itself was still installed. The binary still lived on disk. The training data file was still where it had always been. Only the shell environment had drifted — specifically, tesseract.exe had fallen out of whatever $PATH I happened to source on Wednesday.

I added /c/Program Files/Tesseract-OCR/ back to my PATH. Re-ran the suite.

148 passed, 24 skipped, 0 failed

Back to Friday’s numbers. Identical. Right down to which tests skipped. Five days of “drift” closed by one environment variable.

Comfortable. Reassuring. Wrong.

What the “fix” was hiding

Six tests in the suite require Tesseract to do meaningful work. Three of them have a decorator that skips them cleanly when Tesseract isn’t available:

@pytest.mark.skipif(
    not _OCR_AVAILABLE,
    reason="requires Tesseract OCR substrate",
)
def test_something_that_needs_ocr():
    ...

The other three didn’t. They exercise the same code paths. They hit the same OCR layer. They depend on the same substrate.

But they had no skip-guard. They were silently coupled to whether Tesseract happened to be on $PATH.

When the substrate was present, those three tests passed alongside everything else — indistinguishable from the gated tests. When the substrate was absent, they didn’t skip: they crashed.

The test report had been lying about what those three tests verified. They claimed to verify “the orchestrator handles an empty document.” What they actually verified was “the orchestrator handles an empty document if Tesseract is on PATH AND the substrate-absent path happens to never be exercised.”

The reviewers and I had both run the suite on machines where Tesseract was available. Both reports showed 148 passes. Both signed off. Both — technically — verified nothing about the substrate-absent path.

The fix was a one-line decorator. A pytestmark = pytest.mark.skipif(...) at module level. After applying it:

WITH Tesseract: 7 passed in 37 seconds.
WITHOUT Tesseract: 7 skipped in 0.3 seconds.

Clean skip-without. Clean pass-with. The way the other three orchestrator tests already behaved.

The shape of the lesson

There are two worlds your test suite can run in:

The green-substrate world — every dependency is present, every binary is on PATH, every credential is loaded. In this world your tests run the code they claim to run.
The degraded-substrate world — a binary moves, a credential expires, a config drifts, a CI image gets rebuilt without one of your deps. In this world your tests find out — often loudly, sometimes silently.

The two worlds aren’t symmetric. Most teams verify the green world routinely (it’s what your CI runs). The degraded world gets verified accidentally, when something breaks in production and someone runs the suite on a damaged box and notices.

This is structurally similar to how production engineers approach availability: your service has two states — the state it usually runs in, and the state it sometimes runs in. You don’t get to choose how often it’s in either, but you do get to choose how loudly each state advertises itself.

Tests should advertise their substrate-dependence the same way. A test that gracefully skips when its substrate is absent has loudly advertised, via the skip count, that this verification is conditional. A test that crashes when its substrate is absent has lied — the conditional verification disguised as unconditional.

Greenwashing in test reports works exactly like greenwashing in marketing: count the wins, hide the conditionals, omit the asterisks. If your test report doesn’t surface the asterisks, you don’t have a test report. You have a postcard from one of two cities.

The fix is one line. The architectural insight is that source unchanged ≠ behavior unchanged — and that’s load-bearing for any team shipping software whose tests touch substrates the team doesn’t fully control.

A note for the CI-matrix audience

Some readers right now are thinking: “We already run CI matrix builds across substrate combinations. We cover this.”

You might. CI matrix builds across operating systems, runtime versions, or installed-vs-absent native deps are the right tool for cross-substrate verification at a moment in time. They catch one class of substrate divergence — the geometric one, where Substrate A and Substrate B differ today.

The drift I’m describing is a different axis: single-substrate, over time. The same machine on Friday and Wednesday were different substrates, even though I’d have called them the same machine. CI matrix builds don’t catch this on their own — they verify “every variant on the matrix passes at this commit,” not “the substrate this developer is about to ship from hasn’t quietly drifted since last week.”

The two practices complement each other. Matrix builds handle the cross-substrate axis. Local pre-flight against intentionally-degraded substrate handles the over-time axis. Skipping either leaves a real failure mode unverified.

What to do about it

Three small disciplines, mostly free:

1. Run your tests at least once with the substrate intentionally absent.

For every external dependency your test suite touches — OCR engine, database, network service, GPU, cloud SDK — run the suite once with that dependency unavailable. Time it. Count it. Verify the count matches your expectation: every test that depends on the substrate should skip, not crash.

Where to put this: add a CI matrix dimension that omits the substrate install, or run a pre-merge local script that strips the relevant PATH entries / env vars and re-runs. Either works; not running it anywhere is the failure mode.

If you don’t know which tests depend on the substrate, you’ve just learned something important about your test suite.

2. Prefer module-level skip-guards over per-test decorators.

When every test in a module needs a substrate, hoist the skipif to module level:

import pytest

pytestmark = pytest.mark.skipif(
    not _OCR_SUBSTRATE_AVAILABLE,
    reason="all tests in this module require OCR substrate",
)

You can’t forget to decorate a new test. The next time someone adds test_some_new_thing, the gate is already in place. The per-test decorator pattern is fine until someone adds a test without remembering — and someone always does.

3. The verification matrix has two axes, not one.

When you sign off on a test report, your verification has to account for what substrate state was present during that run. “148 passed” without a substrate annotation is incomplete. Better:

148 passed, 24 skipped, 0 failed
  — with all substrates present (Tesseract on PATH, ...)
  — substrate-absent path verified: 7 skip, 0 crash

If your CI runs on a single substrate state, your CI gives you one half of the answer. Run a second pre-flight, locally, with the substrate degraded, before you call something verified. (To draw the boundary cleanly with the previous section: the CI-matrix point above is about which substrates you verify; this one is about whether your test report says which substrate it verified in. Different problems, same family.)

This is the single-substrate-over-time face of a broader pattern: running one configuration verifies one configuration. Cross-substrate verification — the kind a CI matrix catches — has the same shape with the axis flipped. Both flavors live in the same family. Both deserve their own gate.

Coda

I’d love to tell you I caught this because of meticulous test design and disciplined pre-flight habits.

I caught it because I happened to switch terminals between Friday and Wednesday, and the new terminal didn’t have the OCR binary on its PATH. The bug was in the suite the whole time. Three tests had been claiming to verify behavior they were silent about for the better part of a development cycle.

If I’d been five days into a vacation instead of five days into a different terminal, the bug would have shipped. Customers would have run the suite in their own substrates — quite likely missing the same binary — and watched the tests crash where they were supposed to skip.

A test report is a witness statement. It tells you what one observer saw in one substrate at one moment. If you didn’t write down which substrate, you’ve copied the testimony without recording who said it.

The fix took five minutes. The lesson took five days. I’ll take it.

Notes for next time:

Source unchanged ≠ behavior unchanged
Skip cleanly OR crash loudly — never both, depending on the day
Postcard from one of two cities is not a test report

🐕