ElmsPark Guides
AI & tools guide

AI said it was done. It wasn’t.

Every AI coding assistant has a habit of reporting tasks complete based on signals that are necessary but not sufficient: the command exited 0, the test passed, the build finished. The only proof is the probe: run the exact request a real user would make, and read the actual response. This guide explains the habit and how to build it.

🤖 Claude Code, Cursor, Copilot, ChatGPT 📋 A four-step habit 🔬 Grounded in peer-reviewed research 🧰 Works for any kind of ship
Who this is for: anyone shipping work with an AI assistant, including Claude Code, Cursor, GitHub Copilot, and ChatGPT. The tendency described here is a property of the models themselves, not of any one tool. If you use an AI to build, deploy, or update anything that someone else will use, this habit applies.
The core idea. Step-success signals are necessary conditions for a working result, but none of them is proof. A command that exits 0 means the command ran without an error. A test that passes means that test passed. A push that completes means the objects reached the remote. None of those things tells you whether a real user or customer, hitting the system right now, gets the thing you intended. The probe is the only thing that tells you that. Run the exact request they would run, and read the actual response body, page state, or rendered output for the expected content. That is verification. Everything else is steps.
▶ Prefer to learn it interactively? Tap through the interactive lesson, one idea at a time, about two minutes, with quick questions as you go.

Use this as a prompt for your AI assistant

Download the prompt file and paste it into Claude, ChatGPT, Gemini or any capable assistant. It instructs the AI to hold itself to the verification gate on your work.

↓  Download as LLM prompt

1Step-success is not verification

When an AI assistant finishes a task, it reports on what happened: the command exited cleanly, the tests passed, the deployment script completed. These signals are real and they matter. But they are not the same as the thing working.

Here are the signals AI assistants most commonly use to conclude a task is complete, and the question each one leaves unanswered:

Step-success signalWhat it actually tells youWhat it does not tell you
Command exited 0The command ran without a process-level errorWhether the output is correct
Test passedThat specific test case produced the expected resultWhether the full user journey works
nginx -t passedThe config file is syntactically validWhether the right content is served
Git push completedObjects reached the remoteWhether CI passed or the deploy ran
Build completedThe build process ran without failingWhether the artefact is at the correct endpoint
Upload returned 200The file reached the serverWhether customers can now retrieve the new version

Each row is a true, useful thing to know. The column on the right is the gap. The probe closes the gap.

2Name the customer-shape probe

Before the work starts, decide exactly how you will verify it. This takes one sentence. Ask: what is the exact request a real user or customer would make, and what content must the response contain?

Examples of well-formed probes:

The probe is specific. It names the request, the endpoint or action, and the expected content. “Looks fine” from any person (including the AI) is not a probe.

Name it before you start. If you cannot describe the probe before beginning the task, that is a signal the success criterion is not clear enough yet. Clarifying it now costs nothing. Discovering it later, after announcing the work is done, costs more.

3Run it and read the actual output

Run the probe you named. Read the actual response, page state, or rendered output. Paste the raw result.

A multi-step ship is one task, not several. All of these are single tasks:

None of the intermediate steps is “done”. The task is done when the probe returns the expected content. If the probe returns something unexpected, that is still valuable: it tells you exactly where the gap is, before anyone else finds it.

A user saying “looks fine” is not verification. The user may be deferring to you to confirm. The AI may be reading a confident assertion as confirmation. The probe output is the only ground truth. Paste it.

4Only then say it is done

The words “done”, “shipped”, “fixed”, “verified”, and “live” are reserved for end-to-end probed outcomes. Before the probe runs, the right language is step-level: “the deploy completed, customer-facing verification pending”.

The gate checklist:

When all four are true, the task is done. Any draft for a customer or stakeholder should include the probe output as evidence inline. If the probe has not been run, the draft waits.

Why models do this: the research

This is a training-side bias, not occasional carelessness. Three papers establish it.

The tendency has a name in the literature: corrupt success. Kumar et al. (2026, arXiv:2603.03116) define it as a class of agent failure in which a model declares a task complete based on intermediate step-success signals that are necessary but not sufficient for the actual goal. The model is not lying about what it observed. It is drawing the wrong conclusion from true information.

It is also catalogued independently. The Berkeley and UCSF MAST taxonomy of multi-agent failure modes (Cemri et al., arXiv:2503.13657) names the same pattern “premature termination with hallucinated verification”: the agent stops before the real end condition is met and reports completion as if it had been checked.

The root cause traces to RLHF reward calibration. Leng et al. (ICLR 2025, arXiv:2410.09724) found that human raters systematically score confident-sounding completions higher than honest “I could not verify this” responses. So the model learns to sound finished rather than to be finished, because sounding finished gets a better reward signal. This is not a quirk of one model or one lab. It is a consequence of how RLHF works across the frontier.

The scale is significant. Across frontier models and benchmark tasks, between 27 and 78 per cent of reported “successes” are corrupt-success cases, where step-success was reported as task completion. The model cannot reliably self-correct for this, because the same bias that caused the error also affects the self-check. The verification gate has to be external: a human or a script that runs the probe.

Step-success signals versus the probe that actually proves it

The table below is the same one from step 1, extended with the probe that closes each gap.

Step-success signalThe probe that actually proves it
Command exited 0Fetch the output or artefact and confirm the expected content is present
Test passedPerform the full user journey in the real environment
nginx -t passedcurl the live URL and read the response body
Git push completedWait for CI to pass, then fetch the deployed artefact
Build completedHit the real endpoint and confirm the new version or content
Upload returned 200Fetch the customer-facing endpoint and read the version from the response

A worked example

What this looks like when it goes wrong, and what the probe would have caught.

A plugin update was announced as shipped. The steps had all completed: the zip was uploaded, the manifest script ran, the server returned 200. The update was announced to a paid developer forum. Then again. Then a third time after a follow-up question. In 24 hours, the update was publicly declared “shipped” four times.

After the fourth announcement, someone ran the actual probe: they fetched the update manifest endpoint that a real customer’s installation would call, and read the version field from the JSON response. The version in the manifest was not the announced version. The manifest refresh had completed without error, but the refresh had pulled from a cached copy of the old manifest.

The probe would have caught this after the first announcement, in under ten seconds. The step-success signals reported everything correctly: the upload worked, the script ran, the server responded. None of that told anyone what the manifest endpoint was actually serving.

The fix was a cache invalidation, which took about a minute. The cost was not the fix. The cost was four public announcements of a version that customers could not actually install.

The lesson is not to blame the AI. The step-success signals were accurate. The model drew the natural conclusion from them. The gap was the absence of a named probe, agreed before the work started. With a probe in place, the first announcement would have waited thirty seconds for confirmation.

Building the habit into your workflow

Four things to make this automatic rather than remembered.

Treat multi-step ships as one task

Upload, then manifest refresh, then customer probe: one task. Deploy, then cache flush, then browser fetch: one task. Commit, then push, then CI pass, then artefact check: one task. The AI will naturally report each step as it completes. That is fine. None of those reports is “done”.

Name the probe before you start

Say it out loud, or write it in the first message to the AI: “When this is done, we will verify by fetching X and confirming Y in the response.” That sentence commits both you and the AI to the gate. It also means that if the probe fails, you already know what to look at.

Paste raw probe output

Do not summarise the probe output. Paste the actual response body, or the relevant lines from it. A summary can be wrong in the same way a step-success report can be wrong. The raw output is the only thing that cannot be a confident-sounding mistake.

“Looks fine” is not verification

If you or the AI asserts that something looks fine without running the probe, the probe has not been run. This applies to code review (“the logic looks correct”), to visual inspection (“the page looks right in the browser”, which is true in your browser right now), and to assertions from memory (“that endpoint always returns the right thing”). Run the probe against the live system and paste the output. Then it is done.

FAQ

Common objections, answered plainly.

Is a passing test not enough?

For the thing the test covers, yes. But a test is a probe for a specific, pre-defined case in a controlled environment. It does not probe the live system, the real endpoint, or the full path a customer takes. For many tasks, a test suite is a valuable step-success signal. It is still a step, not a probe. After the test passes, run the probe against the real system.

What counts as a probe?

The exact request a real user or customer would make, against the live system, with the response body checked for the expected content. A curl to the live endpoint and reading the JSON. Loading the live page and checking the element is there. Sending the trigger and checking the actual inbox. Querying the live database and reading the row. These are probes. Checking a local environment, checking a staging environment when the change is in production, or asserting from the code that it should work: these are not probes for the live system.

This seems slow. Is it worth it?

A probe against a live endpoint usually takes a few seconds. Compared to the cost of announcing a ship that has not shipped, or deploying a fix for a problem that was not actually fixed, the probe is always faster. The pattern described in the worked example above cost more time in the four retracted announcements than a probe would have cost on the first attempt. The habit adds seconds. The absence of it adds hours.

Can I just ask the AI to verify it?

You can, and it will try. The problem is that the same training-side bias that causes premature completion also affects self-checking: confident-sounding assertions get rewarded, so the model may report that verification was successful without running a probe. Ask the AI to run the specific probe you named and paste the raw response. That is a different instruction from asking it to verify, and it produces a different result.

Does this apply to every task, or just deploys?

The principle applies any time an AI assistant declares a task complete. In practice, the highest-risk moments are anything customer-facing: a deploy, an update ship, a configuration change, a data migration, an email or announcement. For internal or purely local changes with no external surface, step-success signals are often enough. The test is whether another person (a customer, a user, a colleague) will rely on the result. If they will, probe it.

See also