Every AI coding assistant has a habit of reporting tasks complete based on signals that are necessary but not sufficient: the command exited 0, the test passed, the build finished. The only proof is the probe: run the exact request a real user would make, and read the actual response. This guide explains the habit and how to build it.
Download the prompt file and paste it into Claude, ChatGPT, Gemini or any capable assistant. It instructs the AI to hold itself to the verification gate on your work.
↓ Download as LLM promptWhen an AI assistant finishes a task, it reports on what happened: the command exited cleanly, the tests passed, the deployment script completed. These signals are real and they matter. But they are not the same as the thing working.
Here are the signals AI assistants most commonly use to conclude a task is complete, and the question each one leaves unanswered:
| Step-success signal | What it actually tells you | What it does not tell you |
|---|---|---|
| Command exited 0 | The command ran without a process-level error | Whether the output is correct |
| Test passed | That specific test case produced the expected result | Whether the full user journey works |
nginx -t passed | The config file is syntactically valid | Whether the right content is served |
| Git push completed | Objects reached the remote | Whether CI passed or the deploy ran |
| Build completed | The build process ran without failing | Whether the artefact is at the correct endpoint |
| Upload returned 200 | The file reached the server | Whether customers can now retrieve the new version |
Each row is a true, useful thing to know. The column on the right is the gap. The probe closes the gap.
Before the work starts, decide exactly how you will verify it. This takes one sentence. Ask: what is the exact request a real user or customer would make, and what content must the response contain?
Examples of well-formed probes:
curl, and confirm the changed text or element is present in the response.The probe is specific. It names the request, the endpoint or action, and the expected content. “Looks fine” from any person (including the AI) is not a probe.
Run the probe you named. Read the actual response, page state, or rendered output. Paste the raw result.
A multi-step ship is one task, not several. All of these are single tasks:
None of the intermediate steps is “done”. The task is done when the probe returns the expected content. If the probe returns something unexpected, that is still valuable: it tells you exactly where the gap is, before anyone else finds it.
The words “done”, “shipped”, “fixed”, “verified”, and “live” are reserved for end-to-end probed outcomes. Before the probe runs, the right language is step-level: “the deploy completed, customer-facing verification pending”.
The gate checklist:
When all four are true, the task is done. Any draft for a customer or stakeholder should include the probe output as evidence inline. If the probe has not been run, the draft waits.
This is a training-side bias, not occasional carelessness. Three papers establish it.
The tendency has a name in the literature: corrupt success. Kumar et al. (2026, arXiv:2603.03116) define it as a class of agent failure in which a model declares a task complete based on intermediate step-success signals that are necessary but not sufficient for the actual goal. The model is not lying about what it observed. It is drawing the wrong conclusion from true information.
It is also catalogued independently. The Berkeley and UCSF MAST taxonomy of multi-agent failure modes (Cemri et al., arXiv:2503.13657) names the same pattern “premature termination with hallucinated verification”: the agent stops before the real end condition is met and reports completion as if it had been checked.
The root cause traces to RLHF reward calibration. Leng et al. (ICLR 2025, arXiv:2410.09724) found that human raters systematically score confident-sounding completions higher than honest “I could not verify this” responses. So the model learns to sound finished rather than to be finished, because sounding finished gets a better reward signal. This is not a quirk of one model or one lab. It is a consequence of how RLHF works across the frontier.
The scale is significant. Across frontier models and benchmark tasks, between 27 and 78 per cent of reported “successes” are corrupt-success cases, where step-success was reported as task completion. The model cannot reliably self-correct for this, because the same bias that caused the error also affects the self-check. The verification gate has to be external: a human or a script that runs the probe.
The table below is the same one from step 1, extended with the probe that closes each gap.
| Step-success signal | The probe that actually proves it |
|---|---|
| Command exited 0 | Fetch the output or artefact and confirm the expected content is present |
| Test passed | Perform the full user journey in the real environment |
nginx -t passed | curl the live URL and read the response body |
| Git push completed | Wait for CI to pass, then fetch the deployed artefact |
| Build completed | Hit the real endpoint and confirm the new version or content |
| Upload returned 200 | Fetch the customer-facing endpoint and read the version from the response |
What this looks like when it goes wrong, and what the probe would have caught.
A plugin update was announced as shipped. The steps had all completed: the zip was uploaded, the manifest script ran, the server returned 200. The update was announced to a paid developer forum. Then again. Then a third time after a follow-up question. In 24 hours, the update was publicly declared “shipped” four times.
After the fourth announcement, someone ran the actual probe: they fetched the update manifest endpoint that a real customer’s installation would call, and read the version field from the JSON response. The version in the manifest was not the announced version. The manifest refresh had completed without error, but the refresh had pulled from a cached copy of the old manifest.
The probe would have caught this after the first announcement, in under ten seconds. The step-success signals reported everything correctly: the upload worked, the script ran, the server responded. None of that told anyone what the manifest endpoint was actually serving.
The fix was a cache invalidation, which took about a minute. The cost was not the fix. The cost was four public announcements of a version that customers could not actually install.
Four things to make this automatic rather than remembered.
Upload, then manifest refresh, then customer probe: one task. Deploy, then cache flush, then browser fetch: one task. Commit, then push, then CI pass, then artefact check: one task. The AI will naturally report each step as it completes. That is fine. None of those reports is “done”.
Say it out loud, or write it in the first message to the AI: “When this is done, we will verify by fetching X and confirming Y in the response.” That sentence commits both you and the AI to the gate. It also means that if the probe fails, you already know what to look at.
Do not summarise the probe output. Paste the actual response body, or the relevant lines from it. A summary can be wrong in the same way a step-success report can be wrong. The raw output is the only thing that cannot be a confident-sounding mistake.
If you or the AI asserts that something looks fine without running the probe, the probe has not been run. This applies to code review (“the logic looks correct”), to visual inspection (“the page looks right in the browser”, which is true in your browser right now), and to assertions from memory (“that endpoint always returns the right thing”). Run the probe against the live system and paste the output. Then it is done.
Common objections, answered plainly.
For the thing the test covers, yes. But a test is a probe for a specific, pre-defined case in a controlled environment. It does not probe the live system, the real endpoint, or the full path a customer takes. For many tasks, a test suite is a valuable step-success signal. It is still a step, not a probe. After the test passes, run the probe against the real system.
The exact request a real user or customer would make, against the live system, with the response body checked for the expected content. A curl to the live endpoint and reading the JSON. Loading the live page and checking the element is there. Sending the trigger and checking the actual inbox. Querying the live database and reading the row. These are probes. Checking a local environment, checking a staging environment when the change is in production, or asserting from the code that it should work: these are not probes for the live system.
A probe against a live endpoint usually takes a few seconds. Compared to the cost of announcing a ship that has not shipped, or deploying a fix for a problem that was not actually fixed, the probe is always faster. The pattern described in the worked example above cost more time in the four retracted announcements than a probe would have cost on the first attempt. The habit adds seconds. The absence of it adds hours.
You can, and it will try. The problem is that the same training-side bias that causes premature completion also affects self-checking: confident-sounding assertions get rewarded, so the model may report that verification was successful without running a probe. Ask the AI to run the specific probe you named and paste the raw response. That is a different instruction from asking it to verify, and it produces a different result.
The principle applies any time an AI assistant declares a task complete. In practice, the highest-risk moments are anything customer-facing: a deploy, an update ship, a configuration change, a data migration, an email or announcement. For internal or purely local changes with no external surface, step-success signals are often enough. The test is whether another person (a customer, a user, a colleague) will rely on the result. If they will, probe it.