How to Test Whether Your Prompt Actually Works

A prompt that worked once is not a prompt you can trust. This is something I see small business owners miss constantly. They find a prompt that produced a great output the first time, they deploy it for a recurring task, and three months later they are frustrated that "the AI got worse." The AI did not get worse. The prompt was never tested. They got lucky on the first run, assumed the luck was reliability, and built a workflow on top of a coin flip.

This is the discipline that separates prompts that scale from prompts that just worked once. I run this protocol on every prompt before I recommend anyone use it for real work.

The 10-run rule

Before you deploy any prompt for recurring use, run it ten times with ten different realistic inputs. Not the same input ten times — ten different inputs representative of the cases you will actually hit in practice. Vary the hard cases, the easy cases, and the edge cases.

Then grade each output on a simple pass/fail: was this acceptable to send to a customer or paste into a document without rewriting? No in-between. "Mostly good" is not acceptable. A real-world prompt is running without your supervision — mostly good becomes bad without a backstop.

If you get 9 or 10 passes out of 10 runs, the prompt is deployable. If you get 7 or 8, the prompt needs tightening. If you get 6 or fewer, the prompt is broken — do not deploy it, rebuild it.

What to look for on each run

When you read each output, grade it against these five axes. A failure on any one is a failure for the run.

Structural consistency. Does the output have the shape you asked for? Right number of paragraphs. Right number of options. Right sections. If you asked for a 4-paragraph email and got 2, that is a fail even if the 2 paragraphs were great.

Tone consistency. Is it in the voice you specified? Too corporate? Too casual? Does it suddenly switch register halfway through?

Content depth. Is it substantive or surface-level? Watch for AI-generic hedging language like "it depends on your specific situation" when your prompt told it the specific situation.

Factual accuracy. Did it hallucinate? Did it invent a product name, a policy detail, or a number you did not provide? Hallucinations are automatic fails.

Edge-case handling. If you gave it weird input — a customer with a long complaint, an invoice with an unusual amount, a name in a language the AI might stumble on — did it handle it gracefully, or did it break?

Mostly good is not good enough.
A deployed prompt runs without supervision.

How to iterate when you fail

If the prompt fails the 10-run test, do not rewrite the whole thing. Change one variable at a time and re-test. Otherwise you will never know which change fixed what.

Common fixes, in the order I try them:

Add a constraint. Most failures I see are not the prompt failing to do something — they are the prompt doing something extra that you did not want. Tightening the constraints is almost always the first fix.
Add an example. If the outputs are inconsistent in shape or tone, one or two well-chosen examples pulls the model into line faster than any amount of description.
Tighten the role. If the tone is wrong, the role is often too generic. A more specific role produces a more specific voice.
Expand the context. If outputs are hallucinating details, the AI is filling gaps. Close the gaps.
Lock the output format. If the shape keeps drifting, specify it more literally. "Three bullet points, each starting with a verb, each no longer than 18 words" is better than "a few bullet points."

The testing template

Keep a log of every prompt test. Future-you will thank present-you. Here is the format I use:

Prompt Test Log · v1.0

PROMPT: [Prompt name and version] DATE: [YYYY-MM-DD] MODEL: [e.g. Claude Opus 4.7, GPT-5, Gemini 2.5] TESTED BY: [your name] RUNS (10 total) Run 1 — Input: [summary] Result: [PASS / FAIL] Notes: [what was right or wrong] (repeat for runs 2-10) OVERALL: [n/10 PASS] DEPLOY? [YES / NO / NEEDS FIX] CHANGES FOR NEXT VERSION: - [change] - [change]

When to re-test

A tested prompt is not permanently tested. Re-run your 10-run test when any of these happen:

You switch to a new model (every major version — a prompt that worked on one generation can behave differently on the next)
You change anything in the prompt itself (even "small" changes — if it is worth changing, it is worth re-testing)
Something in your business changes that the prompt references (pricing, services, hours, policies)
Quarterly, as a general hygiene check, even if nothing else changed

That is the whole protocol. It is more work than most people want to do, which is exactly why the people who do it end up with AI that produces consistent, trustworthy output while their competitors are still flipping coins.

The other half of this equation is what your prompts know about your business before you even get to the task. That is what I cover in Giving AI the Context It Needs About Your Business.

How to test whether your prompt actually works.

The 10-run rule

What to look for on each run

How to iterate when you fail

The testing template

When to re-test

Test your whole business, not just prompts.

The Anatomy of a Good Prompt

How to Write an AI SOP

Giving AI the Context It Needs About Your Business