The vendor demo went great. The AI answered three questions perfectly. The pricing is reasonable. You're about to sign.
Pause. Run this protocol first. It takes 90 minutes and saves you from the 80% of AI tools that look good in demo and fail in production.
Step 1: Build your real test set (30 minutes)
Pull 20 actual examples from your business of the work the AI is supposed to do.
If it's an inbox triage tool, pull 20 representative emails from last week — easy ones, hard ones, weird ones, vendor pitches, the angry-client one.
If it's a proposal drafter, pull 20 past briefs and the proposals you wrote from them.
If it's a meeting notes tool, pull 20 transcripts (or recordings) of past calls — short ones, long ones, the rambly one with the chatty client.
The point: real data, not the canned examples the vendor would prefer you use. Mix easy and hard. Include the edge cases that always trip up your team.
Step 2: Run the AI on the test set (30 minutes)
Most vendors will give you a free trial or sandbox. Run all 20 examples through it. Capture the output. Don't tell the vendor what you're doing, just feed in the inputs and grab the outputs.
If the vendor won't let you do this without a paid commitment, that's a red flag in itself. Walk away or push for a structured pilot.
Step 3: Score the output yourself (20 minutes)
For each of the 20 outputs, score it on three dimensions:
Accuracy (0-3): - 0: factually wrong or hallucinated - 1: mostly right but with errors that would need correction - 2: correct but rough - 3: correct and ready to use
Tone match (0-3): - 0: sounds nothing like you/your team would write - 1: generic corporate, would need rewriting - 2: close to your voice, light editing - 3: indistinguishable from your team's writing
Time saved vs. doing it yourself (0-3): - 0: faster to do it manually - 1: marginal time savings - 2: clear time savings, but you're editing - 3: significant time savings, light review only
Total possible: 9 per example × 20 examples = 180 points.
Step 4: Calculate the actual numbers
Three thresholds to clear:
Threshold 1: Above 75% accuracy on the easy 75% of cases.
Of your 20 examples, the 15 easiest cases should average 2+ on accuracy (correct but rough or better). If the AI is wrong on the easy cases, it'll be much worse on the hard ones. Don't buy.
Threshold 2: Average 1+ on tone across all 20 cases.
If the average is below 1 (consistently corporate-bland output), the tool will need too much editing to be net-positive. The accuracy might be there but the workflow drag will eat the savings.
Threshold 3: Average 2+ on time-saved across all 20 cases.
If the AI saves time on average, it's a real product. If it's break-even or slower, you've found a tool that demos well and produces work in production.
What good actually looks like
A real production-grade AI tool, on a real test set, typically scores: - Accuracy: 2.4 average (most cases correct, occasional rough) - Tone: 1.8 average (close to your voice with light editing) - Time saved: 2.5 average (consistent meaningful savings) - Total: ~135/180 (75%)
Anything above 130/180 is buy-territory. Anything below 100/180 is "demo theater", it works in scripted scenarios but breaks on real work. The space between is "talk to the vendor about your specific use case before committing."
The 5 questions to ask the vendor BEFORE you buy
After your test, before signing, ask:
1. "Can I see another customer's real production output?"
Not the demo. Real customer work. Redacted as needed.
2. "What happens when accuracy drops on a specific use case?"
Real vendors have a process: monitoring, alerts, retraining, escalation. Theater vendors say "that doesn't really happen."
3. "What's the contract structure if I want to leave?"
Annual contracts with auto-renewal are red flags at this stage of the AI market. Month-to-month or 90-day cancellation is standard for a real product.
4. "Who do I escalate to when the agent gets it wrong on a customer-facing task?"
If the answer is "submit a ticket and we'll respond in 5 business days," you're a small fish at a big vendor. Find a smaller vendor or a productized provider where you have a human contact.
5. "What happens if my underlying systems change, new CRM, new email, new tools?"
Every business changes its stack every 2-3 years. The vendor's answer reveals whether the tool is locked-in or portable.
When to skip the test entirely
Two cases where this protocol is overkill:
Case 1: Free or consumer-tier AI ($20/month or less). Just try it for two weeks. The cost of being wrong is one month of $20.
Case 2: Productized agent at fixed price under $5K with a 7-day delivery and review gate. The vendor is taking the risk by committing to a fixed scope. Day-2 test set and Day-5 demo on YOUR data is built into the delivery process. The protocol above is essentially what a real productized agent build does as part of its workflow.
For everything else, annual SaaS contracts, custom builds, enterprise platforms, run the protocol. 90 minutes of real-data testing catches the 80% of tools that demo great and fail in production.
What this means for you
Don't sign on the demo. Test on your real data first. If the vendor won't let you, that's the test result.
The next post in this series covers what AI shouldn't be used for, the five things that look like AI use cases but actually aren't.
Get started
Want a real number for your specific situation?
30-minute audit call walks through your workflows and outputs a fixed price for the 2-3 things worth automating first.