How do I test an AI tool to see if it will actually work for my business before paying?

Pull 20 real examples of the work the AI is supposed to do, actual emails, transcripts, or briefs from your business. Run them through the tool's free trial. Score each output on accuracy, tone match to your business, and time saved versus doing it manually. A production-grade tool scores above 130/180 on this scale. Below 100 means it demos well and fails in real work.

What questions should I ask an AI vendor before signing a contract?

Five questions matter: Can you show me a real customer's production output (not the demo)? What happens when accuracy drops on a specific use case? What's the cancellation structure? Who do I escalate to when the agent gets something wrong on a customer-facing task? And what happens if my underlying systems change? Annual auto-renewing contracts and 'submit a ticket' support are red flags at this stage of the market.

When is it okay to skip testing an AI tool and just buy it?

Two cases. First, free or consumer-tier tools at $20/month or less: just try it for two weeks, the cost of being wrong is trivial. Second, a productized agent at a fixed price under $5K with a 7-day delivery and a review gate built into the process. In that setup, the vendor is already testing on your real data at Day 2 and showing you a live demo at Day 5. For annual SaaS contracts, custom builds, or anything above $5K, run the 20-example test first.

How to test if an AI tool will work for your business (before you pay), Alchmy

The vendor demo went great. The AI answered three questions perfectly. The pricing is reasonable. You're about to sign.

Pause. Run this protocol first. It takes 90 minutes and saves you from the 80% of AI tools that look good in demo and fail in production.

Step 1: Build your real test set (30 minutes)

Pull 20 actual examples from your business of the work the AI is supposed to do.

If it's an inbox triage tool, pull 20 representative emails from last week, easy ones, hard ones, weird ones, vendor pitches, the angry-client one.

If it's a proposal drafter, pull 20 past briefs and the proposals you wrote from them.

If it's a meeting notes tool, pull 20 transcripts (or recordings) of past calls, short ones, long ones, the rambly one with the chatty client.

The point: real data, not the canned examples the vendor would prefer you use. Mix easy and hard. Include the edge cases that always trip up your team.

Step 2: Run the AI on the test set (30 minutes)

Most vendors will give you a free trial or sandbox. Run all 20 examples through it. Capture the output. Don't tell the vendor what you're doing, just feed in the inputs and grab the outputs.

If the vendor won't let you do this without a paid commitment, that's a red flag in itself. Walk away or push for a structured pilot.

Step 3: Score the output yourself (20 minutes)

For each of the 20 outputs, score it on three dimensions:

Accuracy (0-3): - 0: factually wrong or hallucinated - 1: mostly right but with errors that would need correction - 2: correct but rough - 3: correct and ready to use

Tone match (0-3): - 0: sounds nothing like you/your team would write - 1: generic corporate, would need rewriting - 2: close to your voice, light editing - 3: indistinguishable from your team's writing

Time saved vs. doing it yourself (0-3): - 0: faster to do it manually - 1: marginal time savings - 2: clear time savings, but you're editing - 3: significant time savings, light review only

Total possible: 9 per example × 20 examples = 180 points.

Step 4: Calculate the actual numbers

Three thresholds to clear:

Threshold 1: Above 75% accuracy on the easy 75% of cases.

Of your 20 examples, the 15 easiest cases should average 2+ on accuracy (correct but rough or better). If the AI is wrong on the easy cases, it'll be much worse on the hard ones. Don't buy.

Threshold 2: Average 1+ on tone across all 20 cases.

If the average is below 1 (consistently corporate-bland output), the tool will need too much editing to be net-positive. The accuracy might be there but the workflow drag will eat the savings.

Threshold 3: Average 2+ on time-saved across all 20 cases.

If the AI saves time on average, it's a real product. If it's break-even or slower, you've found a tool that demos well and produces work in production.

What good actually looks like

A real production-grade AI tool, on a real test set, typically scores: - Accuracy: 2.4 average (most cases correct, occasional rough) - Tone: 1.8 average (close to your voice with light editing) - Time saved: 2.5 average (consistent meaningful savings) - Total: ~135/180 (75%)

Anything above 130/180 is buy-territory. Anything below 100/180 is "demo theater", it works in scripted scenarios but breaks on real work. The space between is "talk to the vendor about your specific use case before committing."

The 5 questions to ask the vendor BEFORE you buy

After your test, before signing, ask:

1. "Can I see another customer's real production output?"

Not the demo. Real customer work. Redacted as needed.

2. "What happens when accuracy drops on a specific use case?"

Real vendors have a process: monitoring, alerts, retraining, escalation. Theater vendors say "that doesn't really happen."

3. "What's the contract structure if I want to leave?"

Annual contracts with auto-renewal are red flags at this stage of the AI market. Month-to-month or 90-day cancellation is standard for a real product.

4. "Who do I escalate to when the agent gets it wrong on a customer-facing task?"

If the answer is "submit a ticket and we'll respond in 5 business days," you're a small fish at a big vendor. Find a smaller vendor or a productized provider where you have a human contact.

5. "What happens if my underlying systems change, new CRM, new email, new tools?"

Every business changes its stack every 2-3 years. The vendor's answer reveals whether the tool is locked-in or portable.

When to skip the test entirely

Two cases where this protocol is overkill:

Case 1: Free or consumer-tier AI ($20/month or less). Just try it for two weeks. The cost of being wrong is one month of $20.

Case 2: Productized agent at fixed price under $5K with a 7-day delivery and review gate. The vendor is taking the risk by committing to a fixed scope. Day-2 test set and Day-5 demo on YOUR data is built into the delivery process. The protocol above is essentially what a real productized agent build does as part of its workflow.

For everything else, annual SaaS contracts, custom builds, enterprise platforms, run the protocol. 90 minutes of real-data testing catches the 80% of tools that demo great and fail in production.

What this means for you

Don't sign on the demo. Test on your real data first. If the vendor won't let you, that's the test result.

The next post in this series covers what AI shouldn't be used for, the five things that look like AI use cases but actually aren't.

Get started

Want a real number for your specific situation?

30-minute audit call walks through your workflows and outputs a fixed price for the 2-3 things worth automating first.

Get a free audit See all agents

How to test if an AI tool will work for your business (before you pay)