← All insights
Explainers·8 min read

What are AI evals, and why every serious AI rollout needs them

Most AI rollouts are run on vibes. The team builds an assistant, tries a few questions, decides it "seems good," and ships it. Then a model gets updated, or someone tweaks a prompt, and a customer gets a confidently wrong answer. Nobody saw it coming because nobody was measuring. Evals are the fix, and they are the single most under-used discipline in SMB AI.

This is the plain-English version: what an eval is, why every serious rollout needs one, and how a small business runs a lightweight version without a data team.

Evals in 30 words

An eval is a systematic, repeatable test of AI output quality: a fixed set of questions, the answers you would accept as good, a rubric for scoring, and a number you track over time.

Think of it as unit testing for AI

Engineers do not ship code and hope. They write tests that prove the code does what it should, and re-run them after every change. An eval is the same idea for AI behaviour.

You assemble a set of representative inputs - real questions, real documents, real tasks - and next to each one you write the answer you would accept. Then you run your AI against the whole set and score how it did. That score is your baseline. From then on, every time something changes - a new model version, an edited prompt, an updated document set - you run the same test and watch whether the number goes up or down.

OpenAI frames it the same way: an eval needs two core ingredients, a data set and the grading criteria that decide whether an output is right. That is the whole concept. Everything else is detail.

Why an owner should care

Evals are not an engineering nicety. They buy you three things that sit squarely in an owner’s lap.

  • Trust. “We think it works” becomes “it scores 92% on our 50-case test set.” You know how good your AI actually is, in a number, not a feeling.
  • Early warning. AI does not stay still. Vendors update models, prompts get edited, documents change - and any of those can quietly make output worse on cases you thought were solved. That is a regression. An eval catches it before a customer does.
  • Evidence. When a buyer doing due diligence, an auditor, or a regulator asks how you control your AI, an eval is the concrete answer. It is the difference between a governed system and a hopeful one.

That last point compounds. AI governance only means something if it can point at something. A policy that says “we ensure quality” is wallpaper. A policy backed by a tracked eval score is control you can demonstrate.

How an SMB runs lightweight evals

You do not need a data science team or a platform. A useful eval has three parts, and you can stand it up in a spreadsheet.

1. A test set

Collect 20 to 50 real inputs that represent the actual work - the questions staff and customers genuinely ask, including the awkward ones. Next to each, write the answer you would accept as correct. This set is the asset. Build it from real cases, not made-up ones, and include edge cases: the ambiguous question, the one where the right answer is “that is not in our documents.”

2. A rubric

Decide how you score each output, and keep it concrete. Anthropic’s guidance is clear on this: grade empirically, not vaguely - mark each answer correct or incorrect, or rate it on a 1 to 5 scale, rather than relying on a hand-wave of “good enough.” A concrete rubric is what makes the score comparable from one run to the next.

3. A tracked score

Run the AI against the full set, grade every answer against the rubric, and write down the total. That number is what you watch. Re-run it whenever you change the model, the prompt or the documents, and before anything ships to customers. A score that moves is a signal; a score you never measured is a blind spot.

Two notes from Anthropic worth stealing. Favour volume over polish - more cases graded automatically beats a handful graded painstakingly by hand, because you want coverage you will actually re-run. And make the test set mirror the real spread of work, edge cases included, so the score means something.

Evals and hallucinations

An eval will not stop a model from occasionally making things up. What it does is measure how often that happens on the work that matters to you, and tell you whether a change helped or hurt. Put a few questions in your set where the only correct answer is “I do not know” or “that is not in the source material,” and you can score whether your AI invents answers under pressure. You manage what you measure. Evals turn hallucination from a background anxiety into a number you can drive down on purpose.

The takeaway

If an AI feature matters enough to put in front of staff or customers, it matters enough to measure. Start small: 30 real cases, a correct/incorrect rubric, a score in a spreadsheet, re-checked on every change. That is not an enterprise programme - it is the line between an AI rollout you can defend and one you are quietly hoping holds.

Frequently asked questions

What is an AI eval in simple terms?
An eval is a repeatable test for an AI feature. You write down a set of representative questions, the answers you would accept as good, and a way to score how close the AI gets. Then you run your AI against that set and record the score. When you change the prompt, swap the model, or update your documents, you run the same test again and see whether the score went up or down. It is unit testing for AI behaviour.
Why do evals matter for a business, not just engineers?
Because without them you are trusting an AI feature on vibes. Evals give you three things owners care about: trust (you know how well it actually performs), early warning (you catch quality dropping before a customer does), and evidence (you can show a buyer, auditor or regulator that your AI is measured and controlled). The difference between 'we think it works' and 'it scores 92% on our 50-case test set, tracked monthly' is the difference between a hopeful rollout and a governed one.
Why do I need evals if the AI worked fine when we launched?
Because AI does not stay still. Vendors update models, you tweak prompts, your documents change, and any of those can quietly make output worse on cases you thought were solved - a regression. Without an eval you only find out when a customer complains. With one, you re-run the test after every change and see the score move first. The model that was great in January is not guaranteed to behave identically in June.
How does an SMB run lightweight evals without a data team?
Three pieces. First, a test set: 20 to 50 real questions or inputs that represent the work, with the answer you would accept written next to each. Second, a rubric: a simple, concrete way to score each output - correct/incorrect, or a 1-5 scale, which Anthropic recommends over vague judgement. Third, a tracked score: run the AI against the set, grade it, write down the number, and re-check it whenever something changes. That is enough to be useful.
How do evals help with hallucinations?
An eval will not stop a model from making things up, but it measures how often it does on the work that matters to you, and it tells you whether a change made that better or worse. By including questions where the correct response is 'I do not know' or 'that is not in the documents', you can score whether the AI invents answers under pressure. You manage what you measure - evals turn hallucination from an anxiety into a number you can drive down.

Where this fits

AI Strategy Workshops

Half-day or full-day workshops with leadership. Walk out with a 12-month plan, not a slide deck.