What is an AI benchmark in plain English?

A benchmark is a standardised test for AI models. You give every model the same set of tasks - solve these coding tickets, answer these graduate-level science questions, complete these multi-step jobs - score how many it gets right, and put the scores side by side. The point is comparison: a single number that lets you say model A beats model B on a particular skill. Benchmarks come in families for different abilities - coding, reasoning, general knowledge, and 'agentic' work where the model has to take a series of actions rather than just answer a question.

What are the main AI benchmarks in 2026?

A handful come up again and again. SWE-bench Verified tests real-world coding by asking the model to fix actual software bugs end to end. Terminal-Bench tests agentic work - completing multi-step tasks in a command line. MMLU, GPQA Diamond and Humanity's Last Exam test knowledge and reasoning, with Humanity's Last Exam deliberately brutal: the best models score around 35% where experts hit around 90%. ARC-AGI tests fluid reasoning on novel puzzles. Online-Mind2Web tests computer use - whether an agent can actually complete tasks across live websites.

Why shouldn't I just pick the model at the top of the leaderboard?

Four reasons. The scores are usually self-reported by the vendor in its own launch materials, with the flattering numbers up front and the weak ones left out. They are run on different harnesses - the scaffolding around the model varies between labs, so the figures are not strictly like-for-like. They can be contaminated, where benchmark questions leak into the training data and the model has effectively seen the test. And most importantly, a leaderboard score rarely predicts how the model performs on your specific work. Top of the chart is a starting shortlist, not an answer.

What is benchmark contamination?

Contamination is when the questions and answers from a benchmark end up in the data a model was trained on. Because these tests are public, their contents circulate on the web, and a model trained on a big slice of the web may have absorbed them. When that happens the model is partly remembering the answers rather than working them out, so the score overstates real ability. It is one of the main reasons a high benchmark number does not always translate into strong performance on fresh, real-world tasks the model has never seen.

What should I do instead of trusting leaderboards?

Run your own small eval on your actual tasks. Take ten to thirty real jobs from your business - the emails, summaries, classifications or document reads you actually need - write down what a good answer looks like, and run your shortlisted models against them. The model that wins on your work is the one to use, regardless of where it sits on a public chart. Use the leaderboards to build the shortlist and pick which models to test; use your own eval to make the decision.

← All insights

Explainers·8 min read

What is an AI benchmark, and should you trust the leaderboards?

Every time a new AI model launches, the press release leads with a wall of scores: 88 per cent on this, state-of-the-art on that, top of the leaderboard. Those numbers are benchmarks. An AI benchmark is a standardised test that scores a model on a task - coding, reasoning, knowledge or agentic work. They are useful for a rough sort of which models are strong, but they are widely misread, and the gap between a high score and a good result on your work is where a lot of money gets wasted.

What a benchmark is

A benchmark is a standardised test for AI models. Give every model the same set of tasks, score how many it gets right, and line the scores up. The point is comparison: a single number that lets you say model A beats model B at a particular skill. The 2026 names you will run into sort into a few families.

Coding and agentic work

SWE-bench Verified hands the model real software bugs from real projects and asks it to fix them end to end. A human-checked subset, so the tasks are known solvable. When a vendor claims a coding score, this is usually it.
Terminal-Bench tests “agentic” work: whether a model can complete multi-step jobs in a command line, taking a sequence of actions rather than just answering.

Knowledge and reasoning

MMLU is a broad general-knowledge exam across many subjects, long the default yardstick. GPQA Diamond is graduate-level science questions.
Humanity’s Last Exam is deliberately brutal: extremely hard questions across every discipline. The best models score around 35 per cent where human experts average around 90. It exists because models had started topping the easier tests.
ARC-AGI tests fluid reasoning on novel puzzles the model cannot have memorised - closer to raw problem-solving than recall.

Computer use

Online-Mind2Web tests whether an agent can actually complete tasks across live websites - shopping, forms, travel bookings - rather than on a frozen snapshot. It was built to catch agents that look good in the lab and fall over on the real web.

Why an owner should be sceptical

Here is the uncomfortable part. A leaderboard position is a starting shortlist, not an answer. Four reasons to treat the headline numbers with care.

They are usually self-reported. The flashy figures come from the vendor’s own launch materials, run on its own setup, with the flattering numbers up front and the weak ones quietly omitted. One 2026 model launched with a wall of impressive scores; independent testing a day later found a hallucination rate north of 80 per cent on a benchmark the press release skipped.
They run on different harnesses. The scaffolding around a model - how it is prompted, how many attempts it gets, what tools it is given - varies between labs, so two “SWE-bench” scores are often not strictly like-for-like.
They can be contaminated. Benchmarks are public, so their questions circulate online, and a model trained on a big slice of the web may have absorbed them. When that happens the model is partly remembering answers rather than working them out, and the score overstates real ability.
They rarely predict your results. This is the big one. A benchmark measures performance on its tasks, not yours. A model that tops a coding chart may be mediocre at drafting your customer emails or reading your invoices.

The research has a name for this. The 2025 paper behind Online-Mind2Web called it an “illusion of progress” - agents that looked strong on static tests failed once the websites were live. The lesson generalises: a benchmark score is evidence, not proof.

What to do instead

Do not throw the leaderboards out. Use them for what they are good at, and add the step that matters.

Use leaderboards to build a shortlist. They are a fine first filter for which two or three models are worth your time.
Then run your own small eval on your real tasks. Take ten to thirty actual jobs from your business - the emails, summaries, classifications or document reads you genuinely need. Write down what a good answer looks like. Run your shortlisted models against them.
Pick the model that wins on your work, not the one that wins on a public chart. If the second-place model is better at your specific job, use the second-place model.

This does not need a data-science team. A spreadsheet, a couple of dozen real examples and an afternoon will tell you more about which model to buy than any leaderboard, because it tests the only thing you care about: your work.

Honest limits

Benchmarks are not useless. They genuinely separate strong models from weak ones at a broad level, and a model that is bad across the board on public tests is unlikely to surprise you on your tasks.
Your own eval is not perfect either. A small test set has blind spots, so treat it as a decision aid, not gospel. And the picture moves - new models and benchmarks arrive constantly, so today’s leader is not a permanent fact.

The one-line version

An AI benchmark is a standardised test that scores a model on a task - SWE-bench Verified for coding, Terminal-Bench for agentic work, MMLU, GPQA and Humanity’s Last Exam for knowledge, ARC-AGI for reasoning, Online-Mind2Web for computer use. Be sceptical: scores are self-reported, run on different harnesses, can be contaminated, and rarely predict your results. Use leaderboards to build a shortlist, then run your own small eval on your actual tasks and pick the model that wins on your work.

Sources

[1]Artificial Analysis - Intelligence Index and methodology - An independent index that aggregates multiple benchmarks (including Terminal-Bench Hard, GPQA Diamond and Humanity's Last Exam) into one score, with published methodology explaining how models are evaluated.
[2]SWE-bench - leaderboard and methodology - The SWE-bench project page describing how models are scored on resolving real GitHub issues, including the human-validated SWE-bench Verified subset.
[3]An Illusion of Progress? Assessing the Current State of Web Agents - Research paper introducing the Online-Mind2Web benchmark and showing that web agents which look strong on static tests often fail on live websites, a caution against reading benchmark scores as real-world ability.

Citations are to publicly available reports, advisor publications and practitioner research. Inclusion does not imply endorsement of XLev by any cited organisation.

Frequently asked questions

What is an AI benchmark in plain English?: A benchmark is a standardised test for AI models. You give every model the same set of tasks - solve these coding tickets, answer these graduate-level science questions, complete these multi-step jobs - score how many it gets right, and put the scores side by side. The point is comparison: a single number that lets you say model A beats model B on a particular skill. Benchmarks come in families for different abilities - coding, reasoning, general knowledge, and 'agentic' work where the model has to take a series of actions rather than just answer a question.
What are the main AI benchmarks in 2026?: A handful come up again and again. SWE-bench Verified tests real-world coding by asking the model to fix actual software bugs end to end. Terminal-Bench tests agentic work - completing multi-step tasks in a command line. MMLU, GPQA Diamond and Humanity's Last Exam test knowledge and reasoning, with Humanity's Last Exam deliberately brutal: the best models score around 35% where experts hit around 90%. ARC-AGI tests fluid reasoning on novel puzzles. Online-Mind2Web tests computer use - whether an agent can actually complete tasks across live websites.
Why shouldn't I just pick the model at the top of the leaderboard?: Four reasons. The scores are usually self-reported by the vendor in its own launch materials, with the flattering numbers up front and the weak ones left out. They are run on different harnesses - the scaffolding around the model varies between labs, so the figures are not strictly like-for-like. They can be contaminated, where benchmark questions leak into the training data and the model has effectively seen the test. And most importantly, a leaderboard score rarely predicts how the model performs on your specific work. Top of the chart is a starting shortlist, not an answer.
What is benchmark contamination?: Contamination is when the questions and answers from a benchmark end up in the data a model was trained on. Because these tests are public, their contents circulate on the web, and a model trained on a big slice of the web may have absorbed them. When that happens the model is partly remembering the answers rather than working them out, so the score overstates real ability. It is one of the main reasons a high benchmark number does not always translate into strong performance on fresh, real-world tasks the model has never seen.
What should I do instead of trusting leaderboards?: Run your own small eval on your actual tasks. Take ten to thirty real jobs from your business - the emails, summaries, classifications or document reads you actually need - write down what a good answer looks like, and run your shortlisted models against them. The model that wins on your work is the one to use, regardless of where it sits on a public chart. Use the leaderboards to build the shortlist and pick which models to test; use your own eval to make the decision.

Where this fits

AI Strategy Workshops

Half-day or full-day workshops with leadership. Walk out with a 12-month plan, not a slide deck.

See AI Strategy Workshops →Book a free 30-min discovery call