How long should an AI pilot take?

Four to eight weeks. Less than four and you won't gather enough real-world output to judge it. More than eight and you've stopped piloting and started a slow, unmanaged rollout. The clock matters because a pilot is a decision-making exercise, not a destination - the deliverable is a confident scale, iterate or kill call, and that call gets harder, not easier, the longer you sit in limbo.

How do I choose which workflow to pilot first?

Score candidate workflows on three things and pick the one that's high on all three at once: volume (it happens many times a day or week, so savings compound), a clear right answer (you can tell good output from bad without a debate), and low blast-radius (if the AI gets one wrong, a human catches it before it reaches a customer or the books). High-volume, clear-answer, low-risk work - drafting, triage, data entry, first-pass summaries - is where pilots succeed. Avoid starting on rare, high-stakes, judgement-heavy tasks.

What single metric should I measure?

Pick one. For most SMBs it's wage-equivalent hours saved (hours of staff time the AI removes, valued at the loaded hourly cost of the person who used to do the work) or cycle-time (how long the task takes start to finish). Capture a baseline for that metric in the week before the pilot starts. A pilot with no baseline can't be judged - 'it feels faster' is not a result you can take to a budget decision.

Why do most AI pilots fail?

Rarely because the AI couldn't do the work. They fail on process: no single owner, no defined metric, no baseline to measure against, a scope that quietly expanded until nothing could be judged, or no human checking the output so trust never formed. The Deloitte/Amazon research found only 5% of AI-using Australian SMBs are fully AI-enabled - most stall at shallow, unscoped trials. Fix the process and the technology is rarely the constraint.

What does 'a human in the loop' actually mean?

A named person reviews the AI's output before it's used, for the whole pilot. They're doing two jobs: catching errors so the blast-radius stays low, and building the judgement to know when the output is trustworthy enough to automate further. Human-in-the-loop is not a permanent tax - it's how you earn the right to remove the check later. You decide what to automate based on the error rate you actually observed, not on a vendor's promise.

← All insights

Playbook·8 min read

How to run an AI pilot in your business (without wasting three months)

Most AI pilots that go nowhere were never really pilots. They were open-ended experiments with no owner, no metric and no end date, and they died of slow neglect three months in. That is the failure mode worth avoiding, and it has almost nothing to do with whether the AI is capable.

A pilot is a decision-making exercise. The deliverable is not “we tried some AI” - it is a confident call: scale, iterate, or kill. Here is the method we use to get to that call in four to eight weeks.

Pick one workflow with the selection rule

The single biggest predictor of a useful pilot is the workflow you choose. Score your candidates on three things and pick the one that is high on all three at once.

High volume. The task happens many times a day or week. Volume makes the savings compound and gives you enough output to judge in weeks rather than quarters.
A clear right answer. You can look at the output and tell good from bad without a debate. Drafting, triage, summarising, data entry and first-pass classification qualify. Open-ended judgement calls do not.
Low blast-radius. If the AI gets one wrong, a human catches it before it reaches a customer, a regulator or the books. You want room to be wrong cheaply while you learn.

The mistake is starting on the hardest, highest-stakes, rarest task in the business because it feels the most impressive. That is the worst possible pilot. Start where volume is high, the answer is clear, and a mistake is harmless.

Define one success metric, and only one

Before anything runs, write down the single number this pilot is judged on. For most SMBs it is one of two:

Wage-equivalent hours saved - the hours of staff time the AI removes, valued at the loaded hourly cost of the person who used to do the work. This becomes a dollar figure a budget owner understands.
Cycle-time - how long the task takes start to finish. Useful when the win is speed, for example a quote that now goes out in two hours instead of two days.

One metric. Not five. A pilot judged on five metrics is judged on none, because you can always find one that looks good. Pick the number that, if it moves, makes the decision for you.

Capture a baseline first

This is the step everyone skips and the one that decides whether your pilot produces a real answer.

In the week before the pilot starts, measure the metric the old way. How many hours does this task actually consume now? How long does the cycle actually take? Write it down. A pilot with no baseline cannot be judged - “it feels faster” is not a result you can take to a spend decision, and it is the line that kills good pilots in the meeting that matters.

If you only have time for one piece of discipline, make it this one.

Scope it tightly and run it with a human in the loop

Fix the scope on day one and hold it. One workflow, one team, one metric, one end date. The most common way a pilot dies is scope creep - it quietly expands to cover three more tasks, the metric blurs, and after ten weeks nothing can be measured because the thing being measured kept changing.

Run it with a named person reviewing every output before it is used. The human in the loop does two jobs. They keep the blast-radius low by catching errors. And they build the judgement to know when output is good enough to trust, which is how you earn the right to remove the check. Human-in-the-loop is not a permanent tax. It is how you find out, from your own observed error rate, what is safe to automate further.

Measure honestly, then make one of three calls

At the end of the window, compare the metric against the baseline and make the decision.

Scale. The numbers cleared the bar. Extend to more people and more volume, and start removing review steps where the observed error rate justifies it.
Iterate. Promising but not there. You can usually name the one thing holding it back - the prompt, the context, the tool, or the workflow around it. Fix that and re-run a short cycle. Do not iterate forever.
Kill. It did not clear the bar and there is no obvious fix. Stop, write down what you learned, and move the effort to a better-chosen workflow. A clean kill is a good outcome, not a failure.

The common failure modes

Almost every dead pilot fits one of these, and none of them is “the AI couldn’t do the work”:

No single owner, so it drifts.
No metric, so success is a matter of opinion.
No baseline, so even a real win cannot be proven.
Scope creep, so nothing can be measured by the end.
No human in the loop, so trust never forms and one bad output sinks the whole thing.

The numbers back this up. Deloitte Access Economics and Amazon found in November 2025 that two-thirds of Australian SMBs use AI but only 5% are “fully AI-enabled” - most stall in exactly these shallow, unscoped trials. Meanwhile MYOB’s April 2026 data shows AI-using SMEs growing 2.8 times faster than the rest. The gap between those two facts is mostly method.

Pick one workflow. Set one metric. Capture a baseline. Keep a human in the loop. Decide in eight weeks. That is the whole game.

Frequently asked questions

How long should an AI pilot take?: Four to eight weeks. Less than four and you won't gather enough real-world output to judge it. More than eight and you've stopped piloting and started a slow, unmanaged rollout. The clock matters because a pilot is a decision-making exercise, not a destination - the deliverable is a confident scale, iterate or kill call, and that call gets harder, not easier, the longer you sit in limbo.
How do I choose which workflow to pilot first?: Score candidate workflows on three things and pick the one that's high on all three at once: volume (it happens many times a day or week, so savings compound), a clear right answer (you can tell good output from bad without a debate), and low blast-radius (if the AI gets one wrong, a human catches it before it reaches a customer or the books). High-volume, clear-answer, low-risk work - drafting, triage, data entry, first-pass summaries - is where pilots succeed. Avoid starting on rare, high-stakes, judgement-heavy tasks.
What single metric should I measure?: Pick one. For most SMBs it's wage-equivalent hours saved (hours of staff time the AI removes, valued at the loaded hourly cost of the person who used to do the work) or cycle-time (how long the task takes start to finish). Capture a baseline for that metric in the week before the pilot starts. A pilot with no baseline can't be judged - 'it feels faster' is not a result you can take to a budget decision.
Why do most AI pilots fail?: Rarely because the AI couldn't do the work. They fail on process: no single owner, no defined metric, no baseline to measure against, a scope that quietly expanded until nothing could be judged, or no human checking the output so trust never formed. The Deloitte/Amazon research found only 5% of AI-using Australian SMBs are fully AI-enabled - most stall at shallow, unscoped trials. Fix the process and the technology is rarely the constraint.
What does 'a human in the loop' actually mean?: A named person reviews the AI's output before it's used, for the whole pilot. They're doing two jobs: catching errors so the blast-radius stays low, and building the judgement to know when the output is trustworthy enough to automate further. Human-in-the-loop is not a permanent tax - it's how you earn the right to remove the check later. You decide what to automate based on the error rate you actually observed, not on a vendor's promise.

Where this fits

AI Strategy Workshops

Half-day or full-day workshops with leadership. Walk out with a 12-month plan, not a slide deck.

See AI Strategy Workshops →Book a free 30-min discovery call