What is temperature in AI, in plain terms?

Temperature controls how random the model's word choices are. At a low temperature the model plays it safe and picks the most likely next word, so you get consistent, predictable, repeatable answers - ask the same thing twice and you get nearly the same reply. At a high temperature it takes more chances, so you get more varied, surprising and creative output, with more risk of going off track. On the Claude API the range is 0 to 1; some other tools go to 2. Low for facts and structure, higher for ideas.

What are max output tokens?

Max output tokens is a cap on how long the answer can be, measured in tokens - the small chunks of text models work in, where roughly 1.3 tokens make up a word. It is a ceiling, not a target: the model can finish well short of it, but it will be cut off mid-sentence if it hits the limit. You set it to stop runaway answers and to control cost, since output tokens are billed and are usually the expensive direction. For most chat use it is set high enough that you never notice it.

Should I change temperature for business work?

Only if you are building or tuning something, and even then sparingly. For factual, structured or repeatable tasks - extracting data, classifying enquiries, drafting to a fixed template, anything where two different answers would be a problem - keep temperature low. For brainstorming, naming, campaign ideas or first-draft creative writing, a higher temperature gives you more range to choose from. If you are just typing into a chat box, leave it alone; the defaults are sensible and you control variety through your prompt instead.

What is reasoning or thinking effort?

On newer models you can set how hard the model thinks before it answers, often called reasoning effort or thinking effort. Low effort is faster and cheaper and is fine for simple tasks. Higher effort makes the model work through a problem more carefully before replying, which helps on hard reasoning, analysis and multi-step work, at the cost of more time and more tokens. GPT-5.5, for example, offers levels from none through to xhigh. On the latest models this dial is increasingly replacing temperature as the main thing worth tuning.

Why can't I change temperature on the newest models?

Because the latest reasoning models drop it. Temperature, top-p and similar sampling controls are unsupported on Claude Opus 4.7 and later, and on GPT-5 reasoning models, and setting them returns an error. These models manage their own internal sampling and expose reasoning effort instead as the dial that matters. It is a deliberate shift: on a model that thinks before it answers, how hard it thinks is a more useful control than how random its word choices are. So on the newest models, reach for reasoning effort, not temperature.

← All insights

Explainers·7 min read

AI settings explained: temperature, tokens and thinking effort

Behind every AI model sits a handful of dials that change how it behaves. Temperature, max output tokens, top-p, and on newer models a reasoning or thinking effort setting. They sound technical, and mostly they are someone else's job, but the ideas are simple and worth knowing.

Here is the headline first, so nobody panics: if you type into a chat box, you will probably never touch any of these, and that is completely fine. The defaults are sensible. This is a guide for when you build or tune something, not homework for everyday use.

Temperature: consistent or creative

Temperature is the dial people ask about most. It controls how random the model’s word choices are.

Low temperature - the model plays it safe and picks the most likely next word. You get consistent, predictable, repeatable answers. Ask the same question twice and you get nearly the same reply.
High temperature - the model takes more chances. You get more varied, surprising and creative output, with more risk of it wandering off track or inventing things.

On the Claude API the range is 0 to 1. Some other tools run to 2. You do not need to memorise the numbers, just the direction: down for consistency, up for variety.

The business translation is straightforward. Low for facts and structure, higher for ideas.

Max output tokens: the length cap

Max output tokens is a ceiling on how long the answer can be. Tokens are the small chunks of text models work in - roughly 1.3 tokens to a word - so this is, in effect, a word limit.

It is a cap, not a target. The model can finish well short of it. But if the answer runs long and hits the limit, it gets cut off mid-sentence, which looks like the AI froze. Two reasons to set it:

Cost control. Output tokens are billed, and they are usually the expensive direction. Capping length caps spend on a busy automation.
Stopping runaways. It prevents a model from producing a 2,000-word essay when you wanted a paragraph.

In everyday chat it is set high enough that you never bump into it. It earns its keep in custom builds, where you are processing volume and want predictable length and cost.

Top-p: the other randomness dial

Top-p, sometimes called nucleus sampling, is a second control over randomness that overlaps with temperature. Instead of dialling up boldness directly, it limits how wide a pool of candidate words the model is allowed to choose from.

For business purposes you can largely ignore it, with one rule worth knowing: change temperature or top-p, not both at once. Both vendors give the same advice, because the two pull on the same lever and adjusting them together makes the result hard to reason about. Pick temperature, leave top-p at its default, and you have one clean dial instead of two tangled ones.

Reasoning effort: how hard it thinks

This is the newer and increasingly the more important dial. On the latest models you can set how hard the model thinks before it answers - often called reasoning effort or thinking effort.

Low effort is faster and cheaper. Fine for simple, well-defined tasks.
Higher effort makes the model work through a problem more carefully before replying. It helps on hard reasoning, analysis and multi-step work, at the cost of more time and more tokens.

GPT-5.5, for instance, offers levels from none through low, medium and high up to xhigh. The trade-off is the same each step up: better thinking, slower and dearer. Match it to the job - low for routine work, high only for the genuinely hard problems where a wrong answer costs more than the extra wait.

When to turn each one

For real business work, a short rule covers most of it.

Factual, structured, repeatable tasks - extracting data, classifying enquiries, drafting to a template, anything where two different answers would be a problem: low temperature.
Brainstorming and creative first drafts - naming, campaign angles, idea lists: higher temperature, where a bit of surprise is the point.
Hard reasoning and analysis on a newer model: reach for higher reasoning effort rather than fiddling with temperature.
Long or costly automations: set a sensible max output tokens so length and spend stay predictable.

Honest limits

A few caveats, because these dials are easy to over-think.

Most people never touch them, and should not. In a chat box you control variety through your prompt - “give me ten wild options” or “be precise, stick to the facts” - not through hidden settings. The dials are a building tool, not a daily one.
The newest models drop the old ones. Temperature, top-p and similar sampling controls are unsupported on Claude Opus 4.7 and later, and on GPT-5 reasoning models. Set them and you get an error. Those models manage their own sampling and expose reasoning effort instead - on a model that thinks before it answers, how hard it thinks is a more useful knob than how random its word choices are. So do not assume every dial exists on every model.
Settings do not fix a weak prompt. No temperature value rescues vague instructions, and no reasoning effort invents facts the model was never given. Get the prompt and context right first. The dials fine-tune a good setup; they do not save a bad one.

The one-line version

The AI dials are simple once named. Temperature sets consistent versus creative - low for facts, higher for ideas. Max output tokens caps length and cost. Top-p is a second randomness control you can leave alone, and never touch it and temperature together. Reasoning effort, on newer models, sets how hard the model thinks and is fast becoming the dial that matters. Most owners never touch any of it - this is for when you build and tune, not when you type.

Frequently asked questions

What is temperature in AI, in plain terms?: Temperature controls how random the model's word choices are. At a low temperature the model plays it safe and picks the most likely next word, so you get consistent, predictable, repeatable answers - ask the same thing twice and you get nearly the same reply. At a high temperature it takes more chances, so you get more varied, surprising and creative output, with more risk of going off track. On the Claude API the range is 0 to 1; some other tools go to 2. Low for facts and structure, higher for ideas.
What are max output tokens?: Max output tokens is a cap on how long the answer can be, measured in tokens - the small chunks of text models work in, where roughly 1.3 tokens make up a word. It is a ceiling, not a target: the model can finish well short of it, but it will be cut off mid-sentence if it hits the limit. You set it to stop runaway answers and to control cost, since output tokens are billed and are usually the expensive direction. For most chat use it is set high enough that you never notice it.
Should I change temperature for business work?: Only if you are building or tuning something, and even then sparingly. For factual, structured or repeatable tasks - extracting data, classifying enquiries, drafting to a fixed template, anything where two different answers would be a problem - keep temperature low. For brainstorming, naming, campaign ideas or first-draft creative writing, a higher temperature gives you more range to choose from. If you are just typing into a chat box, leave it alone; the defaults are sensible and you control variety through your prompt instead.
What is reasoning or thinking effort?: On newer models you can set how hard the model thinks before it answers, often called reasoning effort or thinking effort. Low effort is faster and cheaper and is fine for simple tasks. Higher effort makes the model work through a problem more carefully before replying, which helps on hard reasoning, analysis and multi-step work, at the cost of more time and more tokens. GPT-5.5, for example, offers levels from none through to xhigh. On the latest models this dial is increasingly replacing temperature as the main thing worth tuning.
Why can't I change temperature on the newest models?: Because the latest reasoning models drop it. Temperature, top-p and similar sampling controls are unsupported on Claude Opus 4.7 and later, and on GPT-5 reasoning models, and setting them returns an error. These models manage their own internal sampling and expose reasoning effort instead as the dial that matters. It is a deliberate shift: on a model that thinks before it answers, how hard it thinks is a more useful control than how random its word choices are. So on the newest models, reach for reasoning effort, not temperature.

Where this fits

Claude Implementation

Install Claude properly across your team - Claude Code, Claude.ai projects and skills, custom Anthropic SDK builds.

Explore Claude Implementation →Book a free 30-min discovery call