gpt-5-nano · reasoning effort
Same model and task; only the reasoning-effort setting changes. It shows how much long-context capability gpt-5-nano gains from more reasoning, from minimal (which cannot do the task at all) up to high.
Score by level
Each level holds the model and the test fixed. Only reasoning effort changes. A higher score is better, and a tighter confidence interval is better still.
Reasoning effort gates how far gpt-5-nano gets on long context
At minimal reasoning, gpt-5-nano can't do the task at all. Turn reasoning up and real capability appears, climbing from nothing to a ceiling around H6 at 8K, then flattening out: medium and high land in nearly the same place. So reasoning effort is a genuine lever for nano on long-context work, with sharp diminishing returns past medium.
What we compared
Same model and the same task throughout: gpt-5-nano on state tracking, where it has to follow a chain of references through a long context. The ladder, the 0 to 100 scoring, and exact pass-rate marking were all held fixed. The only thing we changed is the reasoning-effort setting: minimal, low, medium, high. (The medium arm reuses the trials we already had on file, so it cost nothing to add.)
| Reasoning | Score (CI) | Holds through |
|---|---|---|
| minimal | 0.0 (floored) | fails even 2K |
| low | 22.7 (22.0 to 24.7) | H4 @ 8K |
| medium | 29.6 (27.6 to 33.6) | H6 @ 8K |
| high | 33.6 (29.6 to 37.5) | H6 @ 8K |
A note on the endpoints: through the OpenAI Chat Completions API we use, "off" behaves the same as minimal (gpt-5 always reasons a little), and "xhigh" behaves the same as high (xhigh is only exposed on the Responses API, which we don't call yet). So those two settings aren't separately measurable here, and minimal and high are the real floor and ceiling of this sweep.
Reading the numbers
At minimal effort nano is floored: it scores about 1 in 8 even on the easiest 2K rung and zero above that. With almost no reasoning budget it simply can't track state across the context. Low effort gets it onto the ladder, holding H4 @ 8K. Medium takes it to H6 @ 8K, which matches nano's reference number. High holds the same rung and does a little better at 16K, but the gain over medium is small and the intervals overlap.
What it means for picking a setting
The useful range is minimal to medium: that's where almost all the capability appears, going from "can't do it" to a real H6 @ 8K ceiling. Past medium the curve flattens, so for nano on this kind of work, medium is the sensible default and pushing to high (or, once supported, xhigh) buys very little.
Method note
Scores come from a difficulty ladder where reference hops and context length rise together, and a model has to hit a 90% pass rate to hold a rung. Only the reasoning-effort setting varies across the table; the model, task, ladder, and scoring are identical. This is one task type under one prompt setup and does not speak to other benchmarks.