The GPT-5 family, ranked by capability
How the GPT-5 variants compare on the capability ceiling for long-context code reasoning, from nano up to the full model and the Codex variant. Same instrument and scoring; only the model changes.
Score by level
Each level holds the model and the test fixed. Only model changes. A higher score is better, and a tighter confidence interval is better still.
How the GPT-5 family compares on long-context reasoning
Bigger models hold up better, and the jump is large: nano scores 33.6, mini 66.0, and the full model 73.2. The order isn't perfectly clean, though. Codex lands at 62.8, below mini, even though it's the code-specialized variant, and the full model has an odd dip in the middle of its curve.
What we compared
We measured four models in the GPT-5 family on the same task: gpt-5-nano, gpt-5-mini, gpt-5, and gpt-5-codex. The task is state tracking, where the model has to follow a chain of references through a long context. The only thing we set out to vary is the model. Everything else was held fixed: the same difficulty ladder, the same 0 to 100 scoring, exact pass-rate marking, and a reasoning effort of medium on every model. The one thing that wasn't fully controlled is the route: Codex is only reachable for us through OpenRouter, while the other three run on the OpenAI direct API, and through that route Codex's usable context tops out around 256K rather than the full model's 400K.
| Model | Score (CI) | Holds through | Breaks at |
|---|---|---|---|
| gpt-5 | 73.2 (63.4 to 73.8) | H24 @ 200K | H32 @ 350K |
| gpt-5-mini | 66.0 (63.4 to 67.1) | H18 @ 128K | H24 @ 200K |
| gpt-5-codex | 62.8 (58.5 to 66.4) | H14 @ 64K | H18 @ 128K |
| gpt-5-nano | 33.6 (29.6 to 36.6) | H6 @ 8K | H8 @ 16K |
The big story is the climb from nano to the full model: 33.6, then 66.0, then 73.2, about a 40-point spread. All four have a confirmed ceiling, so each number is bracketed on both sides.
Reading the numbers
Nano falls off early and steeply. It passes H6 @ 8K cleanly, drops to about 19% at H10 @ 32K, and scores zero above that. Its interval is the widest of the four, but the ceiling is not in doubt.
The full GPT-5 has a wrinkle. It passes the harder H24 @ 200K at 100% but only 87.5% at the easier H18 @ 128K just before it. 87.5% still clears the bar, so the ceiling placement holds, but the dip is real. Its interval is also wide (63.4 to 73.8), so its edge over mini is not firmly established.
Codex degrades gracefully but earlier than mini. It holds H14 @ 64K cleanly, slips to 87.5% at H18 @ 128K, and is down to about 50% by H24 @ 200K, and its OpenRouter route can't take the 350K rung at all. So on this long-context reasoning task it sits just below mini, despite being the code-specialized model. If your work is tracking state across a long context rather than writing code, Codex is not the one this comparison points to.
What it means for picking a model
Going from nano to mini is the big practical step: about 32 points, and usable context stretches from 8K to 128K. The full model adds a few more points over mini, but the intervals overlap, so the gap between them is soft. Codex comes in just under mini here, with a smaller usable window through the route we have.
Method note
Scores come from a difficulty ladder where reference hops and context length rise together, and a model has to hit a 90% pass rate to hold a rung. Every model ran at medium reasoning effort with exact scoring, so the only thing moving across the table is the model itself (and, for Codex, the API route). This is one task type under one prompt setup, and it does not speak to other reasoning or coding benchmarks.