varies: model 4 levels

The GPT-5 family, ranked by capability

How the GPT-5 variants compare on the capability ceiling for long-context code reasoning, from nano up to the full model and the Codex variant. Same instrument and scoring; only the model changes.

Held constant

query mode: multihop reasoning effort: medium reliability thresholds: sustain ≥ 90%, break < 80% scoring: exact pass-rate task: state tracking (T5)

Not held constant: model (gpt-5, gpt-5-codex, gpt-5-mini, gpt-5-nano); provider (openai, openrouter) .

Score by level

Each level holds the model and the test fixed. Only model changes. A higher score is better, and a tighter confidence interval is better still.

gpt-5

ceiling found

73.2/100

CI 63.4 to 73.8

H10

32K

H14

64K

H18

128K

H24

200K

H32

350K

H40

524K

H52

786K

H64

1049K

sustains H24 · 200K breaks H32 · 350K sharp cliff

gpt-5-mini

ceiling found

66/100

CI 63.4 to 67.1

H10

32K

H14

64K

H18

128K

H24

200K

H32

350K

H40

524K

H52

786K

H64

1049K

sustains H18 · 128K breaks H24 · 200K sharp cliff

gpt-5-codex

ceiling found

62.8/100

CI 58.5 to 66.4

16K

H10

32K

H14

64K

H18

128K

H24

200K

H32

350K

H40

524K

H52

786K

H64

1049K

sustains H14 · 64K breaks H18 · 128K moderate

gpt-5-nano

ceiling found

33.6/100

CI 29.6 to 36.6

16K

H10

32K

H40

524K

H52

786K

H64

1049K

sustains H6 · 8K breaks H8 · 16K moderate

Analysis generated by anthropic/claude-sonnet-4.6 · v1

How the GPT-5 family compares on long-context reasoning

Bigger models hold up better, and the jump is large: nano scores 33.6, mini 66.0, and the full model 73.2. The order isn't perfectly clean, though. Codex lands at 62.8, below mini, even though it's the code-specialized variant, and the full model has an odd dip in the middle of its curve.

What we compared

We measured four models in the GPT-5 family on the same task: gpt-5-nano, gpt-5-mini, gpt-5, and gpt-5-codex. The task is state tracking, where the model has to follow a chain of references through a long context. The only thing we set out to vary is the model. Everything else was held fixed: the same difficulty ladder, the same 0 to 100 scoring, exact pass-rate marking, and a reasoning effort of medium on every model. The one thing that wasn't fully controlled is the route: Codex is only reachable for us through OpenRouter, while the other three run on the OpenAI direct API, and through that route Codex's usable context tops out around 256K rather than the full model's 400K.

Model	Score (CI)	Holds through	Breaks at
gpt-5	73.2 (63.4 to 73.8)	H24 @ 200K	H32 @ 350K
gpt-5-mini	66.0 (63.4 to 67.1)	H18 @ 128K	H24 @ 200K
gpt-5-codex	62.8 (58.5 to 66.4)	H14 @ 64K	H18 @ 128K
gpt-5-nano	33.6 (29.6 to 36.6)	H6 @ 8K	H8 @ 16K

The big story is the climb from nano to the full model: 33.6, then 66.0, then 73.2, about a 40-point spread. All four have a confirmed ceiling, so each number is bracketed on both sides.

Reading the numbers

Nano falls off early and steeply. It passes H6 @ 8K cleanly, drops to about 19% at H10 @ 32K, and scores zero above that. Its interval is the widest of the four, but the ceiling is not in doubt.

The full GPT-5 has a wrinkle. It passes the harder H24 @ 200K at 100% but only 87.5% at the easier H18 @ 128K just before it. 87.5% still clears the bar, so the ceiling placement holds, but the dip is real. Its interval is also wide (63.4 to 73.8), so its edge over mini is not firmly established.

Codex degrades gracefully but earlier than mini. It holds H14 @ 64K cleanly, slips to 87.5% at H18 @ 128K, and is down to about 50% by H24 @ 200K, and its OpenRouter route can't take the 350K rung at all. So on this long-context reasoning task it sits just below mini, despite being the code-specialized model. If your work is tracking state across a long context rather than writing code, Codex is not the one this comparison points to.

What it means for picking a model

Going from nano to mini is the big practical step: about 32 points, and usable context stretches from 8K to 128K. The full model adds a few more points over mini, but the intervals overlap, so the gap between them is soft. Codex comes in just under mini here, with a smaller usable window through the route we have.

Method note

Scores come from a difficulty ladder where reference hops and context length rise together, and a model has to hit a 90% pass rate to hold a rung. Every model ran at medium reasoning effort with exact scoring, so the only thing moving across the table is the model itself (and, for Codex, the API route). This is one task type under one prompt setup, and it does not speak to other reasoning or coding benchmarks.