The Premise

Advertised windows describe capacity. They do not describe capability.

A model that advertises a 400K-token window can read 400K tokens. Whether it can actually reason over them is a different question, and that is the one that matters when the output ships. Hard work means tracing dependencies across a long file and pulling together facts that sit far apart. In practice the range where a model holds a difficult task is much smaller than the range where it can simply read, the gap is different for every model, and nothing on a spec sheet tells you where it falls.

Project Peak measures this directly. For each model we climb a ladder where the task gets harder and the context gets longer at the same time, and we find the hardest step the model still gets right about 90% of the time. We grade it against how it does on the easy steps, so the result describes the model and not the test, and we report it as a 0 to 100 score with a confidence interval, on synthetic code tasks that are scored exactly.

The goal is narrow and concrete. We want a defensible per-model number for how a model does on hard, long-context code work, reported with error bars and a falloff shape, that you can route, budget, and compare against. It is a measured number with a confidence interval rather than a vibe or a one-off demo.