The cleanest number in AI is usually the least useful one: one model, one benchmark, one score. A new arXiv paper argues that this habit is making the field underestimate what large language model systems can actually do in production.
The paper, “The Capability Frontier: Benchmarks Miss 82% of Model Performance,” looks at 21 LLMs across 16 widely used benchmarks covering coding, reasoning, medicine, factuality, instruction following, and agentic tasks. Its core claim is sharp: if you only score a single model on a single run, you miss the performance available from choosing among models and sampling more than one answer when budget allows.
That matters because real AI systems rarely look like leaderboard entries. A support bot, coding agent, analyst copilot, or medical summarization tool may route different prompts to different models, retry hard cases, ask for multiple generations, or keep the cheapest answer that clears a quality bar. The authors call the best achievable cost-performance curve across those choices the “Capability Frontier.” In their analysis, correcting for single-model evaluation reduced error rates by 54%. Correcting for both single-model and single-run evaluation produced an 82% improvement, while matching state-of-the-art accuracy at an 85% lower cost.
The practical takeaway is not “leaderboards are fake.” It is that leaderboards answer a narrower question than many buyers and builders think. They tell you how one model performed under one evaluation setup. They do not tell you what a carefully assembled model portfolio can do when the task mix is messy, the budget is explicit, and different models fail on different questions.
For Daily AI Paper readers, this is especially relevant because model routing has moved from infrastructure trivia to product strategy. The cheapest reliable AI stack may not be a single flagship model. It may be a router, a few specialized models, a retry policy, and a measurement loop that knows when extra tokens actually buy better answers.
There is also a warning here. The paper’s frontier is partly oracle-based: it measures what is theoretically available when the system can select the right model or generation. Production teams still need signals that approximate that choice before the answer is known. Without that, extra sampling can become an expensive ritual, and routing can become a fancy way to move errors around.
The bigger point is that AI evaluation is becoming a systems problem. If your product uses one model in one way, a standard benchmark may be enough to shortlist options. If your product uses multiple models, retries, graders, agents, or tool calls, you need to evaluate the whole operating pattern. The frontier is no longer just which model is smartest. It is which system turns a dollar of inference into the most dependable work.