Latency vs benchmark score: the tradeoff nobody talks about

2026-05-25 · Avery NXR

There is a strange disconnect at the center of the AI dev tools industry.

The benchmarks measure one thing. The users experience another.

Benchmarks score a model on a single completion. Given a prompt, how good is the answer? HumanEval, MMLU, SWE-bench — all of them assume that the unit of work is one input and one output, and the question is how often the output is correct.

Developers do not write code in single completions. They write code in sessions. A real coding session is dozens of prompts, each followed by reading the output, deciding what to change, and prompting again. The output of any single prompt matters less than the rhythm of the whole loop.

Once you start measuring the loop, the math gets interesting.

The loop math

A realistic scaffolding session looks like this. The developer writes a prompt. The model responds. The developer reads the response, finds something to adjust, writes a follow-up. They repeat until the output is good enough to commit.

Empirically, "good enough to commit" takes seven to twelve prompts. Less for trivial work, more for ambitious work. The shape of the distribution is similar across users and across tasks.

Now consider two models. Model A is a frontier cloud LLM. Per-prompt quality is high. Per-prompt latency is two seconds — the round trip to the data center plus the time to generate. Model B is a small local model. Per-prompt quality is somewhat lower. Per-prompt latency is two hundred milliseconds.

If a session takes eight prompts on Model A and ten prompts on Model B — Model A is slightly smarter, so it needs fewer rounds — the total time-in-loop is:

Model A: 8 × 2s = 16 seconds of waiting Model B: 10 × 0.2s = 2 seconds of waiting

Model B is "less smart" on a per-prompt basis. Model B is eight times faster on a per-session basis. The session is what the developer actually experiences.

The cognitive cost of latency

The math above only counts wall-clock time. It does not count what is arguably the bigger cost: the cognitive interruption of every wait.

There is a well-established threshold in human-computer interaction research at about one hundred milliseconds. Below that threshold, a response feels instantaneous. Above it, the user becomes aware that they are waiting, even if only barely. As the wait grows past one second, the user's attention starts to drift. Past three seconds, they often switch tasks entirely.

A two-second cloud response sits in the worst region of that curve. Long enough that the user notices. Short enough that they don't get a proper break. Just long enough to lose flow without compensating with rest.

A two-hundred-millisecond local response sits in the "instant" region. The user does not perceive a wait. The session stays in flow.

This is why the eight-second-vs-two-second comparison above understates the real difference. The developer using Model B is not just saving fourteen seconds. They are staying in a productive cognitive state through ten iterations that the developer using Model A could not.

Where benchmarks still matter

We do not want to overstate the case. Benchmarks matter. A model that is wrong on a single completion is going to make a developer write more prompts to fix it. There is a quality floor below which no amount of latency will save you — if every output is unusable, no number of fast outputs adds up to a working application.

The right framing is not "latency beats quality." It is "below a quality floor, more quality matters more than less latency. Above that floor, less latency matters more than more quality."

The question is where the floor is. We think it is well below the frontier. A model that gets eighty percent of the answer right on a focused, narrow domain — like scaffolding Next.js applications — clears the floor with room to spare. Above the floor, the workflow speed dominates.

What this changes about how we choose models

The industry-default model picker is "use the best model that benchmarks allow." That works when you are doing one-off completions — answering a question, summarizing a document, writing a one-shot piece of code.

It is the wrong picker when the unit of work is a session.

The right picker is "use the smallest model that clears the quality floor, and minimize latency from there." Sometimes that is a frontier cloud LLM, when the task is open-ended enough to need the breadth. Often, when the task is narrow and the iteration loop matters, it is a small local model.

Avery NXR is built around this picker. The local SLM clears the quality floor for the specific job of scaffolding Next.js applications. From there, we minimized the latency by putting the model on the developer's machine. The session experience is the result.

The thing we hope happens

We expect benchmarks for AI dev tools to evolve in the next year or two. The first generation of benchmarks scored single completions. The next generation will need to score sessions.

A session benchmark would measure: how long does it take a developer to get from prompt to shipped code? How many prompts do they need? How long is the total wait time? How often do they abandon the session entirely?

When that benchmark exists, we suspect the leaderboard is going to look very different from today's. Models that score lower on HumanEval might score higher on the session benchmark, because the session benchmark rewards what the user actually feels.

We are building for the session benchmark before it exists. We expect it to arrive in time.