Why Avery NXR runs a local Small Language Model instead of calling a cloud LLM

2026-05-25 · Avery NXR

When developers look under the hood of Avery NXR, the first question is almost always the same: why didn't you just call GPT-4 or Claude?

It's a fair question. The frontier models are extraordinary. They write better prose. They reason across larger problems. They have read more code than any human ever will. For a lot of tasks — debugging an unfamiliar stack trace, refactoring an essay, brainstorming an architecture — they are the right tool.

Scaffolding a Next.js application is not one of those tasks.

The job we're actually doing

Avery NXR has a narrow goal. Given a short prompt, produce a production-ready Next.js + Prisma + TypeScript repository. Not a snippet. Not a starter template you have to finish by hand. A repository you can clone, run, and ship.

That job has a few unusual properties:

It is bounded. The model never has to write a Haskell library or explain quantum mechanics. It only has to know one framework, one ORM, and one type system, deeply.

It is repetitive. Most of what makes a real Next.js app — auth, billing, dashboards, CRUD, jobs, emails, file uploads — is structurally similar across applications. The variation lives in business logic, not in the scaffolding.

It is latency-sensitive. Developers iterate. They run a prompt, look at the output, adjust, run again. The faster that loop runs, the more refinements happen before the result feels right.

It is privacy-sensitive. The prompt for a real application is, by definition, a description of something that doesn't exist yet. It is the most confidential thing on a developer's machine.

A frontier cloud model can do this job. But the frontier model is sized for tasks Avery NXR is never going to ask of it, and that mismatch shows up as cost, latency, and a privacy boundary that crosses the public internet on every request.

What we built instead

The engine inside Avery NXR is a Small Language Model fine-tuned on millions of Next.js patterns — real applications, real generators, real refactors. It is small enough to ship inside the desktop app. It is fast enough to respond before a developer notices it started thinking. It is narrow enough that it doesn't waste capacity on things it will never be asked to do.

It runs on the developer's machine. No API key. No usage meter. No "your request was sent to OpenAI." The prompt, the codebase, and every decision the model makes stay on the laptop.

The tradeoff is real. The SLM cannot reason about a 50-file refactor the way a frontier model can. It cannot improvise novel solutions outside its domain. We are not pretending it is a smarter model than the frontier — it is a different shape of model, optimized for a different shape of job.

Why the math works

On a single completion, a frontier cloud model probably writes slightly better code than the SLM. On a session — the unit a developer actually experiences — the SLM wins.

A real scaffolding workflow takes seven to twelve prompts before the output is good enough to commit. On a cloud model with two-second round trips, that's roughly twenty seconds of waiting per session, plus the cognitive cost of the interruption each time. On the local SLM, the same workflow runs in under two seconds total. The developer never leaves flow.

A model that is eighty percent as smart but answers in two hundred milliseconds isn't worse. It's a different product. The same way a REPL isn't a worse compiler.

Privacy, almost as a side effect

We did not start this project to take a stand on data sovereignty. We started it because we wanted the loop to be faster.

The privacy properties come along for free. Because the model runs locally, the prompt never leaves the machine. Because the prompt never leaves the machine, there is no question of who trained on it, who stored it, who indexed it, or who got subpoenaed for it.

For engineers at a regulated company — finance, healthcare, defense — that property turns Avery NXR from "tool we'd like to try" into "tool we are allowed to use." That was not the goal, but it is one of the more important consequences.

What this thesis predicts

If we're right, more dev tools will move in this direction over the next two years. Not because cloud LLMs are bad — they're not — but because a lot of narrow, latency-sensitive, privacy-sensitive jobs are better served by small models that know one thing deeply than by giant models that know everything shallowly.

We are betting that scaffolding is the first one. We don't think it is the last.