Customer support: where AI cost scales with company growth, forever
· Avery NXR
Customer support is one of the workflows that has been most thoroughly transformed by AI in the past three years. The transformation has been mostly to the good. Triage that used to take a human ten minutes now happens in seconds. Response drafts that used to be written from scratch are now generated as starting points. Knowledge-base lookups that used to require navigating a clunky internal tool now happen inline in the agent's interface.
Every one of those wins is, in most implementations today, a cloud LLM call. And every one of those cloud LLM calls is on a meter.
The economics of support at scale
Customer support has a uniquely uncomfortable cost structure when AI is layered into it. The cost per ticket is small. The number of tickets scales with the number of customers. The number of customers scales with the company's growth. The result is an AI bill that grows precisely as the company grows — forever.
Here is the math at a mid-stage company. Twenty support agents, each handling roughly fifty tickets per day. That is a thousand tickets per day, or about thirty thousand per month.
For each ticket, a typical AI-enhanced support workflow does several things. It classifies the ticket (intent, urgency, queue). It enriches the context (relevant customer history, prior tickets, product version). It suggests a knowledge base article. It drafts a first response. After the agent sends the response, it may also evaluate the response for quality and tag the ticket for trends analysis.
A reasonable token budget per ticket, across all these operations, is about ten thousand input tokens and twelve hundred output tokens. At frontier pricing, that is about $0.048 per ticket — so about $1,440 per month, or $17,280 per year, for this team.
Now scale this. A larger support org — a hundred agents handling fifteen thousand tickets per day — is at over $200,000 per year for the same workflows. We have talked to support leaders at growing companies whose AI bill for support tooling is approaching $500,000 per year. That number tracks the company's customer base directly; as the company doubles, the bill doubles.
Why support is well-suited to a local SLM
The properties that make a workflow well-matched to a specialized local model are all present.
Support is narrow. The vocabulary, the product knowledge, the policy framework, the tone — all of it is specific to one company. A model trained on a company's own support tickets and knowledge base will outperform a general-purpose model on the company's own work, every time.
Support is repetitive. Most tickets fall into a small number of intent categories. Within each category, the patterns of useful response are also small in number. Specialization beats generalization on repetition.
Support is high-volume. The cost-per-call structure of cloud LLMs is brutal here. Every doubling of customer base doubles the AI bill. Local inference flips the curve: every doubling of customer base does not change the cost at all.
Support is privacy-sensitive. Tickets contain customer information, account details, billing data, and in many cases personal identifying information of various kinds. Many companies are uncomfortable routing this data to a third-party cloud LLM, even if their contracts permit it. For regulated industries — healthcare support, financial services support, government services — local inference is often the only acceptable architecture.
Support is latency-sensitive. The agent is staring at the screen, waiting for the AI to suggest a response. A 200ms suggestion feels like the AI is thinking with them. A 2-second suggestion feels like a delay. Across hundreds of agents and millions of suggestions per day, the cumulative cost of latency on agent productivity is significant.
The architecture that wins
A support organization running on a local SLM has a configuration that looks like this.
The model is fine-tuned on the company's own ticket history, knowledge base, and policy documents. The fine-tune captures the specifics — product terminology, common issues, the company's preferred tone, the policies that govern responses.
The model is deployed in a way that puts inference close to the agent. For some companies, that means a server in each region. For others, with smaller-scale agent counts, it can mean the model running on the agent's own workstation. In both cases, the round-trip latency is dramatically lower than a cloud LLM, and the cost is fixed rather than variable.
The cloud LLM remains available for the unusual cases — the genuinely novel ticket that requires reasoning beyond the local model's training. A well-designed pipeline routes 95 percent of tickets to the local model and the remaining 5 percent to the cloud, getting most of the cost savings while preserving the ability to handle outlier cases.
The audit trail improves. Every model action is logged locally; every decision is reviewable; every response the model drafted before the agent edited it is preserved. This is useful for training, for quality assurance, and for the eventual compliance conversation about how AI was used in customer interactions.
The privacy story that closes deals
For a lot of support organizations, the cost story is compelling but not decisive. The privacy story is what closes the conversation.
Healthcare support teams cannot send ticket content to a third-party cloud LLM without significant compliance overhead. Financial services support teams operate under similar constraints. Government services teams have stricter rules still. For all of these, a local-inference architecture is not just better — it is the only architecture that is permitted.
Even at companies without strict regulatory constraints, the privacy posture matters. Customer trust is fragile; the disclosure that "we send your support conversations to a third-party AI provider" lands differently in different markets and with different audiences. A local-inference architecture lets a company say, truthfully, "your support conversation stays inside our systems."
We do not lead with this argument in most conversations because it can feel preachy. But it is a real argument, and it is the closing argument in many of the conversations we have had.
When the cloud LLM is still right
A few cases where a cloud LLM is the right answer for support work.
Brand-new product launches where the model has no historical ticket data to fine-tune on. In the first six months of a new product, the cloud LLM's breadth compensates for the local model's lack of specific training data.
Very small teams where the volume does not justify the local infrastructure investment. Below some volume threshold, the math tips back in favor of the cloud. The threshold is lower than people think — typically a few thousand tickets per month is enough to justify going local — but it is not zero.
Unusual workloads where the model needs to reason about content it has not seen patterns for. For most support orgs, this is a minority of tickets, but it is a real minority.
For everything else — and "everything else" is the vast majority of support work at any company past the early stage — the local-SLM case is strong.
The pattern continues
We keep noting that Avery NXR is not the tool for these workflows. It scaffolds Next.js applications. The reason we keep noting it is that the architectural pattern is the same.
Narrow workload. Repetitive operations. High volume. Privacy-sensitive content. Latency-sensitive interactions. Cost scales with usage in the cloud-LLM architecture, and is fixed in the local-SLM architecture.
If you are running a support organization today and your cloud AI bill is climbing in lockstep with your customer base, the architecture is the lever. The work itself — the triage, the routing, the drafting, the scoring — is well-suited to a specialized local model. The economics are well-suited to local execution. The privacy story is well-suited to local execution. The latency is well-suited to local execution.
The reason most teams have not made the switch is the same reason they have not made the switch for document processing or email processing — the tooling is still maturing. The companies that build the right vertical tools for support, with the right local-SLM architecture and the right business model, are going to have a lot of demand.