Internal knowledge base Q&A: when every employee question becomes a cloud LLM call
· Avery NXR
The internal knowledge base is one of the most thoroughly disliked artifacts in modern company operations. Documents that were written three years ago and haven't been updated since. Information scattered across four tools. Search that returns thirty results when you wanted one. Pages that say "this is out of date — see the new page" without linking to the new page.
AI has done a remarkable job patching this particular failure mode of company knowledge management. Instead of searching the knowledge base, employees now ask a chatbot that searches for them, reads the results, and gives them an answer. It works. People love it. Adoption inside companies has been faster than almost any other AI deployment we have seen.
The bill is climbing along with the adoption.
The economics of "ask anything"
The promise of an internal Q&A system is that an employee should be able to ask any question about any aspect of the company's operations and get a useful answer in seconds. Realizing that promise requires running every question through a model that has access to the company's documents.
Consider a company with five hundred employees, each asking the internal Q&A system an average of fifteen questions per workday. That is seventy-five hundred questions per day, or about a hundred and fifty thousand per month.
Each query, in a typical implementation, retrieves several documents from the knowledge base and passes them to a cloud LLM along with the question. A reasonable token budget per query is twelve thousand input tokens (the retrieved documents plus context) and three hundred output tokens (the answer). At frontier pricing, each query costs about $0.041.
Across a hundred and fifty thousand queries per month, that is about $6,150 per month, or $73,800 per year for one company.
These numbers scale with company size, with the depth of integration, and with how successfully the AI Q&A tool gets adopted. We have talked to companies where the internal Q&A bill is approaching $250,000 per year — bigger than the engineering budget for the team running the tool.
Why this is a strong local-SLM workload
The properties that favor local inference are all present.
The work is narrow in an unusual sense. The model does not need to know everything about everything. It needs to know one company's documents, one company's vocabulary, one company's history. A model fine-tuned on a company's own knowledge base will outperform a general-purpose model on questions about that company, every time.
The work is repetitive in pattern. Most internal questions cluster into a small number of categories — process questions, policy questions, "where do I find" questions, "who owns" questions. A model that has seen thousands of internal queries from this company will get better at the company's specific patterns of question over time.
The volume is enough to make the cloud bill substantial without being so high that local infrastructure is impossible. A hundred and fifty thousand queries per month is tractable for a modestly sized local model running on commodity hardware. The math works on a single mid-grade GPU server, or even on individual employee machines for the heaviest users.
The privacy posture matters more here than in many other workloads. The knowledge base contains everything — financial data, customer information, internal strategy, personnel documents, security procedures. Sending every employee question and every retrieved document to a third-party cloud LLM is, for many companies, an explicit policy violation. For regulated companies, it is a non-starter.
The latency matters. When an employee asks a question, they are sitting at their screen, waiting. A 200ms response feels like an internal tool that is part of the workflow. A 2-second response feels like waiting for the bathroom. Across thousands of queries per day, the cumulative cost of latency in employee attention is measurable.
The architecture that makes this work
A local Q&A system for an internal knowledge base has a clear shape.
A small, specialized model is fine-tuned on the company's own corpus — documentation, policies, historical Q&A logs if available. The fine-tune is a one-time investment that pays back across millions of queries.
A retrieval layer indexes the knowledge base and identifies the relevant documents for each query. This is the classic RAG architecture, but with the LLM step replaced by a local model. The retrieval can be local too, or a hybrid where the index is in a cloud vector database but the inference is local.
The system runs either on a server the company owns, or in some configurations directly on employees' machines for the fastest possible response. The deployment shape depends on the volume and the latency requirements; both deployments are viable.
The cost flips from per-question to fixed. The company pays for the model, the hardware, and the index — and then the volume can grow ten-fold without the cost moving.
The thing that improves more than the cost
The cost savings are the obvious win. The less obvious win is that a model trained on the company's own data is genuinely better at answering the company's questions than a general model with retrieval is.
A general-purpose cloud LLM, given a question and some retrieved documents, will produce a competent answer. A model that has been fine-tuned on the company's own writing — the tone, the vocabulary, the structure of decisions — produces an answer that sounds like it came from someone who works there.
The difference matters in a few ways. The fine-tuned model is better at disambiguating internal jargon. It is better at handling the specific shape of policy and process the company uses. It is better at noticing when a retrieved document is out of date because it knows what the current document on the same topic looks like.
The fine-tuned local model is a better tool, in addition to being a cheaper and more private one. The architecture pivot improves the answer quality, not just the operational metrics.
The cases where the cloud LLM still wins
A few cases where keeping the Q&A workload on a cloud LLM is the right choice.
Companies with knowledge bases that change so fast that fine-tuning would be perpetually out of date. Most companies don't have this problem; their knowledge bases change incrementally and a model can be re-fine-tuned every few weeks. But some do, and for them a cloud-LLM-plus-fresh-retrieval architecture is genuinely better.
Companies whose volume is so low that the local infrastructure investment doesn't pay back. Below a few thousand queries per month, the math tips toward cloud. The threshold is lower than people guess, but it exists.
Companies whose questions are heavily open-ended, requiring reasoning that the local model has not been trained for. For most internal Q&A, this is a minority; most internal questions are pattern-matched against existing documents. But in research-heavy organizations, the share of open-ended questions can be high enough that the local model's specialization is wasted.
For the median company — five hundred to ten thousand employees, with a typical knowledge base, with a typical mix of question types — the local-SLM case is strong.
The pattern, the fifth time
Avery NXR scaffolds Next.js applications, not internal Q&A. The pattern repeats: narrow workload, repetitive operations, high volume, privacy-sensitive content, latency-sensitive interactions, cost that scales with usage in the cloud-LLM architecture and is fixed in the local-SLM architecture.
Internal knowledge base Q&A is a workload where the case for local execution gets stronger every year, as adoption grows, as bills compound, and as employee expectations for the system rise. The companies that solve this with a well-trained local model and a sensible business model are going to find a lot of customers.
We are watching this category. We are not building in it. But the pattern — the same pattern we are building Avery NXR around — is going to produce excellent companies here over the next few years.