Data labeling at scale: using AI to train AI, on a meter
· Avery NXR
There is a particular irony in the modern ML pipeline that doesn't get talked about enough.
To train a custom model — a classifier, an entity extractor, a recommender, a moderation system — you need labeled training data. Lots of it. The "lots of it" used to mean teams of human annotators clicking through spreadsheets for months. Today, the "lots of it" mostly means a cloud LLM that has been prompted to produce labels for whatever data the team needs to train on.
So the workflow is: pay a cloud LLM provider to label your training data, so that you can train your own model, so that you can stop paying the cloud LLM provider.
The bill for this intermediate step is, at many ML teams, larger than people realize.
The math
A representative ML team building a domain-specific classifier needs somewhere between fifty thousand and five hundred thousand labeled examples to get to production quality, depending on the complexity of the task.
For each unlabeled example, a typical pipeline runs it through a cloud LLM with a structured prompt asking for the label, the confidence, and the reasoning. A representative example uses about a thousand input tokens (the example plus the labeling rubric) and a hundred and fifty output tokens (the structured label).
At frontier pricing, that's roughly $0.005 per label.
A medium-sized labeling job — two hundred thousand examples — costs $1,000. A larger one — a million examples — costs $5,000. Not huge numbers in isolation.
The catch is that ML teams don't label one dataset. They label many. They iterate. They re-label when the task definition shifts. They label new datasets as the product grows. A team running a serious ML practice can spend $50,000 to $200,000 per year on cloud-LLM-based labeling alone, with most of that money producing training data for models that the team will eventually deploy locally.
Why labeling is a strong local-SLM workload
The properties are all present.
Labeling is narrow. The model needs to apply one rubric to one kind of input. A model fine-tuned on a sample of human-labeled examples will outperform a general model that has to figure out the rubric from instructions.
It is repetitive. The same rubric, the same input shape, the same output shape, repeated hundreds of thousands of times.
It is extremely high-volume. The bill is small per-label but grows quickly across realistic dataset sizes.
It is privacy-sensitive when the unlabeled data is. Customer messages being labeled for sentiment, medical records being labeled for symptom presence, internal documents being labeled for sensitive content — all of this is data the ML team probably wants to keep inside their controlled environment.
It is latency-insensitive (labeling is batch work), which means the local model can run on commodity hardware without worry. The constraint is throughput, not response time.
The recursive structure of the use case
The interesting thing about labeling as a workflow is its recursive structure.
The team's goal is to train their own model. The model they're training is, in some sense, a specialized small model that will outperform a generic cloud model on the specific task. The labeling step is an intermediate dependency in that goal.
Using a cloud LLM for the labeling step is, structurally, paying a generalist to teach a specialist. The generalist's labels are good enough to train on, but the bill is high and the data has to leave the controlled environment.
Using a local SLM for the labeling step — even a smaller, less capable model than the cloud LLM — closes the loop. The labels are produced by a model on the team's hardware. The training data never leaves. The cost is fixed rather than per-label.
For a team doing repeated labeling across many datasets, the recursive economics get attractive fast. Once the team has trained a labeling model, that model can be used to label more data for the next model, which can be used to label more data for the next model. The cost per labeled example drops to electricity.
What the architecture looks like
A labeling workflow on a local SLM has a structure like this.
The team starts with a small sample of human-labeled data — a few thousand examples. This sample is used to fine-tune a small open-source model into a labeling model for the specific task.
The labeling model runs on infrastructure the team controls. The full dataset flows through it, producing labels at high throughput. The output goes into the training pipeline.
For quality control, a small sample of the labels is reviewed by humans (typically 1-5% of the dataset). The review surfaces systematic errors that can be corrected by additional fine-tuning. The cycle repeats until label quality is acceptable.
The bill becomes the cost of the labeling model (a one-time fine-tune) and the hardware to run it. The marginal cost of labeling each additional example is essentially zero.
Where the cloud LLM is still the right call
A few cases where cloud-LLM-based labeling is genuinely better.
For one-off labeling tasks where the team won't reuse the labeling pipeline. The infrastructure investment doesn't pay back on a single project.
For tasks where the model's reasoning quality is the bottleneck — say, complex multi-step reasoning over long contexts. A frontier cloud LLM may produce higher-quality labels than a small fine-tuned model can.
For early prototyping when the team is figuring out whether labeling is needed and what the rubric should be. Cloud LLMs are faster to iterate against than fine-tuned local models in this phase.
For everything else — the recurring labeling work that runs across multiple projects in any serious ML practice — the local-SLM case is strong.
The pattern, with extra irony
Avery NXR is a Next.js scaffolding tool. It is not a labeling tool. The architectural pattern repeats.
Labeling is a narrow, repetitive, high-volume, privacy-sensitive workload. The economics that favor a specialized local model for code scaffolding are the same economics that favor a specialized local model for labeling.
The extra irony in this use case is that the labeling workflow exists to produce specialized local models. Doing the labeling on a cloud LLM is letting the generalist do the work that produces specialists. Doing the labeling on a local SLM closes the loop — the specialists are produced by other specialists, and the dependency on the generalist falls away.
ML teams that recognize this — and build their labeling infrastructure on local models — will be operating at a structural cost advantage to teams that keep paying the cloud LLM to do the intermediate work.