Fraud detection and transaction review: where every dollar is on the AI line
· Avery NXR
Modern financial systems run a lot of fraud detection AI.
Every transaction that crosses a risk threshold gets scored. Every flagged transaction gets a structured review explanation. Every suspicious pattern gets summarized for a human reviewer. Every confirmed fraud case gets analyzed for the lessons it offers about future detection. Every regulatory report gets drafted with AI assistance. The pipeline is dense, the volume is enormous, and the data is the most sensitive in the company.
In the cloud-LLM-default architecture, all of this is happening on a meter. The bill is real. The privacy posture is, for many financial institutions, untenable.
The math on a real-sized institution
A midsize fintech or regional bank processes a meaningful share of its transaction volume through some kind of AI-enhanced fraud workflow.
A representative institution: ten million transactions per month. Of those, perhaps 5 percent (five hundred thousand) cross a threshold that triggers AI-augmented review — by the standard scoring model, by velocity rules, by anomaly detection, by combinations of the above.
Each flagged transaction goes through one or more LLM operations: contextualize the transaction against the customer's history, generate a structured risk explanation, suggest next actions, draft a review note for the analyst. A reasonable token budget per transaction is two thousand input tokens (transaction details, customer history, merchant data) and four hundred output tokens (the structured explanation).
At frontier pricing, about $0.012 per flagged transaction. Across five hundred thousand per month, that's $6,000 per month, or $72,000 per year.
The numbers get larger fast. A larger institution — a hundred million transactions per month, with the same flagging rate — is paying $720,000 per year. A truly large institution pushes well into seven figures.
These numbers exclude the cost of the upstream scoring models, the case management systems, and the regulatory reporting tools. The AI explanation layer on top is the line item we're examining here.
Why this is one of the strongest local-SLM cases in the operational stack
Fraud detection has every property we look for in a local-SLM workload, and several are extreme.
It is extreme on privacy. Transaction data is the most regulated category of financial data. PCI, GLBA, sectoral regulations across many jurisdictions, customer identification rules — all of it constrains where this data can go and who can process it. Most financial institutions, even the ones comfortable with cloud AI for other workloads, are not comfortable sending transaction-level data to a third-party cloud LLM.
It is extreme on regulatory pressure. Regulators in many jurisdictions have started examining how AI is used in fraud and AML workflows. The conversation about explainability, bias, model governance, and data residency is moving fast. Institutions running cloud-LLM-first architectures are increasingly finding that the architecture is what regulators want to discuss first.
It is high on latency-sensitivity in the real-time case. For card-present transactions, the decision window is hundreds of milliseconds. A cloud LLM in the loop is incompatible with that latency budget. A local SLM can be.
It is narrow. The model needs to know one institution's transaction patterns, customer base, merchant relationships, historical fraud cases. A model fine-tuned on these specifics will outperform a general model on the institution's own work.
It is high-volume. The cost scales linearly with transaction volume, which scales with the institution's growth. The cloud LLM bill never stops growing.
What changes with local inference
A fraud workflow on a local SLM looks like this.
A model is fine-tuned on the institution's transaction history, fraud case database, and risk-explanation corpus. The fine-tune captures the institution's specific patterns and language.
The model runs on infrastructure the institution controls — on-premises, in a private cloud, or in a sovereign cloud that meets the regulator's requirements. The deployment is documented and approved by compliance and risk.
Transactions flow through the scoring pipeline. Flagged transactions hit the local model, which produces structured explanations and analyst-facing notes. The audit trail is local, versioned, and reviewable.
For real-time card-present cases, the model runs at the edge — close enough to the transaction processing to meet the latency budget. Two hundred millisecond local inference fits inside a card authorization window. Two-second cloud inference does not.
The cost flips. The institution pays for the model, the hardware, and the engineering work to deploy and maintain it. The marginal cost of analyzing each additional transaction is zero. Volume can grow without the bill moving.
What's better, beyond cost and privacy
A model fine-tuned on the institution's own fraud history is a meaningfully better tool.
It knows the specific patterns that have predicted fraud in this institution's customer base. It knows which merchants tend to appear in chargebacks. It knows which customer behaviors are consistent with the customer's profile and which are anomalous in the specific way the institution's typical fraud cases are anomalous.
A general cloud LLM has none of this prior. It produces competent explanations, but it is rediscovering patterns the institution already knows.
The fine-tuned local model produces better analyst-facing explanations. The analysts trust the model more. The false positive rate, the false negative rate, and the analyst review time all improve.
When the cloud LLM is still defensible
A few cases where cloud-LLM-based fraud workflows are still the right answer.
For brand-new institutions or new product lines without enough historical data to fine-tune. In the first six to twelve months, the cloud LLM's breadth compensates for the missing training data.
For institutions whose regulators have not yet objected to cloud LLM usage in this category. As of writing, the regulatory pressure varies dramatically by jurisdiction. In some markets, the cloud LLM is still acceptable; in others, it is not.
For workflows that are batch and not latency-sensitive — quarterly fraud trend reports, retrospective case reviews — the latency argument doesn't apply, and the cost is the only meaningful axis. For these workflows, the cloud may still be acceptable depending on the institution's privacy posture.
For everything else — the high-volume, real-time, regulator-relevant transaction review work that constitutes most of the AI in modern fraud operations — the local-SLM case is strong, and the privacy case is closer to mandatory than to optional.
The pattern, in financial services
Avery NXR is not a fraud detection tool. It scaffolds Next.js applications. The architectural pattern repeats.
Fraud detection is a narrow, repetitive, extreme-volume, extreme-privacy, latency-sensitive workload. The economics, the privacy story, and the regulatory pressure all point toward local inference as the right architecture for this category.
The vendors that build excellent fraud and AML tooling on local infrastructure — with appropriate fine-tuning, edge-deployment, and evidence packages for regulatory review — are going to find willing buyers in every financial institution worth talking to. The cloud-LLM-default products will hold the market until the regulatory pressure forces the architectural shift.
We expect that shift to be relatively rapid in financial services compared to other operational categories, because the regulatory pressure compounds with the cost pressure and the privacy pressure. The institutions that move first will be ahead of the curve on cost, on privacy, and on regulatory standing simultaneously.