Loading

Nowadays, AI systems are pretty good at a lot of different things - coding, content creation, answering everyday questions. They can be great for brainstorming, generating ideas and writing texts. But when it comes to numerical data, when numbers need to be exact and precise every single time, AI turns out to be good at one more thing: lying.
Determinism means that given the same input, the system always produces the same output. LLMs are by design non-deterministic. They are optimized to generate the most probable next token, not to guarantee correctness. For text that is usually fine. A summary phrased differently each time, a headline with slightly different wording - close enough is often good enough.
But accuracy requirements don't live on a spectrum. Either the answer is correct or it isn't. And the problem isn't just that AI can be wrong. It's that it can be wrong confidently, consistently, and in ways that are hard to detect until the damage is done.
Reports and numerical data are where this problem becomes impossible to ignore, and impossible to hide. With textual outputs, inconsistency is relatively manageable. You can constrain tone, format, and structure through careful prompting. The model might phrase things differently each time, but close enough is often good enough when you're generating a summary or a headline. Numbers don't work that way. 35 and 37 are not "close enough". And are you sure 35 is actually the correct answer? Treating a confident-sounding answer as a correct one is exactly how you end up with a system users don't trust.
That's exactly what I ran into on one of the AI platforms I worked on. We were processing large volumes of data daily from multiple sources: things like product pricing, customer activity, marketing performance, shipping, and more. The system could answer natural language questions about that data. Impressive in a demo. Useless in production.
The tricky part wasn't that the system was obviously broken. It was that it mostly worked. Ask about pricing, get a clean answer. Ask the exact same question five minutes later, get a slightly different one. "€35" the first time, "€37" the second. Nothing changed - not the question, not the data, not the user. Only the answer did. And you might not even notice, because the difference is subtle. Both answers look reasonable. Neither looks wrong. That's the more dangerous version of the problem: the wrong answer that doesn't look wrong.
And when it happens, where do you even start debugging? Is the problem in the AI logic? Is it in the data pipeline? Maybe the data source had downtime and the data never arrived in the first place. Maybe the query ran correctly but against a stale snapshot. You have no idea which of those things went wrong, or if anything went wrong at all. No error message. No warning. Just a different number, delivered with the same confidence.
Users couldn't tell when to trust it. So they stopped trusting it entirely, and went back to checking answers manually. The feature defeated itself.
Trust isn't something you patch with a better prompt.
To understand the problem better, it helps to know how the system was built. At its core, it was a typical RAG (Retrieval-Augmented Generation) system combining conversation history, semantic search, and an SQL generation layer in the background that retrieved data from vector storage and data warehouse, and returned answers to users.
The SQL generation layer was powered by Vanna AI, an open-source framework that uses large language models to translate natural language questions into SQL queries, letting users interact with databases conversationally without writing a single line of code. Vanna AI works with three core files: a DDL file that describes the database schema, a docs file with additional context, and a YAML file containing question-answer pairs used for training. These files are embedded into a vector database, ChromaDB in this case, which stores them as numerical representations of meaning. When a user asked a question, the system retrieved the most semantically similar examples and fed them to the LLM as context, so it could generate a more accurate query. That query was then executed against Amazon Redshift, where the actual data lived, and shown to the user.

And it worked. In the beginning.
But as the platform grew, more users, more active usage, more data collected every day, cracks started to appear. A single Vanna AI instance shared across multiple LLMs and many different question types simply wasn't enough. The context window grew too large, and whenever a question came in that the model hadn't been trained on, there was no reliable way to know if the returned data was correct without manually checking the original source.
This diagram shows the full architecture, from data sources at the bottom to the frontend at the top. The interesting part is everything in between: how a natural language question gets routed, validated, and turned into a trustworthy answer. The following sections explain exactly how that works.
It was clear from the start that I couldn't change the nature of how LLMs work. But I could minimize inconsistency and maximize determinism through architecture. The core idea had two parts.
First, split one large Vanna AI instance into multiple smaller, specialized ones, each scoped to a specific query category and data source. Second, add multiple validation layers directly into the response flow. At a high level: an incoming query gets classified, routed to the right specialized instance with a minimal context window, and instead of generating one SQL query and trusting it blindly, the system generates three independent ones in parallel. Those three results are then cross-validated, and the best answer is returned to the user along with a confidence score and supporting benchmarks.
Previously: one large shared context, one SQL, one result, no signal of whether it was right.
Now: minimal focused context per category, three independent queries, cross-validated results, and an explicit measure of how much to trust the answer or where the problem could be.
This is what the step-by-step pipeline looks like under the hood: blue steps were already part of the original system, yellow ones are what made it actually trustworthy and deterministic. The following sections focus on the parts that matter most.

The first step was splitting the single large Vanna instance into multiple smaller, specialized ones and implementing a category detection layer that routes each query to the right one.
For example, questions about product pricing pull from one data source, questions about user activity from another. Eight categories in total, each with its own scoped context. The model only ever sees what is relevant to the question at hand.
This is where context window reduction starts to matter. A smaller, focused context means less noise, less room for the model to hallucinate, and more consistent SQL generation. Instead of one instance trying to know everything, each instance knows exactly what it needs to know and nothing more. A smaller context doesn't automatically make the model more consistent, but when there's less irrelevant information to work through, there are fewer things that can go wrong, and the outputs tend to be more stable across runs.
Combining sources across categories is a whole different challenge. More on that later.

Once the right category is detected, the system doesn't generate one SQL query and hope for the best. It generates three, in parallel. Why three? Because a single output gives you no signal. Two gives you a binary. Three gives you something to compare against - and a meaningful way to detect when something is off.
The SQL generation happens through the category-specific instance selected in the previous step. That instance has a minimal, scoped context loaded from its own vector store (in our case ChromaDB with the schema description for that category, supporting documentation, and question-SQL training pairs collected specifically for it). Schema and documentation are injected in full rather than retrieved by similarity. Partial schema context is one of the most common causes of hallucinated column names and broken joins, so the tradeoff in prompt size is worth it. The similar question-SQL pairs, on the other hand, are retrieved by vector similarity: the system finds the closest past queries to the current question and includes them as few-shot examples.
On top of that, the last few SQL queries from the current conversation are injected into the system prompt as explicit prior context. In our case of a reporting platform, users rarely ask questions in isolation. They look at a result and immediately want to slice it differently, e.g. add a column, or narrow the date range. The model has the exact SQL structure it should be extending, not just a vague memory of what was discussed.
The instance then calls the underlying LLM three times independently, each going through this full prompt construction. Because temperature is non-zero but low, the three outputs are usually similar but not identical, which is exactly the point. Each query then goes through validation before anything touches real data. If it passes, it executes and the result is captured. All three results, successes and failures, are passed forward. Nothing is dropped, because a failed query is still a signal.

This is where the actual validation happens. Two LLM comparisons fire concurrently.
The first compares the three SQL queries against each other, pair by pair: SQL 1 vs 2, SQL 1 vs 3, SQL 2 vs 3. Each pair is classified as identical, logically equivalent, or different. Logically equivalent means different syntax that would produce the same meaningful result: a different JOIN order, different table aliases, columns listed in a different sequence in the SELECT, or WHERE conditions written in a different order. Those aren't real differences and aren't penalized.
The second comparison looks at the actual result sets. It takes the rows from each execution and compares them, classified the same way: identical, logically equivalent, or different. Logically the same covers cases where the data is correct but column names vary slightly (price vs total_price, avg_sales vs sales) or where row order differs but the values match.
Both comparisons run at the same time. By the time they finish, the system has a full picture of where the three queries agreed, where they did not and does it even matter. Two queries can produce different SQL and still be counted as equivalent if their results are the same. Two queries can be syntactically near-identical and still diverge on data if they hit different branches of the schema. Both signals feed directly into the confidence score in Step 5.

With all that comparison data in hand, the system selects the best query from the three candidates. Each query is evaluated based on whether it was valid, whether it returned results, and how its output compares to the others. The goal is simple: find the most reliable answer from the three candidates, not just the first one that worked. Generating the response
A natural language response is then generated from the selected query and its results. The model receives the original question, the SQL that was executed, and the full result set. Both the selected SQL and the natural language response move to the final step.
Before anything reaches the user, a final set of checks runs against the data itself. This step doesn't look at the SQL or the model output. It looks at the source (warehouse) directly.
The first check is whether the data is recent enough to be useful. If any referenced table has not been updated in a while, a warning is attached with the table name and the last available date. Data pipelines have lag. A user asking about last week should not silently receive results from two weeks ago.
If the query targets a specific date or range, the system checks whether all expected days are actually present in the data. If days are missing, the user is told which ones and how many.
If anything the query depends on is not configured correctly or has missing data, the user is told before they try to interpret the result.
All of this feeds into a confidence score that summarizes everything collected across the full pipeline: SQL query disagreement across 3 candidates, data freshness, completeness, validation results (stale data, empty results, failed validations etc.). The final score maps to three levels: high, medium, or low.
The user doesn't just get an answer. They get a score, a level, and an explicit list of reasons: which checks passed, which failed, and by how much - a clear signal of whether that answer can be trusted, and if it can't, why not. Instead of forcing users to second-guess every result or manually verify the source, the system does that work for them and surfaces it directly. No more confident-sounding answers with nothing behind them.

Some questions don't belong to a single category. When that happens, the system splits it into smaller sub-queries, each routed to its own specialized category instance and processed independently through the same pipeline. Once all results are ready, they are merged and returned to the user as a single coherent answer.
AI won't stop lying about numbers. But with enough determinism built into the architecture around it, you can at least catch it when it does, so users know when to trust the answer and when to dig deeper.
The improvement wasn't in the model. It was in the architecture around it. Four things together pushed the system toward determinism:
None of these are revolutionary ideas on their own. But together they turn something unreliable into something teams can actually use. They measurably improved both determinism and accuracy. Not to 100% - data can still be missing, things can be misconfigured, and edge cases will always exist. What it does guarantee is that when something is off, the system says so, instead of confidently returning a wrong answer. Because an AI that tells you it might be wrong is infinitely more useful than one that never doubts itself.
Fosleen builds AI-powered web and mobile applications for businesses that need real results — not just a website. From complex automation to full-scale platforms, we turn your goals into production-ready software, on time and built to scale.
Ready to build something that works?
Let's talk about your project.