Designing Evals: The New Discipline Every AI Team Needs in 2026

27/5/2026 · 11 min read

Every AI product team I have worked with in 2026 has had the same conversation — usually three months in, usually after a regression nobody caught — about why they should have built evals from day one.

In the first year of an AI product, vibes-based testing works. The team is small, the surface is small, and the founders are personally clicking through the prompts after every change. Quality is held together by the fact that the same three people are seeing every interesting output and noticing when something feels off.

That model breaks the moment the product crosses a few thousand users, or the team adds a second engineer touching the prompts, or the model itself gets upgraded. Suddenly nobody is seeing the whole surface. Regressions ship without anyone noticing. Quality decays in ways that are invisible until a customer screenshot lands in the shared channel. The discipline that replaces vibes — the one every serious AI team is now investing in — is evals.

What An Eval Actually Is, And What It Is Not

"Eval" is one of those words that everyone uses and nobody defines the same way. The most useful working definition in 2026 is this: an eval is a repeatable measurement of model or system output, against a fixed input and a defined notion of what "good" looks like. The repeatable part is what distinguishes it from a vibes check. The defined notion of good is what distinguishes it from a benchmark.

Evals are not benchmarks. A benchmark measures how a model performs against the same input as everyone else's benchmark. That is useful for model selection and almost useless for product quality. Your users are not the people in HumanEval or MMLU; they are asking questions and pasting data your benchmark never saw.

Evals are not unit tests either, though they look like them at first. Unit tests pass or fail. Evals score on a spectrum, against criteria that themselves require judgment, on outputs that can be different on every run even with identical inputs. That looseness is the whole reason evals are a different discipline rather than just "tests for AI code".

The Four Types Of Evals Every AI Team Needs

Different evals answer different questions. The mistake I see most often is picking one type and treating it as the answer to every product quality question. In practice the teams getting this right run four overlapping kinds of evals, and use each one for what it is good at.

1. Reference evals

A fixed input, an expected output, an exact-match or fuzzy comparison. Useful for tasks with one right answer — extraction, classification, structured generation, code outputs. The dataset is the bottleneck and the value compounds. Reference evals catch the regressions where the new model "sounds fine" but actually gets a fact wrong.

2. LLM-as-judge evals

The output is too open-ended for reference comparison, so another model scores it against a rubric. Good for tone, helpfulness, format adherence, faithfulness to sources. The judge prompt is itself a load-bearing artefact and needs to be evaluated against human ratings before you trust it. The judge's biases become your product's biases, so this is not the place to be lazy.

3. Human evals

Real people score real outputs. Slow, expensive, irreplaceable. Used to calibrate the LLM judges, to validate that the automated metrics correlate with what users actually care about, and to catch the systematic failure modes the model-based evals are blind to. The minimum cadence is "before every significant model change", not "when something feels off".

4. Production behavioural evals

Implicit signals from real users — thumbs, regenerates, abandons, edits to generated content, time to first response, downstream task completion. These are the evals that capture what your product is actually doing in the wild, not what you wish it were doing in a curated test set. Without them the offline evals slowly drift away from reality.

Building The Dataset Is Most Of The Work

The eval framework you choose — Inspect, Promptfoo, Braintrust, OpenAI's evals, an internal tool — is a footnote. The dataset is the whole game. Without representative, well-labelled, regularly-refreshed evaluation data, the framework is a fast way to measure the wrong thing precisely.

The first version of your dataset comes from production traffic, sampled with intent. Not random — random sampling under-represents the long tail of interesting failure modes. Stratify by user segment, by feature surface, by output length, by the cases your team already knows are hard. Aim for two hundred examples across the surfaces that matter; you can grow from there.

Labelling is where the team usually flinches. The work is tedious and feels low-status. The teams who treat labelling as a domain-expert task — a senior engineer or a product manager spending half a day a week with the dataset — end up with evals that catch real regressions. The teams that outsource labelling to "whoever has time" end up with a noisy dataset that nobody trusts and everybody ignores.

Refresh the dataset on a cadence. Every time a real production failure escapes the evals, add a regression example to the set. Over twelve months the dataset becomes the memory of every bug your product has ever shipped, and the eval pass-rate becomes a meaningful proxy for "is this version of the system worse than the last one".

LLM-As-Judge: The Sharp Edge Most Teams Cut Themselves On

Using a model to score model outputs is the seductive shortcut at the centre of every modern eval stack. It is also where the most subtle failure modes live. The failure mode is not "the judge is bad". It is "the judge is biased in a direction you did not measure, and your eval scores are now lying to you with confidence".

The biases that bite hardest are well-documented and worth memorising. Judges prefer longer outputs over shorter ones, even when shorter is more correct. Judges prefer outputs that look like their own writing style. Judges score outputs from the same model family more leniently than outputs from other model families. Judges anchored on a numerical scale cluster around the middle of the range, hiding real quality differences in the noise.

The mitigations are equally well-documented and equally often skipped. Calibrate the judge against human ratings before trusting it. Use pairwise comparisons rather than absolute scores when the rubric is subjective. Strip identifying markers from the candidates before scoring. Run the same candidate past multiple judges and measure inter-judge agreement; if the judges disagree with each other the eval signal is noise.

Plugging Evals Into The Development Loop

An eval suite that runs once a quarter to satisfy a leadership ask is theatre. An eval suite that runs on every prompt change, every model bump and every retrieval tweak is a quality system. The difference is the integration into the development loop, not the sophistication of the metrics.

A fast subset of the evals runs on every pull request that touches a prompt, a tool definition or a system message. Sub-five-minute feedback loop. Catches the dumb regressions before review.
The full eval suite runs on a nightly schedule and on every release candidate, with results posted to a shared dashboard. The dashboard shows pass-rate over time, regressions by category, and which examples newly broke or newly passed since the previous run.
Production behavioural signals stream into the same dashboard, so the offline pass-rate and the real-user signal can be compared. When they diverge, that divergence itself becomes the next thing to investigate.
Every shipped fix for a real production bug ends with a new eval case committed alongside the change. The team builds a regression suite for free, one incident at a time.

The Failure Modes That Quietly Poison An Eval Suite

The painful failure mode is not "we have no evals". It is "we have evals that give us false confidence". A bad eval suite is worse than no eval suite because the team stops looking at outputs, trusting the green dashboard, while quality drifts in a way the evals are structurally blind to.

The first poison is dataset overfitting. The team tweaks the prompts until the eval pass-rate is high, the model behaviour quietly specialises to the test set, and real-world performance does not match the score. The fix is a held-out evaluation set that the team never sees and never optimises against; rotated quarterly.

The second poison is stale rubrics. The rubric the team wrote a year ago no longer reflects what users care about. The eval keeps passing while the product slowly disappoints. The fix is to review the rubric every quarter against fresh user research and production behavioural data, and to be willing to throw out an eval that no longer measures the right thing.

The third poison is metric monoculture. The team optimises one number — a single pass-rate, a single helpfulness score — and the system silently degrades on every dimension that number does not capture. The fix is a small portfolio of evals across different axes (accuracy, helpfulness, format, safety, cost, latency) and the discipline to look at all of them before declaring a release ready.

Who Owns Evals, And When Do You Need A Dedicated Role

Eval ownership in 2026 follows a predictable arc. In the first six months of an AI product, the founding engineer or PM owns it as a side responsibility. In the next year, it becomes a shared discipline across the team — everyone contributes examples, everyone reads the dashboard, everyone is expected to add a regression case when they fix a bug. Past a certain product complexity, you need a dedicated owner.

The trigger for the dedicated role is usually one of three things. The eval surface is now large enough that nobody has the full picture in their head. The team is running multiple parallel experiments and the eval pipeline cannot keep up. The company has external commitments (regulatory, contractual, safety) that require documented evaluation processes a part-time owner cannot maintain.

The title for this role is still in flux — "AI quality engineer", "evals engineer", "AI evaluation lead". The shape is consistent: someone who owns the dataset, the framework, the rubrics, the dashboard and the relationship with the human raters, and who reports product quality numbers up to leadership the same way a data team reports business metrics. Not a research role, not a pure engineering role, an applied quality role that did not really exist three years ago.

Conclusion: Evals Are The Quality System, Not A Project

The teams shipping the best AI products in 2026 do not have better prompts than their competitors. They have better feedback loops on what their prompts are actually doing.

Evals are the unsexy machinery behind that feedback loop. They are the equivalent of the regression test suite, the staging environment, the deploy pipeline — the boring infrastructure that distinguishes a product team from a demo team. The companies that internalise this early build a compounding advantage. The companies that keep treating evals as "something we will do once it stops being fine" learn the lesson the hard way, usually in public.

If you take one thing from this, take this: start the dataset before you need it. Two hundred labelled examples, gathered intentionally from production, scored by a domain expert, is the seed crystal everything else grows around. The framework you pick, the metrics you compute, the dashboard you build — all downstream of having a dataset worth measuring against.

And remember that evals are a discipline, not a project. They get worse if you stop tending them. They get better if the team treats them as load-bearing. Pick one person to be accountable. Give them the time. Read the dashboard out loud at the same meeting where you read the product metrics. That is how evals stop being theatre and start being the quality system your AI product actually deserves.

Setting up evals for an AI product?

Contact me