In February 2026, a fintech company deployed a customer support agent that passed every test in their CI/CD pipeline. Unit tests covered every tool function. Integration tests validated the API connections. End-to-end tests confirmed the agent could answer ten representative questions correctly. The agent went live on a Monday. By Wednesday, it had told three customers their loan applications were approved when they were actually under review. The agent was not hallucinating in the traditional sense — it had retrieved the correct status from the database. It just interpreted "pending final review" as a positive signal and communicated it as an approval. No test in the pipeline had checked for that kind of semantic misinterpretation because no test was designed to evaluate reasoning, only output format.
This is not an isolated incident. It is the predictable result of testing non-deterministic software with deterministic tools. According to LangChain’s 2026 State of AI Agents report, 57% of organisations now have agents in production. Quality is the number one barrier to further deployment, cited by 32% of respondents. Not cost. Not latency. Not infrastructure. Quality — meaning the agent does something wrong, and the team either cannot detect it beforehand or cannot reproduce it afterwards.
The reason is structural. Every QA pipeline in software engineering was built on one assumption: given the same input, the software produces the same output. AI agents violate this assumption on every request. An agent processing the same user query may select different tools, retrieve different context chunks, take a different number of reasoning steps, and generate a different final response. Run it ten times, get ten different execution traces. Traditional testing has no vocabulary for "the output is different every time but should still be correct."
Why Traditional Testing Breaks
Software testing has three foundational layers: unit tests verify individual functions, integration tests verify components working together, and end-to-end tests verify the complete user flow. Each layer assumes determinism. Each layer breaks differently when applied to AI agents.
Unit Tests: Testing the Wrong Thing
You can unit test an agent’s tool functions — the database query, the API call, the data transformation. These are deterministic. They work. But they test the plumbing, not the decision-making. The agent’s value is not in calling a function. It is in deciding which function to call, when, with what parameters, and how to interpret the result. Unit tests cannot cover tool selection logic because that logic lives inside the LLM, not in your code.
Integration Tests: Snapshot Fragility
Traditional integration tests compare actual output against expected output. For AI agents, this means snapshot testing — saving the agent’s response to a known query and failing if the response changes. The problem: the response changes every time. You can pin the temperature to zero, fix the random seed, and freeze the model version. The output will still vary because LLM inference is not mathematically deterministic even with temperature zero. Stochastic sampling at the hardware level, batch size differences, and provider-side infrastructure changes all introduce variance. Teams that build snapshot-based agent tests spend more time updating snapshots than catching real bugs.
End-to-End Tests: Combinatorial Explosion
A traditional web application has a finite number of user paths. You can enumerate them. An agent has a combinatorial explosion of possible execution paths — every tool selection, every reasoning step, every context retrieval creates a branch. A simple agent with five tools and an average of four reasoning steps has 5^4 = 625 possible execution paths. You cannot write end-to-end tests for 625 paths. And that is a simple agent.
- Deterministic assertions break when the same input produces different (but correct) outputs across runs
- Snapshot testing requires constant updates as model behaviour drifts with provider changes
- Code coverage metrics are meaningless — 100% code coverage says nothing about agent decision quality
- Mocking the LLM removes the exact behaviour you need to test (reasoning, tool selection, interpretation)
- Edge cases are infinite — you cannot enumerate all possible reasoning paths through a tool set
The New Testing Paradigm: Behavioural Properties Over Exact Outputs
Testing non-deterministic systems is not a new problem. It is how the physical sciences have worked for centuries. You do not test whether an experiment produces exactly 9.81 m/s² for gravitational acceleration. You test whether the result falls within an acceptable range, whether it is consistent across trials, and whether it exhibits the expected behavioural properties. AI agent testing needs the same paradigm shift: from exact output matching to behavioural property validation.
A behavioural property test does not assert "the agent’s response is exactly X." It asserts "the agent’s response satisfies properties A, B, and C." For a customer support agent, those properties might be: the response addresses the user’s question (relevance), the response contains only information present in retrieved context (groundedness), the response does not promise actions the agent cannot take (safety), and the response maintains the brand’s tone (consistency).
Five Shifts from Deterministic to Probabilistic Testing
Stop asserting "output === expected". Start asserting "output satisfies these properties." Define properties as rubrics with clear pass/fail criteria. A customer support response must address the user’s question, cite only retrieved information, and not make promises the system cannot keep. Each property is scored independently.
Run the same test input three to five times and average the scores. A single run tells you nothing about reliability. Statistical evaluation absorbs non-deterministic variance. If an agent passes 4 out of 5 runs on a given test case, you have a 80% pass rate — which is a real metric you can track and set thresholds against.
Use a separate LLM to evaluate the agent’s outputs against defined rubrics. This scales evaluation beyond what human review can cover. Braintrust’s evaluation platform lets teams define scorers in natural language — "Does this response answer the user’s question without introducing information not present in the context?" — and run them automatically on every pipeline execution.
Replace binary pass/fail with continuous quality scores. An agent output scored 0.85 on relevance, 0.92 on groundedness, and 0.78 on completeness gives you actionable information. Set CI/CD quality gates that block deployment when any score drops below threshold. Braintrust integrates directly into CI/CD pipelines with automated deployment blocking.
Testing does not stop at deployment. Run automated evaluators on a sample of production traffic continuously. LangChain’s report found that nearly a quarter of organisations running any evaluations combine both offline (pre-deployment) and online (production) evaluation. Human review remains essential at 59.8% of teams, but LLM-as-judge at 53.3% handles the volume.
“You do not test a weather model by checking if it predicts exactly 72°F tomorrow. You test whether its predictions fall within acceptable error bounds across hundreds of forecasts. Agent testing is the same discipline applied to software for the first time.”
LLM-as-Judge: The Evaluation Engine That Also Needs Testing
LLM-as-judge is now the dominant automated evaluation approach for AI agents. The concept is straightforward: use one LLM to evaluate another LLM’s output against a defined rubric. The execution is full of traps.
How It Works
A judge LLM receives the agent’s input, output, and any retrieved context. It also receives a scoring rubric — a detailed description of what "good" looks like for this evaluation dimension. The judge scores the output against the rubric, typically on a binary (pass/fail) or ordinal (1-5) scale, and provides reasoning for the score. This runs automatically on every test case, every CI/CD pipeline execution, and optionally on a sample of production traffic.
What Goes Wrong
The first pitfall is self-bias. Research from Microsoft and others has found that judge LLMs subtly favour outputs from models in their own family. GPT-4o judging GPT-4o outputs scores higher than GPT-4o judging Claude outputs for equivalent quality. The mitigation is to use a different model family for judging than for agent execution, or to use multiple judges and require consensus.
The second pitfall is rubric vagueness. A rubric that says "evaluate whether the response is helpful" produces inconsistent scores because "helpful" means different things in different contexts. Effective rubrics include concrete examples: "A score of 5 means the response fully resolves the user’s stated problem and anticipates likely follow-up questions. Example: [specific example]. A score of 1 means the response does not address the user’s question. Example: [specific example]."
The third pitfall is scale granularity. LLMs struggle to calibrate fine-grained numerical scores consistently. Asking a judge to distinguish between a 6 and a 7 on a 10-point scale produces noise, not signal. Binary scoring (pass/fail) or 3-point scales (poor/acceptable/excellent) produce dramatically more reliable results. Pairwise comparisons — "is response A better than response B?" — correlate even better with human judgement than direct scoring.
| Judge Approach | Reliability | Cost | Best For |
|---|---|---|---|
| Binary (pass/fail) | Highest consistency across runs | Lowest — minimal tokens per evaluation | Safety checks, compliance gates, format validation |
| Ordinal (1-5 scale) | Moderate — calibration drift over time | Moderate | Quality tracking, regression detection, A/B comparison |
| Pairwise comparison | Highest correlation with human judgement | Higher — evaluates two outputs per run | Model selection, prompt optimization, A/B testing |
| Chain-of-thought judge | Improved over direct scoring | Highest — judge reasons before scoring | Complex evaluations where scoring rationale matters |
Simulation Testing: Stress-Testing Agents Before Production
The most significant advance in agent testing in 2026 is simulation-based evaluation — generating hundreds or thousands of synthetic test scenarios and running the agent against them before deployment. This addresses the combinatorial explosion problem: you cannot manually write tests for every possible user input, but you can generate them.
Maxim AI’s platform leads this category, combining scenario generation with automated evaluation in a single workflow. Teams define user personas (e.g., "frustrated customer who has been transferred three times"), scenario parameters (e.g., "account locked, two failed password resets"), and expected behavioural properties. The platform generates hundreds of variations and evaluates the agent’s responses across all of them, surfacing failure clusters and edge cases that manual testing would never find.
The approach works because it tests the agent’s decision-making across a realistic distribution of inputs, not just the ten "happy path" queries a QA engineer wrote. A customer support agent tested against 10 hand-crafted queries might score 100%. The same agent tested against 500 generated scenarios covering frustrated users, ambiguous requests, multi-intent messages, and adversarial prompts might score 73% — and the 27% failure cases reveal exactly where the agent breaks.
- Multi-intent messages: "Cancel my subscription and also where is my refund?" — agents often handle the first intent and ignore the second
- Adversarial framing: "My lawyer told me you have to refund me" — tests whether the agent makes commitments it should not
- Context overload: messages with ten paragraphs of background before the actual question — tests whether the agent identifies the real ask
- Language variation: the same question phrased formally, casually, angrily, and with typos — tests robustness to input variation
- Edge-case tool results: empty database responses, API timeouts, partial data — tests how the agent handles degraded tool outputs
The Evaluation Tooling Landscape in 2026
The agent evaluation market is consolidating around a few major platforms, each with a distinct philosophy. None of them does everything. Most production teams use at least two.
| Platform | Philosophy | Key Strength | Key Limitation |
|---|---|---|---|
| Braintrust | Evaluation-first, CI/CD native | Integrates quality gates directly into deployment pipelines. 80x faster trace queries. Natural language scorer definition. | Proprietary, closed-source. No self-hosting option. |
| Langfuse | Open-source, composable | MIT licensed, self-hostable. Full control over data. Flexible evaluation with custom evaluators and prompt versioning. | Requires more setup. Less opinionated — you build your own evaluation workflow. |
| Maxim AI | Simulation + evaluation + observability | End-to-end platform. Agent simulation across thousands of scenarios. Cross-functional collaboration between engineering and product. | Newer platform. Smaller community than open-source alternatives. |
| Galileo | Hallucination specialisation | Luna-2 SLMs: 88% accuracy, 152ms latency, 97% cost reduction vs GPT-4. Research-backed hallucination detection. | Narrower scope — primarily hallucination and guardrails, not general evaluation. |
| Arize Phoenix | Open-source tracing + commercial evaluation | Deep agent trace analysis. Multi-step workflow evaluation. Strong hallucination scoring. | Commercial features (Arize AX) required for full evaluation capabilities. |
The emerging pattern is layered tooling. Teams use an open-source platform (Langfuse or Arize Phoenix) for tracing and basic evaluation, a specialised platform (Braintrust or Maxim) for CI/CD integration and simulation testing, and a focused tool (Galileo) for specific quality dimensions like hallucination detection. This mirrors how the broader observability market matured — no single vendor won; teams assembled stacks.
Building an Agent Testing Pipeline: A Practical Architecture
Here is the testing architecture that works for production AI agents in 2026. It has four layers, each catching different failure categories.
The Four-Layer Agent Testing Pipeline
Unit test your tool functions, your data transformers, your API clients — everything that is not the LLM. Use traditional testing frameworks (Jest, pytest, Go test) for these. They are deterministic. Test them deterministically. Do not mock the LLM here because the LLM is not involved. This layer catches broken tools, malformed API calls, and data transformation errors.
Maintain a dataset of 50-200 representative inputs covering your key scenarios, edge cases, and known failure modes. Run the agent against this dataset on every CI/CD pipeline execution. Evaluate outputs using LLM-as-judge scorers for relevance, groundedness, safety, and domain-specific properties. Average scores across 3 runs per input to absorb variance. Set quality gate thresholds — deployment blocks if any dimension drops below threshold. This layer catches prompt regressions, model drift, and quality degradation.
Before major releases, run simulation-based testing using Maxim AI or equivalent tooling. Generate 500-1000 synthetic scenarios based on production traffic patterns. Evaluate agent performance across the full distribution. This layer catches failures you did not know to test for — the multi-intent messages, adversarial framings, and context overload scenarios that manual test authoring misses.
After deployment, run automated evaluators on 5-10% of production traffic. Compare quality scores against baseline. Alert on statistical deviations. Human review a random sample weekly. This layer catches production-only failures — real user inputs that differ from your test data, model provider changes that affect behaviour, and gradual quality drift that is invisible in pre-deployment testing.
The Human-in-the-Loop Problem
LLM-as-judge scales evaluation. It does not replace human judgement. LangChain’s report found that 59.8% of teams still rely on human review as a primary evaluation method, even as 53.3% also use LLM-as-judge. The two approaches complement each other in specific ways.
Humans are better at evaluating subjective quality dimensions: tone, empathy, cultural sensitivity, brand alignment. They catch failure modes that automated evaluators miss because they bring domain expertise and contextual understanding that no rubric can fully encode. But human evaluation does not scale. A human reviewer can evaluate maybe 50-100 agent interactions per day with meaningful attention. A production agent handles thousands.
The practical architecture: LLM-as-judge handles volume evaluation on every interaction. Human review handles calibration — reviewing a statistically representative sample to ensure the automated evaluators are still aligned with human standards. When the automated judge and human reviewers disagree on more than 20% of a sample, the rubric needs updating, not the human.
“Human review is the calibration instrument for automated evaluation, not the evaluation itself. You calibrate a thermometer against a reference standard periodically. You do not use the reference standard to measure every temperature. Same principle.”
What a Mature Agent Testing Practice Looks Like
- Quality scores are tracked per deployment, per dimension, over time — and the team can show the trend
- CI/CD quality gates automatically block deployments when evaluation scores drop below threshold
- A golden dataset of 50+ curated examples with human-assigned ground truth is maintained and updated monthly
- LLM judge agreement with human reviewers is measured quarterly and exceeds 80%
- Simulation testing runs before every major release with 500+ synthetic scenarios
- Production evaluation runs continuously on 5-10% of traffic with real-time alerting
- The team can answer "what is our agent’s current quality score?" at any time without manual review
Most teams are nowhere near this. The LangChain report found that most teams start with offline evaluations due to their lower barrier to entry. That is fine. Start with Layer 2 — a curated dataset and LLM-as-judge in CI/CD. Then add Layer 4 (production monitoring). Then Layer 3 (simulation). The layers are additive. Each one catches failures the previous layers miss.
What to Do This Week
If your AI agents are in production and your testing pipeline is still built for deterministic software, here is the minimum viable change.
Minimum Viable Agent Testing — This Week
Pull 30 representative queries from production logs. Add 10 known edge cases. Add 10 adversarial inputs. For each, write a brief description of what a good response looks like — not an exact expected output, but behavioural properties. This is your golden dataset. It will take one afternoon.
Pick your highest-risk failure mode — hallucination, relevance, safety, or completeness. Write a scoring rubric with examples for each score level. Use Braintrust, Langfuse, or even a simple script that calls an LLM API with the rubric. Run it against your golden dataset. This gives you a baseline quality score.
Add the evaluation as a pipeline step. Run the agent against the golden dataset on every PR or deployment. Average scores across 3 runs. Set a threshold — start generous and tighten over time. Block deployment if the score drops below threshold. You now have a quality gate.
Log 5% of production agent interactions. Run the same LLM-as-judge scorer on the logged interactions daily. Compare against your baseline. Alert if the score drops by more than 10%. You now have production quality monitoring.
This is not a complete testing practice. It is the foundation that turns "we hope the agent works" into "we measure whether the agent works." Everything else — simulation testing, multi-dimensional evaluation, human calibration, statistical analysis — builds on this foundation.
Where Fordel Builds
We architect and implement AI agent systems with testing built into the foundation. That means evaluation datasets designed from real production traffic patterns. LLM-as-judge scorers calibrated against human ground truth. CI/CD quality gates that prevent quality regressions from reaching users. Simulation testing that stress-tests agents against scenarios your team has not imagined yet.
If your agents are in production and your testing pipeline still treats them like deterministic software, the quality failures are already happening — you just cannot see them yet. We can help you build the testing infrastructure that makes agent quality measurable, trackable, and enforceable. Reach out.