AI Agent Testing: Your QA Pipeline Was Not Built for Software That Makes Its Own Decisions

Fifty-seven percent of organisations have AI agents in production; quality is the top deployment barrier for 32% of them. Traditional unit-and-integration tests assume deterministic outputs — agents break that assumption at every step. This is Fordel's pillar guide to testing non-deterministic agents: behavioural properties, LLM-as-judge, simulation harnesses, and the regression detection patterns we use in production.

Abhishek Sharma· Head of Engg @ Fordel Studios

March 24, 2026Updated May 8, 202616 min read

AI Agent Testing: Your QA Pipeline Was Not Built for Software That Makes Its Own Decisions

In February 2026, a fintech company deployed a customer support agent that passed every test in their CI/CD pipeline. Unit tests covered every tool function. Integration tests validated the API connections. End-to-end tests confirmed the agent could answer ten representative questions correctly. The agent went live on a Monday. By Wednesday, it had told three customers their loan applications were approved when they were actually under review. The agent was not hallucinating in the traditional sense — it had retrieved the correct status from the database. It just interpreted "pending final review" as a positive signal and communicated it as an approval. No test in the pipeline had checked for that kind of semantic misinterpretation because no test was designed to evaluate reasoning, only output format.

This is not an isolated incident. It is the predictable result of testing non-deterministic software with deterministic tools. According to LangChain’s 2026 State of AI Agents report, 57% of organisations now have agents in production. Quality is the number one barrier to further deployment, cited by 32% of respondents. Not cost. Not latency. Not infrastructure. Quality — meaning the agent does something wrong, and the team either cannot detect it beforehand or cannot reproduce it afterwards.

The reason is structural. Every QA pipeline in software engineering was built on one assumption: given the same input, the software produces the same output. AI agents violate this assumption on every request. An agent processing the same user query may select different tools, retrieve different context chunks, take a different number of reasoning steps, and generate a different final response. Run it ten times, get ten different execution traces. Traditional testing has no vocabulary for "the output is different every time but should still be correct."

57%of organisations have AI agents in productionSource: LangChain 2026 State of AI Agents — up from 36% in early 2025

32%cite quality as the top barrier to agent deploymentAbove cost (18%), latency (20%), and infrastructure complexity

60% → 25%agent success rate drops across repeated runsEnterprise agents achieving 60% on single runs drop to 25% across eight identical runs

···

Why Traditional Testing Breaks

Software testing has three foundational layers: unit tests verify individual functions, integration tests verify components working together, and end-to-end tests verify the complete user flow. Each layer assumes determinism. Each layer breaks differently when applied to AI agents.

Unit Tests: Testing the Wrong Thing

You can unit test an agent’s tool functions — the database query, the API call, the data transformation. These are deterministic. They work. But they test the plumbing, not the decision-making. The agent’s value is not in calling a function. It is in deciding which function to call, when, with what parameters, and how to interpret the result. Unit tests cannot cover tool selection logic because that logic lives inside the LLM, not in your code.

Integration Tests: Snapshot Fragility

Traditional integration tests compare actual output against expected output. For AI agents, this means snapshot testing — saving the agent’s response to a known query and failing if the response changes. The problem: the response changes every time. You can pin the temperature to zero, fix the random seed, and freeze the model version. The output will still vary because LLM inference is not mathematically deterministic even with temperature zero. Stochastic sampling at the hardware level, batch size differences, and provider-side infrastructure changes all introduce variance. Teams that build snapshot-based agent tests spend more time updating snapshots than catching real bugs.

End-to-End Tests: Combinatorial Explosion

A traditional web application has a finite number of user paths. You can enumerate them. An agent has a combinatorial explosion of possible execution paths — every tool selection, every reasoning step, every context retrieval creates a branch. A simple agent with five tools and an average of four reasoning steps has 5^4 = 625 possible execution paths. You cannot write end-to-end tests for 625 paths. And that is a simple agent.

Where Traditional Testing Fails for AI Agents

Deterministic assertions break when the same input produces different (but correct) outputs across runs
Snapshot testing requires constant updates as model behaviour drifts with provider changes
Code coverage metrics are meaningless — 100% code coverage says nothing about agent decision quality
Mocking the LLM removes the exact behaviour you need to test (reasoning, tool selection, interpretation)
Edge cases are infinite — you cannot enumerate all possible reasoning paths through a tool set

The New Testing Paradigm: Behavioural Properties Over Exact Outputs

Testing non-deterministic systems is not a new problem. It is how the physical sciences have worked for centuries. You do not test whether an experiment produces exactly 9.81 m/s² for gravitational acceleration. You test whether the result falls within an acceptable range, whether it is consistent across trials, and whether it exhibits the expected behavioural properties. AI agent testing needs the same paradigm shift: from exact output matching to behavioural property validation.

A behavioural property test does not assert "the agent’s response is exactly X." It asserts "the agent’s response satisfies properties A, B, and C." For a customer support agent, those properties might be: the response addresses the user’s question (relevance), the response contains only information present in retrieved context (groundedness), the response does not promise actions the agent cannot take (safety), and the response maintains the brand’s tone (consistency).

Five Shifts from Deterministic to Probabilistic Testing

From exact match to behavioural bounds

Stop asserting "output === expected". Start asserting "output satisfies these properties." Define properties as rubrics with clear pass/fail criteria. A customer support response must address the user’s question, cite only retrieved information, and not make promises the system cannot keep. Each property is scored independently.

From single-run to statistical evaluation

Run the same test input three to five times and average the scores. A single run tells you nothing about reliability. Statistical evaluation absorbs non-deterministic variance. If an agent passes 4 out of 5 runs on a given test case, you have a 80% pass rate — which is a real metric you can track and set thresholds against.

From manual test authoring to LLM-as-judge

Use a separate LLM to evaluate the agent’s outputs against defined rubrics. This scales evaluation beyond what human review can cover. Braintrust’s evaluation platform lets teams define scorers in natural language — "Does this response answer the user’s question without introducing information not present in the context?" — and run them automatically on every pipeline execution.

From pass/fail to quality scores with thresholds

Replace binary pass/fail with continuous quality scores. An agent output scored 0.85 on relevance, 0.92 on groundedness, and 0.78 on completeness gives you actionable information. Set CI/CD quality gates that block deployment when any score drops below threshold. Braintrust integrates directly into CI/CD pipelines with automated deployment blocking.

From pre-deployment only to continuous evaluation

Testing does not stop at deployment. Run automated evaluators on a sample of production traffic continuously. LangChain’s report found that nearly a quarter of organisations running any evaluations combine both offline (pre-deployment) and online (production) evaluation. Human review remains essential at 59.8% of teams, but LLM-as-judge at 53.3% handles the volume.

“You do not test a weather model by checking if it predicts exactly 72°F tomorrow. You test whether its predictions fall within acceptable error bounds across hundreds of forecasts. Agent testing is the same discipline applied to software for the first time.”

···

LLM-as-Judge: The Evaluation Engine That Also Needs Testing

LLM-as-judge is now the dominant automated evaluation approach for AI agents. The concept is straightforward: use one LLM to evaluate another LLM’s output against a defined rubric. The execution is full of traps.

How It Works

A judge LLM receives the agent’s input, output, and any retrieved context. It also receives a scoring rubric — a detailed description of what "good" looks like for this evaluation dimension. The judge scores the output against the rubric, typically on a binary (pass/fail) or ordinal (1-5) scale, and provides reasoning for the score. This runs automatically on every test case, every CI/CD pipeline execution, and optionally on a sample of production traffic.

What Goes Wrong

The first pitfall is self-bias. Research from Microsoft and others has found that judge LLMs subtly favour outputs from models in their own family. GPT-4o judging GPT-4o outputs scores higher than GPT-4o judging Claude outputs for equivalent quality. The mitigation is to use a different model family for judging than for agent execution, or to use multiple judges and require consensus.

The second pitfall is rubric vagueness. A rubric that says "evaluate whether the response is helpful" produces inconsistent scores because "helpful" means different things in different contexts. Effective rubrics include concrete examples: "A score of 5 means the response fully resolves the user’s stated problem and anticipates likely follow-up questions. Example: [specific example]. A score of 1 means the response does not address the user’s question. Example: [specific example]."

The third pitfall is scale granularity. LLMs struggle to calibrate fine-grained numerical scores consistently. Asking a judge to distinguish between a 6 and a 7 on a 10-point scale produces noise, not signal. Binary scoring (pass/fail) or 3-point scales (poor/acceptable/excellent) produce dramatically more reliable results. Pairwise comparisons — "is response A better than response B?" — correlate even better with human judgement than direct scoring.

Judge Approach	Reliability	Cost	Best For
Binary (pass/fail)	Highest consistency across runs	Lowest — minimal tokens per evaluation	Safety checks, compliance gates, format validation
Ordinal (1-5 scale)	Moderate — calibration drift over time	Moderate	Quality tracking, regression detection, A/B comparison
Pairwise comparison	Highest correlation with human judgement	Higher — evaluates two outputs per run	Model selection, prompt optimization, A/B testing
Chain-of-thought judge	Improved over direct scoring	Highest — judge reasons before scoring	Complex evaluations where scoring rationale matters

Simulation Testing: Stress-Testing Agents Before Production

The most significant advance in agent testing in 2026 is simulation-based evaluation — generating hundreds or thousands of synthetic test scenarios and running the agent against them before deployment. This addresses the combinatorial explosion problem: you cannot manually write tests for every possible user input, but you can generate them.

Maxim AI’s platform leads this category, combining scenario generation with automated evaluation in a single workflow. Teams define user personas (e.g., "frustrated customer who has been transferred three times"), scenario parameters (e.g., "account locked, two failed password resets"), and expected behavioural properties. The platform generates hundreds of variations and evaluates the agent’s responses across all of them, surfacing failure clusters and edge cases that manual testing would never find.

The approach works because it tests the agent’s decision-making across a realistic distribution of inputs, not just the ten "happy path" queries a QA engineer wrote. A customer support agent tested against 10 hand-crafted queries might score 100%. The same agent tested against 500 generated scenarios covering frustrated users, ambiguous requests, multi-intent messages, and adversarial prompts might score 73% — and the 27% failure cases reveal exactly where the agent breaks.

What Simulation Testing Catches That Manual Testing Misses

Multi-intent messages: "Cancel my subscription and also where is my refund?" — agents often handle the first intent and ignore the second
Adversarial framing: "My lawyer told me you have to refund me" — tests whether the agent makes commitments it should not
Context overload: messages with ten paragraphs of background before the actual question — tests whether the agent identifies the real ask
Language variation: the same question phrased formally, casually, angrily, and with typos — tests robustness to input variation
Edge-case tool results: empty database responses, API timeouts, partial data — tests how the agent handles degraded tool outputs

···

The Evaluation Tooling Landscape in 2026

The agent evaluation market is consolidating around a few major platforms, each with a distinct philosophy. None of them does everything. Most production teams use at least two.

Platform	Philosophy	Key Strength	Key Limitation
Braintrust	Evaluation-first, CI/CD native	Integrates quality gates directly into deployment pipelines. 80x faster trace queries. Natural language scorer definition.	Proprietary, closed-source. No self-hosting option.
Langfuse	Open-source, composable	MIT licensed, self-hostable. Full control over data. Flexible evaluation with custom evaluators and prompt versioning.	Requires more setup. Less opinionated — you build your own evaluation workflow.
Maxim AI	Simulation + evaluation + observability	End-to-end platform. Agent simulation across thousands of scenarios. Cross-functional collaboration between engineering and product.	Newer platform. Smaller community than open-source alternatives.
Galileo	Hallucination specialisation	Luna-2 SLMs: 88% accuracy, 152ms latency, 97% cost reduction vs GPT-4. Research-backed hallucination detection.	Narrower scope — primarily hallucination and guardrails, not general evaluation.
Arize Phoenix	Open-source tracing + commercial evaluation	Deep agent trace analysis. Multi-step workflow evaluation. Strong hallucination scoring.	Commercial features (Arize AX) required for full evaluation capabilities.

The emerging pattern is layered tooling. Teams use an open-source platform (Langfuse or Arize Phoenix) for tracing and basic evaluation, a specialised platform (Braintrust or Maxim) for CI/CD integration and simulation testing, and a focused tool (Galileo) for specific quality dimensions like hallucination detection. This mirrors how the broader observability market matured — no single vendor won; teams assembled stacks.

Building an Agent Testing Pipeline: A Practical Architecture

Here is the testing architecture that works for production AI agents in 2026. It has four layers, each catching different failure categories.

The Four-Layer Agent Testing Pipeline

Layer 1: Deterministic tests for deterministic components

Unit test your tool functions, your data transformers, your API clients — everything that is not the LLM. Use traditional testing frameworks (Jest, pytest, Go test) for these. They are deterministic. Test them deterministically. Do not mock the LLM here because the LLM is not involved. This layer catches broken tools, malformed API calls, and data transformation errors.

Layer 2: Behavioural evaluation against a curated dataset

Maintain a dataset of 50-200 representative inputs covering your key scenarios, edge cases, and known failure modes. Run the agent against this dataset on every CI/CD pipeline execution. Evaluate outputs using LLM-as-judge scorers for relevance, groundedness, safety, and domain-specific properties. Average scores across 3 runs per input to absorb variance. Set quality gate thresholds — deployment blocks if any dimension drops below threshold. This layer catches prompt regressions, model drift, and quality degradation.

Layer 3: Simulation testing for edge case discovery

Before major releases, run simulation-based testing using Maxim AI or equivalent tooling. Generate 500-1000 synthetic scenarios based on production traffic patterns. Evaluate agent performance across the full distribution. This layer catches failures you did not know to test for — the multi-intent messages, adversarial framings, and context overload scenarios that manual test authoring misses.

Layer 4: Continuous production evaluation

After deployment, run automated evaluators on 5-10% of production traffic. Compare quality scores against baseline. Alert on statistical deviations. Human review a random sample weekly. This layer catches production-only failures — real user inputs that differ from your test data, model provider changes that affect behaviour, and gradual quality drift that is invisible in pre-deployment testing.

···

The Human-in-the-Loop Problem

LLM-as-judge scales evaluation. It does not replace human judgement. LangChain’s report found that 59.8% of teams still rely on human review as a primary evaluation method, even as 53.3% also use LLM-as-judge. The two approaches complement each other in specific ways.

Humans are better at evaluating subjective quality dimensions: tone, empathy, cultural sensitivity, brand alignment. They catch failure modes that automated evaluators miss because they bring domain expertise and contextual understanding that no rubric can fully encode. But human evaluation does not scale. A human reviewer can evaluate maybe 50-100 agent interactions per day with meaningful attention. A production agent handles thousands.

The practical architecture: LLM-as-judge handles volume evaluation on every interaction. Human review handles calibration — reviewing a statistically representative sample to ensure the automated evaluators are still aligned with human standards. When the automated judge and human reviewers disagree on more than 20% of a sample, the rubric needs updating, not the human.

“Human review is the calibration instrument for automated evaluation, not the evaluation itself. You calibrate a thermometer against a reference standard periodically. You do not use the reference standard to measure every temperature. Same principle.”

What a Mature Agent Testing Practice Looks Like

Maturity Indicators for Agent Testing

Quality scores are tracked per deployment, per dimension, over time — and the team can show the trend
CI/CD quality gates automatically block deployments when evaluation scores drop below threshold
A golden dataset of 50+ curated examples with human-assigned ground truth is maintained and updated monthly
LLM judge agreement with human reviewers is measured quarterly and exceeds 80%
Simulation testing runs before every major release with 500+ synthetic scenarios
Production evaluation runs continuously on 5-10% of traffic with real-time alerting
The team can answer "what is our agent’s current quality score?" at any time without manual review

Most teams are nowhere near this. The LangChain report found that most teams start with offline evaluations due to their lower barrier to entry. That is fine. Start with Layer 2 — a curated dataset and LLM-as-judge in CI/CD. Then add Layer 4 (production monitoring). Then Layer 3 (simulation). The layers are additive. Each one catches failures the previous layers miss.

···

What to Do This Week

If your AI agents are in production and your testing pipeline is still built for deterministic software, here is the minimum viable change.

Minimum Viable Agent Testing — This Week

Create a curated evaluation dataset

Pull 30 representative queries from production logs. Add 10 known edge cases. Add 10 adversarial inputs. For each, write a brief description of what a good response looks like — not an exact expected output, but behavioural properties. This is your golden dataset. It will take one afternoon.

Set up one LLM-as-judge scorer

Pick your highest-risk failure mode — hallucination, relevance, safety, or completeness. Write a scoring rubric with examples for each score level. Use Braintrust, Langfuse, or even a simple script that calls an LLM API with the rubric. Run it against your golden dataset. This gives you a baseline quality score.

Run the evaluation in CI/CD

Add the evaluation as a pipeline step. Run the agent against the golden dataset on every PR or deployment. Average scores across 3 runs. Set a threshold — start generous and tighten over time. Block deployment if the score drops below threshold. You now have a quality gate.

Sample production traffic for continuous evaluation

Log 5% of production agent interactions. Run the same LLM-as-judge scorer on the logged interactions daily. Compare against your baseline. Alert if the score drops by more than 10%. You now have production quality monitoring.

This is not a complete testing practice. It is the foundation that turns "we hope the agent works" into "we measure whether the agent works." Everything else — simulation testing, multi-dimensional evaluation, human calibration, statistical analysis — builds on this foundation.

···

Where Fordel Builds

We architect and implement AI agent systems with testing built into the foundation. That means evaluation datasets designed from real production traffic patterns. LLM-as-judge scorers calibrated against human ground truth. CI/CD quality gates that prevent quality regressions from reaching users. Simulation testing that stress-tests agents against scenarios your team has not imagined yet.

If your agents are in production and your testing pipeline still treats them like deterministic software, the quality failures are already happening — you just cannot see them yet. We can help you build the testing infrastructure that makes agent quality measurable, trackable, and enforceable. Reach out.

Frequently Asked Questions

What is the main challenge of testing AI agents compared to traditional software?

AI agents are non-deterministic — the same input can produce different outputs across runs. Traditional QA pipelines assume determinism: same input equals same output equals pass or fail. Testing agents requires probabilistic assertions, behavioral invariants, and statistical sampling across many runs rather than exact output matching.

How do you build a QA pipeline for non-deterministic AI agents?

A production agent QA pipeline needs: behavioral invariant checks (not exact output matching), LLM-as-judge evaluation for open-ended outputs, regression detection via statistical comparison against baseline runs, fixture-based input suites covering edge cases, and latency + cost budgets as first-class test criteria.

What are the most common AI agent testing failure modes?

Top agent testing failures: (1) testing only the happy path, missing edge case inputs that trigger prompt injection, (2) no regression harness so model updates break behavior silently, (3) evaluating on the same data used to prompt-engineer, creating overfitting illusions, (4) ignoring tool call correctness and testing only final output.

How does AI agent testing differ from prompt evaluation?

Prompt evaluation tests a single LLM call in isolation. Agent testing tests multi-step execution: tool call sequences, memory reads/writes, error recovery paths, and final output quality across the full workflow. Agent tests must mock external tools while validating that the agent calls them correctly.

When should you use LLM-as-judge vs human evaluation for agent QA?

Use LLM-as-judge for high-volume regression testing where human review is cost-prohibitive. Use human evaluation for ground truth dataset creation, calibrating judge models, and evaluating nuanced outputs where LLM judges are known to be unreliable (e.g., long-form reasoning, safety-critical decisions). Always human-verify a sample of LLM-judge calls.

In this cluster

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Start a conversation View services

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...

Keep Reading

All articles