In November 2025, a team running four LangChain agents discovered a $47,000 cloud bill. An Analyzer agent and a Verifier agent had entered a conversational loop — ping-ponging requests back and forth for eleven days before anyone on the team noticed. The agents were not malfunctioning in any way a traditional health check would detect. CPU usage was normal. Memory was stable. HTTP status codes were 200. The agents were simply talking to each other, burning tokens, accomplishing nothing.
This is not an edge case. It is the default outcome when you deploy AI agents without purpose-built observability. According to PwC’s 2026 Agent Survey, 79% of organisations have now adopted AI agents in some capacity. But most cannot trace a failure through a multi-step workflow, explain why costs doubled on a Tuesday, or determine whether an agent hallucinated a tool result or the tool itself returned garbage.
Traditional application monitoring — the Datadog dashboards, the Prometheus metrics, the PagerDuty alerts you already have — tells you whether your infrastructure is healthy. It does not tell you whether your agent chose the right tool, interpreted the result correctly, or spent $200 on a task that should have cost $0.30. AI agent observability is a fundamentally different problem, and the industry is only now building the tools to solve it.
Why Traditional Monitoring Fails for AI Agents
Application Performance Monitoring (APM) was designed for deterministic software. A REST endpoint receives a request, queries a database, transforms data, and returns a response. The path is predictable. The cost is fixed. If latency spikes, you check the database. If errors increase, you check the logs. The debugging model is linear: input, processing, output.
AI agents break every assumption in this model. An agent receives a prompt, reasons about it, selects a tool, executes the tool, interprets the result, decides whether to use another tool, and repeats — possibly dozens of times — before producing a final output. The path is non-deterministic. The cost varies by orders of magnitude depending on which tools the agent selects and how many reasoning steps it takes. Two identical inputs can produce radically different execution traces.
- Which tool the agent selected at each step and why it chose that tool over alternatives
- Whether the agent hallucinated a tool result instead of actually calling the tool
- How many LLM inference calls a single user request triggered (and what each cost)
- Whether a quality regression is caused by a prompt change, a model update, or a retrieval failure
- Whether an agent is stuck in a reasoning loop — all health metrics will show green
The $47K LangChain incident is the canonical example. Four agents in a multi-agent system were healthy by every traditional metric. CPU, memory, network, HTTP response codes — all normal. The agents were functioning exactly as designed: receiving messages, processing them with LLM calls, generating responses, sending them to the next agent. The problem was semantic, not infrastructural. The agents were doing useless work in a loop, and no traditional monitoring system is designed to detect "your AI is busy accomplishing nothing."
The Four Pillars of Agent Observability
AI agent observability requires four capabilities that traditional monitoring does not provide. Every production agent system needs all four.
Trace Reconstruction
A trace in agent observability is the complete decision path for a single interaction. Every LLM call, every tool invocation, every retrieval step, every intermediate reasoning decision gets captured with full context. Think of it as the call stack for your AI system — it shows not just what happened, but how the agent decided and why.
Without traces, debugging an agent failure is like debugging a microservices architecture with only the final HTTP response. You know something went wrong. You have no idea where. Agent traces must capture the prompt sent to the model, the model’s response, which tool was selected, the tool’s input and output, and the agent’s interpretation of that output — for every step in the chain.
Quality Evaluation
Traditional software is either correct or incorrect. Agent output exists on a spectrum. A customer support agent might give an answer that is technically correct but misses the user’s actual question. A data extraction agent might return 90% of the fields correctly but hallucinate the remaining 10%. A coding agent might generate syntactically valid code that introduces a subtle security vulnerability.
Production agent observability requires automated evaluation — LLM-as-judge metrics that score every output for relevance, factuality, completeness, and safety. Tools like Braintrust’s Loop can generate custom evaluators from natural language descriptions, then run them automatically on every interaction. Arize Phoenix provides built-in hallucination scorers that flag outputs where the agent’s response is not grounded in the retrieved context.
Cost Attribution
A single agent request can trigger anywhere from 1 to 50+ LLM inference calls, each consuming prompt and completion tokens at different rates depending on the model. Without per-request cost tracking, you cannot answer basic questions: How much does it cost to serve one customer query? Which agent workflows are economical and which are burning money? Is that cost spike from increased traffic or from an agent entering a recursive loop?
Cost attribution must be granular — broken down by request, by agent, by tool call, by model. It must be real-time, not discovered on the monthly invoice. And it must include circuit breakers: hard budget ceilings per run, per day, and per agent, with automatic termination when thresholds are breached.
Behavioural Drift Detection
AI agents drift. A model provider ships a minor version update and your agent’s tool selection patterns change. A retrieval index gets stale and the agent starts hallucinating more frequently. A prompt that worked perfectly for three months suddenly produces worse results because the model’s behaviour shifted.
Drift detection requires baselining — establishing what "normal" looks like for your agent in terms of tool selection frequency, average reasoning steps per request, token consumption per task type, and quality scores. When any of these metrics deviate from baseline, the observability system must alert before the impact reaches users or the invoice.
Production Agent Observability Non-Negotiables
Automatic termination when cost thresholds are breached. No agent should be able to spend unlimited tokens without human intervention. The $47K incidents happened because there was no ceiling.
Exponential backoff on tool failures. Without this, a single API error can trigger thousands of retries, each consuming inference tokens to reason about the failure.
Compare the last N agent messages using embedding similarity. If the agent is producing semantically identical outputs across consecutive steps, it is stuck. Kill the run and alert.
Someone specific gets paged when thresholds are breached. Not a Slack channel. Not an email alias. A person with the authority to kill a runaway agent at 3am.
The Tooling Landscape: Who Owns Agent Observability
The agent observability market split into two camps in 2026: AI-native startups that built purpose-built platforms from scratch, and incumbent observability vendors bolting AI monitoring onto existing APM products. Both approaches have trade-offs.
AI-Native Platforms
| Platform | Approach | Pricing | Best For |
|---|---|---|---|
| Langfuse | Open-source, self-hostable. MIT licensed. Trace viewing, prompt versioning, cost tracking. | Free self-hosted; Cloud from $29/mo | Teams that want full control. On-premises deployments. Privacy-sensitive workloads. |
| Arize Phoenix | Open-source observability + commercial Arize AX. Deep agent evaluation. Multi-step trace analysis. | Phoenix free; AX from $50/mo | Teams needing strong evaluation. Complex multi-agent workflows. Hallucination detection. |
| Braintrust | Evaluation-first platform. CI/CD deployment blocking. 80x faster trace queries than alternatives. | Free 1M spans/mo; Pro $249/mo | Teams building evaluation into CI/CD. Fast iteration on prompt quality. Collaborative development. |
| Helicone | Proxy-based. Zero-code integration via URL swap. Real-time cost and latency dashboards. | Free tier; Pro from $80/mo | Fastest integration path. Teams that want instant visibility without SDK changes. |
| AgentOps | Purpose-built for multi-agent systems. Session replay for agent interactions. | Free tier available | CrewAI/AutoGen users. Teams debugging complex agent-to-agent interactions. |
Incumbent Platforms
Datadog launched LLM Observability in 2025 and expanded it in early 2026 with automatic instrumentation for Google’s Agent Development Kit, native OpenAI and Anthropic tracing, and Bits AI — an assistant that scans error logs and documentation to suggest fixes. Dynatrace introduced its Intelligence platform at Perform 2026, adding automatic instrumentation of AI/ML workloads with under 50 milliseconds overhead per traced request, capturing token usage, model performance, and LLM call chains. New Relic AI Monitoring provides low-overhead instrumentation of LLM calls with token usage tracking across AI pipelines.
OpenTelemetry: The Standard That Changes Everything
The most important development in agent observability is not any single platform. It is OpenTelemetry’s GenAI Semantic Conventions — a standardised schema for collecting telemetry from AI agent systems.
OpenTelemetry (OTEL) already won the observability standards war for traditional applications. It is the CNCF’s second-most active project after Kubernetes. Every major observability vendor supports it. The GenAI semantic conventions extend this standard to cover LLM operations, agent operations, and tool calls.
- gen_ai.system — which model provider handled the request (openai, anthropic, etc.)
- gen_ai.request.model — the specific model used for each inference call
- gen_ai.usage.input_tokens / output_tokens — token consumption per call
- gen_ai.agent.name — which agent in a multi-agent system handled the task
- Agent operation spans: create_agent, invoke_agent with full parent-child trace relationships
- Tool call spans with input parameters and return values
The practical impact: if your agent system emits OTEL-standard telemetry, you can send that data to any observability backend — Datadog, Langfuse, Arize, Grafana, or your own storage — without rewriting instrumentation. You avoid vendor lock-in at the collection layer.
Datadog was among the first incumbents to announce native support for the OTEL GenAI semantic conventions. Langfuse, Arize, and Braintrust all support OTEL ingest. The convention draft was built in collaboration with Google (based on their AI agent white paper), and is being extended to cover frameworks like CrewAI, AutoGen, and LangGraph.
“OpenTelemetry GenAI Semantic Conventions are to agent observability what REST conventions were to API monitoring. They give every tool in the ecosystem a common language for describing what an AI agent did, how long it took, and what it cost.”
The Real Failure Modes: What to Monitor and Why
Agent failures in production do not look like traditional software failures. They are subtler, more expensive, and harder to detect. Here are the five failure modes that observability must catch.
Semantic Loops
The agent enters a reasoning cycle where it repeatedly generates semantically similar outputs without making progress. All health metrics stay green. The agent is "working." It is accomplishing nothing. Detection requires embedding similarity comparison across consecutive agent outputs — if similarity exceeds a threshold (typically 0.92+) for N consecutive steps, the agent is looping.
Silent Hallucination
The agent skips a failed tool call and fabricates the result to "complete" the task. This is the most dangerous failure mode because the output looks correct. The agent returns a confident, well-formatted answer. The answer is wrong. Detection requires grounding checks — comparing the agent’s output against the actual tool responses in the trace. If the agent references data that no tool returned, it hallucinated.
Model Regression
A model provider updates their model (sometimes without announcement), and your agent’s behaviour changes. Tool selection patterns shift. Output quality drifts. Costs increase because the model now takes more reasoning steps. Detection requires continuous evaluation against a baseline dataset and alerting on statistical deviations in quality scores, tool selection frequencies, and token consumption.
Cost Explosion
A single request triggers an unexpectedly deep reasoning chain. An agent that usually completes in 3 tool calls suddenly takes 30. A retry loop on a failed API turns a $0.05 request into a $50 one. Detection requires per-request cost tracking with hard ceilings and real-time alerting when any single request exceeds a cost threshold.
Cascading Agent Failure
In multi-agent systems, one agent’s degraded output becomes another agent’s degraded input. Quality degrades across the chain but no single agent shows a clear failure. The final output is poor, but each individual agent’s metrics look acceptable. Detection requires end-to-end trace correlation with quality evaluation at each handoff point, not just at the final output.
| Failure Mode | Detection Method | Traditional Monitoring Catches It? |
|---|---|---|
| Semantic loop | Embedding similarity over consecutive outputs | No — all health metrics stay green |
| Silent hallucination | Grounding check: output vs actual tool responses | No — agent returns 200 with well-formatted response |
| Model regression | Baseline comparison of quality scores and tool selection | No — infrastructure metrics unchanged |
| Cost explosion | Per-request cost tracking with hard ceilings | Partially — only visible on monthly cloud bill |
| Cascading agent failure | End-to-end quality evaluation at each handoff | No — individual agent metrics look normal |
Building an Agent Observability Stack: A Practical Architecture
There is no single tool that covers all four pillars of agent observability today. Production teams are building layered stacks. Here is the architecture that works.
Agent Observability Stack in 2026
Use the OTEL SDK to emit standardised spans for every LLM call, tool invocation, and agent operation. This is your collection layer. It decouples instrumentation from the backend, so you can swap observability vendors without rewriting code. If your agent framework (LangChain, CrewAI, LangGraph) has an OTEL integration, use it. If not, wrap your LLM client calls with OTEL spans manually — it is 10–15 lines of code per call site.
Send OTEL data to Langfuse, Arize, or Braintrust for rich trace visualisation, prompt versioning, and automated evaluation. This is where your AI engineers live — debugging agent behaviour, testing prompt changes, running evaluation suites. Choose based on your constraints: Langfuse if you need self-hosting, Arize if you need deep evaluation, Braintrust if you need CI/CD integration.
Send the same OTEL data to Datadog, Dynatrace, or New Relic so your infrastructure team can correlate agent performance with infrastructure metrics. When an agent is slow, you need to know whether the bottleneck is the LLM provider, a tool API, a database query, or network latency. Only your APM can answer this.
Budget ceilings, rate limiters, and loop detectors should run as middleware or sidecar processes, not as checks inside your agent code. An agent cannot reliably enforce its own spending limit — the same way a process cannot reliably enforce its own memory limit. Use a proxy layer (Helicone, a custom OTEL processor, or your API gateway) to enforce cost and rate limits before requests reach the LLM provider.
Maintain a set of representative inputs with expected outputs. Run your agent against this dataset on every deployment and on a daily schedule. Compare quality scores, tool selection patterns, token consumption, and latency against the baseline. Alert on statistical deviations. This catches model regressions and prompt drift before they reach production users.
The Economics: What Agent Observability Costs and Saves
Agent observability is not free. You are paying for trace storage, evaluation compute (LLM-as-judge calls cost tokens), and platform licensing. A reasonable estimate for a mid-sized production agent system: $500–$2,000 per month in observability tooling costs.
Set that against the alternative. Inference workloads now consume over 55% of AI-optimised infrastructure spending in 2026 — surpassing training costs for the first time. A single undetected loop or retry storm can burn through $10K–$50K before anyone notices. One report found that teams without proper agent cost controls are spending 30x more than necessary on LLM inference, with the excess driven by over-qualified model selection, missing caches, and uncontrolled agent autonomy.
“The question is not whether you can afford agent observability. It is whether you can afford to run agents without it. Inference costs exceed training costs for the first time in 2026. Every unmonitored agent is an open chequebook.”
What to Do on Monday
If you are running AI agents in production today without purpose-built observability, here is the minimum viable instrumentation to deploy this week.
- Add per-request cost tracking. Log the model, token count (input and output), and calculated cost for every LLM call. Store it in your existing metrics system. This alone would have caught both $47K incidents.
- Set a hard budget ceiling. No single agent run should be able to spend more than a defined amount without termination. Start with $5 per run and adjust based on your workloads.
- Add a step counter. If an agent exceeds N reasoning steps (start with 20), terminate and alert. This catches semantic loops without the complexity of embedding similarity.
- Log tool call inputs and outputs. When an agent calls a tool, capture what it sent and what it received. When the agent’s final output does not match tool responses, you have a hallucination.
- Evaluate one metric automatically. Pick the highest-risk failure mode for your use case (hallucination, completeness, relevance) and run an LLM-as-judge scorer on a sample of production traffic. Even 5% sampling gives early warning.
This is not a complete observability stack. It is the minimum instrumentation that prevents the catastrophic failures — the loops, the cost explosions, the undetected hallucinations — while you build out the full architecture.
Where Fordel Builds
We build production AI agent systems for clients in SaaS, finance, healthcare, and insurance. Agent observability is not an add-on we bolt on after deployment. It is part of the architecture from day one — instrumented traces, cost controls, automated evaluation, and drift detection wired into the system before the first agent goes live.
If you are running agents in production without visibility into what they are doing, what they are costing, and whether their output quality is degrading, we can help you build the observability stack that turns blind deployment into controlled operation. Reach out.