AI Agent Observability: Your Agents Are Running Blind in Production

Seventy-nine percent of organisations have deployed AI agents. Most cannot trace failures across multi-step workflows or explain a $47K LangChain bill that ran for 11 days. This is Fordel's pillar guide to AI agent observability in 2026 — the four pillars, OpenTelemetry's GenAI conventions, the tooling landscape, and the cost economics, written from real production deployments.

Abhishek Sharma· Head of Engg @ Fordel Studios

March 22, 2026Updated May 8, 202615 min read

AI Agent Observability: Your Agents Are Running Blind in Production

In November 2025, a team running four LangChain agents discovered a $47,000 cloud bill. An Analyzer agent and a Verifier agent had entered a conversational loop — ping-ponging requests back and forth for eleven days before anyone on the team noticed. The agents were not malfunctioning in any way a traditional health check would detect. CPU usage was normal. Memory was stable. HTTP status codes were 200. The agents were simply talking to each other, burning tokens, accomplishing nothing.

This is not an edge case. It is the default outcome when you deploy AI agents without purpose-built observability. According to PwC’s 2026 Agent Survey, 79% of organisations have now adopted AI agents in some capacity. But most cannot trace a failure through a multi-step workflow, explain why costs doubled on a Tuesday, or determine whether an agent hallucinated a tool result or the tool itself returned garbage.

Traditional application monitoring — the Datadog dashboards, the Prometheus metrics, the PagerDuty alerts you already have — tells you whether your infrastructure is healthy. It does not tell you whether your agent chose the right tool, interpreted the result correctly, or spent $200 on a task that should have cost $0.30. AI agent observability is a fundamentally different problem, and the industry is only now building the tools to solve it.

79%of organisations have deployed AI agentsSource: PwC 2026 Agent Survey — but most lack multi-step tracing

$47Kburned in a single LangChain loop incidentFour agents conversed for 11 days undetected — all health checks green

55%of AI-optimised cloud spend goes to inferenceInference now exceeds training costs for the first time in 2026

···

Why Traditional Monitoring Fails for AI Agents

Application Performance Monitoring (APM) was designed for deterministic software. A REST endpoint receives a request, queries a database, transforms data, and returns a response. The path is predictable. The cost is fixed. If latency spikes, you check the database. If errors increase, you check the logs. The debugging model is linear: input, processing, output.

AI agents break every assumption in this model. An agent receives a prompt, reasons about it, selects a tool, executes the tool, interprets the result, decides whether to use another tool, and repeats — possibly dozens of times — before producing a final output. The path is non-deterministic. The cost varies by orders of magnitude depending on which tools the agent selects and how many reasoning steps it takes. Two identical inputs can produce radically different execution traces.

What Traditional Monitoring Cannot Tell You About AI Agents

Which tool the agent selected at each step and why it chose that tool over alternatives
Whether the agent hallucinated a tool result instead of actually calling the tool
How many LLM inference calls a single user request triggered (and what each cost)
Whether a quality regression is caused by a prompt change, a model update, or a retrieval failure
Whether an agent is stuck in a reasoning loop — all health metrics will show green

The $47K LangChain incident is the canonical example. Four agents in a multi-agent system were healthy by every traditional metric. CPU, memory, network, HTTP response codes — all normal. The agents were functioning exactly as designed: receiving messages, processing them with LLM calls, generating responses, sending them to the next agent. The problem was semantic, not infrastructural. The agents were doing useless work in a loop, and no traditional monitoring system is designed to detect "your AI is busy accomplishing nothing."

The Four Pillars of Agent Observability

AI agent observability requires four capabilities that traditional monitoring does not provide. Every production agent system needs all four.

Trace Reconstruction

A trace in agent observability is the complete decision path for a single interaction. Every LLM call, every tool invocation, every retrieval step, every intermediate reasoning decision gets captured with full context. Think of it as the call stack for your AI system — it shows not just what happened, but how the agent decided and why.

Without traces, debugging an agent failure is like debugging a microservices architecture with only the final HTTP response. You know something went wrong. You have no idea where. Agent traces must capture the prompt sent to the model, the model’s response, which tool was selected, the tool’s input and output, and the agent’s interpretation of that output — for every step in the chain.

Quality Evaluation

Traditional software is either correct or incorrect. Agent output exists on a spectrum. A customer support agent might give an answer that is technically correct but misses the user’s actual question. A data extraction agent might return 90% of the fields correctly but hallucinate the remaining 10%. A coding agent might generate syntactically valid code that introduces a subtle security vulnerability.

Production agent observability requires automated evaluation — LLM-as-judge metrics that score every output for relevance, factuality, completeness, and safety. Tools like Braintrust’s Loop can generate custom evaluators from natural language descriptions, then run them automatically on every interaction. Arize Phoenix provides built-in hallucination scorers that flag outputs where the agent’s response is not grounded in the retrieved context.

Cost Attribution

A single agent request can trigger anywhere from 1 to 50+ LLM inference calls, each consuming prompt and completion tokens at different rates depending on the model. Without per-request cost tracking, you cannot answer basic questions: How much does it cost to serve one customer query? Which agent workflows are economical and which are burning money? Is that cost spike from increased traffic or from an agent entering a recursive loop?

Cost attribution must be granular — broken down by request, by agent, by tool call, by model. It must be real-time, not discovered on the monthly invoice. And it must include circuit breakers: hard budget ceilings per run, per day, and per agent, with automatic termination when thresholds are breached.

Behavioural Drift Detection

AI agents drift. A model provider ships a minor version update and your agent’s tool selection patterns change. A retrieval index gets stale and the agent starts hallucinating more frequently. A prompt that worked perfectly for three months suddenly produces worse results because the model’s behaviour shifted.

Drift detection requires baselining — establishing what "normal" looks like for your agent in terms of tool selection frequency, average reasoning steps per request, token consumption per task type, and quality scores. When any of these metrics deviate from baseline, the observability system must alert before the impact reaches users or the invoice.

Production Agent Observability Non-Negotiables

Hard budget ceiling per run, per day, per agent

Automatic termination when cost thresholds are breached. No agent should be able to spend unlimited tokens without human intervention. The $47K incidents happened because there was no ceiling.

Rate limiter on external API and tool calls

Exponential backoff on tool failures. Without this, a single API error can trigger thousands of retries, each consuming inference tokens to reason about the failure.

Loop detector using embedding similarity

Compare the last N agent messages using embedding similarity. If the agent is producing semantically identical outputs across consecutive steps, it is stuck. Kill the run and alert.

Named human escalation path

Someone specific gets paged when thresholds are breached. Not a Slack channel. Not an email alias. A person with the authority to kill a runaway agent at 3am.

···

The Tooling Landscape: Who Owns Agent Observability

The agent observability market split into two camps in 2026: AI-native startups that built purpose-built platforms from scratch, and incumbent observability vendors bolting AI monitoring onto existing APM products. Both approaches have trade-offs.

AI-Native Platforms

Platform	Approach	Pricing	Best For
Langfuse	Open-source, self-hostable. MIT licensed. Trace viewing, prompt versioning, cost tracking.	Free self-hosted; Cloud from $29/mo	Teams that want full control. On-premises deployments. Privacy-sensitive workloads.
Arize Phoenix	Open-source observability + commercial Arize AX. Deep agent evaluation. Multi-step trace analysis.	Phoenix free; AX from $50/mo	Teams needing strong evaluation. Complex multi-agent workflows. Hallucination detection.
Braintrust	Evaluation-first platform. CI/CD deployment blocking. 80x faster trace queries than alternatives.	Free 1M spans/mo; Pro $249/mo	Teams building evaluation into CI/CD. Fast iteration on prompt quality. Collaborative development.
Helicone	Proxy-based. Zero-code integration via URL swap. Real-time cost and latency dashboards.	Free tier; Pro from $80/mo	Fastest integration path. Teams that want instant visibility without SDK changes.
AgentOps	Purpose-built for multi-agent systems. Session replay for agent interactions.	Free tier available	CrewAI/AutoGen users. Teams debugging complex agent-to-agent interactions.

Incumbent Platforms

Datadog launched LLM Observability in 2025 and expanded it in early 2026 with automatic instrumentation for Google’s Agent Development Kit, native OpenAI and Anthropic tracing, and Bits AI — an assistant that scans error logs and documentation to suggest fixes. Dynatrace introduced its Intelligence platform at Perform 2026, adding automatic instrumentation of AI/ML workloads with under 50 milliseconds overhead per traced request, capturing token usage, model performance, and LLM call chains. New Relic AI Monitoring provides low-overhead instrumentation of LLM calls with token usage tracking across AI pipelines.

OpenTelemetry: The Standard That Changes Everything

The most important development in agent observability is not any single platform. It is OpenTelemetry’s GenAI Semantic Conventions — a standardised schema for collecting telemetry from AI agent systems.

OpenTelemetry (OTEL) already won the observability standards war for traditional applications. It is the CNCF’s second-most active project after Kubernetes. Every major observability vendor supports it. The GenAI semantic conventions extend this standard to cover LLM operations, agent operations, and tool calls.

What OpenTelemetry GenAI Conventions Standardise

gen_ai.system — which model provider handled the request (openai, anthropic, etc.)
gen_ai.request.model — the specific model used for each inference call
gen_ai.usage.input_tokens / output_tokens — token consumption per call
gen_ai.agent.name — which agent in a multi-agent system handled the task
Agent operation spans: create_agent, invoke_agent with full parent-child trace relationships
Tool call spans with input parameters and return values

The practical impact: if your agent system emits OTEL-standard telemetry, you can send that data to any observability backend — Datadog, Langfuse, Arize, Grafana, or your own storage — without rewriting instrumentation. You avoid vendor lock-in at the collection layer.

Datadog was among the first incumbents to announce native support for the OTEL GenAI semantic conventions. Langfuse, Arize, and Braintrust all support OTEL ingest. The convention draft was built in collaboration with Google (based on their AI agent white paper), and is being extended to cover frameworks like CrewAI, AutoGen, and LangGraph.

“OpenTelemetry GenAI Semantic Conventions are to agent observability what REST conventions were to API monitoring. They give every tool in the ecosystem a common language for describing what an AI agent did, how long it took, and what it cost.”

···

The Real Failure Modes: What to Monitor and Why

Agent failures in production do not look like traditional software failures. They are subtler, more expensive, and harder to detect. Here are the five failure modes that observability must catch.

Semantic Loops

The agent enters a reasoning cycle where it repeatedly generates semantically similar outputs without making progress. All health metrics stay green. The agent is "working." It is accomplishing nothing. Detection requires embedding similarity comparison across consecutive agent outputs — if similarity exceeds a threshold (typically 0.92+) for N consecutive steps, the agent is looping.

Silent Hallucination

The agent skips a failed tool call and fabricates the result to "complete" the task. This is the most dangerous failure mode because the output looks correct. The agent returns a confident, well-formatted answer. The answer is wrong. Detection requires grounding checks — comparing the agent’s output against the actual tool responses in the trace. If the agent references data that no tool returned, it hallucinated.

Model Regression

A model provider updates their model (sometimes without announcement), and your agent’s behaviour changes. Tool selection patterns shift. Output quality drifts. Costs increase because the model now takes more reasoning steps. Detection requires continuous evaluation against a baseline dataset and alerting on statistical deviations in quality scores, tool selection frequencies, and token consumption.

Cost Explosion

A single request triggers an unexpectedly deep reasoning chain. An agent that usually completes in 3 tool calls suddenly takes 30. A retry loop on a failed API turns a $0.05 request into a $50 one. Detection requires per-request cost tracking with hard ceilings and real-time alerting when any single request exceeds a cost threshold.

Cascading Agent Failure

In multi-agent systems, one agent’s degraded output becomes another agent’s degraded input. Quality degrades across the chain but no single agent shows a clear failure. The final output is poor, but each individual agent’s metrics look acceptable. Detection requires end-to-end trace correlation with quality evaluation at each handoff point, not just at the final output.

Failure Mode	Detection Method	Traditional Monitoring Catches It?
Semantic loop	Embedding similarity over consecutive outputs	No — all health metrics stay green
Silent hallucination	Grounding check: output vs actual tool responses	No — agent returns 200 with well-formatted response
Model regression	Baseline comparison of quality scores and tool selection	No — infrastructure metrics unchanged
Cost explosion	Per-request cost tracking with hard ceilings	Partially — only visible on monthly cloud bill
Cascading agent failure	End-to-end quality evaluation at each handoff	No — individual agent metrics look normal

Building an Agent Observability Stack: A Practical Architecture

There is no single tool that covers all four pillars of agent observability today. Production teams are building layered stacks. Here is the architecture that works.

Agent Observability Stack in 2026

Instrument with OpenTelemetry GenAI conventions

Use the OTEL SDK to emit standardised spans for every LLM call, tool invocation, and agent operation. This is your collection layer. It decouples instrumentation from the backend, so you can swap observability vendors without rewriting code. If your agent framework (LangChain, CrewAI, LangGraph) has an OTEL integration, use it. If not, wrap your LLM client calls with OTEL spans manually — it is 10–15 lines of code per call site.

Route traces to an AI-native platform for depth

Send OTEL data to Langfuse, Arize, or Braintrust for rich trace visualisation, prompt versioning, and automated evaluation. This is where your AI engineers live — debugging agent behaviour, testing prompt changes, running evaluation suites. Choose based on your constraints: Langfuse if you need self-hosting, Arize if you need deep evaluation, Braintrust if you need CI/CD integration.

Mirror traces to your existing APM for correlation

Send the same OTEL data to Datadog, Dynatrace, or New Relic so your infrastructure team can correlate agent performance with infrastructure metrics. When an agent is slow, you need to know whether the bottleneck is the LLM provider, a tool API, a database query, or network latency. Only your APM can answer this.

Implement cost controls as infrastructure, not application logic

Budget ceilings, rate limiters, and loop detectors should run as middleware or sidecar processes, not as checks inside your agent code. An agent cannot reliably enforce its own spending limit — the same way a process cannot reliably enforce its own memory limit. Use a proxy layer (Helicone, a custom OTEL processor, or your API gateway) to enforce cost and rate limits before requests reach the LLM provider.

Run continuous evaluation against a baseline dataset

Maintain a set of representative inputs with expected outputs. Run your agent against this dataset on every deployment and on a daily schedule. Compare quality scores, tool selection patterns, token consumption, and latency against the baseline. Alert on statistical deviations. This catches model regressions and prompt drift before they reach production users.

···

The Economics: What Agent Observability Costs and Saves

Agent observability is not free. You are paying for trace storage, evaluation compute (LLM-as-judge calls cost tokens), and platform licensing. A reasonable estimate for a mid-sized production agent system: $500–$2,000 per month in observability tooling costs.

Set that against the alternative. Inference workloads now consume over 55% of AI-optimised infrastructure spending in 2026 — surpassing training costs for the first time. A single undetected loop or retry storm can burn through $10K–$50K before anyone notices. One report found that teams without proper agent cost controls are spending 30x more than necessary on LLM inference, with the excess driven by over-qualified model selection, missing caches, and uncontrolled agent autonomy.

30xexcess spend on LLM inference without controlsDriven by over-qualified models, no caching, and uncontrolled agent autonomy

$500–$2K/motypical agent observability stack costPays for itself after preventing a single runaway loop or cost spike

“The question is not whether you can afford agent observability. It is whether you can afford to run agents without it. Inference costs exceed training costs for the first time in 2026. Every unmonitored agent is an open chequebook.”

What to Do on Monday

If you are running AI agents in production today without purpose-built observability, here is the minimum viable instrumentation to deploy this week.

Minimum Viable Agent Observability — This Week

Add per-request cost tracking. Log the model, token count (input and output), and calculated cost for every LLM call. Store it in your existing metrics system. This alone would have caught both $47K incidents.
Set a hard budget ceiling. No single agent run should be able to spend more than a defined amount without termination. Start with $5 per run and adjust based on your workloads.
Add a step counter. If an agent exceeds N reasoning steps (start with 20), terminate and alert. This catches semantic loops without the complexity of embedding similarity.
Log tool call inputs and outputs. When an agent calls a tool, capture what it sent and what it received. When the agent’s final output does not match tool responses, you have a hallucination.
Evaluate one metric automatically. Pick the highest-risk failure mode for your use case (hallucination, completeness, relevance) and run an LLM-as-judge scorer on a sample of production traffic. Even 5% sampling gives early warning.

This is not a complete observability stack. It is the minimum instrumentation that prevents the catastrophic failures — the loops, the cost explosions, the undetected hallucinations — while you build out the full architecture.

···

Where Fordel Builds

We build production AI agent systems for clients in SaaS, finance, healthcare, and insurance. Agent observability is not an add-on we bolt on after deployment. It is part of the architecture from day one — instrumented traces, cost controls, automated evaluation, and drift detection wired into the system before the first agent goes live.

If you are running agents in production without visibility into what they are doing, what they are costing, and whether their output quality is degrading, we can help you build the observability stack that turns blind deployment into controlled operation. Reach out.

Frequently Asked Questions

What is AI agent observability and why is standard APM not enough?

AI agent observability covers the full execution trace of an agent: what the LLM reasoned, which tools were called with what inputs, how long each step took, what the intermediate outputs were, and where the agent failed or made a bad decision. Standard APM only sees latency and errors — it cannot show you why an agent reached a wrong conclusion or chose the wrong tool.

How do you implement observability for AI agents in production?

Production agent observability requires: structured trace logging with a span per LLM call and per tool invocation, capturing full prompt/completion pairs (not just tokens used), decision logging showing the agent's reasoning steps, cost tracking per workflow, and an anomaly detection layer that flags unexpected tool call sequences.

What are the key metrics to track for AI agent systems?

Essential agent metrics: task completion rate, tool call error rate per named tool, LLM call latency (p50/p95), context window utilization, cost per completed task, retry rate, and hallucination rate on structured outputs. Track these per workflow, not just globally — a degraded tool will show up as a spike in one workflow's error rate.

How does agent observability help with debugging production failures?

Observability lets you replay the exact sequence of reasoning steps that led to a failure. Without it, debugging an agent failure means re-running the same input and hoping to reproduce it — which fails for non-deterministic agents. Trace logs let you identify whether the failure was in the prompt, the tool call, the tool response parsing, or the final output generation.

When should you build custom agent observability vs using LangSmith or Langfuse?

Use LangSmith or Langfuse when you are using standard agent frameworks and need quick time-to-observability. Build custom observability when you need deep integration with your existing data platform, custom anomaly detection tuned to your workflows, or compliance requirements that prohibit sending prompt content to third-party services.

In this cluster

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Start a conversation View services

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...

Keep Reading

All articles