The shift from monitoring to observability is a shift from known-unknowns to unknown-unknowns. Monitoring checks predefined conditions: is CPU above 80%? Is the error rate above 1%? Is the response time above 500ms? Observability lets you ask arbitrary questions of your system without having anticipated those questions in advance.
This distinction matters because modern distributed systems fail in ways that cannot be predicted. A request that traverses 15 microservices, 3 databases, 2 caches, and an AI model inference endpoint has thousands of potential failure modes. You cannot write a monitoring check for each one. You need the ability to trace a single request through the entire system and understand its behavior.
The Three Pillars (and Why They Are Not Enough)
The traditional framing of observability centers on three pillars: logs, metrics, and traces. This framing is useful but incomplete.
Logs capture discrete events. Metrics capture aggregated measurements over time. Traces capture the journey of a single request through distributed services. The insight comes from correlating all three: a metric shows latency increased, a trace pinpoints which service introduced the latency, and a log from that service reveals the root cause.
OpenTelemetry: The Convergence
OpenTelemetry (OTel) has become the standard for instrumenting applications. It provides a vendor-neutral SDK for generating traces, metrics, and logs, with exporters for every major observability backend (Datadog, Grafana, Honeycomb, New Relic, Jaeger). The key insight behind OTel is that instrumentation should be separated from the observability backend — instrument once, export to any vendor.
For AI applications, OTel is particularly valuable because AI pipelines are inherently distributed: an API request triggers RAG retrieval, prompt construction, model inference, output parsing, and response formatting — each potentially running on different services. OTel traces capture this entire chain with timing, input/output sizes, and error states at each step.
- HTTP server and client libraries (auto-instrumentation available for most frameworks)
- Database clients — query timing, connection pool metrics, query text for slow query analysis
- LLM API calls — model name, token count, latency, prompt/completion token breakdown
- Queue consumers — message processing time, batch size, lag
- Custom business logic — spans around critical code paths with domain-specific attributes
Observability for AI Systems
AI systems introduce observability challenges that traditional web applications do not have. Model inference is non-deterministic — the same input can produce different outputs, making reproduction of issues harder. Model quality degrades over time as data distributions shift. Cost is directly proportional to usage in ways that are hard to predict.
| Signal | What to Track | Alert Threshold |
|---|---|---|
| Inference latency (p50/p95/p99) | Response time from model | p99 > 2x baseline |
| Token usage per request | Input + output tokens | Mean > 150% of baseline |
| Error rate by error type | Rate limits, timeouts, format errors | Any error type > 1% |
| Output quality score | LLM-as-judge or heuristic quality | Rolling avg drops > 10% |
| Cost per request | Token cost + infrastructure cost | Daily cost > 120% of budget |
| Cache hit rate | Semantic cache effectiveness | Hit rate drops below 30% |
Implementing Observability-Driven Development
Add OTel instrumentation to your service framework before writing business logic. Tracing should be automatic for all HTTP and database operations from day one.
Service Level Objectives (99.9% availability, p95 latency < 200ms) give you a framework for deciding what matters. Alert on SLO budget burn rate, not individual metric thresholds.
A dashboard shows you that something is wrong. A debug workflow takes you from "something is wrong" to "here is why" in minutes. Design your observability around the debugging journey.
When model quality drops, is it because inference latency increased (timeout causing truncated responses)? Because a new model version was deployed? Because the RAG retrieval is returning irrelevant documents? Correlation is the answer.
“The best engineering teams treat observability as a feature, not infrastructure. Every sprint includes observability work because every feature that ships without observability is a feature that cannot be debugged in production.”