Agent demos are easy. You chain a few LLM calls, wire up some tools, and show the system completing a task. The audience is impressed. Then you try to deploy it. The tool returns a malformed response. The LLM loops. The context window fills. Costs spike. A user's edge case crashes the whole flow.
Production agent architecture is distributed systems engineering applied to non-deterministic components. The patterns that make backend services reliable — idempotency, circuit breakers, observability, graceful degradation — all apply. The difference is that your components now include an LLM that can surprise you.
Pattern 1: The Orchestrator-Worker Split
The most reliable agent architectures separate orchestration from execution. An orchestrator agent plans and routes — it decides what needs to happen, in what order, with what tools. Worker agents execute discrete, bounded tasks and return structured results. The orchestrator never executes directly; workers never plan.
This separation has concrete benefits. Workers can be tested in isolation with fixed inputs. Their failure modes are bounded and predictable. The orchestrator's behavior is visible as a sequence of routing decisions. When something goes wrong, you know immediately whether the failure was in planning or execution.
The anti-pattern is a single monolithic agent that both plans and executes. It works for demos. It becomes unmaintainable when the task space grows, because the system prompt balloons, context fills with execution history, and reasoning quality degrades.
- Orchestrator emits structured task objects, never raw text instructions to workers.
- Workers return structured results with a status field: success, failure, needs_human.
- Orchestrator never retries workers inline — failures go to a retry queue.
- Workers are stateless — all state lives in the orchestrator's state object.
- Each worker has a documented input schema and output schema. Treat them like microservices.
Pattern 2: Explicit State Machines
Every agent workflow has implicit states: gathering information, validating inputs, executing actions, waiting for approval, completing. Make them explicit. A well-designed agent state machine defines what transitions are valid from each state, what triggers them, and what happens when a transition fails.
The payoff is observability and recovery. When an agent run fails, you know exactly which state it was in. Recovery means reinserting the job at the failed state, not re-running from scratch. For long-running workflows that involve expensive tool calls, this is the difference between a 5-second retry and a 5-minute retry.
| State | Entry Condition | Valid Transitions | Failure Handling |
|---|---|---|---|
| INTAKE | New job received | VALIDATING, REJECTED | Return error to caller |
| VALIDATING | Intake complete | PLANNING, REJECTED | Dead-letter queue |
| PLANNING | Validation passed | EXECUTING, FAILED | Retry up to 3x then escalate |
| EXECUTING | Plan approved | REVIEWING, FAILED | Checkpoint and retry from last step |
| REVIEWING | Execution complete | COMPLETE, REVISING | Human escalation |
| COMPLETE | Review passed | — | N/A |
Pattern 3: Tool Call Hardening
Tools are the most common failure point in production agent systems. External APIs return unexpected formats. Rate limits hit. Services go down. The agent tries to parse a 500 error response as a valid tool result and hallucinates from there.
Every tool wrapper in production should validate its input before calling the external API (the LLM sometimes generates invalid arguments), validate the output schema before returning to the agent, implement retry logic with exponential backoff, respect rate limits with a token bucket or leaky bucket implementation, and return a structured error object rather than throwing exceptions. A structured error gives the LLM something to reason about; an exception gives it nothing.
Pattern 4: Cost and Token Budget Management
Unmanaged agent cost is the reason many internal agent projects get cancelled. A workflow that costs $0.02 in development can cost $2.00 in production when real users hit edge cases, the agent loops, or context grows beyond what was tested. Multiply by thousands of daily runs and you have a cost center that kills ROI.
Implementing Token Budget Management
Set a hard limit for total tokens (input + output) per agent run. This is a business decision as much as a technical one — what is the maximum acceptable cost for one execution? Start conservative and widen as you validate.
Every call to an LLM must log prompt tokens, completion tokens, model, and timestamp. Aggregate per-run. This is non-negotiable for production. You cannot optimize what you cannot measure.
For workflows that accumulate history, add periodic summarization: compress the last N messages into a summary and drop the originals from context. Implement this as a node in your state machine, triggered when context exceeds a threshold.
Not every step needs the most capable model. Tool call parsing, format validation, and simple classification can run on smaller models at 10x lower cost. Reserve the large model for complex reasoning steps.
If a run has consumed 80% of its token budget, emit an alert and route to a fallback path. Do not let runaway loops exhaust budgets silently. The fallback can be a human escalation, a simplified answer, or a graceful decline.
Pattern 5: Human-in-the-Loop Escalation
Fully autonomous agents are appropriate for bounded, low-stakes tasks. For anything involving financial decisions, customer-facing communication, compliance actions, or irreversible operations, you need human escalation paths. Building these in from the start is far easier than retrofitting them after an incident.
The key design question is not "should the agent escalate" but "what information does the human need to make a good decision quickly." A human escalation that dumps raw agent context is nearly unusable. A good escalation presents: what the agent was trying to do, what decision it needs made, the relevant options, and the expected impact of each. Design that interface with the same care as any user-facing product.
Observability as a First-Class Concern
Traces, not just logs. Every agent run should produce a trace: the sequence of LLM calls, tool calls, state transitions, and decisions made. Structured traces enable both real-time monitoring and post-hoc debugging. LangSmith, Langfuse, and Arize Phoenix all provide agent-specific tracing. Pick one before you deploy.
The metrics that matter in production: run success rate, average cost per run, p95 latency, tool error rate by tool, escalation rate, and context utilization percentage. Alert on run success rate dropping below baseline and cost-per-run exceeding 2x the rolling average. Everything else is nice to have.
“You cannot debug a system you cannot observe. Agent systems are not special — they follow the same rule as every other distributed component.”