Skip to main content
Research
Engineering & AI12 min read

Context Engineering: Why Your AI Agent Fails and Your Prompts Cannot Fix It

Prompt engineering gets you a demo. Context engineering gets you a production system. Over 70% of errors in modern LLM applications stem not from insufficient model capability but from incomplete, irrelevant, or poorly structured context. This article breaks down what context engineering actually is, why it replaced prompt engineering as the discipline that matters, and how to implement it in production AI systems.

AuthorAbhishek Sharma· Fordel Studios

In September 2025, Anthropic published a blog post titled "Effective Context Engineering for AI Agents" that racked up nearly 500,000 views in weeks. The post did not introduce a new model or a new API. It described a discipline: the systematic practice of deciding what information an AI model needs, when it needs it, and how it should be structured. The post resonated because it named something every production AI team had already discovered the hard way — the model is not the bottleneck. The context is.

The term "context engineering" has since become the dominant framing for a set of problems that prompt engineering never solved. In 2024, a team struggling with unreliable AI agent behaviour would tweak their prompts. In 2026, that same team is redesigning the information pipeline that feeds the model. The shift is not semantic. It is architectural.

70%+of errors in production LLM applications stem from context failuresSource: production AI industry analysis, not model capability gaps
40%of enterprise apps will feature task-specific AI agents by late 2026Source: Gartner, August 2025 — up from less than 5% in 2025
500Kviews on Anthropic's context engineering blog post within weeksPublished September 2025, alongside Claude Sonnet 4.5 release
···

What Context Engineering Actually Is

Prompt engineering is what you do inside the context window. Context engineering is how you decide what fills the window.

That distinction sounds small until you work on a production system. A prompt engineer writes better instructions, refines few-shot examples, and experiments with phrasing. A context engineer builds the system that determines which documents get retrieved, how conversation history is compressed, what tool definitions are exposed, which memory is surfaced, and how all of it is structured before the model sees a single token.

Anthropic framed it as answering one question: "What configuration of context is most likely to generate our model's desired behaviour?" That question subsumes prompt engineering entirely. The prompt is one input. The context is the entire information environment.

DimensionPrompt EngineeringContext Engineering
FocusHow you phrase the instructionWhat information the model has access to
ScopeThe prompt text itselfRetrieval, memory, tools, history, structure — the full pipeline
Failure modeWrong phrasing leads to wrong outputMissing or irrelevant context leads to hallucination or task failure
ScalabilityManual iteration per use caseSystematic pipeline that serves all agent tasks
When it mattersSimple single-turn completionsMulti-step agents, tool use, production systems
DebuggingRewrite the promptTrace what context was assembled, what was missing, what was noise

Neo4j's engineering blog put it directly: prompt engineering asks how to communicate with the model; context engineering asks what the model needs to know. In production agent systems — where the model is making tool calls, accessing databases, maintaining session state, and deciding when to escalate — the "what" dominates the "how" by an order of magnitude.

Why Prompt Engineering Stopped Working

Prompt engineering worked when the typical AI application was a single-turn completion: user sends a query, model returns a response, done. The entire context fit in a system prompt plus user message. You could hand-tune it.

Production AI agents in 2026 are not single-turn completions. They are multi-step systems that retrieve documents, call external tools, maintain conversation history, access long-term memory, decide which tool definitions to expose, and route to humans when confidence drops. When a system does all of this, the prompt is maybe 5% of what the model sees. The other 95% is assembled dynamically — retrieved documents, tool schemas, compressed history, memory lookups, structured data from APIs.

If that 95% is wrong, no prompt can save you.

What Production AI Agents Actually Do
  • Retrieve documents from vector stores and knowledge graphs based on the current task
  • Call external tools — APIs, databases, code interpreters — and incorporate the results
  • Maintain conversation history across sessions, compressing and summarising as needed
  • Access long-term memory about the user, the organisation, or previous interactions
  • Decide which of dozens of tool definitions to expose for a given step
  • Route to human operators when confidence drops below task-specific thresholds

This is not a theoretical argument. The production AI teams at companies like Anthropic, LangChain, and LlamaIndex converged on the same conclusion independently: agent failures are context failures. The model had the capability. It did not have the information.

Most agent failures are not model failures. They are context failures — missing data, poorly formatted inputs, misused tools, or unhandled edge cases that no amount of prompt rewriting will fix.
Anthropic Applied AI Team
···

The Two Failure Modes Nobody Talks About

Context Rot

Context rot is the degradation of model performance as the context window fills with information. Larger context windows — 200K tokens on Claude, 1M on Gemini — do not solve this. They mask it.

The problem is attentional: the more information in the window, the harder it is for the model to identify what matters. A 200K context window stuffed with every possibly relevant document performs worse than a 32K window with only the high-signal content. The model's attention becomes diluted. It starts missing critical details that are buried under volume.

In production, context rot manifests as agents that work well in testing (short contexts) but degrade in real usage (long sessions with accumulated history). The fix is not a bigger window. It is a better pipeline — one that aggressively filters, compresses, and prioritises what enters the context.

Mode Collapse

Mode collapse in production AI is the systematic reduction of output diversity through alignment training and repetitive context patterns. An agent that always receives the same structured context — same tool definitions, same system prompt, same retrieval format — gradually narrows its behavioural range. It becomes predictable in ways that reduce its usefulness for edge cases.

The practical impact: your agent handles the 80% case well and fails silently on the 20% that matters most. It produces outputs that look right but lack the reasoning diversity needed for novel situations. Context engineering addresses this by varying the structure, not just the content, of what the model receives.

···

The Six Techniques That Actually Matter

LangChain's engineering team formalised context engineering into four core strategies — Write, Select, Compress, and Isolate. Combining this with retrieval and memory patterns from across the ecosystem, six techniques define the discipline in 2026:

Production Context Engineering Techniques

01
Write: Structured instructions and tool definitions

System prompts, tool schemas, and few-shot examples are the authored context — the part you control directly. The principle from Anthropic: find the smallest set of high-signal tokens that maximise the likelihood of your desired outcome. Every token in the system prompt that does not directly improve task performance is noise that competes for the model's attention. Tool definitions should describe exactly what each tool does, when to use it, and what the expected output format looks like. Vague tool descriptions are one of the most common causes of incorrect tool selection in production agents.

02
Select: Dynamic retrieval from RAG and knowledge graphs

RAG remains the foundation of context selection, but the 2026 approach is substantially more sophisticated. Vector retrieval handles precise factual queries. GraphRAG — which constructs knowledge graphs and community summaries — handles open-ended questions requiring multi-hop reasoning and global perspective. The combination covers over 90% of enterprise knowledge needs. The key engineering decision: retrieval must be task-aware. A generic "retrieve the top 5 most similar documents" approach produces mediocre results. Production systems use query rewriting, contextual filtering, and relevance scoring calibrated per task type.

03
Compress: Summarisation and semantic filtering

Conversation history, tool results, and retrieved documents all grow over time. Without compression, context rot sets in. Production systems implement rolling summarisation of conversation history (keeping recent turns verbatim, summarising older ones), semantic deduplication of retrieved content, extraction of key entities and decisions from long tool outputs, and progressive compression that preserves decision-relevant information while discarding procedural detail. The goal is minimum viable context — the smallest, most relevant set of high-signal tokens.

04
Isolate: Scoped context per sub-task

Multi-agent systems and complex workflows benefit from context isolation — giving each sub-agent or processing step only the context it needs. A planning agent does not need the raw tool outputs from an execution agent. A summarisation step does not need the full retrieval results. Isolation prevents cross-contamination of context between unrelated sub-tasks and keeps each step's context window lean. LangGraph implements this through separate state channels per agent in a graph workflow.

05
Remember: Persistent memory across sessions

Session memory (what happened in this conversation) and long-term memory (what we know about this user, this codebase, this organisation) are distinct engineering problems. Tools like Mem0, Zep, and LangGraph's memory stores provide the infrastructure. The engineering challenge is deciding what to remember, when to surface it, and how to prevent stale memories from degrading current performance. Memory that is always injected becomes noise. Memory that is never surfaced is wasted storage. The production pattern is relevance-scored memory retrieval — surfacing memories only when they score above a confidence threshold for the current task.

06
Verify: Context validation before inference

The most overlooked technique. Before sending assembled context to the model, validate it: Is the retrieved content actually relevant to the current query? Are tool definitions consistent with available tools? Is the conversation history coherent, or has compression introduced contradictions? Are there PII or compliance-sensitive elements that should be redacted? Context validation catches failures before they become model failures. It is the difference between debugging a hallucination after the fact and preventing it at the pipeline level.

The Tooling Landscape

The context engineering tooling ecosystem in 2026 has consolidated around a few key players, each handling different parts of the pipeline:

Tool / FrameworkRoleWhen to use it
LangGraph (LangChain)Orchestration and agent workflows with explicit context flowsMulti-step agents where context must be routed, compressed, and isolated per step
LlamaIndexData ingestion, indexing, and retrieval pipelinesWhen your context comes from documents, databases, or APIs that need structured access
Mem0 / ZepPersistent memory layer across sessionsWhen agents need to remember user preferences, past interactions, or organisational knowledge
Pinecone / Weaviate / pgvectorVector storage for semantic retrievalThe retrieval backbone for RAG-based context selection
Neo4j / knowledge graphsStructured relationship data for multi-hop reasoningWhen context requires understanding relationships between entities, not just document similarity
Anthropic / OpenAI structured outputsEnforcing output format from the modelWhen downstream systems need predictable context formats from model outputs

The 2026 best practice, confirmed across multiple production teams, is to combine frameworks: LlamaIndex for data ingestion and indexing, LangGraph for orchestration and agent logic, a vector store for retrieval, and a memory layer for persistence. No single tool handles the full context engineering pipeline.

···

A Production Context Engineering Architecture

Here is what a production context engineering pipeline looks like for an AI agent handling complex enterprise tasks:

Building a Context Engineering Pipeline

01
Query analysis and intent classification

Before retrieving anything, classify the incoming query. Is this a factual lookup, a multi-step reasoning task, a tool-use request, or a conversational follow-up? The classification determines which retrieval strategies activate, which tools are exposed, and how much conversation history to include. This step prevents the most common context engineering failure: treating every query the same way and retrieving the same generic context regardless of intent.

02
Task-aware retrieval

Based on intent, execute targeted retrieval. Factual queries hit the vector store with tight similarity thresholds. Reasoning tasks query the knowledge graph for entity relationships. Tool-use requests load only the relevant tool definitions — not all 50 tools the agent has access to. The retrieval layer should support query rewriting (reformulating the user query for better retrieval results) and contextual filtering (narrowing results by metadata like recency, source authority, or domain).

03
Context assembly and compression

Assemble retrieved content, conversation history, memory, and tool definitions into a single context payload. Then compress: deduplicate overlapping retrieved content, summarise older conversation turns, extract key values from verbose tool outputs. The target is minimum viable context — enough for the model to complete the task, nothing more. A useful heuristic: if removing a context element does not degrade task performance, it should not be there.

04
Validation and safety checks

Before inference, validate the assembled context. Check for PII that should be redacted, verify that retrieved content is actually relevant (not just semantically similar), confirm tool definitions match available tools, and ensure the total token count is within the model's effective processing range — which is not the same as the maximum context window. Log the assembled context for debugging and audit.

05
Inference with structured output

Send the validated context to the model with structured output constraints where needed. After inference, evaluate the response: did the model use the provided context correctly? Did it hallucinate information not in the context? Did it select the right tools? This evaluation feeds back into the retrieval and assembly layers for continuous improvement.

06
Post-inference context update

After the model responds, update the context state: add the interaction to conversation history, update memory with new information worth persisting, log the full context-response pair for observability. If this is a multi-step workflow, pass only the relevant context forward to the next step — do not accumulate the full history of every step.

···

What This Changes About How You Build

Context engineering is not a new library to install. It is a shift in where your engineering effort goes. Teams that adopt it report three consistent changes:

How Context Engineering Changes Development
  • Debugging shifts from "why did the model say that" to "what context did the model see when it said that" — which is a far more tractable problem. You can inspect, reproduce, and fix context assembly bugs. You cannot easily fix a model's internal reasoning.
  • Evaluation becomes context-centric. Instead of measuring model accuracy on benchmarks, you measure retrieval precision, context relevance scores, and compression fidelity. These metrics are actionable — a low retrieval precision score tells you exactly what to fix.
  • Cost drops significantly. Context engineering naturally leads to minimum viable context, which means fewer tokens per request. Teams using the plan-and-execute pattern — where a frontier model creates a strategy that cheaper models execute with targeted context — report cost reductions of up to 90% compared to sending everything to a frontier model.
The era of the prompt engineer is not over — it has been subsumed. Context engineering includes prompt engineering the way software engineering includes writing code. The code matters. But the system around it matters more.
···

Where Fordel Builds

We build production AI agents for clients in finance, legal, healthcare, and SaaS. Context engineering is not a phase of our projects — it is the architecture. Every agent we ship has task-aware retrieval, context validation, compression pipelines, and observability built into the context layer from day one.

If your AI agent works in demos and breaks in production, the problem is almost certainly not the model. It is what the model sees. We can tell you exactly where your context pipeline is failing and what it takes to fix it. No pitch deck. If that conversation is useful, reach out.

Keep Exploring

Related services, agents, and capabilities

Services
01
AI Agent DevelopmentAgents that ship to production — not just pass a demo.
02
AI Product StrategyAvoid the AI wrapper trap. Find where AI creates a defensible moat.
03
API Design & IntegrationAPIs that AI agents can call reliably — and humans can maintain.
04
AI Cost OptimizationThe inference cost crisis — audited and addressed.
Capabilities
05
AI Agent DevelopmentAutonomous systems that act, not just answer
06
AI/ML IntegrationAI that works in production, not just in notebooks
07
Backend DevelopmentThe infrastructure that makes AI-powered systems reliable
Industries
08
SaaSThe SaaSocalypse narrative is real and it is not done. Cursor with Claude built Anysphere into a $2.5B company selling to developers who used to pay for multiple separate tools. Bolt, Lovable, and Replit Agent are letting non-engineers ship MVPs in hours. Zero-seat software is emerging — AI agents as the only users of your API, with no human seat count to price against. The "wrapper problem" is killing thin AI wrappers with no moat. Single-person billion-dollar companies are no longer theoretical. Vertical AI is eating horizontal SaaS in category after category. And the great SaaS repricing is underway: customers are refusing to renew at legacy prices when AI does the same job for less.
09
FinanceAI-first neobanks are emerging. Bloomberg GPT and domain-specific financial LLMs are in production. Upstart and Zest AI are disrupting FICO-based credit scoring. Deepfake voice fraud is hitting bank call centers at scale. The RegTech market is heading toward $20B+ as compliance automation replaces compliance headcount. JP Morgan's LOXM and Goldman's AI initiatives are setting expectations for what institutional-grade financial AI looks like — and the compliance infrastructure required to deploy it.
10
HealthcareAmbient AI scribes are in production at health systems across the country — Abridge raised $150M, Nuance DAX is embedded in Epic, and physicians are actually adopting these tools because they remove documentation burden rather than adding to it. The prior authorization automation wars are heating up with CMS mandating FHIR APIs. AlphaFold and Recursion Pharma are rewriting drug discovery timelines. The engineering challenge is not AI capability — it is building systems that are safe, explainable, and HIPAA-compliant at the same time.
11
LegalGPT-4 scored in the 90th percentile on the bar exam. Lawyers have been sanctioned for citing AI-hallucinated cases in federal court. Harvey AI raised over $100M and partnered with BigLaw. CoCounsel was acquired by Thomson Reuters. The "robot lawyers" debate is live, the billable hour death spiral is real, and the firms that figure out new pricing models before their clients force the issue will define the next decade of legal services.