In September 2025, Anthropic published a blog post titled "Effective Context Engineering for AI Agents" that racked up nearly 500,000 views in weeks. The post did not introduce a new model or a new API. It described a discipline: the systematic practice of deciding what information an AI model needs, when it needs it, and how it should be structured. The post resonated because it named something every production AI team had already discovered the hard way — the model is not the bottleneck. The context is.
The term "context engineering" has since become the dominant framing for a set of problems that prompt engineering never solved. In 2024, a team struggling with unreliable AI agent behaviour would tweak their prompts. In 2026, that same team is redesigning the information pipeline that feeds the model. The shift is not semantic. It is architectural.
What Context Engineering Actually Is
Prompt engineering is what you do inside the context window. Context engineering is how you decide what fills the window.
That distinction sounds small until you work on a production system. A prompt engineer writes better instructions, refines few-shot examples, and experiments with phrasing. A context engineer builds the system that determines which documents get retrieved, how conversation history is compressed, what tool definitions are exposed, which memory is surfaced, and how all of it is structured before the model sees a single token.
Anthropic framed it as answering one question: "What configuration of context is most likely to generate our model's desired behaviour?" That question subsumes prompt engineering entirely. The prompt is one input. The context is the entire information environment.
| Dimension | Prompt Engineering | Context Engineering |
|---|---|---|
| Focus | How you phrase the instruction | What information the model has access to |
| Scope | The prompt text itself | Retrieval, memory, tools, history, structure — the full pipeline |
| Failure mode | Wrong phrasing leads to wrong output | Missing or irrelevant context leads to hallucination or task failure |
| Scalability | Manual iteration per use case | Systematic pipeline that serves all agent tasks |
| When it matters | Simple single-turn completions | Multi-step agents, tool use, production systems |
| Debugging | Rewrite the prompt | Trace what context was assembled, what was missing, what was noise |
Neo4j's engineering blog put it directly: prompt engineering asks how to communicate with the model; context engineering asks what the model needs to know. In production agent systems — where the model is making tool calls, accessing databases, maintaining session state, and deciding when to escalate — the "what" dominates the "how" by an order of magnitude.
Why Prompt Engineering Stopped Working
Prompt engineering worked when the typical AI application was a single-turn completion: user sends a query, model returns a response, done. The entire context fit in a system prompt plus user message. You could hand-tune it.
Production AI agents in 2026 are not single-turn completions. They are multi-step systems that retrieve documents, call external tools, maintain conversation history, access long-term memory, decide which tool definitions to expose, and route to humans when confidence drops. When a system does all of this, the prompt is maybe 5% of what the model sees. The other 95% is assembled dynamically — retrieved documents, tool schemas, compressed history, memory lookups, structured data from APIs.
If that 95% is wrong, no prompt can save you.
- Retrieve documents from vector stores and knowledge graphs based on the current task
- Call external tools — APIs, databases, code interpreters — and incorporate the results
- Maintain conversation history across sessions, compressing and summarising as needed
- Access long-term memory about the user, the organisation, or previous interactions
- Decide which of dozens of tool definitions to expose for a given step
- Route to human operators when confidence drops below task-specific thresholds
This is not a theoretical argument. The production AI teams at companies like Anthropic, LangChain, and LlamaIndex converged on the same conclusion independently: agent failures are context failures. The model had the capability. It did not have the information.
“Most agent failures are not model failures. They are context failures — missing data, poorly formatted inputs, misused tools, or unhandled edge cases that no amount of prompt rewriting will fix.”
The Two Failure Modes Nobody Talks About
Context Rot
Context rot is the degradation of model performance as the context window fills with information. Larger context windows — 200K tokens on Claude, 1M on Gemini — do not solve this. They mask it.
The problem is attentional: the more information in the window, the harder it is for the model to identify what matters. A 200K context window stuffed with every possibly relevant document performs worse than a 32K window with only the high-signal content. The model's attention becomes diluted. It starts missing critical details that are buried under volume.
In production, context rot manifests as agents that work well in testing (short contexts) but degrade in real usage (long sessions with accumulated history). The fix is not a bigger window. It is a better pipeline — one that aggressively filters, compresses, and prioritises what enters the context.
Mode Collapse
Mode collapse in production AI is the systematic reduction of output diversity through alignment training and repetitive context patterns. An agent that always receives the same structured context — same tool definitions, same system prompt, same retrieval format — gradually narrows its behavioural range. It becomes predictable in ways that reduce its usefulness for edge cases.
The practical impact: your agent handles the 80% case well and fails silently on the 20% that matters most. It produces outputs that look right but lack the reasoning diversity needed for novel situations. Context engineering addresses this by varying the structure, not just the content, of what the model receives.
The Six Techniques That Actually Matter
LangChain's engineering team formalised context engineering into four core strategies — Write, Select, Compress, and Isolate. Combining this with retrieval and memory patterns from across the ecosystem, six techniques define the discipline in 2026:
Production Context Engineering Techniques
System prompts, tool schemas, and few-shot examples are the authored context — the part you control directly. The principle from Anthropic: find the smallest set of high-signal tokens that maximise the likelihood of your desired outcome. Every token in the system prompt that does not directly improve task performance is noise that competes for the model's attention. Tool definitions should describe exactly what each tool does, when to use it, and what the expected output format looks like. Vague tool descriptions are one of the most common causes of incorrect tool selection in production agents.
RAG remains the foundation of context selection, but the 2026 approach is substantially more sophisticated. Vector retrieval handles precise factual queries. GraphRAG — which constructs knowledge graphs and community summaries — handles open-ended questions requiring multi-hop reasoning and global perspective. The combination covers over 90% of enterprise knowledge needs. The key engineering decision: retrieval must be task-aware. A generic "retrieve the top 5 most similar documents" approach produces mediocre results. Production systems use query rewriting, contextual filtering, and relevance scoring calibrated per task type.
Conversation history, tool results, and retrieved documents all grow over time. Without compression, context rot sets in. Production systems implement rolling summarisation of conversation history (keeping recent turns verbatim, summarising older ones), semantic deduplication of retrieved content, extraction of key entities and decisions from long tool outputs, and progressive compression that preserves decision-relevant information while discarding procedural detail. The goal is minimum viable context — the smallest, most relevant set of high-signal tokens.
Multi-agent systems and complex workflows benefit from context isolation — giving each sub-agent or processing step only the context it needs. A planning agent does not need the raw tool outputs from an execution agent. A summarisation step does not need the full retrieval results. Isolation prevents cross-contamination of context between unrelated sub-tasks and keeps each step's context window lean. LangGraph implements this through separate state channels per agent in a graph workflow.
Session memory (what happened in this conversation) and long-term memory (what we know about this user, this codebase, this organisation) are distinct engineering problems. Tools like Mem0, Zep, and LangGraph's memory stores provide the infrastructure. The engineering challenge is deciding what to remember, when to surface it, and how to prevent stale memories from degrading current performance. Memory that is always injected becomes noise. Memory that is never surfaced is wasted storage. The production pattern is relevance-scored memory retrieval — surfacing memories only when they score above a confidence threshold for the current task.
The most overlooked technique. Before sending assembled context to the model, validate it: Is the retrieved content actually relevant to the current query? Are tool definitions consistent with available tools? Is the conversation history coherent, or has compression introduced contradictions? Are there PII or compliance-sensitive elements that should be redacted? Context validation catches failures before they become model failures. It is the difference between debugging a hallucination after the fact and preventing it at the pipeline level.
The Tooling Landscape
The context engineering tooling ecosystem in 2026 has consolidated around a few key players, each handling different parts of the pipeline:
| Tool / Framework | Role | When to use it |
|---|---|---|
| LangGraph (LangChain) | Orchestration and agent workflows with explicit context flows | Multi-step agents where context must be routed, compressed, and isolated per step |
| LlamaIndex | Data ingestion, indexing, and retrieval pipelines | When your context comes from documents, databases, or APIs that need structured access |
| Mem0 / Zep | Persistent memory layer across sessions | When agents need to remember user preferences, past interactions, or organisational knowledge |
| Pinecone / Weaviate / pgvector | Vector storage for semantic retrieval | The retrieval backbone for RAG-based context selection |
| Neo4j / knowledge graphs | Structured relationship data for multi-hop reasoning | When context requires understanding relationships between entities, not just document similarity |
| Anthropic / OpenAI structured outputs | Enforcing output format from the model | When downstream systems need predictable context formats from model outputs |
The 2026 best practice, confirmed across multiple production teams, is to combine frameworks: LlamaIndex for data ingestion and indexing, LangGraph for orchestration and agent logic, a vector store for retrieval, and a memory layer for persistence. No single tool handles the full context engineering pipeline.
A Production Context Engineering Architecture
Here is what a production context engineering pipeline looks like for an AI agent handling complex enterprise tasks:
Building a Context Engineering Pipeline
Before retrieving anything, classify the incoming query. Is this a factual lookup, a multi-step reasoning task, a tool-use request, or a conversational follow-up? The classification determines which retrieval strategies activate, which tools are exposed, and how much conversation history to include. This step prevents the most common context engineering failure: treating every query the same way and retrieving the same generic context regardless of intent.
Based on intent, execute targeted retrieval. Factual queries hit the vector store with tight similarity thresholds. Reasoning tasks query the knowledge graph for entity relationships. Tool-use requests load only the relevant tool definitions — not all 50 tools the agent has access to. The retrieval layer should support query rewriting (reformulating the user query for better retrieval results) and contextual filtering (narrowing results by metadata like recency, source authority, or domain).
Assemble retrieved content, conversation history, memory, and tool definitions into a single context payload. Then compress: deduplicate overlapping retrieved content, summarise older conversation turns, extract key values from verbose tool outputs. The target is minimum viable context — enough for the model to complete the task, nothing more. A useful heuristic: if removing a context element does not degrade task performance, it should not be there.
Before inference, validate the assembled context. Check for PII that should be redacted, verify that retrieved content is actually relevant (not just semantically similar), confirm tool definitions match available tools, and ensure the total token count is within the model's effective processing range — which is not the same as the maximum context window. Log the assembled context for debugging and audit.
Send the validated context to the model with structured output constraints where needed. After inference, evaluate the response: did the model use the provided context correctly? Did it hallucinate information not in the context? Did it select the right tools? This evaluation feeds back into the retrieval and assembly layers for continuous improvement.
After the model responds, update the context state: add the interaction to conversation history, update memory with new information worth persisting, log the full context-response pair for observability. If this is a multi-step workflow, pass only the relevant context forward to the next step — do not accumulate the full history of every step.
What This Changes About How You Build
Context engineering is not a new library to install. It is a shift in where your engineering effort goes. Teams that adopt it report three consistent changes:
- Debugging shifts from "why did the model say that" to "what context did the model see when it said that" — which is a far more tractable problem. You can inspect, reproduce, and fix context assembly bugs. You cannot easily fix a model's internal reasoning.
- Evaluation becomes context-centric. Instead of measuring model accuracy on benchmarks, you measure retrieval precision, context relevance scores, and compression fidelity. These metrics are actionable — a low retrieval precision score tells you exactly what to fix.
- Cost drops significantly. Context engineering naturally leads to minimum viable context, which means fewer tokens per request. Teams using the plan-and-execute pattern — where a frontier model creates a strategy that cheaper models execute with targeted context — report cost reductions of up to 90% compared to sending everything to a frontier model.
“The era of the prompt engineer is not over — it has been subsumed. Context engineering includes prompt engineering the way software engineering includes writing code. The code matters. But the system around it matters more.”
Where Fordel Builds
We build production AI agents for clients in finance, legal, healthcare, and SaaS. Context engineering is not a phase of our projects — it is the architecture. Every agent we ship has task-aware retrieval, context validation, compression pipelines, and observability built into the context layer from day one.
If your AI agent works in demos and breaks in production, the problem is almost certainly not the model. It is what the model sees. We can tell you exactly where your context pipeline is failing and what it takes to fix it. No pitch deck. If that conversation is useful, reach out.