In March 2025, Shopify engineers published an internal post-mortem on their documentation search system. The system used a standard RAG pipeline: embed the query, retrieve the top five chunks from a vector database, pass them to GPT-4, generate an answer. It worked in demos. In production, it hallucinated answers in roughly 17% of cases. Not because the model was bad. Because the retrieval was wrong, and the system had no way to know.
The chunks came back in the wrong order. Relevant information was split across document boundaries. The embedding model lost precision on Shopify-specific terminology like “metafield” and “theme block.” The system retrieved confidently, generated confidently, and was wrong 17% of the time with no mechanism to detect or correct it.
This is the fundamental problem with traditional RAG. It is a one-shot pipeline: retrieve, generate, done. There is no reasoning about whether the retrieved documents actually answer the question. No ability to reformulate the query if the first retrieval misses. No self-correction. In 2024, this was state of the art. In 2026, it is a known architectural deficiency, and the industry is converging on a replacement: agentic RAG.
What Traditional RAG Actually Does
A traditional RAG pipeline has three stages: indexing, retrieval, and generation. During indexing, documents are split into chunks, each chunk is embedded into a vector, and the vectors are stored in a database. During retrieval, the user’s query is embedded using the same model, a similarity search finds the top-k nearest vectors, and the corresponding chunks are returned. During generation, the retrieved chunks are concatenated into the LLM’s context window alongside the query, and the model generates an answer.
The architecture assumes that the retrieval step will return the right documents. This assumption breaks in predictable ways:
- Semantic gap: the query embedding and document embeddings occupy different semantic spaces, especially for domain-specific terminology that the general-purpose embedding model was never trained on.
- Chunking boundary errors: the answer spans two chunks, but only one is retrieved. The model generates a partial or fabricated answer from incomplete context.
- Ranking failures: the relevant document is in the top-20 results but not the top-5. Standard top-k retrieval misses it entirely.
- Query ambiguity: the user’s question requires information from multiple sources, but the single embedding captures only one facet of the query.
- Temporal drift: the embedding index was built months ago. New documents use different terminology. The embeddings no longer represent the current corpus accurately.
Traditional RAG has no feedback loop. If the retrieval returns irrelevant documents, the model does not know. It generates from whatever context it receives. If the generation hallucinates, there is no verification step. The system returns the answer with the same confidence regardless of whether the underlying retrieval was accurate.
“The fundamental flaw in traditional RAG is not the retrieval algorithm or the generation model. It is the absence of a reasoning layer between retrieval and generation — a layer that asks: did I get what I needed?”
Agentic RAG: Retrieval as a Control Loop
Agentic RAG replaces the linear retrieve-generate pipeline with an autonomous control loop. An AI agent orchestrates the retrieval process, making decisions at each step: what to retrieve, from which sources, whether the retrieval was sufficient, and whether to iterate.
The core loop follows the ReAct (Reason and Act) pattern. The agent receives a query, reasons about what information it needs, takes a retrieval action, observes the results, evaluates whether the retrieved context is sufficient to answer the question, and decides whether to continue retrieving or generate a final answer. This loop can execute multiple times per query.
The key architectural difference is agency over the retrieval process:
| Capability | Traditional RAG | Agentic RAG |
|---|---|---|
| Query handling | Single embedding, single retrieval | Query decomposition into sub-queries, multiple targeted retrievals |
| Source selection | One vector database | Multiple knowledge bases, APIs, databases, web search |
| Retrieval evaluation | None — trusts the retrieval blindly | Agent evaluates relevance and completeness after each retrieval |
| Self-correction | None | Reformulates queries when retrieval quality is low, retries with different strategies |
| Reasoning | No intermediate reasoning | Chain-of-thought reasoning between retrieval steps to plan next actions |
| Output verification | None | Agent verifies generated answer against source documents before returning |
Concretely, when a user asks an agentic RAG system a complex question like “How did our Q4 revenue compare to our top three competitors, and what market factors drove the differences?” the agent decomposes this into sub-tasks: retrieve internal Q4 revenue data, identify the top three competitors, retrieve their public financial data, retrieve market analysis from relevant timeframes, and synthesise the comparison. A traditional RAG pipeline would embed the entire question as a single vector and retrieve whatever five chunks happened to be nearest — almost certainly missing most of the required information.
The Four Patterns That Matter
The agentic RAG space has coalesced around four distinct architectural patterns, each addressing a specific failure mode in traditional RAG:
Corrective RAG: The Retrieval Evaluator
Corrective RAG (CRAG), introduced by Yan et al. in 2024, adds a lightweight retrieval evaluator between the retrieval and generation stages. After the initial retrieval, a classifier scores each retrieved document as Correct, Ambiguous, or Incorrect. Based on the scores, it triggers one of three corrective paths: if documents score Correct, generation proceeds normally. If Ambiguous, a knowledge refinement step extracts and re-ranks relevant information. If Incorrect, the system falls back to web search or alternative retrieval sources.
The evaluator is typically a small, fine-tuned model — not the primary LLM — which keeps the latency overhead under 100 milliseconds per evaluation. This pattern is particularly effective for factual question answering where the retrieval corpus may be incomplete or outdated.
Self-RAG: Reflection Tokens
Self-RAG, introduced by Asai et al. in 2023 and refined through 2025, takes a different approach: rather than adding an external evaluator, it trains the generation model to emit special reflection tokens that evaluate its own retrieval and generation quality. The model generates tokens like [IsRel] (is this document relevant?), [IsSup] (does the document support the claim?), and [IsUse] (is the response useful?) alongside its regular output.
This self-evaluation happens at generation time with minimal latency overhead. The trade-off is that it requires fine-tuning the LLM to emit these reflection tokens, which locks you into specific model versions and makes model upgrades more complex. Self-RAG achieves strong results on benchmarks but the fine-tuning requirement limits its adoption in production systems that need model flexibility.
Adaptive RAG: Query Routing
Adaptive RAG, formalised by Jeong et al. in 2024, addresses the problem of query complexity. Not every question needs the same retrieval strategy. A simple factual lookup (“What is our refund policy?”) can be answered with basic retrieval. A multi-hop reasoning question (“Which products had the highest return rate among customers who also purchased accessories?”) requires iterative multi-step retrieval.
Adaptive RAG uses a trained classifier — typically a T5-large model — to route queries into three tiers. Tier A (simple) routes to basic single-retrieval RAG. Tier B (moderate) routes to multi-step retrieval with chain-of-thought reasoning. Tier C (complex) routes to full agentic retrieval with query decomposition, multi-source retrieval, and iterative refinement. By not over-engineering simple queries, Adaptive RAG reduces average latency and cost while maintaining quality on complex questions.
GraphRAG: Retrieval Over Relationships
Microsoft Research introduced GraphRAG in early 2024, and by March 2026 it has reached production maturity. Instead of retrieving individual text chunks based on vector similarity, GraphRAG first constructs a knowledge graph from the document corpus — entities, relationships, and community structures — and then retrieves over the graph at query time.
The approach works in two modes. Local search retrieves specific entities and their immediate neighbourhood in the graph, similar to how a traditional RAG system retrieves relevant chunks but with relationship context preserved. Global search uses community summaries — pre-computed summaries of clustered entity groups — to answer questions that require understanding themes and patterns across the entire corpus.
Microsoft’s benchmarks show GraphRAG achieving 80% accuracy compared to 50% for traditional vector retrieval on questions requiring cross-document reasoning — the 3.4x improvement that makes it compelling for enterprise use cases where documents reference each other extensively. The trade-off is indexing cost: building the knowledge graph requires LLM calls to extract entities and relationships from every document, which can be expensive for large corpora.
“GraphRAG does not replace vector search. It adds a retrieval dimension that vector search cannot provide: the ability to answer questions about relationships between concepts, not just questions about individual concepts.”
The Production Reality: Where Agentic RAG Breaks
The patterns above work elegantly in research papers and conference demos. Production exposes their failure modes. Teams deploying agentic RAG in 2026 are discovering that the control loop introduces its own category of problems.
Runaway Retrieval Loops
An agentic RAG system that cannot find a satisfactory answer will keep retrieving. Without explicit loop limits, a single query can trigger dozens of retrieval calls, each incurring embedding computation, vector search, and LLM reasoning costs. A query that costs /bin/zsh.02 in a traditional RAG pipeline can cost /bin/zsh.50 or more in an unbounded agentic loop. At scale, this is the difference between a viable product and a financial liability.
The solution is straightforward but often missed: set a maximum iteration count (typically 3–5 loops), implement a total token budget per query, and define explicit “I don’t know” exit conditions. If the agent cannot ground an answer within the budget, it should say so rather than continuing to search.
Latency Multiplication
Each iteration in the agentic loop adds latency. A single retrieval takes 50–200ms. An LLM reasoning step takes 500ms–2 seconds. Three iterations means 1.5–6 seconds per query, compared to the 200–500ms of a traditional RAG pipeline. For customer-facing applications where response time matters, this latency is often unacceptable.
Adaptive RAG partially addresses this by routing simple queries through a fast single-retrieval path. But for complex queries that genuinely require multiple retrieval steps, the latency is inherent. Teams must decide: do you optimise for answer quality (more iterations) or response time (fewer iterations)? The answer depends on the use case, and most production systems implement tiered latency budgets — 200ms for autocomplete, 2 seconds for search, 10 seconds for complex analysis.
Embedding Drift
One of the least discussed reliability problems in production RAG is embedding drift. Your vector index was built using a specific embedding model at a specific point in time. New documents added later may use updated terminology. If you update the embedding model, existing vectors no longer align with new ones. One team reported that their hallucination rate jumped 15% after re-chunking their legal policy corpus — the new chunk boundaries split critical clauses across vectors in ways the original chunking did not.
In agentic RAG, embedding drift compounds because the agent may issue multiple queries with different phrasings, and each query interacts with the drift differently. Monitoring embedding quality over time — tracking retrieval precision, re-ranking effectiveness, and hallucination rates per index version — is not optional. It is the difference between a system that degrades silently and one that alerts you before users notice.
The Framework Landscape
Three frameworks dominate agentic RAG implementation in 2026. Each optimises for a different developer mental model.
| Framework | Approach | Strengths | Weaknesses | Best For |
|---|---|---|---|---|
| LangGraph (LangChain) | Graph-based state machines with nodes and edges defining agent workflows | Fine-grained control over agent flow, built-in persistence and checkpointing, strong community | Frequent breaking changes between versions, steep learning curve, custom syntax with operator overloading | Teams needing precise control over multi-step agent workflows |
| LlamaIndex Workflows | Event-driven async pipelines with Pythonic abstractions | Simpler API, faster time-to-production, deep integration with data connectors and index types | Less control over complex orchestration patterns, smaller ecosystem than LangChain | Teams prioritising speed to production with smaller teams |
| Microsoft GraphRAG | Knowledge graph construction and graph-based retrieval | Unmatched cross-document reasoning, open-source, strong enterprise backing | Expensive indexing (requires LLM calls per document), limited to graph-based retrieval patterns | Enterprise knowledge bases with high relationship density between documents |
A common production pattern is combining frameworks: use Adaptive RAG routing at the entry point, LangGraph or LlamaIndex for the agentic retrieval loop, and GraphRAG as one of the retrieval sources alongside traditional vector search. The A-RAG framework (February 2026) formalised this by exposing keyword search, semantic search, and chunk-level retrieval as separate tools the agent can choose between, improving QA accuracy by 5–13% over flat retrieval.
Building an Agentic RAG System: The Decision Framework
Not every RAG system needs to be agentic. The additional complexity — more LLM calls, higher latency, more failure modes — is only justified when the one-shot pipeline demonstrably fails. Here is how to decide:
When to Upgrade from Traditional to Agentic RAG
Deploy a traditional RAG pipeline first. Measure hallucination rate, retrieval precision, and user satisfaction. If your hallucination rate is below 5% and users are satisfied, agentic RAG adds cost and complexity without proportional benefit. If your rate is above 10%, the control loop will meaningfully improve quality.
Is retrieval returning irrelevant documents (a ranking problem)? Is the answer split across documents (a chunking problem)? Does the query require information from multiple sources (a decomposition problem)? Each failure mode points to a different agentic pattern: re-ranking for the first, Corrective RAG for the second, Adaptive RAG with query decomposition for the third.
Add a lightweight retrieval evaluator between retrieval and generation. This single change — scoring retrieved documents before passing them to the LLM — catches the majority of retrieval failures with minimal latency overhead. Only escalate to full agentic retrieval with query decomposition and multi-source retrieval when the evaluator consistently flags insufficient context.
Define a maximum iteration count (3–5), a total token budget per query, and a latency ceiling. Build your agentic loop with these constraints from day one, not retrofitted after the first invoice. The agent should have a clear exit condition: if it cannot ground an answer within budget, it returns an explicit uncertainty signal rather than generating a low-confidence response.
Track retrieval precision per query type, embedding quality per index version, hallucination rate over time, and cost per query. Agentic RAG systems are dynamic — their behaviour changes as the corpus evolves, as embedding models are updated, and as query patterns shift. Without continuous monitoring, quality degrades silently.
Evaluation: How to Know If It Works
RAG evaluation in 2026 has standardised around three dimensions, each requiring different metrics:
| Dimension | Metrics | Tools | What It Catches |
|---|---|---|---|
| Retrieval Quality | Precision@k, Recall@k, MRR (Mean Reciprocal Rank), NDCG | RAGAS, Maxim AI, custom evaluation harnesses | Wrong documents retrieved, relevant documents ranked too low, missing context |
| Generation Quality | Faithfulness (answer grounded in context), Answer Relevance, Hallucination Rate | RAGAS, Braintrust, DeepEval, LLM-as-Judge | Hallucinated claims, answers that ignore retrieved context, off-topic responses |
| End-to-End Quality | Answer Correctness, User Satisfaction, Task Completion Rate | Human evaluation, A/B testing, implicit feedback signals | System works component-by-component but fails holistically |
For agentic RAG specifically, you need additional metrics that traditional RAG evaluation does not cover: average iterations per query (are your agents looping excessively?), retrieval evaluator accuracy (is the Corrective RAG classifier making good decisions?), query decomposition quality (are sub-queries meaningful?), and cost per query (is the reasoning loop economically viable?).
“You cannot evaluate an agentic RAG system by evaluating retrieval and generation separately. The agent’s decisions — when to retrieve, when to reformulate, when to stop — are the system. Evaluate the loop, not the components.”
What the Next 12 Months Look Like
Three developments will reshape the agentic RAG landscape by March 2027:
Longer Context Windows Do Not Kill RAG
Every time a model vendor announces a longer context window, someone declares RAG dead. It is not. A 1-million-token context window costs roughly to fill at current frontier model pricing. Filling it for every query is economically insane for any system handling more than a few hundred queries per day. RAG retrieves the relevant subset of a corpus rather than feeding all of it into the model. The economic argument for RAG gets stronger as corpora grow, not weaker.
What longer context windows do change is the generation stage. With more room in the context window, agentic RAG systems can retrieve more aggressively — pulling 20–50 chunks instead of 5 — and let the model sort through them during generation. This shifts the bottleneck from retrieval precision (getting exactly the right 5 documents) to retrieval recall (getting all potentially relevant documents into the context).
MCP as the Agent-to-Retrieval Interface
The Model Context Protocol is becoming the standard interface between AI agents and data sources. In an agentic RAG architecture, each knowledge source — vector database, graph database, SQL database, API endpoint — can be exposed as an MCP server with typed tool definitions. The agent discovers available retrieval tools via MCP, selects the appropriate one based on the query, and invokes it with structured parameters.
This standardisation means agentic RAG systems no longer need custom integration code for each data source. A LangGraph agent that queries Pinecone today can query a Neo4j knowledge graph tomorrow by connecting to a different MCP server, without changing the agent’s code. The A-RAG framework already implements this pattern, exposing keyword, semantic, and chunk-level retrieval as MCP-compatible tools.
Hybrid Retrieval Becomes Default
Pure vector search is giving way to hybrid approaches that combine dense embeddings with sparse retrieval (BM25 or SPLADE). Models like BGE-M3 generate both dense and sparse representations in a single forward pass, enabling hybrid retrieval without maintaining separate indices. The Higress-RAG framework demonstrated that dual hybrid retrieval achieves over 90% recall on enterprise datasets — significantly higher than either dense or sparse retrieval alone.
For agentic RAG, hybrid retrieval reduces the number of iterations needed. When the first retrieval is more accurate, the agent needs fewer corrective loops, which reduces latency and cost. Expect hybrid retrieval to be the default configuration in production RAG systems within the year.
Where Fordel Builds
Every RAG system we deploy at Fordel starts with measurement. We instrument retrieval precision, hallucination rate, and user satisfaction before deciding whether agentic patterns are warranted. When they are, we implement the minimum viable loop: Corrective RAG with a lightweight evaluator, explicit iteration budgets, and cost monitoring from day one. We escalate to full agentic retrieval with query decomposition and multi-source access only when the data justifies the complexity.
The 17% hallucination rate that Shopify discovered is not unusual. We have seen similar numbers in enterprise deployments across healthcare, legal, and financial services — domains where a wrong answer is not just a bad experience but a liability. If your RAG pipeline is guessing and you do not know how often it guesses wrong, that is the first problem to solve. We can show you exactly where your retrieval breaks and what architecture fixes it. No pitch deck. If that conversation is useful, reach out.