What is agentic RAG and how does it differ from standard RAG?

Standard RAG retrieves once and generates once. Agentic RAG treats retrieval as a multi-step reasoning loop — the agent decides what to retrieve, evaluates the quality of retrieved chunks, reformulates queries when results are insufficient, and iterates until it has enough context. This dramatically reduces hallucinations on complex queries.

How do you implement a retrieval control loop in production?

A production retrieval control loop requires: query decomposition before retrieval, relevance scoring on returned chunks, a requery trigger when relevance scores are low, a maximum iteration cap to prevent infinite loops, and structured context assembly before the final generation call.

What are the failure modes of agentic RAG systems?

Common agentic RAG failures: retrieval loops that never terminate, context window overflow from over-retrieval, stale chunk insertion when the vector index is not kept current, and relevance score calibration drift when embedding models update. Always cap max retrieval iterations and monitor context token usage.

How does agentic RAG compare to fine-tuning for enterprise knowledge?

Agentic RAG retrieves knowledge at inference time from a live index, making it current and auditable. Fine-tuning bakes knowledge into weights, making it fast but static and expensive to update. For enterprise use cases where knowledge changes frequently, agentic RAG wins on maintainability even if fine-tuning wins on latency.

When should you use agentic RAG vs a single-shot RAG pipeline?

Use agentic RAG when queries are complex, multi-hop, or ambiguous — where a single retrieval step will fail. Use single-shot RAG for simple factual lookups with well-structured indexes. Agentic RAG costs 3–5x more in tokens and latency; only use it where answer quality justifies the overhead.

Fordel Studios

Your RAG Pipeline Retrieves Once, Guesses Once, and Calls It an Answer

Traditional RAG retrieves documents once, stuffs them into a context window, and generates an answer. No verification. No iteration. No self-correction. Agentic RAG replaces this one-shot pipeline with a control loop: the agent decides what to retrieve, evaluates whether the retrieval was useful, reformulates queries when it was not, and iterates until the answer is grounded. The difference between a chatbot that guesses and a system that reasons is the loop.

Abhishek Sharma· Founder, Fordel Studios

March 25, 2026Updated May 8, 202614 min read

Your RAG Pipeline Retrieves Once, Guesses Once, and Calls It an Answer

In March 2025, Shopify engineers published an internal post-mortem on their documentation search system. The system used a standard RAG pipeline: embed the query, retrieve the top five chunks from a vector database, pass them to GPT-4, generate an answer. It worked in demos. In production, it hallucinated answers in roughly 17% of cases. Not because the model was bad. Because the retrieval was wrong, and the system had no way to know.

The chunks came back in the wrong order. Relevant information was split across document boundaries. The embedding model lost precision on Shopify-specific terminology like “metafield” and “theme block.” The system retrieved confidently, generated confidently, and was wrong 17% of the time with no mechanism to detect or correct it.

This is the fundamental problem with traditional RAG. It is a one-shot pipeline: retrieve, generate, done. There is no reasoning about whether the retrieved documents actually answer the question. No ability to reformulate the query if the first retrieval misses. No self-correction. In 2024, this was state of the art. In 2026, it is a known architectural deficiency, and the industry is converging on a replacement: agentic RAG.

17%hallucination rate in Shopify’s production RAG pipelineCaused by retrieval ordering and chunking errors, not model failures

70%of engineers have RAG in production or plan to ship within a yearSource: 2026 RAG infrastructure surveys

3.4xaccuracy improvement from GraphRAG over traditional vector retrievalMicrosoft Research benchmarks on enterprise datasets

···

What Traditional RAG Actually Does

A traditional RAG pipeline has three stages: indexing, retrieval, and generation. During indexing, documents are split into chunks, each chunk is embedded into a vector, and the vectors are stored in a database. During retrieval, the user’s query is embedded using the same model, a similarity search finds the top-k nearest vectors, and the corresponding chunks are returned. During generation, the retrieved chunks are concatenated into the LLM’s context window alongside the query, and the model generates an answer.

The architecture assumes that the retrieval step will return the right documents. This assumption breaks in predictable ways:

Why One-Shot Retrieval Fails

Semantic gap: the query embedding and document embeddings occupy different semantic spaces, especially for domain-specific terminology that the general-purpose embedding model was never trained on.
Chunking boundary errors: the answer spans two chunks, but only one is retrieved. The model generates a partial or fabricated answer from incomplete context.
Ranking failures: the relevant document is in the top-20 results but not the top-5. Standard top-k retrieval misses it entirely.
Query ambiguity: the user’s question requires information from multiple sources, but the single embedding captures only one facet of the query.
Temporal drift: the embedding index was built months ago. New documents use different terminology. The embeddings no longer represent the current corpus accurately.

Traditional RAG has no feedback loop. If the retrieval returns irrelevant documents, the model does not know. It generates from whatever context it receives. If the generation hallucinates, there is no verification step. The system returns the answer with the same confidence regardless of whether the underlying retrieval was accurate.

“The fundamental flaw in traditional RAG is not the retrieval algorithm or the generation model. It is the absence of a reasoning layer between retrieval and generation — a layer that asks: did I get what I needed?”

Agentic RAG: Retrieval as a Control Loop

Agentic RAG replaces the linear retrieve-generate pipeline with an autonomous control loop. An AI agent orchestrates the retrieval process, making decisions at each step: what to retrieve, from which sources, whether the retrieval was sufficient, and whether to iterate.

The core loop follows the ReAct (Reason and Act) pattern. The agent receives a query, reasons about what information it needs, takes a retrieval action, observes the results, evaluates whether the retrieved context is sufficient to answer the question, and decides whether to continue retrieving or generate a final answer. This loop can execute multiple times per query.

The key architectural difference is agency over the retrieval process:

Capability	Traditional RAG	Agentic RAG
Query handling	Single embedding, single retrieval	Query decomposition into sub-queries, multiple targeted retrievals
Source selection	One vector database	Multiple knowledge bases, APIs, databases, web search
Retrieval evaluation	None — trusts the retrieval blindly	Agent evaluates relevance and completeness after each retrieval
Self-correction	None	Reformulates queries when retrieval quality is low, retries with different strategies
Reasoning	No intermediate reasoning	Chain-of-thought reasoning between retrieval steps to plan next actions
Output verification	None	Agent verifies generated answer against source documents before returning

Concretely, when a user asks an agentic RAG system a complex question like “How did our Q4 revenue compare to our top three competitors, and what market factors drove the differences?” the agent decomposes this into sub-tasks: retrieve internal Q4 revenue data, identify the top three competitors, retrieve their public financial data, retrieve market analysis from relevant timeframes, and synthesise the comparison. A traditional RAG pipeline would embed the entire question as a single vector and retrieve whatever five chunks happened to be nearest — almost certainly missing most of the required information.

···

The Four Patterns That Matter

The agentic RAG space has coalesced around four distinct architectural patterns, each addressing a specific failure mode in traditional RAG:

Corrective RAG: The Retrieval Evaluator

Corrective RAG (CRAG), introduced by Yan et al. in 2024, adds a lightweight retrieval evaluator between the retrieval and generation stages. After the initial retrieval, a classifier scores each retrieved document as Correct, Ambiguous, or Incorrect. Based on the scores, it triggers one of three corrective paths: if documents score Correct, generation proceeds normally. If Ambiguous, a knowledge refinement step extracts and re-ranks relevant information. If Incorrect, the system falls back to web search or alternative retrieval sources.

The evaluator is typically a small, fine-tuned model — not the primary LLM — which keeps the latency overhead under 100 milliseconds per evaluation. This pattern is particularly effective for factual question answering where the retrieval corpus may be incomplete or outdated.

Self-RAG: Reflection Tokens

Self-RAG, introduced by Asai et al. in 2023 and refined through 2025, takes a different approach: rather than adding an external evaluator, it trains the generation model to emit special reflection tokens that evaluate its own retrieval and generation quality. The model generates tokens like [IsRel] (is this document relevant?), [IsSup] (does the document support the claim?), and [IsUse] (is the response useful?) alongside its regular output.

This self-evaluation happens at generation time with minimal latency overhead. The trade-off is that it requires fine-tuning the LLM to emit these reflection tokens, which locks you into specific model versions and makes model upgrades more complex. Self-RAG achieves strong results on benchmarks but the fine-tuning requirement limits its adoption in production systems that need model flexibility.

Adaptive RAG: Query Routing

Adaptive RAG, formalised by Jeong et al. in 2024, addresses the problem of query complexity. Not every question needs the same retrieval strategy. A simple factual lookup (“What is our refund policy?”) can be answered with basic retrieval. A multi-hop reasoning question (“Which products had the highest return rate among customers who also purchased accessories?”) requires iterative multi-step retrieval.

Adaptive RAG uses a trained classifier — typically a T5-large model — to route queries into three tiers. Tier A (simple) routes to basic single-retrieval RAG. Tier B (moderate) routes to multi-step retrieval with chain-of-thought reasoning. Tier C (complex) routes to full agentic retrieval with query decomposition, multi-source retrieval, and iterative refinement. By not over-engineering simple queries, Adaptive RAG reduces average latency and cost while maintaining quality on complex questions.

GraphRAG: Retrieval Over Relationships

Microsoft Research introduced GraphRAG in early 2024, and by March 2026 it has reached production maturity. Instead of retrieving individual text chunks based on vector similarity, GraphRAG first constructs a knowledge graph from the document corpus — entities, relationships, and community structures — and then retrieves over the graph at query time.

The approach works in two modes. Local search retrieves specific entities and their immediate neighbourhood in the graph, similar to how a traditional RAG system retrieves relevant chunks but with relationship context preserved. Global search uses community summaries — pre-computed summaries of clustered entity groups — to answer questions that require understanding themes and patterns across the entire corpus.

Microsoft’s benchmarks show GraphRAG achieving 80% accuracy compared to 50% for traditional vector retrieval on questions requiring cross-document reasoning — the 3.4x improvement that makes it compelling for enterprise use cases where documents reference each other extensively. The trade-off is indexing cost: building the knowledge graph requires LLM calls to extract entities and relationships from every document, which can be expensive for large corpora.

“GraphRAG does not replace vector search. It adds a retrieval dimension that vector search cannot provide: the ability to answer questions about relationships between concepts, not just questions about individual concepts.”

Microsoft Research

···

The Production Reality: Where Agentic RAG Breaks

The patterns above work elegantly in research papers and conference demos. Production exposes their failure modes. Teams deploying agentic RAG in 2026 are discovering that the control loop introduces its own category of problems.

Runaway Retrieval Loops

An agentic RAG system that cannot find a satisfactory answer will keep retrieving. Without explicit loop limits, a single query can trigger dozens of retrieval calls, each incurring embedding computation, vector search, and LLM reasoning costs. A query that costs /bin/zsh.02 in a traditional RAG pipeline can cost /bin/zsh.50 or more in an unbounded agentic loop. At scale, this is the difference between a viable product and a financial liability.

The solution is straightforward but often missed: set a maximum iteration count (typically 3–5 loops), implement a total token budget per query, and define explicit “I don’t know” exit conditions. If the agent cannot ground an answer within the budget, it should say so rather than continuing to search.

Latency Multiplication

Each iteration in the agentic loop adds latency. A single retrieval takes 50–200ms. An LLM reasoning step takes 500ms–2 seconds. Three iterations means 1.5–6 seconds per query, compared to the 200–500ms of a traditional RAG pipeline. For customer-facing applications where response time matters, this latency is often unacceptable.

Adaptive RAG partially addresses this by routing simple queries through a fast single-retrieval path. But for complex queries that genuinely require multiple retrieval steps, the latency is inherent. Teams must decide: do you optimise for answer quality (more iterations) or response time (fewer iterations)? The answer depends on the use case, and most production systems implement tiered latency budgets — 200ms for autocomplete, 2 seconds for search, 10 seconds for complex analysis.

Embedding Drift

One of the least discussed reliability problems in production RAG is embedding drift. Your vector index was built using a specific embedding model at a specific point in time. New documents added later may use updated terminology. If you update the embedding model, existing vectors no longer align with new ones. One team reported that their hallucination rate jumped 15% after re-chunking their legal policy corpus — the new chunk boundaries split critical clauses across vectors in ways the original chunking did not.

In agentic RAG, embedding drift compounds because the agent may issue multiple queries with different phrasings, and each query interacts with the drift differently. Monitoring embedding quality over time — tracking retrieval precision, re-ranking effectiveness, and hallucination rates per index version — is not optional. It is the difference between a system that degrades silently and one that alerts you before users notice.

The Framework Landscape

Three frameworks dominate agentic RAG implementation in 2026. Each optimises for a different developer mental model.

Framework	Approach	Strengths	Weaknesses	Best For
LangGraph (LangChain)	Graph-based state machines with nodes and edges defining agent workflows	Fine-grained control over agent flow, built-in persistence and checkpointing, strong community	Frequent breaking changes between versions, steep learning curve, custom syntax with operator overloading	Teams needing precise control over multi-step agent workflows
LlamaIndex Workflows	Event-driven async pipelines with Pythonic abstractions	Simpler API, faster time-to-production, deep integration with data connectors and index types	Less control over complex orchestration patterns, smaller ecosystem than LangChain	Teams prioritising speed to production with smaller teams
Microsoft GraphRAG	Knowledge graph construction and graph-based retrieval	Unmatched cross-document reasoning, open-source, strong enterprise backing	Expensive indexing (requires LLM calls per document), limited to graph-based retrieval patterns	Enterprise knowledge bases with high relationship density between documents

A common production pattern is combining frameworks: use Adaptive RAG routing at the entry point, LangGraph or LlamaIndex for the agentic retrieval loop, and GraphRAG as one of the retrieval sources alongside traditional vector search. The A-RAG framework (February 2026) formalised this by exposing keyword search, semantic search, and chunk-level retrieval as separate tools the agent can choose between, improving QA accuracy by 5–13% over flat retrieval.

···

Building an Agentic RAG System: The Decision Framework

Not every RAG system needs to be agentic. The additional complexity — more LLM calls, higher latency, more failure modes — is only justified when the one-shot pipeline demonstrably fails. Here is how to decide:

When to Upgrade from Traditional to Agentic RAG

Measure your current failure rate

Deploy a traditional RAG pipeline first. Measure hallucination rate, retrieval precision, and user satisfaction. If your hallucination rate is below 5% and users are satisfied, agentic RAG adds cost and complexity without proportional benefit. If your rate is above 10%, the control loop will meaningfully improve quality.

Identify the failure mode

Is retrieval returning irrelevant documents (a ranking problem)? Is the answer split across documents (a chunking problem)? Does the query require information from multiple sources (a decomposition problem)? Each failure mode points to a different agentic pattern: re-ranking for the first, Corrective RAG for the second, Adaptive RAG with query decomposition for the third.

Start with Corrective RAG, not full agency

Add a lightweight retrieval evaluator between retrieval and generation. This single change — scoring retrieved documents before passing them to the LLM — catches the majority of retrieval failures with minimal latency overhead. Only escalate to full agentic retrieval with query decomposition and multi-source retrieval when the evaluator consistently flags insufficient context.

Set explicit budgets

Define a maximum iteration count (3–5), a total token budget per query, and a latency ceiling. Build your agentic loop with these constraints from day one, not retrofitted after the first invoice. The agent should have a clear exit condition: if it cannot ground an answer within budget, it returns an explicit uncertainty signal rather than generating a low-confidence response.

Monitor retrieval health continuously

Track retrieval precision per query type, embedding quality per index version, hallucination rate over time, and cost per query. Agentic RAG systems are dynamic — their behaviour changes as the corpus evolves, as embedding models are updated, and as query patterns shift. Without continuous monitoring, quality degrades silently.

···

Evaluation: How to Know If It Works

RAG evaluation in 2026 has standardised around three dimensions, each requiring different metrics:

Dimension	Metrics	Tools	What It Catches
Retrieval Quality	Precision@k, Recall@k, MRR (Mean Reciprocal Rank), NDCG	RAGAS, Maxim AI, custom evaluation harnesses	Wrong documents retrieved, relevant documents ranked too low, missing context
Generation Quality	Faithfulness (answer grounded in context), Answer Relevance, Hallucination Rate	RAGAS, Braintrust, DeepEval, LLM-as-Judge	Hallucinated claims, answers that ignore retrieved context, off-topic responses
End-to-End Quality	Answer Correctness, User Satisfaction, Task Completion Rate	Human evaluation, A/B testing, implicit feedback signals	System works component-by-component but fails holistically

For agentic RAG specifically, you need additional metrics that traditional RAG evaluation does not cover: average iterations per query (are your agents looping excessively?), retrieval evaluator accuracy (is the Corrective RAG classifier making good decisions?), query decomposition quality (are sub-queries meaningful?), and cost per query (is the reasoning loop economically viable?).

“You cannot evaluate an agentic RAG system by evaluating retrieval and generation separately. The agent’s decisions — when to retrieve, when to reformulate, when to stop — are the system. Evaluate the loop, not the components.”

What the Next 12 Months Look Like

Three developments will reshape the agentic RAG landscape by March 2027:

Longer Context Windows Do Not Kill RAG

Every time a model vendor announces a longer context window, someone declares RAG dead. It is not. A 1-million-token context window costs roughly to fill at current frontier model pricing. Filling it for every query is economically insane for any system handling more than a few hundred queries per day. RAG retrieves the relevant subset of a corpus rather than feeding all of it into the model. The economic argument for RAG gets stronger as corpora grow, not weaker.

What longer context windows do change is the generation stage. With more room in the context window, agentic RAG systems can retrieve more aggressively — pulling 20–50 chunks instead of 5 — and let the model sort through them during generation. This shifts the bottleneck from retrieval precision (getting exactly the right 5 documents) to retrieval recall (getting all potentially relevant documents into the context).

MCP as the Agent-to-Retrieval Interface

The Model Context Protocol is becoming the standard interface between AI agents and data sources. In an agentic RAG architecture, each knowledge source — vector database, graph database, SQL database, API endpoint — can be exposed as an MCP server with typed tool definitions. The agent discovers available retrieval tools via MCP, selects the appropriate one based on the query, and invokes it with structured parameters.

This standardisation means agentic RAG systems no longer need custom integration code for each data source. A LangGraph agent that queries Pinecone today can query a Neo4j knowledge graph tomorrow by connecting to a different MCP server, without changing the agent’s code. The A-RAG framework already implements this pattern, exposing keyword, semantic, and chunk-level retrieval as MCP-compatible tools.

Hybrid Retrieval Becomes Default

Pure vector search is giving way to hybrid approaches that combine dense embeddings with sparse retrieval (BM25 or SPLADE). Models like BGE-M3 generate both dense and sparse representations in a single forward pass, enabling hybrid retrieval without maintaining separate indices. The Higress-RAG framework demonstrated that dual hybrid retrieval achieves over 90% recall on enterprise datasets — significantly higher than either dense or sparse retrieval alone.

For agentic RAG, hybrid retrieval reduces the number of iterations needed. When the first retrieval is more accurate, the agent needs fewer corrective loops, which reduces latency and cost. Expect hybrid retrieval to be the default configuration in production RAG systems within the year.

···

Where Fordel Builds

Every RAG system we deploy at Fordel starts with measurement. We instrument retrieval precision, hallucination rate, and user satisfaction before deciding whether agentic patterns are warranted. When they are, we implement the minimum viable loop: Corrective RAG with a lightweight evaluator, explicit iteration budgets, and cost monitoring from day one. We escalate to full agentic retrieval with query decomposition and multi-source access only when the data justifies the complexity.

The 17% hallucination rate that Shopify discovered is not unusual. We have seen similar numbers in enterprise deployments across healthcare, legal, and financial services — domains where a wrong answer is not just a bad experience but a liability. If your RAG pipeline is guessing and you do not know how often it guesses wrong, that is the first problem to solve. We can show you exactly where your retrieval breaks and what architecture fixes it. No pitch deck. If that conversation is useful, reach out.

···

Operational Reality: Running Agentic RAG in Production

Deploying agentic RAG is not a model problem — it is an infrastructure problem. The retrieval loop adds latency with each iteration. A 3-iteration loop with a 200ms retrieval step and a 500ms generation step takes 2.1 seconds minimum. If your application has a 3-second SLA, you have 900ms of headroom for network overhead, embedding computation, and reranking. This is tight. Production agentic RAG systems need aggressive timeout management: set a maximum iteration count (typically 3-5), a wall-clock timeout (typically 5-10 seconds), and a quality threshold that stops iteration early when the answer confidence exceeds a threshold. For comprehensive observability on these systems, trace every retrieval loop iteration independently.

Cost is the other operational concern. Each loop iteration costs retrieval compute (embedding + vector search), LLM inference (the reasoning step), and potentially reranking inference. A 3-iteration agentic RAG query costs roughly 3x a single-shot RAG query. At 100,000 queries per day, this difference is material. Teams running agentic RAG in production report 2-4x cost increases over traditional RAG, offset by measurable improvements in answer quality and reduced hallucination rates. The cost-quality tradeoff needs explicit modelling — not every query justifies multiple retrieval iterations. Implement a routing layer that classifies queries by complexity and routes simple factual lookups to single-shot RAG while sending complex, multi-hop questions through the agentic pipeline.

···

When Not to Use Agentic RAG

Agentic RAG is not always the right answer. For simple factual retrieval (looking up a specific document, answering a question with a known single source), single-shot RAG with a good retriever is faster, cheaper, and equally accurate. Agentic RAG adds value when: the answer requires synthesising information from multiple documents, the initial retrieval is likely to return partially relevant results that need refinement, or the question is ambiguous and benefits from iterative clarification.

The decision framework: if your single-shot RAG system answers correctly more than 85% of the time (measured by human evaluation or automated metrics), the incremental gain from agentic RAG may not justify the added complexity and cost. Focus your engineering effort on improving retrieval quality (better chunking, hybrid search, reranking) before adding agentic loops. If your system is below 85% accuracy despite retrieval optimisation, agentic RAG is worth the investment. For teams evaluating this tradeoff, our guide on model evaluation beyond benchmarks provides a production-focused evaluation framework.

“The best agentic RAG systems know when not to iterate. A well-calibrated confidence threshold that stops the loop early on easy questions saves more compute than any infrastructure optimisation.”

Part of: Fordel pillar guide

AI Agent Architecture: Production Patterns

Fordel's pillar guide to architecting production AI agents — state machines, retry semantics, escalation, and audit trails.

Read the full guide →

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Start a conversation View services

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...

Keep Reading

All articles