We Built a Multi-Tenant AI Pipeline and Here's What Actually Happened

A Series B SaaS client wanted AI agents across 340 tenants. We shipped the pipeline, watched it melt down in ways nobody predicted, and rebuilt it. Honest post-mortem of rate limits, fan-out, and tenant isolation.

Abhishek Sharma· Founder, Fordel Studios

April 8, 2026Updated May 8, 20269 min read

We Built a Multi-Tenant AI Pipeline and Here's What Actually Happened

This happened in late 2025. I am changing client details, but the engineering is exactly as it happened.

How did we end up building an AI pipeline for 340 tenants?

The client was a Series B SaaS platform in the HR tech space — think applicant tracking with workflow automation. About 340 paying tenants, ranging from 5-person startups to a 2,000-employee enterprise. They had just closed their B round and the board wanted AI features. Specifically: AI-generated job descriptions, candidate summarization, and an interview question generator.

Straightforward, right? Three features, all text generation, all using the same model. We scoped it at six weeks. The architecture was clean: a shared inference service behind their existing API gateway, tenant context injected via system prompts, responses cached by tenant and input hash. We chose Claude as the primary model with GPT-4o as fallback.

We even built a cost allocation layer. Every API call tagged with tenant ID, model used, tokens consumed. We had dashboards. We had alerts. We felt good about this.

340Active tenants at launchMix of SMB to enterprise

6 weeksOriginal scope estimateCame in at 11

72 hrsTime to first major incidentAfter production launch

···

What did the architecture look like on day one?

The initial design was a shared inference service — a single Node.js service sitting behind their API gateway. Every tenant hit the same endpoint. The service resolved the tenant from the JWT, loaded their configuration (which features were enabled, any custom prompt overrides), and made the model call.

We had a Redis-backed cache layer. Same tenant, same input, same model — serve from cache. Cache hit rates were around 60% in staging because HR teams tend to write similar job descriptions. We had per-tenant rate limits enforced at the gateway level: 100 requests per minute for standard plans, 500 for enterprise.

Cost tracking was a Postgres table. Every inference call wrote a row: tenant ID, feature, model, prompt tokens, completion tokens, latency, cache hit boolean. We had a nightly aggregation job that rolled this up into per-tenant cost reports. The client loved the dashboards.

On paper, this was solid. In practice, we had made three assumptions that were all wrong.

···

What went wrong in the first 72 hours?

We launched on a Wednesday. By Friday afternoon, I got a call from the client's VP of Engineering. The platform was unusable. Not just the AI features — the entire platform was degraded.

Here is what happened, in order.

The noisy neighbor problem nobody modeled

Three enterprise tenants discovered the AI features on launch day and went all in. One of them had built an internal tool that called the candidate summarization endpoint in a loop — processing their entire backlog of 12,000 candidates in one batch. Another had a Chrome extension that auto-generated job descriptions on every form focus event. The third was running what appeared to be a benchmarking script.

Within 24 hours, these three tenants had consumed 89% of the total token budget we had allocated for the entire month. Our per-tenant rate limits were set at the request level, not the token level. A single request that generated a 2,000-token candidate summary counted the same as a request that generated a 50-token job title suggestion.

“We had rate limits on requests. We did not have rate limits on tokens. In the LLM world, that is like metering water by the number of times you turn the faucet, not by how much flows through.”

Post-incident review notes

The cascading failure

The shared inference service was running on three containers behind an ALB. The three heavy tenants saturated all three containers with long-running inference calls. Claude API calls for 12,000-character candidate profiles were taking 8-12 seconds each. The containers hit their connection limits. New requests from all 337 other tenants started queuing.

The API gateway had a 30-second timeout. Queued requests started timing out. The client's frontend retried on timeout — with no exponential backoff, because nobody had anticipated the AI service being the bottleneck. Every timeout generated two more requests. The queue grew exponentially.

The gateway's connection pool to the inference service exhausted. This spilled over into the gateway's connection pool for other services. The user authentication service, which shared the same gateway, started failing health checks. The load balancer marked it as unhealthy and started routing auth requests to the remaining two instances, which then also fell over.

By Friday at 3 PM IST, new users could not log in. Existing sessions were fine, but any API call that needed token refresh failed. The AI features had taken down authentication.

89%Token budget consumed by 3 tenantsIn the first 24 hours

337Tenants affected by cascading failureOut of 340 total

4.5 hrsTotal platform degradationAuth + AI features down

···

How did we stop the bleeding?

The immediate fix took about 90 minutes. We killed the inference service entirely — took it offline. This immediately freed the gateway's connection pool and auth recovered within minutes. We sent an in-app banner: "AI features temporarily unavailable for maintenance."

Then we had the uncomfortable conversation with the client. The three enterprise tenants were their biggest accounts. We could not just throttle them without a business conversation. The tenant running the 12,000-candidate batch was their single largest customer by ARR.

We spent the weekend on a triage fix — not the right fix, the fast fix. We needed AI features back online by Monday without the same failure mode.

The weekend triage: what we shipped in 48 hours

First, we split the inference service into two deployment groups: one for enterprise tenants (the top 15 by plan tier) and one for everyone else. Separate containers, separate connection pools, separate gateway routes. If enterprise tenants melted their pool, standard tenants would be unaffected.

Second, we added token-based rate limiting. Not just request counts — actual token budgets per tenant per day. We used a sliding window in Redis. When a tenant hit 80% of their daily token budget, we started returning 429s with a Retry-After header and a human-readable message explaining their usage.

Third, we added circuit breakers between the gateway and the inference service. If the inference service error rate exceeded 20% over a 30-second window, the circuit opened and all AI requests got a graceful degradation response: cached results if available, a "temporarily unavailable" message if not.

This got us through Monday. It was not pretty, but it worked.

···

What did the real fix look like?

The weekend triage bought us time, but it was held together with duct tape. Over the next three weeks, we rebuilt the pipeline properly. Here is what changed.

Tenant isolation as infrastructure, not application logic

The core mistake was treating tenant isolation as an application concern. We had if-statements checking tenant budgets in our Node.js code. That is the wrong layer. Tenant isolation needs to be infrastructure.

We moved to a queue-based architecture. Every AI request went into a per-tenant SQS queue. A pool of workers consumed from all queues using weighted fair queuing — enterprise tenants got higher priority, but no tenant could starve another. If a tenant's queue depth exceeded a threshold, new requests got rejected at the API level before ever hitting the queue.

This completely decoupled request ingestion from inference execution. The API could accept requests at full speed. The workers processed them at a rate the model APIs could handle. No more connection pool exhaustion. No more cascading failures.

Token budgets as a first-class billing concept

We worked with the client's product team to make token budgets a visible, billable concept. Each plan tier got a monthly token allocation. Usage was tracked in real-time and exposed in the tenant's admin dashboard. Overages were billed at a per-token rate, with configurable hard caps.

This turned a cost problem into a revenue feature. The client's enterprise customers were willing to pay significantly more for higher token budgets. The biggest tenant — the one that batch-processed 12,000 candidates — upgraded to a custom plan that cost 4x their original subscription. The AI features went from a cost center to a profit center.

“The moment we made AI usage visible and billable, the incentive structure fixed itself. Tenants who were accidentally burning tokens started optimizing their usage. Tenants who genuinely needed high throughput started paying for it.”

Client VP of Engineering

Model routing and fallback

We added intelligent model routing. Not every request needs the most expensive model. Job title suggestions? That is a Haiku-class task. Full candidate summaries for executive roles? That is Opus territory. Interview question generation? Sonnet handles it fine.

We built a routing layer that selected the model based on feature type, tenant tier, and current budget consumption. When a tenant was approaching their budget limit, we automatically downgraded to cheaper models with a quality disclaimer in the UI. This reduced average per-request cost by about 40% without meaningful quality degradation for most use cases.

···

What would we do differently if we started over?

Plenty. Here is the honest list.

What We Would Change

Token-based rate limiting from day one, not request-based. In the LLM world, request count is a vanity metric. Tokens are the real unit of cost and capacity.
Queue-based architecture from the start. Synchronous inference behind a shared gateway is a ticking time bomb in any multi-tenant system. The queue adds 200-400ms of latency, which is invisible when the inference itself takes 3-8 seconds.
Separate infrastructure for AI services. Never share a gateway connection pool between AI inference and core platform services like auth. The latency profiles are fundamentally different — auth calls take 20ms, inference calls take 8 seconds. Mixing them in the same pool is asking for trouble.
Load test with adversarial tenants, not average tenants. Our load tests modeled the median tenant. We should have modeled the worst tenant — the one who writes a script to batch-process their entire database through your AI endpoint on launch day.
Make cost visibility a launch feature, not a follow-up. If tenants can see their usage from day one, they self-regulate. If they cannot, they treat it as unlimited.

···

Is multi-tenant AI actually harder than regular multi-tenancy?

Yes, and it is harder in a specific way that catches teams off guard.

Traditional multi-tenant systems have predictable resource consumption. A database query takes 5-50ms. An API call takes 20-200ms. You can model capacity with standard queueing theory. The variance is manageable.

LLM inference has enormous variance. A simple completion might take 500ms and cost $0.001. A complex one might take 12 seconds and cost $0.15. That is a 150x cost difference and a 24x latency difference, all on the same endpoint. Traditional capacity planning does not account for this.

The other difference is cost. In a traditional SaaS, the marginal cost of serving one more API request is negligible — it is mostly compute you have already paid for. In an AI-powered SaaS, every request has a direct, variable cost. A noisy tenant does not just degrade performance for others; they directly eat into your gross margin.

This is why we now treat AI cost isolation as an infrastructure problem on par with data isolation. You would never let one tenant's data leak into another tenant's queries. You should not let one tenant's token consumption leak into another tenant's experience.

···

How did it end?

The rebuilt pipeline has been running for about four months now. Zero cascading failures. Token budget overages generate revenue instead of outages. The queue-based architecture handles traffic spikes gracefully — during their January hiring surge, request volume tripled and the only visible effect was slightly longer queue times for the lowest-priority tenants.

The client's AI features are now their primary sales differentiator. They have closed three enterprise deals specifically because of the candidate summarization feature. The token billing model adds about 15% to their average revenue per account.

We shipped it in 11 weeks instead of six. The first three weeks were the original build. The weekend was triage. The next three weeks were the real rebuild. The last two were hardening, load testing (with adversarial scenarios this time), and documentation.

If we had built it right the first time, it would have taken eight weeks instead of six. Two extra weeks for the queue architecture and token-based billing. That is the math on cutting corners with AI infrastructure: you save two weeks up front and spend five weeks fixing it.

“The lesson was not that we were bad engineers. The lesson was that our mental models for multi-tenant infrastructure did not account for the cost and latency characteristics of LLM inference. We were building with traditional assumptions in a non-traditional domain.”

Internal retrospective

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Start a conversation View services

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...

Keep Reading

All articles