Skip to main content
Back to Pulse
researchFirst of its KindSlow Burn
MarkTechPost

Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale

Read the full articleMoonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale on MarkTechPost

What Happened

For years, the way large language models handle inference has been stuck inside a box — literally. The high-bandwidth RDMA networks that make modern LLM serving work have confined both prefill and decode to the same datacenter, sometimes even the same rack. A team of researchers at Moonshot AI and T

Our Take

Prefill and decode stages of LLM inference are now decoupled across datacenters using PrfaaS, with KVCache shuttled over standard WAN links instead of requiring RDMA-bound clusters. The architecture ran tests on 13B and 33B models across Beijing and Shanghai with sub-5% latency overhead.

This breaks the assumption that low-latency inference demands colocated hardware. Teams relying on GPT-4 or Claude for global RAG deployments will see higher egress costs, but PrfaaS proves that geographic separation of prefill and decode is viable—challenging the reflex to scale up within a single region. Latency-sensitive agents can now optimize for data locality, not just model size.

Global inference teams with users across regions should pilot PrfaaS-like splits instead of overprovisioning in one zone; single-region shops can ignore it. Do split prefill and decode across regions instead of replicating full stacks everywhere because WAN-efficient KVCache cuts cloud spend by 30% in multi-region RAG.

What To Do

Do split prefill and decode across regions instead of replicating full stacks everywhere because WAN-efficient KVCache cuts cloud spend by 30% in multi-region RAG

Builder's Brief

Who

teams running global LLM inference

What changes

infrastructure topology for multi-region RAG

When

months

Watch for

cloud providers offering split-tier inference tiers

What Skeptics Say

WAN jitter under real-world load could destabilize decode timing, making this impractical outside controlled backbones.

Cited By

React

Newsletter

Get the weekly AI digest

The stories that matter, with a builder's perspective. Every Thursday.

Loading comments...