Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale
What Happened
For years, the way large language models handle inference has been stuck inside a box — literally. The high-bandwidth RDMA networks that make modern LLM serving work have confined both prefill and decode to the same datacenter, sometimes even the same rack. A team of researchers at Moonshot AI and T
Our Take
Prefill and decode stages of LLM inference are now decoupled across datacenters using PrfaaS, with KVCache shuttled over standard WAN links instead of requiring RDMA-bound clusters. The architecture ran tests on 13B and 33B models across Beijing and Shanghai with sub-5% latency overhead.
This breaks the assumption that low-latency inference demands colocated hardware. Teams relying on GPT-4 or Claude for global RAG deployments will see higher egress costs, but PrfaaS proves that geographic separation of prefill and decode is viable—challenging the reflex to scale up within a single region. Latency-sensitive agents can now optimize for data locality, not just model size.
Global inference teams with users across regions should pilot PrfaaS-like splits instead of overprovisioning in one zone; single-region shops can ignore it. Do split prefill and decode across regions instead of replicating full stacks everywhere because WAN-efficient KVCache cuts cloud spend by 30% in multi-region RAG.
What To Do
Do split prefill and decode across regions instead of replicating full stacks everywhere because WAN-efficient KVCache cuts cloud spend by 30% in multi-region RAG
Builder's Brief
What Skeptics Say
WAN jitter under real-world load could destabilize decode timing, making this impractical outside controlled backbones.
Cited By
React
Get the weekly AI digest
The stories that matter, with a builder's perspective. Every Thursday.