Running AI models is turning into a memory game
What Happened
When we talk about the cost of AI infrastructure, the focus is usually on Nvidia and GPUs -- but memory is an increasingly important part of the picture.
Our Take
Here's what nobody talks about: HBM (high-bandwidth memory) is the actual bottleneck now, not NVIDIA's H100 compute. Running Llama 3.1 at scale means you're buying more memory than compute, and NVIDIA owns that supply chain. NVIDIA's laughing all the way to earnings.
This is infrastructure getting weirdly expensive in weird places. Cloud providers can't optimize it away — memory's physical, not algorithmic. So anyone running production LLMs is about to get a rude awakening on TCO. Open weights helps (you own the hardware), closed APIs don't (cloud markups on HBM will kill you).
What To Do
Ask your cloud vendor exactly how much they're charging per GB of model memory — that number will shock you.
Cited By
React