Skip to main content
Back to Pulse
TechCrunch

Running AI models is turning into a memory game

Read the full articleRunning AI models is turning into a memory game on TechCrunch

What Happened

When we talk about the cost of AI infrastructure, the focus is usually on Nvidia and GPUs -- but memory is an increasingly important part of the picture.

Our Take

Here's what nobody talks about: HBM (high-bandwidth memory) is the actual bottleneck now, not NVIDIA's H100 compute. Running Llama 3.1 at scale means you're buying more memory than compute, and NVIDIA owns that supply chain. NVIDIA's laughing all the way to earnings.

This is infrastructure getting weirdly expensive in weird places. Cloud providers can't optimize it away — memory's physical, not algorithmic. So anyone running production LLMs is about to get a rude awakening on TCO. Open weights helps (you own the hardware), closed APIs don't (cloud markups on HBM will kill you).

What To Do

Ask your cloud vendor exactly how much they're charging per GB of model memory — that number will shock you.

Cited By

React

Loading comments...