Hugging FaceSep 10, 2020

Block Sparse Matrices for Smaller and Faster Language Models

Read the full articleBlock Sparse Matrices for Smaller and Faster Language Models on Hugging Face

↗

What Happened

Fordel's Take

Block sparse matrices replace dense weight tensors with structured zero blocks that hardware skips at inference time. Recent research shows 80% sparsity with under 1% accuracy loss on MMLU benchmarks.

A sparse 70B model fits in roughly half the VRAM of its dense equivalent — material if you're paying $3/hr for A100 time on vLLM. Most teams stop at GGUF quantization and call it done; sparsity stacks on top of quantization. Treating quantization as the only compression lever is leaving compound savings unclaimed.

Inference teams running >1M tokens/day should trial sparse fine-tuning pipelines now. torch.sparse and DeepSpeed Sparse aren't production-stable yet — single-GPU hobby deployments, ignore this entirely.

What To Do

Use sparse fine-tuning via DeepSpeed on top of quantization instead of quantization-only pipelines because the two techniques compound, cutting VRAM by 50–60% on 70B-class models served via vLLM.

Cited By

Hugging Face Block Sparse Matrices for Smaller and Faster Language Models

React

Newsletter

Get the weekly AI digest

The stories that matter, with a builder's perspective. Every Thursday.

Loading comments...