My Journey to a serverless transformers pipeline on Google Cloud
What Happened
My Journey to a serverless transformers pipeline on Google Cloud
Fordel's Take
Someone shipped a working HuggingFace Transformers inference pipeline on Google Cloud Run, fully serverless, no always-on GPU instance. The model loads from Cloud Storage on cold start and serves via a containerized FastAPI endpoint.
Cold starts on Cloud Run with a 1B+ parameter model can hit 30–60 seconds — that's not a footnote, it's a product decision. Most teams reflexively reach for a dedicated VM or Vertex AI endpoint, burning $200–400/month for workloads that get 50 requests/day. Minimum instances solve latency but kill the cost argument entirely.
Teams with bursty, low-frequency inference — document parsing, async classification, nightly batch — should try this pattern with Cloud Run min-instances=0 and a sub-1B model like DistilBERT. Anything user-facing with p99 latency requirements, skip it.
What To Do
Use Cloud Run with min-instances=0 and a quantized sub-1B model instead of Vertex AI endpoints because idle Vertex endpoints cost $150+/month for workloads under 1k daily requests.
Cited By
React
Get the weekly AI digest
The stories that matter, with a builder's perspective. Every Thursday.