Hugging FaceMar 18, 2021

My Journey to a serverless transformers pipeline on Google Cloud

Read the full articleMy Journey to a serverless transformers pipeline on Google Cloud on Hugging Face

↗

What Happened

Fordel's Take

Someone shipped a working HuggingFace Transformers inference pipeline on Google Cloud Run, fully serverless, no always-on GPU instance. The model loads from Cloud Storage on cold start and serves via a containerized FastAPI endpoint.

Cold starts on Cloud Run with a 1B+ parameter model can hit 30–60 seconds — that's not a footnote, it's a product decision. Most teams reflexively reach for a dedicated VM or Vertex AI endpoint, burning $200–400/month for workloads that get 50 requests/day. Minimum instances solve latency but kill the cost argument entirely.

Teams with bursty, low-frequency inference — document parsing, async classification, nightly batch — should try this pattern with Cloud Run min-instances=0 and a sub-1B model like DistilBERT. Anything user-facing with p99 latency requirements, skip it.

What To Do

Use Cloud Run with min-instances=0 and a quantized sub-1B model instead of Vertex AI endpoints because idle Vertex endpoints cost $150+/month for workloads under 1k daily requests.

Cited By

Hugging Face My Journey to a serverless transformers pipeline on Google Cloud

React

Newsletter

Get the weekly AI digest

The stories that matter, with a builder's perspective. Every Thursday.

Loading comments...