An estimated 87% of ML models never reach production. The technical reasons are consistent: training is not reproducible, there is no automated evaluation gate, deployment is manual, and monitoring does not exist. The organizational reason is also consistent: the team that builds the model is not the team responsible for operating it, and nobody owns the pipeline that connects the two.
The Production ML Pipeline
A production ML pipeline is not just model training. It is a system that ingests data, validates data quality, trains models, evaluates models against production baselines, deploys approved models, monitors live performance, and triggers retraining when performance degrades. Every step must be automated, logged, and reproducible.
Pipeline Stages
Automated data pulls from source systems with schema validation, distribution checks, and anomaly detection. If the training data distribution shifts significantly from baseline, halt the pipeline and alert — training on anomalous data produces anomalous models.
Transform raw data into model features with version-controlled feature definitions. Feature stores (Feast, Tecton) ensure that training features match serving features — a common source of training-serving skew that causes production accuracy drops.
Reproducible training with pinned dependencies, versioned datasets, logged hyperparameters, and tracked experiments. MLflow and Weights & Biases are the dominant experiment tracking platforms.
Every trained model is evaluated against the current production model on your evaluation suite. Only models that exceed the production baseline on all critical metrics proceed to deployment.
Route 5-10% of production traffic to the new model. Monitor key metrics. Automatically roll back if metrics degrade beyond thresholds. Only promote to full deployment after 24-48 hours of stable canary metrics.
Continuous tracking of prediction accuracy, feature drift, latency, and business metrics. Trigger retraining when monitored metrics cross thresholds.
MLOps Tooling in 2026
| Tool | Category | Strengths | Best For |
|---|---|---|---|
| MLflow | Experiment tracking + registry | Open-source, broad adoption | Most teams, especially Databricks users |
| Weights & Biases | Experiment tracking + viz | Best experiment visualization | Research-heavy teams |
| Kubeflow | Pipeline orchestration | K8s native, flexible | Teams with K8s expertise |
| Vertex AI | Managed ML platform | Google Cloud integrated | GCP-native organizations |
| SageMaker | Managed ML platform | AWS integrated | AWS-native organizations |
| dbt + Great Expectations | Data pipeline + validation | Data quality focus | Feature pipeline quality gates |
The LLM Pipeline Shift
LLM-based applications shift the pipeline emphasis. Traditional ML pipelines focus on training and retraining. LLM applications often use foundation models without fine-tuning, so the pipeline focuses instead on prompt management, RAG corpus updates, evaluation suite maintenance, and cost optimization.
The production concerns are different too. An LLM pipeline must manage prompt versions (small changes in prompts can cause large changes in output), RAG index freshness (stale retrieval data causes outdated responses), and model provider reliability (API rate limits, latency spikes, and the occasional model behavior change when providers update their models).
- Prompt versioning and A/B testing infrastructure
- RAG index update pipeline with freshness SLOs
- Evaluation suite that runs on every prompt or model change
- Semantic caching to reduce cost and latency for repeated queries
- Fallback model configuration for provider outages
- Cost tracking and alerting per model, per endpoint, per use case
“The difference between a demo and a production ML system is not the model — it is the 90% of engineering work that surrounds the model: data pipelines, evaluation gates, deployment automation, and monitoring.”