Getting ML Models to Production: The 90% Nobody Talks About
The data science community has a dirty secret: most models never make it to production. Research papers, conference talks, and blog posts focus almost exclusively on model architecture, training techniques, and benchmark scores. But getting a model from a Jupyter notebook to a production system that serves real users, handles edge cases, stays reliable under load, and maintains accuracy as data drifts — that is the actual hard problem. It accounts for roughly 90% of the effort in a real ML project, and it receives roughly 10% of the attention.
At LockedIn Labs, we build ML systems that run in production. Not demos, not proofs of concept, not notebooks that a data scientist runs manually — production systems with SLAs, monitoring, automated retraining pipelines, and on-call engineers. This article covers the engineering infrastructure that makes production ML work: feature stores, model serving, A/B testing, drift monitoring, and CI/CD for ML. If you have built a model that works on your laptop and you need it to work in production, this is the practical guide.
Feature Stores: The Foundation of Production ML
The single biggest source of bugs in production ML systems is the training-serving skew — the difference between the features the model saw during training and the features it receives during inference. A data scientist computes features in a Pandas notebook using historical data. An engineer reimplements those features in a production service using a different language, different libraries, and different data access patterns. Subtle differences in timestamp handling, null value imputation, or aggregation logic produce features that look similar but are mathematically different. The model’s predictions degrade, and nobody understands why because the model code has not changed.
A feature store eliminates training-serving skew by providing a single, shared system for computing, storing, and serving features. Features are defined once as transformation logic. The feature store executes that same logic for both batch training and real-time serving, guaranteeing consistency. The offline store holds historical feature values for training and batch prediction. The online store holds the latest feature values for low-latency inference. A materialization pipeline continuously computes features from raw data and writes them to both stores.
Feature definition example using Feast
from feast import Entity, Feature, FeatureView, FileSource, ValueType
from datetime import timedelta
# Define the entity (the thing we compute features for)
customer = Entity(
name="customer_id",
value_type=ValueType.INT64,
description="Unique customer identifier",
)
# Define the data source
customer_transactions = FileSource(
path="s3://feature-store/customer_transactions.parquet",
event_timestamp_column="event_timestamp",
)
# Define the feature view (transformation + storage config)
customer_spending_features = FeatureView(
name="customer_spending",
entities=["customer_id"],
ttl=timedelta(days=90),
features=[
Feature(name="total_spend_30d", dtype=ValueType.FLOAT),
Feature(name="avg_order_value_30d", dtype=ValueType.FLOAT),
Feature(name="transaction_count_30d", dtype=ValueType.INT64),
Feature(name="days_since_last_purchase", dtype=ValueType.INT64),
Feature(name="preferred_category", dtype=ValueType.STRING),
],
online=True, # Serve from online store for real-time inference
source=customer_transactions,
)
# Training: get historical features for a set of entities and timestamps
training_df = store.get_historical_features(
entity_df=entity_df, # customer_id + event_timestamp pairs
features=["customer_spending:total_spend_30d", "customer_spending:avg_order_value_30d"],
).to_df()
# Serving: get latest features for real-time inference
online_features = store.get_online_features(
features=["customer_spending:total_spend_30d", "customer_spending:avg_order_value_30d"],
entity_rows=[{"customer_id": 12345}],
).to_dict()The ROI of a feature store extends beyond skew prevention. Feature reuse across models eliminates redundant computation — once you compute customer spending patterns for the churn model, the same features are available for the recommendation model and the fraud model. Feature monitoring provides a single dashboard for tracking feature distributions, staleness, and quality across all models. Feature discovery lets new team members find and understand existing features instead of recomputing them from scratch.
Model Serving: From Artifact to API
Model serving is the infrastructure that turns a trained model artifact into a low-latency, high-availability API endpoint. This sounds simple until you account for the requirements: sub-100ms inference latency for real-time predictions, throughput scaling from 10 to 10,000 requests per second, GPU memory management for models that consume gigabytes of VRAM, model version management with zero-downtime deployments, and graceful degradation when the model service is overloaded or unavailable.
We use a layered serving architecture. The model server (typically TorchServe, Triton Inference Server, or a custom FastAPI service) handles the actual inference — loading the model into memory, managing GPU allocation, batching requests for throughput, and executing the forward pass. An inference gateway sits in front of the model server, handling authentication, rate limiting, request validation, feature enrichment (calling the feature store for real-time features), and response formatting. A load balancer distributes traffic across multiple model server replicas, with health checks that verify not just that the server is responsive but that it is producing valid predictions.
Request batching is a critical optimization for GPU-based models. GPUs are throughput-optimized processors — running 32 inferences in a single batch is barely slower than running one inference. Without batching, each request uses a tiny fraction of the GPU’s capacity, and you need many expensive GPUs to handle load. With dynamic batching, the model server accumulates incoming requests for a short window (typically 5 to 50 milliseconds), groups them into a batch, and executes a single forward pass. This can improve throughput by 10 to 30 times with a minimal latency increase. Triton Inference Server provides this out of the box. For custom serving solutions, we implement it using an async request queue with configurable batch size and maximum wait time.
A/B Testing ML Models: Beyond Offline Metrics
Offline evaluation tells you whether the new model is better on your test set. A/B testing tells you whether it is better in production, with real users, real data distributions, and real business outcomes. The gap between these two can be enormous. A model that improves click-through rate on historical data might decrease revenue in production because it recommends cheaper products. A model that improves accuracy on the test set might increase user complaints because its errors cluster on a specific demographic that was underrepresented in training data.
We implement A/B testing for ML models through the same inference gateway that handles serving. The gateway assigns each user to a model variant deterministically based on a hash of their user ID, ensuring consistent treatment throughout their session. Traffic allocation starts conservative — 5% to the new model, 95% to the existing model — and ramps gradually as confidence in the new model grows. The experiment runs until we have statistical significance on the primary business metric, with guardrail metrics that trigger automatic rollback if they degrade beyond threshold.
Our A/B Testing Framework for ML Models
Primary Metrics
The business outcome the model is optimizing: conversion rate, revenue per user, engagement time, fraud catch rate
Guardrail Metrics
Metrics that must not degrade: latency p99, error rate, user complaint rate, demographic fairness measures
Model Health Metrics
Prediction distribution, confidence score distribution, feature coverage, inference latency per variant
Statistical Rigor
Sequential testing with alpha spending, minimum detectable effect pre-registration, automated significance checks
Monitoring Drift: When Production Data Stops Looking Like Training Data
ML models are the only software components that degrade silently. A bug in a traditional software system produces an error — a stack trace, a failed assertion, an exception. A drifting ML model continues to return predictions with the same HTTP 200 status code. The predictions just get gradually worse. By the time anyone notices, the model has been producing suboptimal results for weeks or months.
Drift comes in two flavors. Data drift means the distribution of input features has changed since training. Concept drift means the relationship between features and the target variable has changed. Both are common and both degrade model performance, but they require different remediation. Data drift might be addressable by retraining on recent data. Concept drift might require a fundamental model redesign or new features.
We monitor drift at three levels. Feature-level drift tracking compares the distribution of each input feature in production against the training distribution, using statistical tests (Population Stability Index for categorical features, Kolmogorov-Smirnov test for continuous features). When a feature drifts beyond threshold, the monitoring system generates an alert and records the drift magnitude for trend analysis. Prediction-level monitoring tracks the distribution of model outputs — if a binary classifier that historically predicted positive 30% of the time suddenly predicts positive 60% of the time, something has changed. Performance-level monitoring compares model predictions against ground truth labels when they become available. For many ML systems, ground truth arrives with a delay — a fraud label is confirmed days after the transaction, a churn label is confirmed weeks after the prediction. The monitoring system handles this delayed feedback loop by joining predictions with delayed labels and computing accuracy metrics on a rolling basis.
Drift monitoring pipeline configuration
monitoring:
feature_drift:
schedule: "every 6 hours"
reference_dataset: "s3://ml-artifacts/training-data/v3.parquet"
tests:
continuous_features:
method: "ks_test" # Kolmogorov-Smirnov
threshold: 0.05 # p-value threshold
categorical_features:
method: "psi" # Population Stability Index
threshold: 0.2 # PSI > 0.2 = significant drift
alerts:
slack_channel: "#ml-monitoring"
pagerduty_severity: "warning"
prediction_drift:
schedule: "every 1 hour"
metrics:
- name: "prediction_distribution"
method: "chi_squared"
window: "24h"
- name: "confidence_calibration"
method: "expected_calibration_error"
threshold: 0.05
performance_monitoring:
schedule: "daily"
label_delay: "7d" # Ground truth arrives 7 days after prediction
metrics:
- "accuracy"
- "precision"
- "recall"
- "auc_roc"
alert_on: "10% degradation vs trailing 30-day average"
auto_retrain:
trigger: "performance degradation > 15% for 3 consecutive days"
pipeline: "ml-pipelines/retrain-v2"
approval: "manual" # Require human approval before deploying retrained modelCI/CD for ML: Treating Models as Software
The ML community has spent a decade building better models and approximately zero time building better deployment processes. The typical deployment workflow at many organizations involves a data scientist exporting a model from a notebook, uploading it to a shared drive or S3 bucket, and sending a Slack message to the engineering team asking them to deploy it. This is the equivalent of deploying software by emailing a ZIP file to the ops team. It does not scale, it does not audit, and it does not roll back.
We build ML CI/CD pipelines that provide the same guarantees as software CI/CD: automated testing, versioned artifacts, staged rollouts, and automated rollback. The pipeline is triggered by either a code change (new feature engineering, model architecture changes) or a data change (new training data available, retraining schedule). It executes in stages. The training stage runs the training job on versioned data, producing a model artifact with metadata (training data hash, hyperparameters, training metrics, the git commit that produced it). The evaluation stage runs the model against the benchmark suite, comparing results against the current production model and against minimum quality thresholds. The registration stage publishes the model to a model registry with semantic versioning. The deployment stage rolls out the model using the canary deployment pattern — 5% traffic initially, ramping to full deployment over 48 hours if quality metrics hold.
Every stage produces artifacts and metadata that create a complete audit trail. Given any prediction the model made in production, you can trace back to the exact model version, the training data it was trained on, the evaluation results it achieved, the code commit that produced it, and the engineer who approved its deployment. This traceability is not just good engineering practice — it is a regulatory requirement in healthcare, finance, and other regulated industries where model decisions must be explainable and auditable.
The Organizational Challenge: Bridging Data Science and Engineering
The hardest part of production ML is not technical — it is organizational. Data scientists and software engineers have different skills, different tools, different workflows, and different definitions of done. A data scientist considers a model done when it achieves good metrics on a test set. An engineer considers it done when it runs in production with monitoring, alerting, documentation, and an on-call rotation. Bridging this gap requires either ML engineers who span both worlds, or a well-defined handoff process with clear contracts between the teams.
We advocate for the ML engineer role — engineers who understand both model development and production systems. They work alongside data scientists during model development to ensure the model is designed for production from the start: the feature engineering uses production-compatible data sources, the model architecture fits within the serving latency budget, the evaluation metrics align with business outcomes. They own the infrastructure — the feature store, the serving platform, the monitoring system, the CI/CD pipeline — and they are on call when production models degrade.
Production ML by the Numbers
87%
Of ML models never reach production according to industry surveys
90%
Of production ML effort is infrastructure, not model development
3-6 mo
Typical model refresh cycle before drift degrades performance meaningfully
Getting ML models to production is an engineering discipline, not a research problem. The models themselves are the easy part — the open-source community produces new architectures and training techniques every week. The hard part is the infrastructure, the processes, and the organizational culture that keep those models reliable, accurate, and valuable in production over months and years. The organizations that invest in MLOps infrastructure early build a compounding advantage: every subsequent model is faster to deploy, cheaper to operate, and more reliable in production. The ones that treat each model as a one-off project spend more time fighting infrastructure than improving models.
Related Articles
Need to get your ML models to production?
Our ML engineering team builds the infrastructure that takes models from notebooks to production-grade systems.