Model Serving
Status: Roadmap (Q4 2025)
Infrastructure for deploying and managing ML model inference servers.
Planned Capabilities
Model Server Provisioning
TorchServe (PyTorch models):
from senren import ModelServer
torchserve = ModelServer(
name="recommendation-model",
type="torchserve",
model_url="s3://models/recommendation-v2.pt",
replicas=3,
memory_gb=8,
gpu=True,
regions=["aws:us-east-1", "gcp:us-central1"],
)
TensorFlow Serving:
tf_serving = ModelServer(
name="ranking-model",
type="tensorflow-serving",
model_url="s3://models/ranking/",
replicas=5,
memory_gb=4,
regions=["aws:us-east-1"],
)
A/B Testing Infrastructure
Traffic splitting:
from senren import ModelDeployment
deployment = ModelDeployment(
name="recommendation",
variants=[
ModelVariant(
name="control",
model="recommendation-v1",
traffic_percentage=90,
),
ModelVariant(
name="treatment",
model="recommendation-v2",
traffic_percentage=10,
),
],
regions=["aws:us-east-1"],
)
Planned features: - Automatic traffic routing - Canary deployments (gradual rollout) - Shadow traffic (parallel testing without affecting users) - Automatic rollback on latency/error spikes
Shadow Traffic Testing
Parallel inference:
shadow = ModelDeployment(
name="recommendation",
primary="recommendation-v1",
shadow="recommendation-v2", # Receives copy of traffic, results ignored
regions=["aws:us-east-1"],
)
Use cases: - Test new model performance under production load - Compare latency/throughput before rollout - Validate model behavior on real traffic
Current Workaround
Deploy model servers manually using Kubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-server
spec:
replicas: 3
template:
spec:
containers:
- name: torchserve
image: pytorch/torchserve:latest
resources:
limits:
memory: "8Gi"
nvidia.com/gpu: 1
Limitations: - Manual multi-region deployment - No built-in A/B testing - No shadow traffic support - Manual monitoring setup
Timeline
Q4 2025: Model serving with A/B testing and shadow traffic.
See the roadmap for details.