Deploying Models as Services

Your model performs well offline. Congratulations — you have completed 30% of the project. The remaining 70% is getting it into production: packaging the model so it survives deserialization across environments, wrapping it in an API that handles concurrent requests without crashing, containerizing everything so it deploys identically on any host, and choosing an infrastructure that matches your latency and cost budget.

Most data scientists treat deployment as someone else’s problem. And then they wonder why their model — the one that achieved state-of-the-art metrics on a held-out test set — is producing garbage predictions in production, or timing out under load, or failing silently because the input schema changed and nobody validated it.

Deployment is not an afterthought. It is where your model meets reality, and reality is hostile. Users send malformed inputs. Traffic spikes at 2 AM. Dependencies update and break serialization. The container that worked on your machine uses a different glibc on the production host. These are not edge cases. These are Tuesdays.

The Four Stages

Every model deployment follows the same progression, whether you are serving a logistic regression or a 70B-parameter language model:

Stage 1: Packaging. You must serialize your trained model into a format that can be loaded in a different process, on a different machine, potentially in a different language. Pickle is the Python default — and it is a security vulnerability. ONNX and TorchScript are the production alternatives.

Stage 2: API. You must wrap the model in an HTTP service that accepts requests, validates inputs, runs inference, and returns structured responses. FastAPI gives you async support, automatic validation, and OpenAPI documentation with minimal boilerplate.

Stage 3: Containerization. You must package the API, its dependencies, and the model weights into a Docker image that runs identically everywhere. ML images are notoriously bloated — a naive PyTorch container can exceed 4GB. Multi-stage builds and careful dependency management bring this down to a few hundred megabytes.

Stage 4: Infrastructure. You must choose where to run the container. Serverless functions offer zero cost at zero traffic but suffer cold starts that can take 30 seconds for large models. Dedicated endpoints provide consistent latency at consistent cost. Managed platforms like SageMaker and Vertex AI abstract the infrastructure but lock you in and charge a premium.

Deployment Pipeline

Each stage introduces failure modes that the previous stage did not anticipate. A model that loads fine locally may fail when deserialized in a container with a different NumPy version. An API that handles single requests perfectly may collapse under batch traffic. A container that runs on Docker Desktop may exceed the memory limit on a cloud instance. An infrastructure choice that seemed cheap at 100 requests per day becomes ruinously expensive at 100,000.

What Separates Hobby Deployment from Production Deployment

The differences are not in the model. They are in everything around the model:

Concern	Hobby	Production
Serialization	`pickle.dump(model, f)`	ONNX export with opset version pinning
Input validation	`assert len(features) == 10`	Pydantic model with type coercion and range checks
Error handling	Stack trace in response body	Structured error codes, no information leakage
Concurrency	Single-threaded Flask	Async FastAPI with connection pooling
Container	`FROM python:3.11` (1.2GB)	Multi-stage build with slim base (300MB)
Health checks	None	Liveness + readiness + model-loaded probes
Scaling	”It works on my machine”	Auto-scaling with request-based metrics
Security	Model loads via pickle	Model loaded from verified artifact store

These differences compound. A hobby deployment that cuts one corner in each category will fail in production. A production deployment that handles each category correctly will survive the traffic spike, the malformed input, the dependency conflict, and the 3 AM page — because it was designed with the assumption that all of those will happen.

Roadmap

Section 10.1 covers model packaging — why pickle is dangerous, how ONNX provides a universal export format, and when TorchScript is the right choice for PyTorch models. Section 10.2 builds the inference API with FastAPI: input validation, model lifecycle management, batch prediction, and request batching for GPU efficiency. Section 10.3 tackles containerization with Docker: multi-stage builds, image size optimization, and Docker Compose for local development. Section 10.4 lays out the infrastructure decision — serverless versus dedicated versus managed — with a quantitative framework for choosing based on your actual workload characteristics.

Every code example in this chapter is deployment-ready. You will not see app.run(debug=True) or FROM python:latest. You will see the patterns that survive the first week in production — because the first week is where most deployments die.

The prerequisite for this chapter is Chapter 8 (evaluation), because you should not deploy a model you cannot evaluate properly. If your offline metrics are unreliable, your production metrics will be worse.