Back to Projects
Machine Learning

ML Model Serving API .

Python FastAPI TensorFlow
ML Model Serving API

Overview

A production-ready API for serving machine learning models with automatic batching, model versioning, and GPU utilization.

Key Features

Request Batching

Groups multiple incoming requests to maximize GPU throughput.

from fastapi import FastAPI
import tensorflow as tf

app = FastAPI()
model = tf.keras.models.load_model('model.h5')

class RequestBatcher:
    def __init__(self, max_batch_size=32, timeout=0.01):
        self.queue = []
        self.max_batch_size = max_batch_size
        
    async def add_request(self, data):
        self.queue.append(data)
        if len(self.queue) >= self.max_batch_size:
            return await self.process_batch()
        
    async def process_batch(self):
        batch = np.array(self.queue)
        predictions = model.predict(batch)
        self.queue = []
        return predictions

Model Versioning

  • A/B testing between model versions
  • Gradual rollout with canary deployments
  • Rollback capability

Performance

  • Throughput: 1000 requests/second on single GPU
  • Latency: 15ms p50, 30ms p99
  • GPU Utilization: 85% average

Monitoring

  • Model drift detection
  • Prediction latency monitoring
  • Input distribution tracking