Mukund's Portfolio | ML Model Serving API

Overview

A production-ready API for serving machine learning models with automatic batching, model versioning, and GPU utilization.

Key Features

Request Batching

Groups multiple incoming requests to maximize GPU throughput.

from fastapi import FastAPI
import tensorflow as tf

app = FastAPI()
model = tf.keras.models.load_model('model.h5')

class RequestBatcher:
    def __init__(self, max_batch_size=32, timeout=0.01):
        self.queue = []
        self.max_batch_size = max_batch_size
        
    async def add_request(self, data):
        self.queue.append(data)
        if len(self.queue) >= self.max_batch_size:
            return await self.process_batch()
        
    async def process_batch(self):
        batch = np.array(self.queue)
        predictions = model.predict(batch)
        self.queue = []
        return predictions

Model Versioning

A/B testing between model versions
Gradual rollout with canary deployments
Rollback capability

Performance

Throughput: 1000 requests/second on single GPU
Latency: 15ms p50, 30ms p99
GPU Utilization: 85% average

Monitoring

Model drift detection
Prediction latency monitoring
Input distribution tracking