Back to Projects
Data Engineering

Distributed Log Analyzer .

Kafka Python ClickHouse
Distributed Log Analyzer

Problem Statement

Processing millions of log events per minute from microservices, extracting insights, and detecting anomalies in real-time.

Architecture

App Servers → Kafka → Stream Processor → ClickHouse → Grafana

Components

  1. Producers: Apps send structured logs to Kafka topics
  2. Stream Processor: Python consumers aggregate and enrich data
  3. Storage: ClickHouse for column-oriented analytics
  4. Visualization: Grafana dashboards for monitoring
from kafka import KafkaConsumer
import clickhouse_driver

consumer = KafkaConsumer('app-logs', bootstrap_servers=['localhost:9092'])
client = clickhouse_driver.Client('localhost')

for message in consumer:
    log = json.loads(message.value)
    
    # Extract metrics
    if log['level'] == 'ERROR':
        client.execute(
            'INSERT INTO error_logs VALUES',
            [(log['timestamp'], log['service'], log['message'])]
        )

Features

  • Real-time Processing: Sub-second latency from log to dashboard
  • Anomaly Detection: Statistical models flag unusual patterns
  • Alerting: PagerDuty integration for critical errors
  • Retention: 30-day hot storage, 1-year cold storage in S3

Scale

  • Processes 5M events/minute at peak
  • 99.9% delivery guarantee
  • Sub-100ms query latency on ClickHouse