Distributed Systems + ML

Distributed ML Platform

100 sensors, three models, one stream, no batch. Sub-300ms end to end, on a pipeline that sustains 700+ msg/s.

Year2025

RoleDistributed Systems + ML

ScopeStream Processing, Distributed ML, Real-time Analytics

DeviceData Platform

ToolsApache Kafka, Apache Spark, Elasticsearch, Kibana, Docker, PySpark MLlib

Linksgithub.com/zaineli/distributed-ml-platform Project report (PDF)

01Context

Sensor data, end to end, in real time.

A distributed machine-learning inference and analytics platform for real-time IoT sensor data, built for NUST's Parallel & Distributed Computing course with Ali Abbas. Data is ingested, processed, scored by ML models, and visualized, all as a streaming pipeline.

It combines Apache Kafka for ingestion, Apache Spark Structured Streaming for distributed processing and inference, three MLlib models for analytics, and Elasticsearch + Kibana for live dashboards.

02The Problem

Volume, velocity, and the need for answers now.

IoT sensor networks generate more data, faster, than traditional systems handle: millions of points per second, real-time latency requirements, heterogeneous formats, and a need to turn it all into actionable insight. The architecture has to be horizontally scalable and fault-tolerant from the start.

03Architecture

Kafka → Spark → Elasticsearch → Kibana.

Sensors publish to Kafka, which partitions and replicates for throughput and fault tolerance. Spark Structured Streaming consumes the topics and runs three models in parallel across worker nodes. Scored records land in Elasticsearch (sharded for parallel indexing) and surface in Kibana dashboards. The whole stack is containerized with Docker Compose.

Three models, one stream

StatusFlagPredictor: a Random Forest classifier for operational status (95.2%, <10ms).
TemperatureAnomalyDetector: Logistic Regression for anomalies (92.1%, <5ms).
HumidityRegressor: a Gradient-Boosted Tree regressor (RMSE 2.3, <15ms).

Parallel and fault-tolerant by design

Data parallelism via Kafka partitioning, Spark RDDs, and Elasticsearch sharding.
Reliability via Kafka replication, Spark checkpointing, and Elasticsearch replicas.

Distributed ML Platform conceptual overview: collection, streaming, analytics, storage, and visualization layers — The conceptual stack: collection → streaming → parallel analytics → storage → visualization, with fault tolerance and horizontal scaling at every layer.

Throughput vs sensor-count chart — Throughput scales with sensor count: Kafka ingestion to ~1,000 msg/s; the full ML pipeline sustains ~700 msg/s.

04Performance

What it sustained.

Deployed on a 4-node cluster (8 cores, 16GB RAM each) with 100 simulated sensors at 1 message/second, the platform held linear scalability and sub-second latency end to end.

Full pipeline

700+ msg/s

End-to-end latency

< 300ms

Status accuracy

95.2%

Anomaly accuracy

92.1%

Inference

< 15ms

Cluster

4 nodes · 32 cores

05What Shipped

The platform

Streaming pipeline

Kafka ingestion into Spark Structured Streaming with checkpointing.

Parallel ML inference

Three MLlib models scored on every record across worker nodes.

Searchable store

Elasticsearch with sharding and replicas for fast, redundant indexing.

Live dashboards

Kibana views for sensor overview, anomalies, and alert history.

Live Kibana dashboard with sensor map, throughput timeline, and anomaly stream — The live Kibana dashboard: sensor map, throughput over time, and the anomaly/alert stream, updating as scored records land in Elasticsearch.

06What I Learned

Portable lessons

01Kafka partition sizing decides your throughput ceiling.

02Model complexity is a latency budget. Spend it deliberately.

03Make the pipeline fault-tolerant before you make it fast.