Skip to main content
Back to projects

Real-Time Enterprise Data Platform

End-to-end streaming platform on Apache Kafka and Azure Event Hubs, processing 50M+ events per day across distributed microservices with sub-100ms end-to-end latency.

  • 50M+ events processed daily
  • Sub-100ms p99 latency
  • 12 source systems integrated
  • Apache Kafka
  • Azure Event Hubs
  • Apache Spark
  • Delta Lake
  • dbt
  • Azure

The Problem

A large UK retail group had twelve operational systems — POS, inventory, loyalty, e-commerce, fulfilment, and others — each maintaining their own data stores with no real-time integration between them. Inventory allocation decisions were made on data that was up to four hours old. The merchandising team had no visibility into live sales velocity. Fraud detection was running on previous-day snapshots.

The existing data warehouse was batch-loaded nightly via a combination of SFTP file drops and JDBC extracts. It worked as a reporting system, but it couldn't support the operational use cases the business needed to compete.

The ask was a platform that could ingest events from all twelve systems in real time, make the data available for operational queries within seconds, and serve as the foundation for ML feature pipelines and real-time analytics.

Architecture

The platform follows a streaming-first medallion architecture with three tiers:

Bronze — raw, immutable events from all source systems, landed in Delta Lake on ADLS Gen2 via Kafka-to-Delta connectors. Nothing is ever deleted from Bronze; it provides full replay capability.

Silver — cleaned, deduplicated, and validated records with schema enforcement. Maintained by Spark Structured Streaming jobs running continuously on Azure Databricks, with Delta Live Tables managing quality constraints and pipeline dependencies.

Gold — business-domain aggregations (inventory positions, sales velocity by SKU, loyalty redemption rates) updated on sub-minute cadences for operational consumers.

Source Systems (12)
  → Debezium CDC / Kafka Producers
  → Azure Event Hubs (Kafka-compatible)
  → Bronze Delta Tables (ADLS Gen2)
  → Spark Structured Streaming (Databricks)
  → Silver Delta Tables
  → Delta Live Tables (Gold aggregations)
  → Serving Layer (Databricks SQL / API)

Change Data Capture

For the six systems with relational backends (SQL Server, PostgreSQL), I deployed Debezium via Azure Container Apps. Debezium captures every row-level change as a structured event on a Kafka topic, with exactly-once semantics via Kafka transactions.

This replaced the previous pattern of scheduled JDBC extracts, which were fragile (full-table scans on production databases), slow (hour-long extraction windows), and incomplete (they missed deletes entirely).

Event Schema Management

All events are validated against schemas registered in a Confluent Schema Registry (hosted on Azure). Producers are schema-validated at publish time; consumers are protected from breaking changes by schema evolution rules. Introducing a new event type requires a schema review, which we made lightweight — a pull request against the schema registry repo, reviewed by the platform team.

Latency Profile

End-to-end latency from a source system event to a queryable Gold record:

  • Median: 45ms
  • p99: 87ms
  • p99.9: 340ms (typically during Databricks autoscale events)

Results

  • 12 source systems integrated into a single streaming platform, replacing fragmented batch pipelines and point-to-point integrations
  • Inventory allocation accuracy improved by 23% in the first quarter after go-live, measured against out-of-stock rates and markdown frequency
  • Fraud detection model retrained daily on fresh features rather than previous-day snapshots; false positive rate dropped by 18%
  • Merchandising team self-serve adoption: 40+ analysts now query Gold tables directly via Databricks SQL, up from a handful who had access to the legacy warehouse

Key Technical Decisions

Azure Event Hubs over self-managed Kafka — the team didn't have dedicated Kafka platform engineers, and operational overhead of self-managing a Kafka cluster on Azure VMs was unjustifiable. Event Hubs provides Kafka protocol compatibility with managed operations; the tradeoff is lower configurability for advanced Kafka features, none of which we needed.

Delta Lake over Parquet — ACID transactions are essential when streaming writers and batch readers share the same tables. Delta's optimistic concurrency model handles this cleanly; raw Parquet would have required reader/writer coordination that would have added significant operational complexity.

dbt for Gold layer — while the Bronze-to-Silver path is pure Spark Structured Streaming, the Silver-to-Gold aggregations are modelled in dbt running on Databricks SQL. This gives the analytics team ownership of Gold definitions using familiar SQL tooling, without them needing to write Spark code.