Skip to main content

Why Your Data Lakehouse Needs a Streaming-First Design

Batch pipelines still dominate most lakehouses, but the cost of that choice compounds over time. Here's the architectural case for building stream-first from day one.

  • Data Architecture
  • Apache Kafka
  • Streaming
  • Azure

Most lakehouses are built batch-first. Ingestion jobs run every hour, every fifteen minutes, sometimes every five. Dashboards show "near real-time" data. Stakeholders accept a lag they've been told is unavoidable.

It isn't.

The assumption that batch is simpler, cheaper, or more reliable than streaming is one of the most expensive beliefs in enterprise data engineering. And it becomes more expensive every year you carry it.

The Real Cost of Batch Lag

When your ingestion layer is batch-based, every downstream system inherits that latency. A fraud detection model running on hourly snapshots is making decisions on hour-old signals. An inventory allocation system refreshing every fifteen minutes creates arbitrage windows your competitors exploit.

More insidiously, batch lag hides itself. Teams build workflows around known delays. Analysts learn not to trust the last partition. Eventually, "the data is always a bit behind" becomes institutional knowledge rather than a solvable engineering problem.

The compounding cost isn't just latency — it's the operational complexity of managing freshness guarantees across a system built on fundamentally asynchronous batch semantics.

What Streaming-First Actually Means

Streaming-first doesn't mean abandoning batch entirely. It means designing your architecture so that streams are the primary interface between systems, and batch is a derived pattern applied deliberately.

In practice, this looks like:

  • Change Data Capture (CDC) at the database tier using Debezium, publishing every row mutation to Kafka topics before any downstream system sees the change
  • Event-driven ingestion via Kafka producers, where applications emit structured events rather than expecting periodic polling
  • Incremental processing using Apache Spark Structured Streaming or Flink, maintaining stateful aggregations in near real-time rather than recomputing from scratch on a schedule
  • Micro-batch serving layers that update Silver and Gold Delta tables continuously rather than in scheduled windows

The lakehouse format — Delta Lake, Apache Iceberg — makes this tractable. Both support ACID transactions, schema evolution, and time-travel semantics that previously made streaming-to-lake workloads operationally nightmarish.

The Azure Implementation Pattern

On Azure, the stack I've used most effectively for streaming-first lakehouses looks like this:

Azure Event Hubs (Kafka-compatible) 
  → Azure Databricks Structured Streaming 
  → Delta Lake (ADLS Gen2) 
  → Delta Live Tables for medallion orchestration

Event Hubs gives you Kafka protocol compatibility without the operational overhead of self-managed Kafka — critical for teams that don't have dedicated platform engineers. Databricks Structured Streaming handles the stateful transformation layer, and Delta Live Tables manages pipeline quality and dependency ordering declaratively.

Where Kafka is already present (common in large financial services environments), the same pattern applies with the Kafka brokers taking the place of Event Hubs.

The Migration Path

"Streaming-first" sounds like a greenfield luxury. In reality, most teams are migrating from existing batch pipelines. The practical migration path I've used successfully:

  1. Introduce CDC alongside existing batch ingestion — run both in parallel, validate parity
  2. Replace the batch ingestion layer once CDC parity is confirmed and monitored
  3. Convert downstream aggregations from batch jobs to streaming incrementals, one at a time
  4. Deprecate the batch scheduler — once dependencies are migrated, the scheduler becomes redundant

This approach keeps the platform stable throughout migration and gives business stakeholders continuous access to validated data.

When Batch Still Makes Sense

Not everything benefits from streaming. Historical backfills, large-scale ML feature computation, and end-of-period financial reconciliations are legitimately batch workloads. The point isn't to eliminate batch — it's to stop using it as a default when it isn't the right tool.

Design your architecture so that streaming is the path of least resistance for new workloads, and batch is a deliberate, well-understood exception.

Your stakeholders will notice the difference. And once they've experienced genuinely fresh data, they won't tolerate going back.