Building Real-time Data Pipelines
BlogData
Data

Building Real-time Data Pipelines

14 min read
Back to Blog

Batch ETL pipelines were the dominant data architecture pattern for two decades. The data engineering industry is now in the middle of an irreversible shift toward stream processing — not because batch processing is fundamentally broken, but because business decisions that used to tolerate T+1 data are increasingly competitive-pressure-driven to require sub-second freshness.

01

Kafka as the Central Nervous System

Apache Kafka has become the de facto standard for event streaming infrastructure at scale. Its key architectural properties — sequential disk writes for high throughput, consumer group semantics for parallel processing, configurable retention for replay capability, and the immutable log as a source of truth — make it uniquely suited as the backbone of a streaming data platform.

At scale, Kafka cluster sizing is a first-principles exercise: message throughput (MB/s), retention duration, replication factor, and consumer lag tolerance determine broker count, disk configuration, and network bandwidth requirements. Confluent's managed Kafka (Confluent Cloud) and AWS MSK have made this operationally accessible, trading some tunability for elimination of Kafka operations expertise as a prerequisite.

02

Flink: The Stream Processing Engine

Apache Flink occupies the stream processing tier — consuming from Kafka, applying stateful transformations, and producing results to downstream sinks. Flink's defining capability is exactly-once processing semantics with stateful operators that can maintain rolling aggregations, join streams with temporal windows, and handle late-arriving events through watermark-based time management.

The Flink programming model — DataStream API for procedural code, Table API and SQL for declarative transformations — accommodates both data engineers and data analysts. Flink SQL in particular has made real-time analytics accessible to teams without deep Java/Scala expertise, enabling complex streaming joins and aggregations with familiar SQL syntax.

03

Operational Patterns for Production Pipelines

Production streaming pipelines introduce operational challenges that do not exist in batch workflows. State management — Flink's RocksDB state backend can grow to hundreds of gigabytes for long-window aggregations — requires capacity planning and incremental checkpointing tuning. Schema evolution, handled through Confluent Schema Registry or Apache Atlas, must accommodate both producer and consumer versioning simultaneously.

Monitoring a streaming pipeline requires different mental models. Lag (consumer position relative to latest offset), processing latency, checkpoint duration, and state backend size are the primary health indicators. Setting alert thresholds and runbooks for each is essential before a pipeline reaches production.

04

The Lambda Architecture Is Dead; Long Live Kappa

Jay Kreps' Kappa Architecture — running a single stream processing system for both real-time and historical reprocessing — has largely superseded the Lambda Architecture's awkward dual-path batch+stream design. With Kafka's long retention and Flink's ability to replay from committed offsets, historical backfill and real-time processing can run through the same code path.

Iceberg and Delta Lake, combined with streaming writers, provide the ACID transactional layer that enables upserts and deletes in the streaming data lake — closing the last capability gap that kept some use cases on Lambda Architecture designs.

Key Takeaway

"Real-time data pipelines built on Kafka and Flink are no longer bleeding-edge infrastructure — they are production-proven technology running at the world's largest data organizations. The investment in stream processing fundamentals pays dividends across analytics latency, operational responsiveness, and eventually cost, as expensive batch recomputation is replaced by incremental stream updates."

Topics

Apache KafkaApache FlinkStreamingData EngineeringReal-time