Every data team starts with batch processing. You extract data overnight, transform it in the morning, and load it into a warehouse by noon. It works until the business needs fresher data — and the gap between what batch can deliver and what the business requires is widening every year.
Real-time data pipelines are not just faster versions of batch pipelines. They require fundamentally different architecture, different failure modes, and different operational practices. The technology is mature — Kafka, Flink, and their ecosystems are production-proven at massive scale — but the engineering practices around them are still catching up.
This article covers what we have learned building and operating real-time data pipelines for enterprises processing millions of events per second. Not the getting-started tutorial — the operational patterns that determine whether your pipeline survives contact with production.
The Streaming Architecture Stack
A production streaming architecture has four layers: ingestion, transport, processing, and serving. Each layer has distinct technology choices and operational concerns. Getting the boundaries between these layers right is the most important architectural decision you will make.
Ingestion: Getting Data In
Ingestion is the boundary between your source systems and your streaming infrastructure. For database change capture, Debezium is the standard — it reads the database transaction log and produces change events to Kafka with minimal impact on the source database. For application events, producers write directly to Kafka topics. For third-party integrations, Kafka Connect provides a framework for connectors that handle offset management, error recovery, and schema evolution automatically.
The critical design decision at the ingestion layer is schema management. Every event must have a well-defined schema, and that schema must evolve without breaking downstream consumers. We use the Confluent Schema Registry with Avro encoding for all production topics. The registry enforces backward and forward compatibility, which means producers can add fields and consumers can tolerate both old and new schemas. Without schema enforcement, your streaming pipeline becomes a conveyor belt for chaos.
// Avro schema with evolution support
{
"type": "record",
"name": "OrderEvent",
"namespace": "com.example.events",
"fields": [
{ "name": "order_id", "type": "string" },
{ "name": "customer_id", "type": "string" },
{ "name": "total_amount", "type": "double" },
{ "name": "currency", "type": "string", "default": "USD" },
{ "name": "items", "type": { "type": "array", "items": "string" } },
{ "name": "created_at", "type": "long", "logicalType": "timestamp-millis" },
// New field with default — backward compatible
{ "name": "source_region", "type": "string", "default": "us-east-1" }
]
}Transport: Kafka as the Central Nervous System
Kafka is the durable, ordered, partitioned log that connects everything. In our architecture, Kafka is not just a message bus — it is the system of record for all events. With retention set to 7–30 days (depending on the topic), consumers can reprocess events, recover from failures, and build new derived views without touching the source systems. Topic design is critical: partition by the key that your consumers need to process in order (typically entity ID), and size partitions for your expected throughput with headroom for spikes.
Stream Processing with Flink
Apache Flink is our default stream processing engine for workloads that require stateful computation, windowing, or complex event processing. For simple transformations and routing, Kafka Streams is sufficient and operationally simpler. But for anything involving time windows, aggregations, joins across streams, or exactly-once processing guarantees, Flink is the battle-tested choice.
Exactly-Once Semantics: The Hard Part
Exactly-once processing is the most misunderstood concept in stream processing. It does not mean every event is processed exactly once — it means the observable side effects of processing are consistent with having processed every event exactly once. The implementation relies on transactional coordination between Kafka consumer offsets, Flink’s checkpoint state, and the downstream sink. When a failure occurs, Flink restores from the last checkpoint, reprocesses events from the stored Kafka offsets, and produces output that is indistinguishable from uninterrupted processing.
// Flink exactly-once configuration
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// Enable checkpointing every 30 seconds
env.enableCheckpointing(30_000);
env.getCheckpointConfig()
.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig()
.setMinPauseBetweenCheckpoints(10_000);
env.getCheckpointConfig()
.setCheckpointTimeout(120_000);
env.getCheckpointConfig()
.setMaxConcurrentCheckpoints(1);
// Configure state backend for large state
env.setStateBackend(new EmbeddedRocksDBStateBackend());
env.getCheckpointConfig()
.setCheckpointStorage("s3://checkpoints/pipeline-v2/");Windowing Strategies
Time-based windowing in streaming systems is deceptively complex because events arrive out of order. A sale that happened at 11:59 PM might arrive at 12:03 AM — does it belong in yesterday’s window or today’s? Flink solves this with watermarks: a mechanism that tracks how far behind the latest event time the system should still wait for late data. We configure watermarks based on the observed lateness distribution for each source. For most pipelines, a 5-minute watermark delay handles 99.9% of late events. Events that arrive after the watermark are routed to a side output for manual reconciliation rather than silently dropped.
Backpressure Management
Backpressure occurs when a downstream operator cannot keep up with the upstream production rate. In Flink, backpressure propagates naturally through the network buffers — a slow sink causes upstream operators to slow down automatically. This is a feature, not a bug. The operational concern is detecting sustained backpressure before it causes checkpoint timeouts. We monitor the backpressure ratio on every operator and alert when it exceeds 50% for more than 5 minutes. The usual fixes are increasing parallelism for the bottleneck operator, optimizing the sink batch size, or provisioning more resources for the downstream system.
Monitoring and Operational Patterns
Streaming pipelines that run 24/7 require monitoring that is both comprehensive and actionable. We organize our monitoring into three tiers: infrastructure metrics (broker health, disk usage, network throughput), pipeline metrics (consumer lag, processing latency, error rates), and business metrics (events processed per second, end-to-end latency for key workflows, data freshness SLAs).
Consumer lag is the single most important metric. It measures the gap between the latest event produced and the latest event consumed. A growing lag means your consumer is falling behind — and if it grows faster than it shrinks, you will eventually hit retention limits and start losing data. We alert on lag growth rate rather than absolute lag, because a brief spike during a traffic burst is normal and self-correcting, while sustained growth indicates a systemic problem.
# Critical streaming pipeline alerts
groups:
- name: streaming-pipeline
rules:
- alert: ConsumerLagGrowing
expr: |
rate(kafka_consumer_group_lag[10m]) > 100
for: 15m
labels:
severity: warning
- alert: CheckpointFailing
expr: |
flink_job_lastCheckpointDuration > 120000
or flink_job_numberOfFailedCheckpoints > 0
for: 5m
labels:
severity: critical
- alert: BackpressureSustained
expr: |
flink_taskmanager_job_task_backPressuredTimeMsPerSecond > 500
for: 5m
labels:
severity: warning
- alert: EndToEndLatencyHigh
expr: |
histogram_quantile(0.99, pipeline_e2e_latency_seconds) > 30
for: 10m
labels:
severity: criticalDead letter queues deserve special attention. Every stage of the pipeline that can fail should route failed events to a dead letter topic with the original event, the error message, the timestamp, and the pipeline stage that failed. Without DLQ infrastructure, failed events are silently dropped, and you discover the data loss days later when a dashboard shows incorrect numbers. We run automated reconciliation jobs that compare counts between source topics and sink destinations, alerting on discrepancies above 0.01%.
Building for Production, Not for Demos
A real-time data pipeline that works in a demo processes a few thousand events per minute with a single consumer and no failure handling. A production pipeline processes millions of events per second across dozens of partitions, handles schema evolution without downtime, recovers from broker failures automatically, and provides end-to-end latency guarantees that the business can depend on.
The gap between these two is not technology — it is engineering rigor. Schema management, exactly-once semantics, backpressure handling, dead letter queues, consumer lag monitoring, and checkpoint tuning are the unglamorous work that determines whether your streaming architecture is an asset or a liability.
At LockedIn Labs, we build and operate real-time data pipelines for enterprises where data freshness is a competitive advantage. Our data engineering team brings production experience with Kafka clusters processing billions of events per day, Flink jobs running stateful computations across terabytes of state, and the operational practices that keep it all running reliably.