Skip to main content
📨Messaging & Queues

Stream Processing: Real-Time Data Processing at Scale

Stream processing is the continuous computation of data as it arrives, in contrast to batch processing where data is collected over time and processed in b...

📖 7 min read

Stream Processing: Real-Time Data Processing at Scale

Stream processing is the continuous computation of data as it arrives, in contrast to batch processing where data is collected over time and processed in bulk. In a world where businesses need real-time dashboards, instant fraud detection, and live recommendations, stream processing has become essential infrastructure. Technologies like Kafka Streams, Apache Flink, and Apache Spark Streaming enable processing millions of events per second with sub-second latency.

Batch vs Stream Processing

Aspect Batch Processing Stream Processing
Data handling Process bounded dataset at once Process unbounded stream continuously
Latency Minutes to hours Milliseconds to seconds
Triggering Scheduled (cron, daily) Continuous (event arrival)
State Full dataset available Incremental, windowed views
Example tools Hadoop MapReduce, Spark Batch Kafka Streams, Flink, Spark Streaming
Use case Monthly reports, ETL jobs Real-time alerts, live dashboards

Kafka Streams

Kafka Streams is a client library for building stream processing applications on top of Apache Kafka. Unlike Flink or Spark Streaming, it does not require a separate cluster — it runs as a regular Java application.

// Kafka Streams: Real-time order value aggregation (Java)
import org.apache.kafka.streams.*;
import org.apache.kafka.streams.kstream.*;

Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "order-analytics");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092");

StreamsBuilder builder = new StreamsBuilder();

// Read from orders topic
KStream<String, Order> orders = builder.stream("orders");

// Real-time aggregation: total order value per customer per hour
KTable<Windowed<String>, Double> hourlyTotals = orders
    .groupByKey()
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofHours(1)))
    .aggregate(
        () -> 0.0,                              // Initializer
        (key, order, total) -> total + order.getTotal(),  // Aggregator
        Materialized.as("hourly-order-totals")
    );

// Write results to output topic
hourlyTotals.toStream()
    .map((windowed, total) -> KeyValue.pair(
        windowed.key(),
        "Window: " + windowed.window().startTime() + " Total: $" + total
    ))
    .to("order-analytics-output");

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

Apache Flink is a distributed stream processing framework designed for stateful computations over unbounded data streams. It provides exactly-once processing guarantees, sophisticated windowing, and event-time processing.

# Apache Flink: Fraud detection (Python/PyFlink)
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import TumblingEventTimeWindows
from pyflink.common.time import Time

env = StreamExecutionEnvironment.get_execution_environment()

# Read from Kafka
transactions = env.add_source(kafka_source)

# Detect suspicious patterns: more than 3 transactions
# from the same card in a 1-minute window
alerts = (transactions
    .key_by(lambda t: t['card_id'])
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .process(FraudDetectionFunction())
)

# FraudDetectionFunction checks:
# - More than 3 transactions in 1 minute
# - Transactions from different countries within 5 minutes
# - Single transaction exceeding 10x average amount
alerts.add_sink(alert_sink)

env.execute("Fraud Detection Pipeline")

Windowing

Windows are fundamental to stream processing — they define how to group unbounded data into finite chunks for computation.

Window Types

Window Type Description Use Case
Tumbling Fixed-size, non-overlapping windows Hourly aggregations, daily reports
Sliding (Hopping) Fixed-size windows that overlap Moving averages, trend detection
Session Dynamic windows based on activity gaps User session analysis, clickstream
Global Single window per key (with custom trigger) Accumulating counters, batch triggers
# Windowing examples (conceptual)

# Tumbling window: Count events per 5-minute window
# |--window 1--|--window 2--|--window 3--|
# |  events: 5 |  events: 3 |  events: 8 |
# 10:00       10:05       10:10       10:15

# Sliding window: Count events in 10-min window, sliding every 5 min
# |----window 1----|
#      |----window 2----|
#           |----window 3----|
# An event at 10:07 appears in both window 1 (10:00-10:10)
# and window 2 (10:05-10:15)

# Session window: Group events with gaps less than 5 minutes
# User clicks: 10:00, 10:02, 10:03, [gap], 10:15, 10:17
# Session 1: 10:00-10:03 (3 events)
# Session 2: 10:15-10:17 (2 events)

Event Time vs Processing Time

A critical distinction in stream processing:

  • Event time: When the event actually occurred (embedded in the event payload). Handles out-of-order events correctly.
  • Processing time: When the event is processed by the stream processor. Simpler but inaccurate if events arrive late.

Flink strongly supports event-time processing with watermarks — timestamps that indicate "all events before this time have arrived." Late events (arriving after the watermark) can be handled via side outputs or allowed lateness configurations.

Stateful Processing

Many stream processing operations require maintaining state — counts, aggregations, session data, or ML model parameters. Stateful processing is what makes stream processing powerful but also complex.

# Stateful stream processor: Running average
class RunningAverageProcessor:
    def __init__(self):
        self.state = {}  # key -> {sum, count}
    
    def process(self, key, value):
        if key not in self.state:
            self.state[key] = {"sum": 0.0, "count": 0}
        
        self.state[key]["sum"] += value
        self.state[key]["count"] += 1
        
        avg = self.state[key]["sum"] / self.state[key]["count"]
        return {"key": key, "average": avg, "count": self.state[key]["count"]}
    
    def checkpoint(self):
        # Persist state for fault tolerance
        save_to_storage(self.state)

Flink uses RocksDB as its state backend, enabling state larger than memory. Kafka Streams uses RocksDB-backed state stores with changelog topics for fault tolerance.

Exactly-Once Semantics

Achieving exactly-once processing in stream processing means each input event affects the output exactly once, even in the presence of failures. This requires:

  • Checkpointing: Periodically snapshot the processor state. On failure, restore from the last checkpoint.
  • Idempotent sinks: Output operations must be idempotent so replaying from a checkpoint does not produce duplicates.
  • Transactional writes: In Kafka Streams, consume-process-produce is wrapped in a Kafka transaction.

Stream Processing Framework Comparison

Feature Kafka Streams Apache Flink Spark Streaming
Deployment Library (no cluster) Dedicated cluster Spark cluster
Latency Milliseconds Milliseconds Seconds (micro-batch)
Exactly-once Yes (Kafka only) Yes (any source/sink) Yes (limited)
State management RocksDB + changelog RocksDB + checkpoints In-memory + checkpoints
Data source Kafka only Kafka, Kinesis, files, etc. Kafka, Kinesis, files, etc.
Best for Kafka-native, simple topologies Complex event processing Unified batch + stream

Real-World Use Cases

Real-Time Fraud Detection: Banks process credit card transactions through Flink, checking each transaction against fraud rules (unusual location, rapid succession, amount anomalies) within milliseconds. Suspicious transactions are flagged before they complete.

Live Dashboards: An e-commerce platform uses Kafka Streams to compute real-time metrics — orders per minute, revenue by region, conversion rates — and pushes updates to dashboards via WebSocket every second.

Recommendation Engines: Netflix processes billions of user events (plays, searches, ratings) through stream processing to update recommendation models in near-real-time, ensuring suggestions reflect recent viewing patterns.

Frequently Asked Questions

When should I use stream processing vs batch processing?

Use stream processing when you need real-time results (seconds of latency), when data arrives continuously, or when timeliness directly impacts business value (fraud detection, real-time pricing). Use batch processing for historical analysis, large-scale ETL jobs, or when data completeness matters more than speed. Many systems use both — the Lambda architecture processes data through both batch and stream paths.

How does Kafka Streams handle failures?

Kafka Streams uses changelog topics to back up state stores. Each state change is written to a compacted Kafka topic. On failure, the replacement instance rebuilds state by replaying the changelog topic. Combined with Kafka's consumer group rebalancing, this provides automatic fault tolerance without external coordination.

What are watermarks in stream processing?

Watermarks are timestamps that flow through the stream alongside events. A watermark at time T means "all events with event time less than T have arrived." This allows the system to close windows and emit results even when events arrive out of order. Events arriving after the watermark (late events) can be handled via side outputs or configured allowed lateness.

Can I join two streams in real-time?

Yes, stream-stream joins are supported by Flink and Kafka Streams. A common pattern is joining a stream of orders with a stream of payments to detect unpaid orders. Joins are windowed — you specify a time window within which matching events are expected. This is more complex than batch joins because both streams are unbounded and events may arrive at different times.

Related Articles