Lambda and Kappa Architecture: Processing Big Data

When your system needs to process millions of events per second and also answer queries that span months or years of historical data, you face a fundamental architectural challenge: how do you combine real-time processing with batch processing? The Lambda and Kappa architectures are two influential answers to this question.

In this guide, we explore both architectures in depth — how they work, their trade-offs, when to use each, and the tools that power them. For related concepts, see Event-Driven Architecture, Stream Processing, and Apache Kafka.

The Big Data Challenge

Traditional request-response systems process one event at a time. But many modern applications need to:

Process millions of events per second in real time (fraud detection, live dashboards)
Run complex analytical queries over terabytes or petabytes of historical data
Guarantee correctness even when real-time processing has errors or delays
Handle late-arriving data that must be incorporated into existing results

No single processing paradigm handles all these requirements well. Batch processing excels at correctness and scale but introduces latency. Stream processing provides low latency but can struggle with complex aggregations and exactly-once guarantees. Lambda architecture combines both.

Lambda Architecture

Proposed by Nathan Marz (creator of Apache Storm), the Lambda architecture uses three layers to deliver both real-time and batch results:

Batch Layer

The batch layer stores the complete, immutable master dataset and periodically recomputes batch views by running MapReduce or Spark jobs over the entire dataset. This layer prioritizes correctness over speed.

# Batch layer: Spark job computing daily aggregations
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, count, avg, window

spark = SparkSession.builder.appName("DailyAggregation").getOrCreate()

# Read the complete master dataset
events = spark.read.parquet("s3://data-lake/events/")

# Compute batch views
daily_metrics = events     .groupBy(
        window("timestamp", "1 day"),
        "product_id"
    )     .agg(
        count("*").alias("total_events"),
        sum("revenue").alias("total_revenue"),
        avg("duration").alias("avg_duration")
    )

# Write batch views
daily_metrics.write.mode("overwrite").parquet("s3://batch-views/daily-metrics/")

Speed Layer

The speed layer processes events in real time, producing approximate views that compensate for the latency of the batch layer. It only needs to handle the data that has arrived since the last batch computation.

// Speed layer: Kafka Streams real-time aggregation
KStream<String, Event> events = builder.stream("events-topic");

KTable<Windowed<String>, Metrics> realTimeMetrics = events
    .groupByKey()
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
    .aggregate(
        Metrics::new,
        (key, event, metrics) -> metrics.add(event),
        Materialized.<String, Metrics, WindowStore<Bytes, byte[]>>as("realtime-metrics-store")
    );

// Expose via interactive queries or write to a fast-access store
realTimeMetrics.toStream().to("realtime-metrics-topic");

Serving Layer

The serving layer merges results from the batch and speed layers to answer queries. When a query comes in, it reads the precomputed batch view (which may be hours old) and overlays the real-time data from the speed layer to produce a complete, up-to-date result.

# Serving layer: merging batch and real-time views
class QueryService:
    def get_metrics(self, product_id, time_range):
        # Get the precomputed batch view (complete but delayed)
        batch_result = self.batch_store.get(product_id, time_range)

        # Get real-time updates since last batch run
        realtime_result = self.realtime_store.get(
            product_id,
            since=batch_result.last_computed_at
        )

        # Merge: batch provides the base, real-time fills the gap
        return self.merge(batch_result, realtime_result)

Lambda Architecture Data Flow

Layer	Input	Processing	Latency	Accuracy
Batch	Complete dataset	MapReduce / Spark	Hours	Exact
Speed	Recent events only	Stream processing	Seconds	Approximate
Serving	Batch + Speed views	Query merging	Milliseconds	Near-exact

Lambda Architecture: Pros and Cons

Advantages:

Guarantees eventual correctness through batch recomputation
Handles late-arriving data naturally (batch layer reprocesses everything)
Fault tolerant — if the speed layer crashes, the batch layer compensates
Battle-tested at companies like Twitter, LinkedIn, and Netflix

Disadvantages:

Code duplication: You must implement the same logic twice — once for batch (Spark) and once for streaming (Kafka Streams / Flink)
Operational complexity: Three layers to maintain, monitor, and debug
Merging is hard: The serving layer merge logic can be complex and error-prone
High resource cost: Running both batch and streaming infrastructure

Kappa Architecture

Proposed by Jay Kreps (co-creator of Apache Kafka), the Kappa architecture simplifies Lambda by removing the batch layer entirely. Instead of maintaining two parallel processing systems, Kappa uses a single stream processing engine for everything.

The core insight is: if your event log is replayable (like a Kafka topic with long retention), you can reprocess historical data by simply replaying the log through a new version of your stream processor. You do not need a separate batch system.

How Kappa Works

All data enters as an immutable event log (e.g., Kafka with infinite retention)
A stream processing engine consumes the log and produces derived views
When you need to recompute (due to a bug fix or schema change), deploy a new version of the stream processor that replays from the beginning of the log
Once the new processor has caught up, switch queries to the new views and decommission the old ones

# Kappa architecture: single stream processing job handles both
# real-time and historical reprocessing

# Flink job that can process from any offset
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)

# Source: Kafka topic with long retention (months/years)
t_env.execute_sql("""
    CREATE TABLE events (
        event_id STRING,
        product_id STRING,
        revenue DECIMAL(10, 2),
        event_time TIMESTAMP(3),
        WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'events',
        'properties.bootstrap.servers' = 'kafka:9092',
        'scan.startup.mode' = 'earliest-offset',
        'format' = 'json'
    )
""")

# Single processing logic for both real-time and reprocessing
t_env.execute_sql("""
    INSERT INTO metrics_sink
    SELECT
        product_id,
        TUMBLE_START(event_time, INTERVAL '1' HOUR) as window_start,
        COUNT(*) as event_count,
        SUM(revenue) as total_revenue
    FROM events
    GROUP BY
        product_id,
        TUMBLE(event_time, INTERVAL '1' HOUR)
""")

Kappa Architecture: Pros and Cons

Advantages:

Single codebase: Write the processing logic once
Simpler operations: One system to maintain instead of three
No merge logic: Results come from a single source
Natural fit for event sourcing: The event log is the source of truth

Disadvantages:

Reprocessing can be slow: Replaying years of data through a stream processor takes time
Resource-intensive reprocessing: Requires significant compute during replay
Storage costs: Keeping the full event log indefinitely is expensive
Complex aggregations: Some analytical queries are more natural in batch SQL than streaming SQL

Lambda vs Kappa: When to Use Which

Criteria	Lambda	Kappa
Code complexity	High (dual codebase)	Low (single codebase)
Operational complexity	High (3 layers)	Medium (1 layer + replay)
Reprocessing speed	Fast (optimized batch)	Slower (stream replay)
Late data handling	Excellent (batch fixes all)	Good (watermarks, triggers)
Cost	Higher (batch + stream infra)	Lower (stream infra only)
Best for	Complex analytics + real-time	Event-driven, simpler aggregations

Choose Lambda when:

You need complex analytical queries that are easier to express in batch SQL
Your batch processing infrastructure already exists (Hadoop, Spark)
Correctness guarantees are critical and you want batch as a safety net
Reprocessing must be fast (optimized batch engines are faster than stream replay)

Choose Kappa when:

Your event log is the source of truth (event sourcing pattern)
You want to avoid maintaining two codebases
Your aggregations are relatively simple (counts, sums, averages)
You are building a new system without legacy batch infrastructure

Tools and Ecosystem

Component	Lambda Tools	Kappa Tools
Event Log	HDFS, S3, Kafka	Kafka (long retention), Pulsar
Batch Processing	Hadoop MapReduce, Spark	N/A
Stream Processing	Storm, Spark Streaming	Kafka Streams, Flink, Spark Structured Streaming
Serving Store	HBase, Cassandra, Druid	Druid, ClickHouse, Elasticsearch
Query Engine	Presto, Hive, Drill	Flink SQL, ksqlDB

Modern Alternatives

The industry is converging on architectures that blur the line between Lambda and Kappa:

Unified Batch and Streaming

Frameworks like Apache Flink and Apache Spark Structured Streaming can handle both batch and streaming workloads with a single API. This gives you the simplicity of Kappa with the batch capabilities of Lambda:

# Spark Structured Streaming: same code for batch and streaming
# Streaming mode
df = spark.readStream.format("kafka")     .option("subscribe", "events")     .load()

# Batch mode (same logic, different source)
df = spark.read.format("kafka")     .option("subscribe", "events")     .load()

# Processing logic is identical for both modes
result = df.groupBy("product_id", window("timestamp", "1 hour"))     .agg(sum("revenue"), count("*"))

The Lakehouse Architecture

The lakehouse combines the scalability of data lakes with the reliability and performance of data warehouses. Tools like Delta Lake, Apache Iceberg, and Apache Hudi add ACID transactions, time travel, and schema enforcement to data lake storage (S3, HDFS). This enables a single storage layer that supports both batch analytics and streaming ingestion.

Real-Time OLAP

Databases like Apache Druid, ClickHouse, and Apache Pinot can ingest streaming data and serve analytical queries with sub-second latency. They essentially combine the speed and serving layers into a single system, simplifying the architecture significantly.

Real-World Implementations

LinkedIn: Pioneered Lambda architecture for activity data processing using Kafka + Hadoop + Voldemort
Netflix: Uses a variant of Lambda for real-time analytics with Kafka, Flink, and Druid
Uber: Built a Kappa-style architecture for real-time pricing with Kafka and Flink
Spotify: Uses a lakehouse approach with Cloud Dataflow and BigQuery for both batch and streaming analytics

Lambda and Kappa are not dogmatic prescriptions — they are reference architectures that guide your thinking. Most production systems end up somewhere in between, using the principles of both while leveraging modern tools that reduce the sharp distinctions. Start with Kappa if you are building from scratch and your needs are straightforward. Consider Lambda elements when you need complex historical analytics or your stream processor cannot efficiently reprocess large datasets. And keep an eye on unified frameworks and lakehouse architectures — the future is convergence. For the event infrastructure underlying both architectures, see Apache Kafka and Event-Driven Architecture.

Lambda and Kappa Architecture: Processing Big Data