Data Lakes and Data Warehouses: Modern Data Architecture

As organizations collect ever-growing volumes of data, choosing the right storage and analytics architecture becomes critical. Data lakes, data warehouses, and the newer data lakehouse paradigm each solve different problems. Understanding their trade-offs, ingestion patterns, and governance models is essential for building scalable, cost-effective data platforms. This guide covers schema strategies, ETL vs ELT, leading tools, and modern architectural patterns like data mesh and medallion architecture.

Data Warehouse Fundamentals

What Is a Data Warehouse?

A data warehouse is a centralized repository of structured, curated data optimized for analytical queries. Data is cleaned, transformed, and loaded into a predefined schema before it can be queried. This schema-on-write approach enforces data quality at ingestion time, making queries fast and results trustworthy.

Warehouses excel at business intelligence (BI), dashboards, and ad-hoc SQL analytics. They are the backbone of most reporting systems and pair naturally with OLAP workloads.

Schema-on-Write

In schema-on-write, you define tables, columns, and data types before loading data. Any record that does not conform is rejected or transformed during ingestion. This guarantees that every row in the warehouse matches the expected format, enabling aggressive query optimization through columnar storage, compression, and indexing.

-- Example: creating a fact table in a warehouse
CREATE TABLE fact_orders (
  order_id      BIGINT       NOT NULL,
  customer_id   BIGINT       NOT NULL,
  order_date    DATE         NOT NULL,
  total_amount  DECIMAL(12,2) NOT NULL,
  status        VARCHAR(20)  NOT NULL,
  region        VARCHAR(50)  NOT NULL
)
PARTITION BY (order_date)
CLUSTER BY (region);

Popular Data Warehouses

Warehouse	Cloud Provider	Key Strengths	Pricing Model
Snowflake	Multi-cloud	Separation of storage/compute, instant scaling, data sharing	Pay per second of compute + storage
BigQuery	Google Cloud	Serverless, petabyte-scale, built-in ML	Pay per query (bytes scanned) or flat-rate
Amazon Redshift	AWS	Deep AWS integration, Redshift Spectrum for S3	On-demand or reserved instances
Azure Synapse	Azure	Unified analytics, Spark integration	Dedicated or serverless pools

Data Lake Fundamentals

What Is a Data Lake?

A data lake is a vast storage repository that holds raw data in its native format—structured, semi-structured, or unstructured—until it is needed. Unlike a warehouse, a data lake uses schema-on-read: the schema is applied only when the data is queried, not when it is stored. This makes ingestion extremely fast and flexible.

Data lakes are built on cheap, scalable object storage such as Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS). They store everything from CSV and JSON files to Parquet, Avro, images, and log files.

Schema-on-Read

Schema-on-read defers structure decisions to query time. A data engineer or data scientist defines the schema when they read the data, not when they write it. This is powerful for exploratory analytics and machine learning, where the same raw data might be read in many different ways.

# Reading raw Parquet files with schema-on-read in PySpark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LakeRead").getOrCreate()

# Schema is inferred or specified at read time
orders = spark.read.parquet("s3://data-lake/raw/orders/")
orders.printSchema()

# Apply transformations at query time
daily_revenue = (
    orders
    .filter(orders.status == "completed")
    .groupBy("order_date")
    .agg({"total_amount": "sum"})
)
daily_revenue.show()

Data Lake Challenges

Without governance, data lakes become data swamps—disorganized repositories where nobody can find or trust the data. Common pitfalls include:

No catalog or metadata management, so users cannot discover datasets
No access controls, leading to compliance violations
Duplicate, stale, or contradictory data with no lineage tracking
Poor file formats (many small files) that degrade query performance

The Data Lakehouse Paradigm

What Is a Data Lakehouse?

A data lakehouse combines the flexibility and low cost of a data lake with the reliability, performance, and governance features of a data warehouse. It stores data in open file formats on object storage but adds a transactional metadata layer that provides ACID transactions, schema enforcement, and time travel.

The lakehouse eliminates the need to maintain separate lake and warehouse systems, reducing data duplication, ETL complexity, and cost. For a deeper understanding of ACID guarantees, see our guide on ACID vs BASE.

Key Lakehouse Technologies

Technology	Backed By	Format	Key Features
Delta Lake	Databricks	Parquet + transaction log	ACID transactions, time travel, schema evolution, Z-ordering
Apache Iceberg	Netflix / Apache	Parquet or ORC + metadata	Hidden partitioning, snapshot isolation, partition evolution
Apache Hudi	Uber / Apache	Parquet + timeline	Upserts, incremental processing, compaction

# Creating a Delta Lake table with ACID guarantees
from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder     .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")     .getOrCreate()

# Write with schema enforcement
orders_df.write.format("delta")     .mode("overwrite")     .partitionBy("order_date")     .save("s3://data-lake/delta/orders")

# Time travel: read a previous version
old_orders = spark.read.format("delta")     .option("versionAsOf", 5)     .load("s3://data-lake/delta/orders")

# Upsert (merge) new data
delta_table = DeltaTable.forPath(spark, "s3://data-lake/delta/orders")
delta_table.alias("target").merge(
    new_data.alias("source"),
    "target.order_id = source.order_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

ETL vs ELT

ETL: Extract, Transform, Load

In ETL, data is extracted from sources, transformed in a staging area (cleaning, joining, aggregating), and then loaded into the target warehouse. This was the dominant pattern when compute was expensive and warehouses had limited processing power.

# Classic ETL pipeline pseudocode
# 1. Extract from source database
raw_data = extract_from_postgres("SELECT * FROM orders WHERE updated_at > last_run")

# 2. Transform in application layer
cleaned = raw_data.drop_duplicates().fill_nulls()
enriched = cleaned.join(dim_customers, on="customer_id")
aggregated = enriched.group_by("region", "month").sum("amount")

# 3. Load into warehouse
aggregated.write_to_warehouse("fact_monthly_revenue")

ELT: Extract, Load, Transform

In ELT, raw data is extracted and loaded directly into the target system (lake or warehouse), and transformations happen inside the target using its compute engine. Modern cloud warehouses like Snowflake and BigQuery have massive compute power, making in-warehouse transformations faster and simpler than external ETL.

Tools like dbt (data build tool) have popularized the ELT approach by enabling analysts to define transformations as SQL models with built-in testing and documentation.

-- dbt model: transform raw orders into a clean fact table
-- models/fact_orders.sql

WITH raw_orders AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

cleaned AS (
    SELECT
        order_id,
        customer_id,
        order_date,
        COALESCE(total_amount, 0) AS total_amount,
        UPPER(TRIM(status)) AS status,
        region
    FROM raw_orders
    WHERE order_id IS NOT NULL
)

SELECT
    c.*,
    d.customer_name,
    d.customer_segment
FROM cleaned c
LEFT JOIN {{ ref('dim_customers') }} d
    ON c.customer_id = d.customer_id

ETL vs ELT Comparison

Aspect	ETL	ELT
Transform location	External staging area	Inside target system
Best for	Legacy warehouses, complex transformations	Cloud warehouses, SQL-based transforms
Data freshness	Batch-oriented, higher latency	Near real-time possible
Scalability	Limited by staging compute	Scales with warehouse compute
Tooling	Informatica, Talend, SSIS	dbt, Snowflake tasks, BigQuery scheduled queries
Data lake compatibility	Moderate	Excellent

Medallion Architecture

Bronze, Silver, Gold Layers

The medallion architecture (popularized by Databricks) organizes a lakehouse into three layers of increasing data quality. This pattern provides a clear, auditable data lineage from raw ingestion to business-ready datasets.

Layer	Purpose	Data Quality	Consumers
Bronze (Raw)	Ingest raw data as-is from sources	Unvalidated, may contain duplicates	Data engineers, debugging
Silver (Cleaned)	Clean, deduplicate, conform, join	Validated, standardized schemas	Data scientists, analysts
Gold (Business)	Aggregate, model for specific use cases	Business-ready, optimized for queries	BI dashboards, reports, APIs

# Medallion architecture pipeline example

# BRONZE: Raw ingestion (append-only)
raw_events = spark.readStream     .format("kafka")     .option("subscribe", "user-events")     .load()

raw_events.writeStream     .format("delta")     .outputMode("append")     .option("checkpointLocation", "s3://lake/checkpoints/bronze")     .start("s3://lake/bronze/user_events")

# SILVER: Cleaned and deduplicated
bronze_events = spark.read.format("delta").load("s3://lake/bronze/user_events")

silver_events = (
    bronze_events
    .dropDuplicates(["event_id"])
    .filter("event_type IS NOT NULL")
    .withColumn("event_date", col("timestamp").cast("date"))
    .withColumn("processed_at", current_timestamp())
)

silver_events.write.format("delta")     .mode("overwrite")     .partitionBy("event_date")     .save("s3://lake/silver/user_events")

# GOLD: Business aggregates
silver = spark.read.format("delta").load("s3://lake/silver/user_events")

daily_active_users = (
    silver
    .filter("event_type = 'page_view'")
    .groupBy("event_date")
    .agg(countDistinct("user_id").alias("dau"))
)

daily_active_users.write.format("delta")     .mode("overwrite")     .save("s3://lake/gold/daily_active_users")

Data Governance and Cataloging

Why Governance Matters

Without governance, data platforms become unusable. Teams cannot find datasets, do not know if data is fresh, and cannot determine who has access. Governance covers metadata management, access control, data lineage, quality monitoring, and compliance (GDPR, HIPAA, SOC2).

Key Governance Tools

Tool	Category	Key Features
Apache Atlas	Metadata / Lineage	Entity classification, lineage graphs, Hadoop ecosystem integration
DataHub (LinkedIn)	Data Catalog	Search and discovery, lineage, schema history, ownership tracking
Unity Catalog (Databricks)	Unified Governance	Fine-grained ACLs, lineage, cross-workspace sharing
AWS Glue Data Catalog	Schema Registry	Automatic schema discovery, Athena/Redshift integration
Great Expectations	Data Quality	Expectation suites, validation, profiling, alerting

# Great Expectations: defining data quality checks
import great_expectations as gx

context = gx.get_context()

# Define expectations for the silver orders table
validator = context.sources.pandas_default.read_csv("silver_orders.csv")

validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_between("total_amount", min_value=0, max_value=1000000)
validator.expect_column_values_to_be_in_set("status", ["pending", "completed", "cancelled", "refunded"])
validator.expect_column_pair_values_a_to_be_greater_than_b("ship_date", "order_date", or_equal=True)

results = validator.validate()
print(f"Validation passed: {results.success}")

Data Mesh Architecture

What Is Data Mesh?

Data mesh is a decentralized approach to data architecture proposed by Zhamak Dehghani. Instead of a centralized data team owning all pipelines and warehouses, data mesh distributes data ownership to domain teams who produce, own, and serve their data as products.

Four Principles of Data Mesh

Domain ownership: Each business domain (orders, payments, customers) owns its data end-to-end, including ingestion, transformation, quality, and serving.
Data as a product: Domain teams treat their datasets as products with SLAs, documentation, discoverability, and quality guarantees.
Self-serve data platform: A platform team provides infrastructure (storage, compute, catalog, access control) as a self-serve product so domain teams do not reinvent the wheel.
Federated computational governance: Global policies (naming conventions, security, interoperability standards) are defined centrally but enforced computationally through automation.

Data Mesh vs Centralized Architecture

Aspect	Centralized (Data Warehouse/Lake)	Data Mesh
Ownership	Central data team	Domain teams
Bottleneck	Central team becomes a bottleneck	Distributed, scales with org
Data quality	Central team responsible	Domain producers accountable
Best for	Small/medium orgs, unified analytics	Large orgs with many domains
Complexity	Simpler infrastructure	Higher organizational complexity

Partitioning and Performance Optimization

Choosing File Formats

File format choice dramatically impacts query performance and storage cost in data lakes. Columnar formats are preferred for analytical workloads because they enable column pruning (reading only needed columns) and better compression.

Format	Type	Best For	Compression
Parquet	Columnar	Analytics, Spark, Athena, BigQuery	Excellent (Snappy, Zstd)
ORC	Columnar	Hive ecosystem, Presto	Excellent (Zlib, Snappy)
Avro	Row-based	Streaming, Kafka, schema evolution	Good (Deflate, Snappy)
CSV/JSON	Row-based	Data exchange, small datasets	Poor (text-based)

For deeper coverage of partitioning strategies, see our guide on database partitioning. For distributing data across nodes, check out sharding strategies.

Partition Strategies for Data Lakes

-- Hive-style partitioning in a data lake
-- Directory structure:
-- s3://lake/silver/orders/year=2024/month=01/day=15/part-00000.parquet
-- s3://lake/silver/orders/year=2024/month=01/day=16/part-00000.parquet

-- Query with partition pruning (only reads relevant directories)
SELECT region, SUM(total_amount) AS revenue
FROM silver.orders
WHERE year = 2024 AND month = 1
GROUP BY region
ORDER BY revenue DESC;

-- Iceberg hidden partitioning (no need for partition columns in queries)
CREATE TABLE catalog.silver.orders (
    order_id    BIGINT,
    order_date  TIMESTAMP,
    region      STRING,
    total_amount DECIMAL(12,2)
) USING iceberg
PARTITIONED BY (month(order_date), region);

-- Iceberg automatically prunes partitions from predicates
SELECT * FROM catalog.silver.orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';

Real-World Architecture Patterns

Lambda Architecture

Lambda architecture maintains two parallel pipelines: a batch layer for accurate, complete processing of historical data and a speed layer for low-latency processing of real-time streams. A serving layer merges results from both. While powerful, this approach has high operational complexity due to maintaining two separate codebases.

Kappa Architecture

Kappa architecture simplifies Lambda by using a single stream-processing pipeline for both real-time and historical reprocessing. All data flows through a log (like Apache Kafka), and reprocessing is done by replaying the log through an updated processor. This reduces complexity but requires a robust streaming platform.

Choosing the Right Architecture

Decision Framework:

1. Data Volume and Variety
   - Structured data only, less than 10TB     --> Data Warehouse (Snowflake, BigQuery)
   - Mixed formats, ML workloads, over 10TB   --> Data Lakehouse (Delta Lake, Iceberg)
   - Raw dump with future unknown use cases   --> Data Lake (S3 + Glue Catalog)

2. Team and Organization
   - Small data team, SQL-first               --> Warehouse + dbt
   - Large org, many domains, distributed     --> Data Mesh + Lakehouse
   - Startup, iterate fast                    --> ELT into warehouse

3. Latency Requirements
   - Batch (hourly/daily) is acceptable       --> Standard ELT pipeline
   - Near real-time (minutes)                 --> Streaming into lakehouse
   - True real-time (seconds)                 --> Kappa + stream processing

4. Budget
   - Predictable, controlled costs            --> Reserved warehouse capacity
   - Pay-per-query, variable workloads        --> Serverless (BigQuery, Athena)
   - Minimize storage cost                    --> Data lake on S3/GCS

For understanding how these architectures interact with transactional systems, see our guide on OLTP vs OLAP. To learn about choosing between SQL and NoSQL for source systems, check out SQL vs NoSQL databases.

Frequently Asked Questions

When should I use a data lake vs a data warehouse?

Use a data warehouse when you have structured data and need fast, reliable SQL analytics for BI and reporting. Use a data lake when you have diverse data formats (logs, images, JSON), need cheap storage at massive scale, or run machine learning workloads. In most modern architectures, a data lakehouse gives you both capabilities in one system.

What is the difference between ETL and ELT?

ETL transforms data before loading it into the target. ELT loads raw data first and transforms it inside the target system. ELT is preferred with modern cloud warehouses because they have abundant compute power, and keeping raw data accessible enables more flexible transformations over time.

Is data mesh right for my organization?

Data mesh works best for large organizations with multiple autonomous domains that each generate and consume significant data. If you have fewer than 5 data-producing domains or a small data team, a centralized lakehouse with clear ownership conventions will likely be simpler and more effective. Data mesh adds organizational complexity that only pays off at scale.

How do I prevent my data lake from becoming a data swamp?

Implement governance from day one: use a data catalog (DataHub, Glue Catalog) for discoverability, enforce naming conventions and folder structures, set up automated data quality checks (Great Expectations), track lineage, define ownership for every dataset, and use the medallion architecture to separate raw from curated data. Treat your data platform like a product, not a dumping ground.

Should I use Delta Lake or Apache Iceberg?

Both are excellent table formats for lakehouses. Choose Delta Lake if you are in the Databricks ecosystem or use Spark heavily. Choose Apache Iceberg if you need multi-engine support (Spark, Trino, Flink, Dremio) or want engine-agnostic open standards. Both provide ACID transactions, time travel, and schema evolution. Iceberg has stronger partition evolution capabilities, while Delta Lake has deeper Databricks integration and optimization features like Z-ordering and liquid clustering.