Skip to main content
🗄️Databases

Data Lakes and Data Warehouses: Modern Data Architecture

Data lakes, data warehouses, and data lakehouses serve different analytical needs. Compare schema-on-read vs schema-on-write, ETL vs ELT, and modern tools...

📖 11 min read

Data Lakes and Data Warehouses: Modern Data Architecture

As organizations collect ever-growing volumes of data, choosing the right storage and analytics architecture becomes critical. Data lakes, data warehouses, and the newer data lakehouse paradigm each solve different problems. Understanding their trade-offs, ingestion patterns, and governance models is essential for building scalable, cost-effective data platforms. This guide covers schema strategies, ETL vs ELT, leading tools, and modern architectural patterns like data mesh and medallion architecture.

Data Warehouse Fundamentals

What Is a Data Warehouse?

A data warehouse is a centralized repository of structured, curated data optimized for analytical queries. Data is cleaned, transformed, and loaded into a predefined schema before it can be queried. This schema-on-write approach enforces data quality at ingestion time, making queries fast and results trustworthy.

Warehouses excel at business intelligence (BI), dashboards, and ad-hoc SQL analytics. They are the backbone of most reporting systems and pair naturally with OLAP workloads.

Schema-on-Write

In schema-on-write, you define tables, columns, and data types before loading data. Any record that does not conform is rejected or transformed during ingestion. This guarantees that every row in the warehouse matches the expected format, enabling aggressive query optimization through columnar storage, compression, and indexing.

-- Example: creating a fact table in a warehouse
CREATE TABLE fact_orders (
  order_id      BIGINT       NOT NULL,
  customer_id   BIGINT       NOT NULL,
  order_date    DATE         NOT NULL,
  total_amount  DECIMAL(12,2) NOT NULL,
  status        VARCHAR(20)  NOT NULL,
  region        VARCHAR(50)  NOT NULL
)
PARTITION BY (order_date)
CLUSTER BY (region);
WarehouseCloud ProviderKey StrengthsPricing Model
SnowflakeMulti-cloudSeparation of storage/compute, instant scaling, data sharingPay per second of compute + storage
BigQueryGoogle CloudServerless, petabyte-scale, built-in MLPay per query (bytes scanned) or flat-rate
Amazon RedshiftAWSDeep AWS integration, Redshift Spectrum for S3On-demand or reserved instances
Azure SynapseAzureUnified analytics, Spark integrationDedicated or serverless pools

Data Lake Fundamentals

What Is a Data Lake?

A data lake is a vast storage repository that holds raw data in its native format—structured, semi-structured, or unstructured—until it is needed. Unlike a warehouse, a data lake uses schema-on-read: the schema is applied only when the data is queried, not when it is stored. This makes ingestion extremely fast and flexible.

Data lakes are built on cheap, scalable object storage such as Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS). They store everything from CSV and JSON files to Parquet, Avro, images, and log files.

Schema-on-Read

Schema-on-read defers structure decisions to query time. A data engineer or data scientist defines the schema when they read the data, not when they write it. This is powerful for exploratory analytics and machine learning, where the same raw data might be read in many different ways.

# Reading raw Parquet files with schema-on-read in PySpark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LakeRead").getOrCreate()

# Schema is inferred or specified at read time
orders = spark.read.parquet("s3://data-lake/raw/orders/")
orders.printSchema()

# Apply transformations at query time
daily_revenue = (
    orders
    .filter(orders.status == "completed")
    .groupBy("order_date")
    .agg({"total_amount": "sum"})
)
daily_revenue.show()

Data Lake Challenges

Without governance, data lakes become data swamps—disorganized repositories where nobody can find or trust the data. Common pitfalls include:

  • No catalog or metadata management, so users cannot discover datasets
  • No access controls, leading to compliance violations
  • Duplicate, stale, or contradictory data with no lineage tracking
  • Poor file formats (many small files) that degrade query performance

The Data Lakehouse Paradigm

What Is a Data Lakehouse?

A data lakehouse combines the flexibility and low cost of a data lake with the reliability, performance, and governance features of a data warehouse. It stores data in open file formats on object storage but adds a transactional metadata layer that provides ACID transactions, schema enforcement, and time travel.

The lakehouse eliminates the need to maintain separate lake and warehouse systems, reducing data duplication, ETL complexity, and cost. For a deeper understanding of ACID guarantees, see our guide on ACID vs BASE.

Key Lakehouse Technologies

TechnologyBacked ByFormatKey Features
Delta LakeDatabricksParquet + transaction logACID transactions, time travel, schema evolution, Z-ordering
Apache IcebergNetflix / ApacheParquet or ORC + metadataHidden partitioning, snapshot isolation, partition evolution
Apache HudiUber / ApacheParquet + timelineUpserts, incremental processing, compaction
# Creating a Delta Lake table with ACID guarantees
from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder     .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")     .getOrCreate()

# Write with schema enforcement
orders_df.write.format("delta")     .mode("overwrite")     .partitionBy("order_date")     .save("s3://data-lake/delta/orders")

# Time travel: read a previous version
old_orders = spark.read.format("delta")     .option("versionAsOf", 5)     .load("s3://data-lake/delta/orders")

# Upsert (merge) new data
delta_table = DeltaTable.forPath(spark, "s3://data-lake/delta/orders")
delta_table.alias("target").merge(
    new_data.alias("source"),
    "target.order_id = source.order_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

ETL vs ELT

ETL: Extract, Transform, Load

In ETL, data is extracted from sources, transformed in a staging area (cleaning, joining, aggregating), and then loaded into the target warehouse. This was the dominant pattern when compute was expensive and warehouses had limited processing power.

# Classic ETL pipeline pseudocode
# 1. Extract from source database
raw_data = extract_from_postgres("SELECT * FROM orders WHERE updated_at > last_run")

# 2. Transform in application layer
cleaned = raw_data.drop_duplicates().fill_nulls()
enriched = cleaned.join(dim_customers, on="customer_id")
aggregated = enriched.group_by("region", "month").sum("amount")

# 3. Load into warehouse
aggregated.write_to_warehouse("fact_monthly_revenue")

ELT: Extract, Load, Transform

In ELT, raw data is extracted and loaded directly into the target system (lake or warehouse), and transformations happen inside the target using its compute engine. Modern cloud warehouses like Snowflake and BigQuery have massive compute power, making in-warehouse transformations faster and simpler than external ETL.

Tools like dbt (data build tool) have popularized the ELT approach by enabling analysts to define transformations as SQL models with built-in testing and documentation.

-- dbt model: transform raw orders into a clean fact table
-- models/fact_orders.sql

WITH raw_orders AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

cleaned AS (
    SELECT
        order_id,
        customer_id,
        order_date,
        COALESCE(total_amount, 0) AS total_amount,
        UPPER(TRIM(status)) AS status,
        region
    FROM raw_orders
    WHERE order_id IS NOT NULL
)

SELECT
    c.*,
    d.customer_name,
    d.customer_segment
FROM cleaned c
LEFT JOIN {{ ref('dim_customers') }} d
    ON c.customer_id = d.customer_id

ETL vs ELT Comparison

AspectETLELT
Transform locationExternal staging areaInside target system
Best forLegacy warehouses, complex transformationsCloud warehouses, SQL-based transforms
Data freshnessBatch-oriented, higher latencyNear real-time possible
ScalabilityLimited by staging computeScales with warehouse compute
ToolingInformatica, Talend, SSISdbt, Snowflake tasks, BigQuery scheduled queries
Data lake compatibilityModerateExcellent

Medallion Architecture

Bronze, Silver, Gold Layers

The medallion architecture (popularized by Databricks) organizes a lakehouse into three layers of increasing data quality. This pattern provides a clear, auditable data lineage from raw ingestion to business-ready datasets.

LayerPurposeData QualityConsumers
Bronze (Raw)Ingest raw data as-is from sourcesUnvalidated, may contain duplicatesData engineers, debugging
Silver (Cleaned)Clean, deduplicate, conform, joinValidated, standardized schemasData scientists, analysts
Gold (Business)Aggregate, model for specific use casesBusiness-ready, optimized for queriesBI dashboards, reports, APIs
# Medallion architecture pipeline example

# BRONZE: Raw ingestion (append-only)
raw_events = spark.readStream     .format("kafka")     .option("subscribe", "user-events")     .load()

raw_events.writeStream     .format("delta")     .outputMode("append")     .option("checkpointLocation", "s3://lake/checkpoints/bronze")     .start("s3://lake/bronze/user_events")

# SILVER: Cleaned and deduplicated
bronze_events = spark.read.format("delta").load("s3://lake/bronze/user_events")

silver_events = (
    bronze_events
    .dropDuplicates(["event_id"])
    .filter("event_type IS NOT NULL")
    .withColumn("event_date", col("timestamp").cast("date"))
    .withColumn("processed_at", current_timestamp())
)

silver_events.write.format("delta")     .mode("overwrite")     .partitionBy("event_date")     .save("s3://lake/silver/user_events")

# GOLD: Business aggregates
silver = spark.read.format("delta").load("s3://lake/silver/user_events")

daily_active_users = (
    silver
    .filter("event_type = 'page_view'")
    .groupBy("event_date")
    .agg(countDistinct("user_id").alias("dau"))
)

daily_active_users.write.format("delta")     .mode("overwrite")     .save("s3://lake/gold/daily_active_users")

Data Governance and Cataloging

Why Governance Matters

Without governance, data platforms become unusable. Teams cannot find datasets, do not know if data is fresh, and cannot determine who has access. Governance covers metadata management, access control, data lineage, quality monitoring, and compliance (GDPR, HIPAA, SOC2).

Key Governance Tools

ToolCategoryKey Features
Apache AtlasMetadata / LineageEntity classification, lineage graphs, Hadoop ecosystem integration
DataHub (LinkedIn)Data CatalogSearch and discovery, lineage, schema history, ownership tracking
Unity Catalog (Databricks)Unified GovernanceFine-grained ACLs, lineage, cross-workspace sharing
AWS Glue Data CatalogSchema RegistryAutomatic schema discovery, Athena/Redshift integration
Great ExpectationsData QualityExpectation suites, validation, profiling, alerting
# Great Expectations: defining data quality checks
import great_expectations as gx

context = gx.get_context()

# Define expectations for the silver orders table
validator = context.sources.pandas_default.read_csv("silver_orders.csv")

validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_between("total_amount", min_value=0, max_value=1000000)
validator.expect_column_values_to_be_in_set("status", ["pending", "completed", "cancelled", "refunded"])
validator.expect_column_pair_values_a_to_be_greater_than_b("ship_date", "order_date", or_equal=True)

results = validator.validate()
print(f"Validation passed: {results.success}")

Data Mesh Architecture

What Is Data Mesh?

Data mesh is a decentralized approach to data architecture proposed by Zhamak Dehghani. Instead of a centralized data team owning all pipelines and warehouses, data mesh distributes data ownership to domain teams who produce, own, and serve their data as products.

Four Principles of Data Mesh

  • Domain ownership: Each business domain (orders, payments, customers) owns its data end-to-end, including ingestion, transformation, quality, and serving.
  • Data as a product: Domain teams treat their datasets as products with SLAs, documentation, discoverability, and quality guarantees.
  • Self-serve data platform: A platform team provides infrastructure (storage, compute, catalog, access control) as a self-serve product so domain teams do not reinvent the wheel.
  • Federated computational governance: Global policies (naming conventions, security, interoperability standards) are defined centrally but enforced computationally through automation.

Data Mesh vs Centralized Architecture

AspectCentralized (Data Warehouse/Lake)Data Mesh
OwnershipCentral data teamDomain teams
BottleneckCentral team becomes a bottleneckDistributed, scales with org
Data qualityCentral team responsibleDomain producers accountable
Best forSmall/medium orgs, unified analyticsLarge orgs with many domains
ComplexitySimpler infrastructureHigher organizational complexity

Partitioning and Performance Optimization

Choosing File Formats

File format choice dramatically impacts query performance and storage cost in data lakes. Columnar formats are preferred for analytical workloads because they enable column pruning (reading only needed columns) and better compression.

FormatTypeBest ForCompression
ParquetColumnarAnalytics, Spark, Athena, BigQueryExcellent (Snappy, Zstd)
ORCColumnarHive ecosystem, PrestoExcellent (Zlib, Snappy)
AvroRow-basedStreaming, Kafka, schema evolutionGood (Deflate, Snappy)
CSV/JSONRow-basedData exchange, small datasetsPoor (text-based)

For deeper coverage of partitioning strategies, see our guide on database partitioning. For distributing data across nodes, check out sharding strategies.

Partition Strategies for Data Lakes

-- Hive-style partitioning in a data lake
-- Directory structure:
-- s3://lake/silver/orders/year=2024/month=01/day=15/part-00000.parquet
-- s3://lake/silver/orders/year=2024/month=01/day=16/part-00000.parquet

-- Query with partition pruning (only reads relevant directories)
SELECT region, SUM(total_amount) AS revenue
FROM silver.orders
WHERE year = 2024 AND month = 1
GROUP BY region
ORDER BY revenue DESC;

-- Iceberg hidden partitioning (no need for partition columns in queries)
CREATE TABLE catalog.silver.orders (
    order_id    BIGINT,
    order_date  TIMESTAMP,
    region      STRING,
    total_amount DECIMAL(12,2)
) USING iceberg
PARTITIONED BY (month(order_date), region);

-- Iceberg automatically prunes partitions from predicates
SELECT * FROM catalog.silver.orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';

Real-World Architecture Patterns

Lambda Architecture

Lambda architecture maintains two parallel pipelines: a batch layer for accurate, complete processing of historical data and a speed layer for low-latency processing of real-time streams. A serving layer merges results from both. While powerful, this approach has high operational complexity due to maintaining two separate codebases.

Kappa Architecture

Kappa architecture simplifies Lambda by using a single stream-processing pipeline for both real-time and historical reprocessing. All data flows through a log (like Apache Kafka), and reprocessing is done by replaying the log through an updated processor. This reduces complexity but requires a robust streaming platform.

Choosing the Right Architecture

Decision Framework:

1. Data Volume and Variety
   - Structured data only, less than 10TB     --> Data Warehouse (Snowflake, BigQuery)
   - Mixed formats, ML workloads, over 10TB   --> Data Lakehouse (Delta Lake, Iceberg)
   - Raw dump with future unknown use cases   --> Data Lake (S3 + Glue Catalog)

2. Team and Organization
   - Small data team, SQL-first               --> Warehouse + dbt
   - Large org, many domains, distributed     --> Data Mesh + Lakehouse
   - Startup, iterate fast                    --> ELT into warehouse

3. Latency Requirements
   - Batch (hourly/daily) is acceptable       --> Standard ELT pipeline
   - Near real-time (minutes)                 --> Streaming into lakehouse
   - True real-time (seconds)                 --> Kappa + stream processing

4. Budget
   - Predictable, controlled costs            --> Reserved warehouse capacity
   - Pay-per-query, variable workloads        --> Serverless (BigQuery, Athena)
   - Minimize storage cost                    --> Data lake on S3/GCS

For understanding how these architectures interact with transactional systems, see our guide on OLTP vs OLAP. To learn about choosing between SQL and NoSQL for source systems, check out SQL vs NoSQL databases.

Frequently Asked Questions

When should I use a data lake vs a data warehouse?

Use a data warehouse when you have structured data and need fast, reliable SQL analytics for BI and reporting. Use a data lake when you have diverse data formats (logs, images, JSON), need cheap storage at massive scale, or run machine learning workloads. In most modern architectures, a data lakehouse gives you both capabilities in one system.

What is the difference between ETL and ELT?

ETL transforms data before loading it into the target. ELT loads raw data first and transforms it inside the target system. ELT is preferred with modern cloud warehouses because they have abundant compute power, and keeping raw data accessible enables more flexible transformations over time.

Is data mesh right for my organization?

Data mesh works best for large organizations with multiple autonomous domains that each generate and consume significant data. If you have fewer than 5 data-producing domains or a small data team, a centralized lakehouse with clear ownership conventions will likely be simpler and more effective. Data mesh adds organizational complexity that only pays off at scale.

How do I prevent my data lake from becoming a data swamp?

Implement governance from day one: use a data catalog (DataHub, Glue Catalog) for discoverability, enforce naming conventions and folder structures, set up automated data quality checks (Great Expectations), track lineage, define ownership for every dataset, and use the medallion architecture to separate raw from curated data. Treat your data platform like a product, not a dumping ground.

Should I use Delta Lake or Apache Iceberg?

Both are excellent table formats for lakehouses. Choose Delta Lake if you are in the Databricks ecosystem or use Spark heavily. Choose Apache Iceberg if you need multi-engine support (Spark, Trino, Flink, Dremio) or want engine-agnostic open standards. Both provide ACID transactions, time travel, and schema evolution. Iceberg has stronger partition evolution capabilities, while Delta Lake has deeper Databricks integration and optimization features like Z-ordering and liquid clustering.

Related Articles