Data Lakes and Data Warehouses: Modern Data Architecture
As organizations collect ever-growing volumes of data, choosing the right storage and analytics architecture becomes critical. Data lakes, data warehouses, and the newer data lakehouse paradigm each solve different problems. Understanding their trade-offs, ingestion patterns, and governance models is essential for building scalable, cost-effective data platforms. This guide covers schema strategies, ETL vs ELT, leading tools, and modern architectural patterns like data mesh and medallion architecture.
Data Warehouse Fundamentals
What Is a Data Warehouse?
A data warehouse is a centralized repository of structured, curated data optimized for analytical queries. Data is cleaned, transformed, and loaded into a predefined schema before it can be queried. This schema-on-write approach enforces data quality at ingestion time, making queries fast and results trustworthy.
Warehouses excel at business intelligence (BI), dashboards, and ad-hoc SQL analytics. They are the backbone of most reporting systems and pair naturally with OLAP workloads.
Schema-on-Write
In schema-on-write, you define tables, columns, and data types before loading data. Any record that does not conform is rejected or transformed during ingestion. This guarantees that every row in the warehouse matches the expected format, enabling aggressive query optimization through columnar storage, compression, and indexing.
-- Example: creating a fact table in a warehouse
CREATE TABLE fact_orders (
order_id BIGINT NOT NULL,
customer_id BIGINT NOT NULL,
order_date DATE NOT NULL,
total_amount DECIMAL(12,2) NOT NULL,
status VARCHAR(20) NOT NULL,
region VARCHAR(50) NOT NULL
)
PARTITION BY (order_date)
CLUSTER BY (region);
Popular Data Warehouses
| Warehouse | Cloud Provider | Key Strengths | Pricing Model |
|---|---|---|---|
| Snowflake | Multi-cloud | Separation of storage/compute, instant scaling, data sharing | Pay per second of compute + storage |
| BigQuery | Google Cloud | Serverless, petabyte-scale, built-in ML | Pay per query (bytes scanned) or flat-rate |
| Amazon Redshift | AWS | Deep AWS integration, Redshift Spectrum for S3 | On-demand or reserved instances |
| Azure Synapse | Azure | Unified analytics, Spark integration | Dedicated or serverless pools |
Data Lake Fundamentals
What Is a Data Lake?
A data lake is a vast storage repository that holds raw data in its native format—structured, semi-structured, or unstructured—until it is needed. Unlike a warehouse, a data lake uses schema-on-read: the schema is applied only when the data is queried, not when it is stored. This makes ingestion extremely fast and flexible.
Data lakes are built on cheap, scalable object storage such as Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS). They store everything from CSV and JSON files to Parquet, Avro, images, and log files.
Schema-on-Read
Schema-on-read defers structure decisions to query time. A data engineer or data scientist defines the schema when they read the data, not when they write it. This is powerful for exploratory analytics and machine learning, where the same raw data might be read in many different ways.
# Reading raw Parquet files with schema-on-read in PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("LakeRead").getOrCreate()
# Schema is inferred or specified at read time
orders = spark.read.parquet("s3://data-lake/raw/orders/")
orders.printSchema()
# Apply transformations at query time
daily_revenue = (
orders
.filter(orders.status == "completed")
.groupBy("order_date")
.agg({"total_amount": "sum"})
)
daily_revenue.show()
Data Lake Challenges
Without governance, data lakes become data swamps—disorganized repositories where nobody can find or trust the data. Common pitfalls include:
- No catalog or metadata management, so users cannot discover datasets
- No access controls, leading to compliance violations
- Duplicate, stale, or contradictory data with no lineage tracking
- Poor file formats (many small files) that degrade query performance
The Data Lakehouse Paradigm
What Is a Data Lakehouse?
A data lakehouse combines the flexibility and low cost of a data lake with the reliability, performance, and governance features of a data warehouse. It stores data in open file formats on object storage but adds a transactional metadata layer that provides ACID transactions, schema enforcement, and time travel.
The lakehouse eliminates the need to maintain separate lake and warehouse systems, reducing data duplication, ETL complexity, and cost. For a deeper understanding of ACID guarantees, see our guide on ACID vs BASE.
Key Lakehouse Technologies
| Technology | Backed By | Format | Key Features |
|---|---|---|---|
| Delta Lake | Databricks | Parquet + transaction log | ACID transactions, time travel, schema evolution, Z-ordering |
| Apache Iceberg | Netflix / Apache | Parquet or ORC + metadata | Hidden partitioning, snapshot isolation, partition evolution |
| Apache Hudi | Uber / Apache | Parquet + timeline | Upserts, incremental processing, compaction |
# Creating a Delta Lake table with ACID guarantees
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .getOrCreate()
# Write with schema enforcement
orders_df.write.format("delta") .mode("overwrite") .partitionBy("order_date") .save("s3://data-lake/delta/orders")
# Time travel: read a previous version
old_orders = spark.read.format("delta") .option("versionAsOf", 5) .load("s3://data-lake/delta/orders")
# Upsert (merge) new data
delta_table = DeltaTable.forPath(spark, "s3://data-lake/delta/orders")
delta_table.alias("target").merge(
new_data.alias("source"),
"target.order_id = source.order_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
ETL vs ELT
ETL: Extract, Transform, Load
In ETL, data is extracted from sources, transformed in a staging area (cleaning, joining, aggregating), and then loaded into the target warehouse. This was the dominant pattern when compute was expensive and warehouses had limited processing power.
# Classic ETL pipeline pseudocode
# 1. Extract from source database
raw_data = extract_from_postgres("SELECT * FROM orders WHERE updated_at > last_run")
# 2. Transform in application layer
cleaned = raw_data.drop_duplicates().fill_nulls()
enriched = cleaned.join(dim_customers, on="customer_id")
aggregated = enriched.group_by("region", "month").sum("amount")
# 3. Load into warehouse
aggregated.write_to_warehouse("fact_monthly_revenue")
ELT: Extract, Load, Transform
In ELT, raw data is extracted and loaded directly into the target system (lake or warehouse), and transformations happen inside the target using its compute engine. Modern cloud warehouses like Snowflake and BigQuery have massive compute power, making in-warehouse transformations faster and simpler than external ETL.
Tools like dbt (data build tool) have popularized the ELT approach by enabling analysts to define transformations as SQL models with built-in testing and documentation.
-- dbt model: transform raw orders into a clean fact table
-- models/fact_orders.sql
WITH raw_orders AS (
SELECT * FROM {{ source('raw', 'orders') }}
),
cleaned AS (
SELECT
order_id,
customer_id,
order_date,
COALESCE(total_amount, 0) AS total_amount,
UPPER(TRIM(status)) AS status,
region
FROM raw_orders
WHERE order_id IS NOT NULL
)
SELECT
c.*,
d.customer_name,
d.customer_segment
FROM cleaned c
LEFT JOIN {{ ref('dim_customers') }} d
ON c.customer_id = d.customer_id
ETL vs ELT Comparison
| Aspect | ETL | ELT |
|---|---|---|
| Transform location | External staging area | Inside target system |
| Best for | Legacy warehouses, complex transformations | Cloud warehouses, SQL-based transforms |
| Data freshness | Batch-oriented, higher latency | Near real-time possible |
| Scalability | Limited by staging compute | Scales with warehouse compute |
| Tooling | Informatica, Talend, SSIS | dbt, Snowflake tasks, BigQuery scheduled queries |
| Data lake compatibility | Moderate | Excellent |
Medallion Architecture
Bronze, Silver, Gold Layers
The medallion architecture (popularized by Databricks) organizes a lakehouse into three layers of increasing data quality. This pattern provides a clear, auditable data lineage from raw ingestion to business-ready datasets.
| Layer | Purpose | Data Quality | Consumers |
|---|---|---|---|
| Bronze (Raw) | Ingest raw data as-is from sources | Unvalidated, may contain duplicates | Data engineers, debugging |
| Silver (Cleaned) | Clean, deduplicate, conform, join | Validated, standardized schemas | Data scientists, analysts |
| Gold (Business) | Aggregate, model for specific use cases | Business-ready, optimized for queries | BI dashboards, reports, APIs |
# Medallion architecture pipeline example
# BRONZE: Raw ingestion (append-only)
raw_events = spark.readStream .format("kafka") .option("subscribe", "user-events") .load()
raw_events.writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "s3://lake/checkpoints/bronze") .start("s3://lake/bronze/user_events")
# SILVER: Cleaned and deduplicated
bronze_events = spark.read.format("delta").load("s3://lake/bronze/user_events")
silver_events = (
bronze_events
.dropDuplicates(["event_id"])
.filter("event_type IS NOT NULL")
.withColumn("event_date", col("timestamp").cast("date"))
.withColumn("processed_at", current_timestamp())
)
silver_events.write.format("delta") .mode("overwrite") .partitionBy("event_date") .save("s3://lake/silver/user_events")
# GOLD: Business aggregates
silver = spark.read.format("delta").load("s3://lake/silver/user_events")
daily_active_users = (
silver
.filter("event_type = 'page_view'")
.groupBy("event_date")
.agg(countDistinct("user_id").alias("dau"))
)
daily_active_users.write.format("delta") .mode("overwrite") .save("s3://lake/gold/daily_active_users")
Data Governance and Cataloging
Why Governance Matters
Without governance, data platforms become unusable. Teams cannot find datasets, do not know if data is fresh, and cannot determine who has access. Governance covers metadata management, access control, data lineage, quality monitoring, and compliance (GDPR, HIPAA, SOC2).
Key Governance Tools
| Tool | Category | Key Features |
|---|---|---|
| Apache Atlas | Metadata / Lineage | Entity classification, lineage graphs, Hadoop ecosystem integration |
| DataHub (LinkedIn) | Data Catalog | Search and discovery, lineage, schema history, ownership tracking |
| Unity Catalog (Databricks) | Unified Governance | Fine-grained ACLs, lineage, cross-workspace sharing |
| AWS Glue Data Catalog | Schema Registry | Automatic schema discovery, Athena/Redshift integration |
| Great Expectations | Data Quality | Expectation suites, validation, profiling, alerting |
# Great Expectations: defining data quality checks
import great_expectations as gx
context = gx.get_context()
# Define expectations for the silver orders table
validator = context.sources.pandas_default.read_csv("silver_orders.csv")
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_between("total_amount", min_value=0, max_value=1000000)
validator.expect_column_values_to_be_in_set("status", ["pending", "completed", "cancelled", "refunded"])
validator.expect_column_pair_values_a_to_be_greater_than_b("ship_date", "order_date", or_equal=True)
results = validator.validate()
print(f"Validation passed: {results.success}")
Data Mesh Architecture
What Is Data Mesh?
Data mesh is a decentralized approach to data architecture proposed by Zhamak Dehghani. Instead of a centralized data team owning all pipelines and warehouses, data mesh distributes data ownership to domain teams who produce, own, and serve their data as products.
Four Principles of Data Mesh
- Domain ownership: Each business domain (orders, payments, customers) owns its data end-to-end, including ingestion, transformation, quality, and serving.
- Data as a product: Domain teams treat their datasets as products with SLAs, documentation, discoverability, and quality guarantees.
- Self-serve data platform: A platform team provides infrastructure (storage, compute, catalog, access control) as a self-serve product so domain teams do not reinvent the wheel.
- Federated computational governance: Global policies (naming conventions, security, interoperability standards) are defined centrally but enforced computationally through automation.
Data Mesh vs Centralized Architecture
| Aspect | Centralized (Data Warehouse/Lake) | Data Mesh |
|---|---|---|
| Ownership | Central data team | Domain teams |
| Bottleneck | Central team becomes a bottleneck | Distributed, scales with org |
| Data quality | Central team responsible | Domain producers accountable |
| Best for | Small/medium orgs, unified analytics | Large orgs with many domains |
| Complexity | Simpler infrastructure | Higher organizational complexity |
Partitioning and Performance Optimization
Choosing File Formats
File format choice dramatically impacts query performance and storage cost in data lakes. Columnar formats are preferred for analytical workloads because they enable column pruning (reading only needed columns) and better compression.
| Format | Type | Best For | Compression |
|---|---|---|---|
| Parquet | Columnar | Analytics, Spark, Athena, BigQuery | Excellent (Snappy, Zstd) |
| ORC | Columnar | Hive ecosystem, Presto | Excellent (Zlib, Snappy) |
| Avro | Row-based | Streaming, Kafka, schema evolution | Good (Deflate, Snappy) |
| CSV/JSON | Row-based | Data exchange, small datasets | Poor (text-based) |
For deeper coverage of partitioning strategies, see our guide on database partitioning. For distributing data across nodes, check out sharding strategies.
Partition Strategies for Data Lakes
-- Hive-style partitioning in a data lake
-- Directory structure:
-- s3://lake/silver/orders/year=2024/month=01/day=15/part-00000.parquet
-- s3://lake/silver/orders/year=2024/month=01/day=16/part-00000.parquet
-- Query with partition pruning (only reads relevant directories)
SELECT region, SUM(total_amount) AS revenue
FROM silver.orders
WHERE year = 2024 AND month = 1
GROUP BY region
ORDER BY revenue DESC;
-- Iceberg hidden partitioning (no need for partition columns in queries)
CREATE TABLE catalog.silver.orders (
order_id BIGINT,
order_date TIMESTAMP,
region STRING,
total_amount DECIMAL(12,2)
) USING iceberg
PARTITIONED BY (month(order_date), region);
-- Iceberg automatically prunes partitions from predicates
SELECT * FROM catalog.silver.orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
Real-World Architecture Patterns
Lambda Architecture
Lambda architecture maintains two parallel pipelines: a batch layer for accurate, complete processing of historical data and a speed layer for low-latency processing of real-time streams. A serving layer merges results from both. While powerful, this approach has high operational complexity due to maintaining two separate codebases.
Kappa Architecture
Kappa architecture simplifies Lambda by using a single stream-processing pipeline for both real-time and historical reprocessing. All data flows through a log (like Apache Kafka), and reprocessing is done by replaying the log through an updated processor. This reduces complexity but requires a robust streaming platform.
Choosing the Right Architecture
Decision Framework:
1. Data Volume and Variety
- Structured data only, less than 10TB --> Data Warehouse (Snowflake, BigQuery)
- Mixed formats, ML workloads, over 10TB --> Data Lakehouse (Delta Lake, Iceberg)
- Raw dump with future unknown use cases --> Data Lake (S3 + Glue Catalog)
2. Team and Organization
- Small data team, SQL-first --> Warehouse + dbt
- Large org, many domains, distributed --> Data Mesh + Lakehouse
- Startup, iterate fast --> ELT into warehouse
3. Latency Requirements
- Batch (hourly/daily) is acceptable --> Standard ELT pipeline
- Near real-time (minutes) --> Streaming into lakehouse
- True real-time (seconds) --> Kappa + stream processing
4. Budget
- Predictable, controlled costs --> Reserved warehouse capacity
- Pay-per-query, variable workloads --> Serverless (BigQuery, Athena)
- Minimize storage cost --> Data lake on S3/GCS
For understanding how these architectures interact with transactional systems, see our guide on OLTP vs OLAP. To learn about choosing between SQL and NoSQL for source systems, check out SQL vs NoSQL databases.
Frequently Asked Questions
When should I use a data lake vs a data warehouse?
Use a data warehouse when you have structured data and need fast, reliable SQL analytics for BI and reporting. Use a data lake when you have diverse data formats (logs, images, JSON), need cheap storage at massive scale, or run machine learning workloads. In most modern architectures, a data lakehouse gives you both capabilities in one system.
What is the difference between ETL and ELT?
ETL transforms data before loading it into the target. ELT loads raw data first and transforms it inside the target system. ELT is preferred with modern cloud warehouses because they have abundant compute power, and keeping raw data accessible enables more flexible transformations over time.
Is data mesh right for my organization?
Data mesh works best for large organizations with multiple autonomous domains that each generate and consume significant data. If you have fewer than 5 data-producing domains or a small data team, a centralized lakehouse with clear ownership conventions will likely be simpler and more effective. Data mesh adds organizational complexity that only pays off at scale.
How do I prevent my data lake from becoming a data swamp?
Implement governance from day one: use a data catalog (DataHub, Glue Catalog) for discoverability, enforce naming conventions and folder structures, set up automated data quality checks (Great Expectations), track lineage, define ownership for every dataset, and use the medallion architecture to separate raw from curated data. Treat your data platform like a product, not a dumping ground.
Should I use Delta Lake or Apache Iceberg?
Both are excellent table formats for lakehouses. Choose Delta Lake if you are in the Databricks ecosystem or use Spark heavily. Choose Apache Iceberg if you need multi-engine support (Spark, Trino, Flink, Dremio) or want engine-agnostic open standards. Both provide ACID transactions, time travel, and schema evolution. Iceberg has stronger partition evolution capabilities, while Delta Lake has deeper Databricks integration and optimization features like Z-ordering and liquid clustering.