📌 Distributed File Systems — HDFS, GFS & the Architecture Behind Big Data Storage
When a single machine can no longer hold or process your data, you need a file system that spans hundreds or thousands of nodes. Distributed File Systems (DFS) solve this by splitting files into chunks, replicating them across machines, and providing a unified namespace so applications see one logical file system. Google File System (GFS) pioneered this approach, and Hadoop Distributed File System (HDFS) brought it to the open-source world. Together, they form the backbone of virtually every large-scale data pipeline in existence.
This guide covers the architecture of GFS and HDFS, how replication and fault tolerance work, the read/write data flow, MapReduce integration, practical commands, and when you should (and shouldn't) reach for a distributed file system. For a broader look at storage paradigms, see our guide on Storage & Databases in System Design.
🔍 Why Distributed File Systems Exist
Traditional file systems like ext4 or NTFS are designed for a single disk on a single machine. They hit hard limits quickly:
- Capacity — A single server tops out at tens of terabytes. Modern datasets are petabytes or more.
- Throughput — One disk's sequential read speed (~200 MB/s for HDD) is a bottleneck when you need to scan terabytes.
- Reliability — A single disk failure means data loss unless you maintain external backups.
- Compute locality — Moving terabytes of data to a compute node is slow; moving compute to data is fast.
Distributed file systems address all four by spreading data across a cluster, reading in parallel from many disks, replicating every block, and enabling frameworks like MapReduce to process data where it lives.
⚙️ Google File System (GFS) Architecture
GFS was described in the landmark 2003 paper by Ghemawat, Gobioff, and Leung. It was built around a set of deliberate assumptions: files are large (multi-GB), appends are far more common than random writes, and hardware failures are the norm, not the exception.
Core Components
| Component | Role |
|---|---|
| GFS Master | Single node that holds all metadata: the namespace (directory tree), file-to-chunk mapping, and chunk-to-chunkserver mapping. It never stores actual data. |
| Chunkservers | Commodity Linux machines that store 64 MB chunks as regular files on local disks. Each chunk is identified by a globally unique 64-bit chunk handle. |
| GFS Client | A library linked into applications. It contacts the master for metadata, then communicates directly with chunkservers for data — keeping the master off the data path. |
Key Design Decisions
- Large chunk size (64 MB) — Reduces metadata stored on the master and minimises client-master interactions.
- Single master — Simplifies design. The master's in-memory state makes metadata operations fast, but it is a single point of failure mitigated by operation log replication and shadow masters.
- Lease-based mutation ordering — The master grants a lease to one replica (the primary), which serialises concurrent mutations to a chunk.
- Record append — GFS's signature operation: atomically append data at an offset chosen by GFS, enabling concurrent producers without locking.
🧩 Hadoop Distributed File System (HDFS) Architecture
HDFS is the open-source implementation inspired by GFS. It ships as part of the Apache Hadoop project and shares the same master/worker topology, with different naming conventions.
HDFS Components
| HDFS Term | GFS Equivalent | Description |
|---|---|---|
| NameNode | GFS Master | Manages namespace, block mapping, and replication. Stores metadata in memory for speed, persisted via FsImage and EditLog. |
| DataNode | Chunkserver | Stores actual data blocks on local disks. Sends periodic heartbeats and block reports to the NameNode. |
| Secondary NameNode | Shadow Master | Periodically merges FsImage + EditLog to prevent the edit log from growing unboundedly. It is not a hot standby (use HA NameNode for that). |
Block Size
The default HDFS block size is 128 MB (increased from the original 64 MB). A 1 GB file is split into 8 blocks, each independently replicated and distributed. Large blocks reduce NameNode memory usage (fewer blocks to track) and amortise disk seek time over larger sequential reads.
🔁 Replication Strategy
Both GFS and HDFS replicate every chunk/block — typically 3 copies. The placement strategy is critical for both durability and read performance.
HDFS Rack-Aware Replication (Default Replication Factor = 3)
- Replica 1 — Placed on the DataNode where the writer is running (or a random node if the writer is external).
- Replica 2 — Placed on a DataNode in a different rack to survive an entire rack failure (power, top-of-rack switch).
- Replica 3 — Placed on a different node in the same rack as Replica 2, balancing durability with network efficiency (intra-rack bandwidth is typically much higher than cross-rack).
This rack-awareness policy ensures that no single rack failure can destroy all copies of a block, while keeping two replicas on the same rack to speed up reads and reduce cross-rack traffic during writes.
🛡️ Fault Tolerance Mechanisms
Hardware failures are expected at scale. A 10,000-node cluster will see multiple disk and node failures every day. Distributed file systems handle this transparently.
Heartbeats
Every DataNode sends a heartbeat to the NameNode every 3 seconds. If the NameNode misses heartbeats for 10 minutes (configurable via dfs.namenode.heartbeat.recheck-interval), it declares the DataNode dead and initiates re-replication of all blocks that were stored on that node.
Block Reports
DataNodes periodically send a full inventory of their blocks to the NameNode. This allows the NameNode to detect under-replicated, over-replicated, or corrupt blocks and schedule corrective actions.
Checksums
Each block is stored with a CRC-32C checksum. DataNodes verify checksums on every read. If corruption is detected, the client transparently reads from another replica and the NameNode schedules re-replication of the healthy copy.
NameNode High Availability
In production, HDFS runs with an Active/Standby NameNode pair. Both share the edit log via a quorum of JournalNodes. If the active NameNode fails, the standby takes over within seconds. This eliminates the single-point-of-failure concern inherited from GFS.
📖 Read and Write Flow
HDFS Write Path
- Client calls
create()on the NameNode, which allocates a new file entry and returns a list of DataNodes for the first block. - Client builds a pipeline: DataNode1 → DataNode2 → DataNode3.
- Client streams data packets (64 KB default) to DataNode1, which forwards to DataNode2, which forwards to DataNode3.
- Acknowledgements flow back through the pipeline. Once all replicas confirm, the packet is considered durable.
- After all blocks are written, the client calls
close(), and the NameNode persists the block locations.
HDFS Read Path
- Client calls
open()on the NameNode, which returns the block locations sorted by proximity (same node > same rack > different rack). - Client reads directly from the closest DataNode for each block — the NameNode is not on the data path.
- If a DataNode fails mid-read, the client transparently fails over to the next replica.
🚀 MapReduce Integration
DFS and MapReduce are designed as a pair. MapReduce exploits data locality: the JobTracker (or YARN ResourceManager) schedules map tasks on the nodes that already hold the input blocks, eliminating network transfer for the map phase.
Here is a classic word-count MapReduce example in Java:
public class WordCount {
public static class TokenizerMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
}
The framework splits the HDFS input file into InputSplits (aligned with block boundaries), runs mappers in parallel across the cluster, shuffles intermediate key-value pairs, and feeds them to reducers. The output is written back to HDFS.
Common HDFS Shell Commands
# List files in a directory
hdfs dfs -ls /user/data/
# Upload a local file to HDFS
hdfs dfs -put localfile.csv /user/data/
# Download from HDFS to local
hdfs dfs -get /user/data/output.csv ./
# Check replication factor of a file
hdfs dfs -stat %r /user/data/localfile.csv
# Set replication factor to 5
hdfs dfs -setrep 5 /user/data/localfile.csv
# Check filesystem health
hdfs fsck /user/data/ -files -blocks -locations
# Show disk usage summary
hdfs dfs -du -s -h /user/data/
📊 GFS vs HDFS Comparison
| Feature | GFS | HDFS |
|---|---|---|
| Origin | Google (2003 paper) | Apache Hadoop (open-source) |
| Chunk/Block Size | 64 MB | 128 MB (default) |
| Master Node | Single GFS Master + shadow masters | NameNode with HA (Active/Standby) |
| Worker Nodes | Chunkservers | DataNodes |
| Default Replication | 3x | 3x |
| Write Model | Record append (atomic, concurrent) | Write-once, append-only (single writer) |
| Consistency | Relaxed (defined but not consistent for concurrent mutations) | Strong (single-writer, pipeline acks) |
| Caching | No client-side data caching | Short-circuit local reads, centralized cache |
| Ecosystem | Internal Google (Bigtable, MapReduce) | Hive, Spark, HBase, Pig, Flink, etc. |
| Availability | Proprietary | Open-source (Apache 2.0) |
🌐 Modern Alternatives
While HDFS remains widely used, the landscape has evolved significantly:
- Cloud Object Stores — Amazon S3, Azure Blob Storage, and Google Cloud Storage offer effectively infinite capacity, built-in replication, and pay-per-use pricing. Many organisations have moved from HDFS to S3 + compute engines like Spark or Presto.
- Ceph — A unified storage system providing object, block, and file storage. It uses CRUSH algorithm for data placement, eliminating the single-master bottleneck.
- MinIO — S3-compatible object storage you can run on-premises. Popular for hybrid cloud and Kubernetes-native workloads.
- Alluxio (formerly Tachyon) — A virtual distributed file system that sits between compute engines and storage backends, providing caching and unified access.
- Apache Ozone — A next-generation object store for Hadoop, designed to handle billions of small files that HDFS struggles with due to NameNode memory constraints.
For a deeper comparison of storage technologies, check out the System Design Comparison Tool on SWEHelper.
💡 When to Use a Distributed File System
Use DFS When:
- Your data exceeds what a single machine can store or process.
- You need high-throughput sequential reads over large datasets (log processing, ETL, analytics).
- Fault tolerance is non-negotiable — you cannot afford data loss from individual node failures.
- You want data locality — processing data where it resides to avoid network bottlenecks.
- Your workload is write-once, read-many (WORM) — batch analytics, data lakes, ML training data.
Avoid DFS When:
- You need low-latency random reads/writes — use a database (sharded database) or key-value store instead.
- You have millions of small files (the "small files problem") — each file consumes NameNode memory, so billions of tiny files can overwhelm metadata management.
- Your data fits on a single machine — the operational complexity of a cluster is not justified.
- You need POSIX compliance with strong random-write semantics — DFS is optimised for append-heavy workloads.
Use the Complexity Calculator to estimate whether your data volumes justify a distributed approach.
❓ Frequently Asked Questions
Q1: What happens when a DataNode fails in HDFS?
The NameNode detects the failure when heartbeats stop arriving (after approximately 10 minutes by default). It then identifies all blocks that were stored on the failed node and schedules re-replication — copying those blocks from surviving replicas to other healthy DataNodes to restore the configured replication factor. This process is automatic and transparent to applications reading or writing data.
Q2: Why is the HDFS block size so large (128 MB) compared to a regular filesystem (4 KB)?
Large blocks serve two purposes. First, they reduce metadata overhead — the NameNode tracks every block in memory, so fewer blocks means less memory consumption. A 1 PB cluster with 128 MB blocks has roughly 8 million blocks; at 4 KB it would be over 250 billion. Second, large blocks amortise disk seek time: once the head is positioned, sequential throughput dominates, making large reads highly efficient for analytical workloads.
Q3: Can HDFS handle real-time or low-latency workloads?
HDFS is designed for batch processing with high throughput, not low-latency random access. For real-time needs, use Apache HBase (built on top of HDFS) for random read/write access, or consider dedicated systems like Redis or Memcached for caching, and traditional databases for transactional workloads.
Q4: How does rack awareness improve fault tolerance?
By placing replicas across at least two different racks, HDFS ensures that an entire rack failure (power outage, network switch failure) cannot destroy all copies of any block. Without rack awareness, all three replicas could end up on the same rack, making the system vulnerable to correlated failures. The trade-off is increased cross-rack write traffic, but this is a small price for significantly improved durability.
Q5: What is the difference between HDFS Federation and HDFS High Availability?
HDFS HA addresses the single-point-of-failure problem by running an Active and Standby NameNode pair sharing the same namespace. HDFS Federation addresses the scalability problem by allowing multiple independent NameNodes, each managing a separate namespace volume. Federation scales metadata capacity horizontally (more NameNodes = more files), while HA ensures any single NameNode can survive failures. In practice, production clusters use both together.
Distributed file systems remain a foundational concept in system design. Understanding GFS and HDFS architecture gives you the vocabulary to reason about data partitioning, replication, and fault tolerance across any distributed system. For more system design deep-dives, explore the System Design Blog on SWEHelper.