Skip to main content
🏢Case Studies

Design Dropbox: A Cloud File Storage and Sync System

Dropbox synchronizes files across billions of devices, storing over 1 exabyte of data for 700+ million users. Designing a system like Dropbox tests your un...

📖 10 min read

Design Dropbox: A Cloud File Storage and Sync System

Dropbox synchronizes files across billions of devices, storing over 1 exabyte of data for 700+ million users. Designing a system like Dropbox tests your understanding of file chunking, deduplication, conflict resolution, metadata management, and efficient sync protocols. This guide covers the complete architecture for a cloud file sync service.

1. Requirements

Functional Requirements

  • Upload and download files from any device.
  • Automatic sync: changes on one device are reflected on all devices.
  • File versioning: view and restore previous versions of a file.
  • Selective sync: choose which folders to sync on each device.
  • Share files and folders with other users via links or direct sharing.
  • Offline access: work on files without internet and sync when reconnected.
  • Conflict resolution when the same file is edited on multiple devices simultaneously.

Non-Functional Requirements

  • Durability: 99.999999999% (11 nines). No file data can ever be lost.
  • Consistency: All devices must eventually see the same file state.
  • Low bandwidth: Only transfer changed portions of files, not entire files.
  • Scalability: Handle 700M+ users, exabytes of data.
  • Low latency: Sync should complete within seconds for small changes.

2. Capacity Estimation

Metric Estimate
Total users 700 million
Daily active users 100 million
Average files per user 200
Average file size 1 MB
Total storage 700M × 200 × 1 MB = 140 PB (before dedup)
Deduplication savings ~60% (many shared files, common documents)
Actual storage needed ~56 PB (after dedup)
File syncs per day 100M users × 10 syncs = 1 billion
Sync operations per second 1B / 86,400 ≈ 11,500/sec
Upload bandwidth ~50 GB/sec (variable; delta sync reduces this significantly)

3. High-Level Design

Component Responsibility
Desktop Client (Sync Engine) Monitors local file changes, chunks files, uploads deltas
API Gateway Authentication, rate limiting, request routing
Metadata Service Stores file/folder hierarchy, versions, sharing permissions
Block Service Handles upload/download of file chunks (blocks)
Block Storage (S3) Stores actual file blocks with high durability
Notification Service Notifies other devices when files change (long polling/WebSocket)
Sync Service Manages sync state per device, resolves conflicts
Metadata Database Relational DB for file tree, user data, permissions
Message Queue Decouples sync events from notification delivery

4. Detailed Component Design

4.1 File Chunking

Instead of uploading entire files, we split them into chunks (blocks). This enables: delta sync (only upload changed chunks), deduplication (identical chunks stored once), parallel upload, and resumable uploads.

import hashlib

def chunk_file(filepath, chunk_size=4*1024*1024):  # 4 MB chunks
    chunks = []
    with open(filepath, 'rb') as f:
        while True:
            data = f.read(chunk_size)
            if not data:
                break
            chunk_hash = hashlib.sha256(data).hexdigest()
            chunks.append({
                'hash': chunk_hash,
                'size': len(data),
                'data': data,
                'offset': f.tell() - len(data)
            })
    return chunks

# Content-defined chunking (Rabin fingerprinting) is better
# because it handles insertions without re-chunking the entire file
def content_defined_chunking(data, min_size=2*1024*1024, max_size=8*1024*1024):
    """Uses rolling hash to find chunk boundaries based on content"""
    chunks = []
    offset = 0
    while offset < len(data):
        boundary = find_next_boundary(data, offset, min_size, max_size)
        chunk_data = data[offset:boundary]
        chunk_hash = hashlib.sha256(chunk_data).hexdigest()
        chunks.append({'hash': chunk_hash, 'size': len(chunk_data), 'offset': offset})
        offset = boundary
    return chunks

Why content-defined chunking? Fixed-size chunks are problematic when data is inserted in the middle of a file: all subsequent chunk boundaries shift, and every chunk after the insertion appears "new" even though the content is unchanged. Content-defined chunking uses a rolling hash (Rabin fingerprint) to find boundaries based on content patterns, making boundaries stable across insertions.

4.2 Delta Sync

When a file is modified, the client computes the new chunk list and compares it with the server's chunk list:

async function syncFile(filepath) {
    const localChunks = chunkFile(filepath);
    const localHashes = localChunks.map(c => c.hash);

    // Get server's chunk list for this file
    const serverHashes = await metadataService.getChunkList(filepath);

    // Find new chunks that don't exist on server
    const newChunks = localChunks.filter(c => !serverHashes.includes(c.hash));

    // Upload only new chunks
    for (const chunk of newChunks) {
        await blockService.upload(chunk.hash, chunk.data);
    }

    // Update metadata with new chunk list
    await metadataService.updateFile(filepath, {
        chunks: localHashes,
        size: localChunks.reduce((sum, c) => sum + c.size, 0),
        modified_at: Date.now()
    });
}

For a 100 MB file where only 4 MB changed, only 1 chunk is uploaded instead of the entire file. This dramatically reduces bandwidth usage.

4.3 Deduplication

Chunks are content-addressed: their storage key is the SHA-256 hash of the content. If two users upload the same file, or if a chunk appears in multiple files, it is stored only once.

async function uploadChunk(chunkHash, chunkData) {
    // Check if chunk already exists
    const exists = await blockStorage.exists(chunkHash);
    if (exists) {
        // Increment reference count
        await blockStorage.incrementRefCount(chunkHash);
        return { status: "deduplicated", hash: chunkHash };
    }

    // Upload new chunk
    await blockStorage.put(chunkHash, compress(chunkData));
    return { status: "uploaded", hash: chunkHash };
}

Deduplication saves approximately 60% of storage at Dropbox's scale, since many users store the same popular documents, photos, and software installers.

4.4 Conflict Resolution

When the same file is edited on two devices simultaneously before sync, a conflict occurs. Dropbox handles this with "last writer wins" plus conflict copies:

  1. Device A modifies file.txt and syncs. Server version becomes V2 (from A).
  2. Device B modifies file.txt (based on V1) and tries to sync.
  3. Server detects that B's base version (V1) does not match current version (V2).
  4. Server accepts B's version as a "conflicted copy" named file (B's computer's conflicted copy).txt.
  5. Both versions are preserved; the user manually resolves the conflict.
async function handleFileUpdate(userId, filePath, newChunks, baseVersion) {
    const currentVersion = await metadata.getCurrentVersion(filePath);

    if (baseVersion === currentVersion.version) {
        // No conflict: apply update normally
        await metadata.createNewVersion(filePath, newChunks, currentVersion.version + 1);
        await notifyOtherDevices(userId, filePath, "updated");
    } else {
        // Conflict detected: save as conflicted copy
        const conflictName = generateConflictName(filePath, userId);
        await metadata.createFile(conflictName, newChunks);
        await notifyOtherDevices(userId, conflictName, "conflict_created");
    }
}

4.5 Notification and Sync Protocol

When a file changes on one device, other devices need to be notified:

  • Long polling / WebSocket: Desktop clients maintain a persistent connection to the Notification Service. When a sync event occurs, the server pushes a notification.
  • Mobile push: APNs/FCM for mobile devices that are not actively connected.

The sync protocol uses a journal-based approach: the server maintains a sync journal (ordered list of changes). Each client tracks its position in the journal and applies changes sequentially. See WebSocket patterns for real-time notification design.

5. Database Schema

CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL,
    storage_quota_bytes BIGINT DEFAULT 2147483648,
    storage_used_bytes BIGINT DEFAULT 0,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE files (
    id BIGINT PRIMARY KEY,
    user_id BIGINT NOT NULL,
    parent_folder_id BIGINT,
    name VARCHAR(255) NOT NULL,
    is_folder BOOLEAN DEFAULT FALSE,
    size_bytes BIGINT DEFAULT 0,
    current_version INT DEFAULT 1,
    content_hash VARCHAR(64),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    modified_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    deleted_at TIMESTAMP,
    UNIQUE (user_id, parent_folder_id, name)
);

CREATE INDEX idx_files_parent ON files(parent_folder_id);
CREATE INDEX idx_files_user ON files(user_id);

CREATE TABLE file_versions (
    id BIGINT PRIMARY KEY,
    file_id BIGINT NOT NULL REFERENCES files(id),
    version_number INT NOT NULL,
    size_bytes BIGINT,
    chunk_hashes TEXT NOT NULL,
    modified_by_device_id BIGINT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE (file_id, version_number)
);

CREATE TABLE chunks (
    hash VARCHAR(64) PRIMARY KEY,
    size_bytes INT NOT NULL,
    reference_count INT DEFAULT 1,
    storage_url TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE sync_journal (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    user_id BIGINT NOT NULL,
    file_id BIGINT NOT NULL,
    action ENUM('create','update','delete','move','rename'),
    version_number INT,
    timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_journal_user ON sync_journal(user_id, id);

CREATE TABLE devices (
    id BIGINT PRIMARY KEY,
    user_id BIGINT NOT NULL,
    name VARCHAR(100),
    last_sync_journal_id BIGINT DEFAULT 0,
    last_seen TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

6. Key Trade-offs

Decision Trade-off
Fixed vs content-defined chunking Fixed-size chunking is simpler but inefficient for insertions (all chunks after the insertion shift). Content-defined chunking (Rabin fingerprint) is more complex but produces stable boundaries, dramatically reducing re-upload for small edits. Dropbox uses content-defined chunking.
Chunk size Smaller chunks (1 MB) enable finer-grained delta sync and better deduplication, but increase metadata overhead. Larger chunks (8 MB) have less metadata but waste bandwidth on small changes. 4 MB is a common balance.
Conflict resolution strategy Last-writer-wins discards one version silently (data loss risk). Conflict copies preserve both versions (no data loss but user must resolve). Operational transformation (like Google Docs) enables real-time collaboration but is complex. Dropbox uses conflict copies for safety.
Metadata store: SQL vs NoSQL File trees are hierarchical and relational (parent-child). SQL is natural for this. Dropbox uses sharded MySQL for metadata. NoSQL would require denormalization and lose transactional guarantees for folder operations. See SQL vs NoSQL.

7. Scaling Considerations

7.1 Block Storage

Chunks are stored in a distributed object storage system (Dropbox built their own called Magic Pocket, replacing S3). The storage system must provide 11-nines durability through replication and erasure coding. Data is replicated across multiple data centers and geographic regions.

7.2 Metadata Sharding

The metadata database is sharded by user_id. This co-locates all of a user's files, folders, and sync journal on the same shard, enabling efficient transactions for operations like folder rename (which updates many rows). Use consistent hashing for shard assignment.

7.3 Sync Efficiency

  • Compression: Chunks are compressed (LZ4 or zstd) before upload, reducing bandwidth by 30-50% for text files.
  • Streaming sync: For large batch changes (e.g., initial sync), use streaming protocols instead of individual requests.
  • Bandwidth throttling: The client limits upload speed to avoid saturating the user's network connection.

7.4 Caching

Hot metadata (recently active users' file trees) is cached in Redis/Memcached. Block deduplication lookups (does this chunk hash exist?) are cached to avoid hitting storage on every upload.

Use swehelper.com tools to practice file sync and storage system design.

8. Frequently Asked Questions

Q1: How does delta sync reduce bandwidth for large files?

When a file is modified, only the changed chunks need to be uploaded. For example, editing a paragraph in a 100 MB document changes only one 4 MB chunk. The client computes chunk hashes, compares with the server's list, and uploads only the new/changed chunks. With content-defined chunking, even insertions in the middle of the file only affect 1-2 chunks rather than re-uploading everything.

Q2: How does deduplication work across users?

Each chunk is content-addressed by its SHA-256 hash. When User B uploads a file that User A already has, the system checks if each chunk hash already exists in block storage. If it does, no upload is needed; the system just creates a metadata reference to the existing chunk and increments the reference count. This is why popular files (OS installers, common documents) are stored only once regardless of how many users have them.

Q3: How does the sync journal work?

The sync journal is an append-only log of all file changes per user, ordered by a monotonically increasing ID. Each device tracks the last journal entry it has processed. When a device comes online, it fetches all journal entries after its last processed ID and applies them sequentially. This ensures all devices converge to the same state, regardless of when they sync.

Q4: What happens during initial sync of a large folder?

For initial sync (e.g., 50 GB of files), the client chunks all files locally, computes hashes, and sends the hash list to the server. The server responds with which hashes are already known (dedup check). Only truly new chunks are uploaded. Uploads happen in parallel (typically 4-8 concurrent uploads) with resumable upload support. The process can take hours but is restartable without losing progress.

Related Articles