Database Sharding: Scaling Databases Horizontally

When a single database server can no longer handle your workload, sharding is the technique that lets you distribute data across multiple servers. This guide covers sharding strategies, shard key selection, resharding, cross-shard queries, and real-world implementations used by companies like Instagram, Discord, and Uber.

What Is Database Sharding?

Sharding is a form of horizontal partitioning where data is split across multiple independent database instances (shards). Each shard holds a subset of the total data and operates independently. Unlike replication which copies all data to every node, sharding divides data so each node stores only a fraction.

The goal is to distribute both data storage and query load across multiple machines. A table with 1 billion rows might be split into 10 shards of 100 million rows each, where each shard runs on its own server with its own CPU, memory, and disk.

When Do You Need Sharding?

Sharding adds significant complexity. Only shard when you have exhausted simpler scaling techniques:

Vertical scaling — upgrade to a bigger server (more CPU, RAM, faster SSDs)
Query optimization and indexing — ensure queries are efficient
Read replicas — offload read traffic
Caching — reduce database load with Redis or Memcached
Table partitioning — partition within a single database
Sharding — when none of the above are sufficient

Sharding Strategies

Hash-Based Sharding

Hash-based sharding applies a hash function to the shard key and uses modulo to determine which shard stores the data. This distributes data evenly across shards.

-- Hash-based shard assignment
shard_number = hash(user_id) % number_of_shards

-- Example with 4 shards:
-- user_id = 12345 -> hash(12345) = 7823 -> 7823 % 4 = 3 -> Shard 3
-- user_id = 67890 -> hash(67890) = 2156 -> 2156 % 4 = 0 -> Shard 0

-- Application-level routing
def get_shard(user_id, num_shards):
    return hash(user_id) % num_shards

-- Each shard has the same schema
-- Shard 0: users WHERE hash(user_id) % 4 = 0
-- Shard 1: users WHERE hash(user_id) % 4 = 1
-- Shard 2: users WHERE hash(user_id) % 4 = 2
-- Shard 3: users WHERE hash(user_id) % 4 = 3

Pros: Even data distribution, simple to implement. Cons: Adding or removing shards requires rehashing and moving data (resharding). Range queries across shards are expensive.

Consistent Hashing

Consistent hashing solves the resharding problem. Instead of simple modulo, keys and servers are mapped onto a ring. When a shard is added or removed, only a fraction of keys need to be redistributed. This is used by DynamoDB, Cassandra, and many distributed systems. Learn more in our consistent hashing guide.

Range-Based Sharding

Range-based sharding assigns continuous ranges of the shard key to each shard. For example, users with IDs 1-1,000,000 go to Shard 1, IDs 1,000,001-2,000,000 to Shard 2, and so on.

-- Range-based sharding example
-- Shard 1: orders WHERE created_at BETWEEN '2024-01-01' AND '2024-03-31'
-- Shard 2: orders WHERE created_at BETWEEN '2024-04-01' AND '2024-06-30'
-- Shard 3: orders WHERE created_at BETWEEN '2024-07-01' AND '2024-09-30'

-- Shard routing
def get_shard_by_date(order_date):
    if order_date < '2024-04-01':
        return 'shard_1'
    elif order_date < '2024-07-01':
        return 'shard_2'
    else:
        return 'shard_3'

Pros: Range queries on the shard key are efficient (query hits only relevant shards). Easy to understand. Cons: Uneven distribution (hot shards) if data is skewed. Recent data shard gets all new writes.

Geographic Sharding

Geographic sharding routes data based on geographic location. US users go to US shards, EU users to EU shards. This reduces latency and can help with data residency compliance (GDPR).

Directory-Based Sharding

A lookup table (directory) maps each shard key to its shard. This provides maximum flexibility — you can move individual keys between shards — but the directory itself becomes a single point of failure and a potential bottleneck.

Shard Key Selection

Choosing the right shard key is the most critical decision in a sharding strategy. A bad shard key leads to hot shards, cross-shard queries, and painful resharding.

Criteria	Good Shard Key	Bad Shard Key
Cardinality	High (user_id, order_id)	Low (status, country)
Distribution	Even across shards	Skewed (celebrity user_id)
Query pattern	Most queries include shard key	Queries rarely filter by shard key
Monotonicity	Non-sequential (UUIDs with hash)	Auto-increment (all writes to one shard)

Cross-Shard Queries

The biggest challenge with sharding is queries that span multiple shards. A JOIN between a user table sharded by user_id and an orders table sharded by order_id requires scatter-gather: the query is sent to all shards, results are collected, and then merged. This is slow and resource-intensive.

-- Cross-shard query (scatter-gather)
-- "Find all orders for a given product across all user shards"
-- This requires querying ALL shards since orders are sharded by user_id

-- Strategy: Co-locate related data on the same shard
-- If orders are sharded by user_id (same as users), then
-- "Find all orders for user_id=42" only hits ONE shard

-- Anti-pattern: Sharding orders by order_id
-- Now "Find all orders for user_id=42" must query ALL shards

Mitigation strategies:

Co-locate related data on the same shard (shard users and orders by user_id)
Maintain denormalized copies of frequently joined data
Use a secondary index service (like Elasticsearch) for cross-shard searches
Accept eventual consistency for cross-shard aggregations

Resharding

Resharding — changing the number of shards or the sharding strategy — is one of the most operationally painful database tasks. When you add shards, data must be redistributed while the system continues serving traffic.

Approaches to resharding include: Double-write migration (write to both old and new shards during migration), backfill and switchover (copy data to new shards, then switch traffic), and virtual shards (create many more logical shards than physical servers, then remap logical shards to physical servers when scaling).

Vitess: MySQL Sharding at Scale

Vitess is an open-source database clustering system for horizontal scaling of MySQL. Originally developed at YouTube, it now powers Slack, Square, and many others. Vitess provides automatic sharding, connection pooling, query rewriting, and transparent resharding.

-- Vitess VSchema: Define sharding strategy
{
  "sharded": true,
  "vindexes": {
    "hash": {
      "type": "hash"
    }
  },
  "tables": {
    "users": {
      "column_vindexes": [
        {
          "column": "user_id",
          "name": "hash"
        }
      ]
    }
  }
}

-- Application sends queries to Vitess gateway (VTGate)
-- VTGate routes to the correct shard transparently
SELECT * FROM users WHERE user_id = 12345;
-- VTGate: routes to shard holding hash(12345)

Real-World Sharding Examples

Instagram

Instagram shards PostgreSQL by user_id. Each shard contains all data for a subset of users (photos, likes, comments). They use logical sharding with thousands of logical shards mapped to a smaller number of physical servers. This makes resharding easier — just move logical shards between physical servers.

Discord

Discord shards messages by channel_id combined with a time bucket. Each shard stores messages for a set of channels. This ensures all messages in a conversation are on the same shard. They migrated from MongoDB to Cassandra to handle trillions of messages.

Sharding vs Partitioning

Aspect	Sharding	Partitioning
Scope	Across multiple servers	Within a single server
Complexity	High (routing, cross-shard queries)	Low (handled by database engine)
Scalability	Nearly unlimited horizontal scale	Limited by single server resources
Transactions	Distributed transactions needed	Single-server transactions

For related scaling techniques, explore database replication, distributed databases, and partitioning strategies.

Frequently Asked Questions

How do I handle auto-increment IDs across shards?

Auto-increment IDs do not work across shards because each shard has its own sequence. Use UUIDs, Snowflake IDs (Twitter's approach: timestamp + machine ID + sequence), or a centralized ID generation service. Snowflake IDs are sortable by time and globally unique, making them ideal for sharded systems.

Can I use foreign keys across shards?

No. Foreign key constraints cannot span database servers. You must enforce referential integrity at the application level. This is a fundamental trade-off of sharding — you give up database-level referential integrity for horizontal scalability.

How do I run analytics queries on sharded data?

Use a separate analytics pipeline. Replicate data from all shards into a data warehouse (Redshift, BigQuery) or a read-only aggregated database. Do not run heavy analytics queries directly on your sharded production databases.

When should I use a distributed database instead of manual sharding?

If you need horizontal scalability with SQL support, consider NewSQL databases like CockroachDB or TiDB. They handle sharding transparently. Manual sharding gives you more control but requires significant engineering effort. For most teams, a managed distributed database is the better choice.

What happens if a shard goes down?

Each shard should have its own replication for high availability. A primary shard with one or more replicas ensures that if the primary fails, a replica can be promoted. The data on the failed shard is unavailable until failover completes, but other shards continue operating normally.

Database Sharding: Scaling Databases Horizontally