Database Sharding & Partitioning Strategies¶

What Is Sharding?¶

Sharding (also called partitioning) means breaking a large dataset into smaller, independent pieces — called shards or partitions — and spreading them across multiple machines, so that each machine holds and serves only a subset of the data.

In Designing Data-Intensive Applications (Chapter 6), Martin Kleppmann defines a partition as a way of organizing data such that each record belongs to exactly one partition. Even though each partition is its own small database, a single operation may need to touch several of them. The terminology differs by system but the idea is identical:

System	Name for a shard
MongoDB, Elasticsearch, SolrCloud	shard
HBase	region
Bigtable, Spanner	tablet
Cassandra, Riak	vnode (virtual node)
Couchbase	vBucket
PostgreSQL, Kafka	partition

The key mental model: a shard is a fully functional slice of the data. Each shard can be read from and written to independently, which is exactly what makes the system scale.

flowchart TB
    subgraph before["Before: one node holds everything"]
        ALL[(All users<br/>A–Z · 5 TB · 50k QPS)]
    end
    subgraph after["After: data split across shards"]
        S1[(Shard 1<br/>A–H)]
        S2[(Shard 2<br/>I–P)]
        S3[(Shard 3<br/>Q–Z)]
    end
    before --> after

Partitioning vs. sharding — a note on terminology

DDIA uses partitioning and sharding interchangeably, and so do most engineers. You will, however, hear an informal distinction:

Partitioning sometimes means splitting a table within a single database engine (e.g., PostgreSQL declarative table partitioning, where one server still owns all the partitions).
Sharding sometimes means distributing those pieces across multiple independent database instances/servers.

The distinction that matters is whether the pieces live on one machine (helps query performance and maintenance) or many machines (helps with capacity and throughput). If an interviewer uses the terms precisely, clarify which they mean; the scaling story in this page is about spreading across machines.

Why Do We Need Sharding?¶

Every database eventually hits the ceiling of a single machine. That machine has a finite amount of CPU, RAM, disk capacity, disk I/O, and network bandwidth. You can grow a single node for a while — buy a bigger box, add RAM, attach faster SSDs. This is vertical scaling (scaling up), and it is the right first move because it requires no application changes. But it runs out:

Data volume — when the dataset no longer fits on one machine's disk (or the working set no longer fits in RAM, so every query hits disk).
Write throughput — a single leader can only absorb so many writes per second; replication adds read replicas but every replica still has to apply every write, so it does not help write scaling.
Cost and blast radius — the largest machines are disproportionately expensive, and one big node is a single point of failure for the entire dataset.

Sharding is the answer to horizontal scaling (scaling out): instead of one machine doing all the work, the data — and the read/write load that comes with it — is divided across many commodity machines. As Kleppmann frames it, the main reason to partition data is scalability: spreading the data and the query load across many nodes so that throughput grows roughly linearly as you add machines.

Sharding vs. replication — they solve different problems

These two scaling techniques are almost always used together, but they are not the same thing (DDIA, Ch. 5–6):

Replication keeps copies of the same data on multiple nodes — for fault tolerance and read scaling.
Sharding splits the dataset into different subsets on different nodes — for write scaling and capacity.

A production system typically does both: it shards the data into partitions, then replicates each shard across a few nodes so that no single failure loses a partition. Sharding gives you capacity; replication gives you durability and availability.

Why Sharding Matters in Interviews¶

Once you understand what and why, the senior/staff-level skill is execution. At L6, interviewers expect you to choose the right partition key, anticipate hot spots, and plan for resharding — not just say "we'll shard the data."

Note

Every system design question eventually leads to "how do you scale the data layer?" Your sharding answer determines whether your design works at 10x the initial traffic.

Partitioning Strategies¶

Key-Based (Hash) Partitioning¶

Compute shard = hash(key) % num_shards. Data distributes evenly if the hash function is uniform.

Pros	Cons
Even distribution	Adding/removing shards reshuffles most keys (modulo change)
Simple routing	Range queries across shards are expensive
No hotspots (if hash is good)	Cannot exploit data locality

Range Partitioning¶

Assign contiguous key ranges to shards (e.g., A-M on shard 1, N-Z on shard 2).

Pros	Cons
Efficient range queries	Uneven distribution (some ranges are hotter)
Natural ordering	Requires periodic rebalancing
Easy to understand	Sequential keys create write hotspots

Consistent Hashing¶

Map both keys and nodes to a ring. Each key is assigned to the first node clockwise from its hash position. Adding a node only moves keys between neighbors.

Pros	Cons
Minimal key movement on node add/remove	Uneven load with few nodes (solved with virtual nodes)
No central routing table needed	More complex than simple modulo
Used by Cassandra, DynamoDB, Riak	Ring metadata must be distributed

Directory-Based Partitioning¶

A lookup service maps keys to shards. Maximum flexibility but adds a dependency.

Pros	Cons
Complete flexibility in placement	Lookup service is a single point of failure
Can move individual keys	Additional latency for lookup
Supports heterogeneous shards	Must be highly available and fast

Choosing a Partition Key¶

The partition key determines data distribution, query patterns, and hotspot risk.

Consideration	Guidance
Cardinality	High cardinality (e.g., user_id) distributes well; low cardinality (e.g., country) creates hot shards
Query patterns	If most queries filter by `tenant_id`, shard by `tenant_id` to avoid scatter-gather
Write distribution	If writes are timestamped and sequential, hashing the timestamp avoids a single hot shard
Join locality	Co-locate data that is frequently joined (e.g., user + orders on the same shard)

Common Partition Key Choices¶

System	Partition Key	Rationale
Chat system	`conversation_id`	All messages in a conversation are on one shard; efficient reads
E-commerce orders	`user_id`	Most queries are "my orders"; avoids cross-shard queries
Time-series metrics	`hash(metric_name)`	Avoids sequential write hotspots; range queries use time index within shard
Multi-tenant SaaS	`tenant_id`	Data isolation; compliance; per-tenant backup/restore

Warning

Anti-pattern: Sharding by auto-increment ID puts all recent writes on the last shard. Use a hash of the ID or a compound key.

Hot Spot Mitigation¶

Technique	Description
Salting	Append a random suffix to the key before hashing (e.g., `user_123_salt_7`); distributes writes but complicates reads (must query all salt variants)
Key splitting	Split a hot partition into sub-partitions; route based on secondary key
Caching	Cache hot reads in front of the shard; reduces read load on the hot shard
Rate limiting per shard	Protect hot shards from overwhelming writes; backpressure to clients
Virtual nodes	In consistent hashing, more virtual nodes per physical node smooths distribution

Cross-Shard Operations¶

Operation	Challenge	Solutions
Cross-shard queries	Must fan out to all shards; aggregate results	Application-level scatter-gather; or denormalize data to avoid cross-shard reads
Cross-shard joins	No native SQL join across shards	Denormalize; or use a materialized view updated via CDC
Cross-shard transactions	No single-DB ACID guarantee	2PC (if same DB engine); Saga pattern; or design to avoid cross-shard transactions
Global secondary indexes	Index entries span multiple shards	Local indexes (per-shard, scatter on query) or global index (updated async, eventually consistent)

Tip

Staff-level answer: "I'd design the schema so that 95% of queries hit a single shard. For the rare cross-shard query (e.g., admin dashboard), I'd build a read-optimized materialized view in a separate analytical store, updated via CDC from the sharded OLTP database."

Resharding Strategies¶

As data grows, the initial shard count becomes insufficient. Resharding is one of the hardest operational challenges.

Strategy	Description	Downtime
Logical sharding	Use many more logical shards than physical nodes (e.g., 1024 logical shards on 16 nodes); move logical shards between nodes without re-hashing	Zero (with careful migration)
Dual-write migration	Write to both old and new shard layout; backfill historical data; cut over reads; stop old writes	Zero (but complex)
Shadow traffic	Route a copy of traffic to the new shard layout; compare results; switch when consistent	Zero
Stop-the-world	Take the system offline; redistribute data; bring back up	Minutes to hours