Database Sharding & Partitioning Strategies¶
What Is Sharding?¶
Sharding (also called partitioning) means breaking a large dataset into smaller, independent pieces — called shards or partitions — and spreading them across multiple machines, so that each machine holds and serves only a subset of the data.
In Designing Data-Intensive Applications (Chapter 6), Martin Kleppmann defines a partition as a way of organizing data such that each record belongs to exactly one partition. Even though each partition is its own small database, a single operation may need to touch several of them. The terminology differs by system but the idea is identical:
| System | Name for a shard |
|---|---|
| MongoDB, Elasticsearch, SolrCloud | shard |
| HBase | region |
| Bigtable, Spanner | tablet |
| Cassandra, Riak | vnode (virtual node) |
| Couchbase | vBucket |
| PostgreSQL, Kafka | partition |
The key mental model: a shard is a fully functional slice of the data. Each shard can be read from and written to independently, which is exactly what makes the system scale.
flowchart TB
subgraph before["Before: one node holds everything"]
ALL[(All users<br/>A–Z · 5 TB · 50k QPS)]
end
subgraph after["After: data split across shards"]
S1[(Shard 1<br/>A–H)]
S2[(Shard 2<br/>I–P)]
S3[(Shard 3<br/>Q–Z)]
end
before --> after
Partitioning vs. sharding — a note on terminology
DDIA uses partitioning and sharding interchangeably, and so do most engineers. You will, however, hear an informal distinction:
- Partitioning sometimes means splitting a table within a single database engine (e.g., PostgreSQL declarative table partitioning, where one server still owns all the partitions).
- Sharding sometimes means distributing those pieces across multiple independent database instances/servers.
The distinction that matters is whether the pieces live on one machine (helps query performance and maintenance) or many machines (helps with capacity and throughput). If an interviewer uses the terms precisely, clarify which they mean; the scaling story in this page is about spreading across machines.
Why Do We Need Sharding?¶
Every database eventually hits the ceiling of a single machine. That machine has a finite amount of CPU, RAM, disk capacity, disk I/O, and network bandwidth. You can grow a single node for a while — buy a bigger box, add RAM, attach faster SSDs. This is vertical scaling (scaling up), and it is the right first move because it requires no application changes. But it runs out:
- Data volume — when the dataset no longer fits on one machine's disk (or the working set no longer fits in RAM, so every query hits disk).
- Write throughput — a single leader can only absorb so many writes per second; replication adds read replicas but every replica still has to apply every write, so it does not help write scaling.
- Cost and blast radius — the largest machines are disproportionately expensive, and one big node is a single point of failure for the entire dataset.
Sharding is the answer to horizontal scaling (scaling out): instead of one machine doing all the work, the data — and the read/write load that comes with it — is divided across many commodity machines. As Kleppmann frames it, the main reason to partition data is scalability: spreading the data and the query load across many nodes so that throughput grows roughly linearly as you add machines.
Sharding vs. replication — they solve different problems
These two scaling techniques are almost always used together, but they are not the same thing (DDIA, Ch. 5–6):
- Replication keeps copies of the same data on multiple nodes — for fault tolerance and read scaling.
- Sharding splits the dataset into different subsets on different nodes — for write scaling and capacity.
A production system typically does both: it shards the data into partitions, then replicates each shard across a few nodes so that no single failure loses a partition. Sharding gives you capacity; replication gives you durability and availability.
Why Sharding Matters in Interviews¶
Once you understand what and why, the senior/staff-level skill is execution. At L6, interviewers expect you to choose the right partition key, anticipate hot spots, and plan for resharding — not just say "we'll shard the data."
Note
Every system design question eventually leads to "how do you scale the data layer?" Your sharding answer determines whether your design works at 10x the initial traffic.
Partitioning Strategies¶
Key-Based (Hash) Partitioning¶
Compute shard = hash(key) % num_shards. Data distributes evenly if the hash function is uniform.
| Pros | Cons |
|---|---|
| Even distribution | Adding/removing shards reshuffles most keys (modulo change) |
| Simple routing | Range queries across shards are expensive |
| No hotspots (if hash is good) | Cannot exploit data locality |
Range Partitioning¶
Assign contiguous key ranges to shards (e.g., A-M on shard 1, N-Z on shard 2).
| Pros | Cons |
|---|---|
| Efficient range queries | Uneven distribution (some ranges are hotter) |
| Natural ordering | Requires periodic rebalancing |
| Easy to understand | Sequential keys create write hotspots |
Consistent Hashing¶
Map both keys and nodes to a ring. Each key is assigned to the first node clockwise from its hash position. Adding a node only moves keys between neighbors.
| Pros | Cons |
|---|---|
| Minimal key movement on node add/remove | Uneven load with few nodes (solved with virtual nodes) |
| No central routing table needed | More complex than simple modulo |
| Used by Cassandra, DynamoDB, Riak | Ring metadata must be distributed |
Directory-Based Partitioning¶
A lookup service maps keys to shards. Maximum flexibility but adds a dependency.
| Pros | Cons |
|---|---|
| Complete flexibility in placement | Lookup service is a single point of failure |
| Can move individual keys | Additional latency for lookup |
| Supports heterogeneous shards | Must be highly available and fast |
Choosing a Partition Key¶
The partition key determines data distribution, query patterns, and hotspot risk.
| Consideration | Guidance |
|---|---|
| Cardinality | High cardinality (e.g., user_id) distributes well; low cardinality (e.g., country) creates hot shards |
| Query patterns | If most queries filter by tenant_id, shard by tenant_id to avoid scatter-gather |
| Write distribution | If writes are timestamped and sequential, hashing the timestamp avoids a single hot shard |
| Join locality | Co-locate data that is frequently joined (e.g., user + orders on the same shard) |
Common Partition Key Choices¶
| System | Partition Key | Rationale |
|---|---|---|
| Chat system | conversation_id |
All messages in a conversation are on one shard; efficient reads |
| E-commerce orders | user_id |
Most queries are "my orders"; avoids cross-shard queries |
| Time-series metrics | hash(metric_name) |
Avoids sequential write hotspots; range queries use time index within shard |
| Multi-tenant SaaS | tenant_id |
Data isolation; compliance; per-tenant backup/restore |
Warning
Anti-pattern: Sharding by auto-increment ID puts all recent writes on the last shard. Use a hash of the ID or a compound key.
Hot Spot Mitigation¶
| Technique | Description |
|---|---|
| Salting | Append a random suffix to the key before hashing (e.g., user_123_salt_7); distributes writes but complicates reads (must query all salt variants) |
| Key splitting | Split a hot partition into sub-partitions; route based on secondary key |
| Caching | Cache hot reads in front of the shard; reduces read load on the hot shard |
| Rate limiting per shard | Protect hot shards from overwhelming writes; backpressure to clients |
| Virtual nodes | In consistent hashing, more virtual nodes per physical node smooths distribution |
Cross-Shard Operations¶
| Operation | Challenge | Solutions |
|---|---|---|
| Cross-shard queries | Must fan out to all shards; aggregate results | Application-level scatter-gather; or denormalize data to avoid cross-shard reads |
| Cross-shard joins | No native SQL join across shards | Denormalize; or use a materialized view updated via CDC |
| Cross-shard transactions | No single-DB ACID guarantee | 2PC (if same DB engine); Saga pattern; or design to avoid cross-shard transactions |
| Global secondary indexes | Index entries span multiple shards | Local indexes (per-shard, scatter on query) or global index (updated async, eventually consistent) |
Tip
Staff-level answer: "I'd design the schema so that 95% of queries hit a single shard. For the rare cross-shard query (e.g., admin dashboard), I'd build a read-optimized materialized view in a separate analytical store, updated via CDC from the sharded OLTP database."
Resharding Strategies¶
As data grows, the initial shard count becomes insufficient. Resharding is one of the hardest operational challenges.
| Strategy | Description | Downtime |
|---|---|---|
| Logical sharding | Use many more logical shards than physical nodes (e.g., 1024 logical shards on 16 nodes); move logical shards between nodes without re-hashing | Zero (with careful migration) |
| Dual-write migration | Write to both old and new shard layout; backfill historical data; cut over reads; stop old writes | Zero (but complex) |
| Shadow traffic | Route a copy of traffic to the new shard layout; compare results; switch when consistent | Zero |
| Stop-the-world | Take the system offline; redistribute data; bring back up | Minutes to hours |
Warning
Staff-level insight: Always over-provision logical shards at design time. Starting with 1024 logical shards on 8 physical nodes is far easier to scale (just move logical shards) than starting with 8 logical shards and needing to re-hash everything.
Replication Within Shards¶
Each shard should be replicated for durability and read scaling:
| Topology | Description |
|---|---|
| Leader-follower | One leader handles writes; followers serve reads; failover promotes a follower |
| Leader-leader | Both nodes accept writes; conflict resolution needed (LWW, vector clocks) |
| Leaderless (quorum) | Any node accepts writes; quorum ensures overlap (Dynamo-style) |
Sharding in Real Systems¶
| System | Sharding Approach |
|---|---|
| MySQL (Vitess) | Application-level sharding; Vitess adds routing, resharding, and schema management |
| PostgreSQL (Citus) | Extension that distributes tables across worker nodes; co-located joins |
| MongoDB | Built-in sharding with configurable shard key; automatic balancing |
| Cassandra | Consistent hashing with virtual nodes; automatic rebalancing |
| Google Spanner | Range-based splits with automatic split/merge; Paxos per split |
| DynamoDB | Hash partitioning; automatic splitting of hot partitions |
Interview Checklist¶
| Topic | What to Cover |
|---|---|
| Why shard? | "A single node cannot handle X writes/sec or Y TB of data" |
| Partition key | Choose based on query patterns, cardinality, and write distribution |
| Hot spots | Identify and mitigate with salting, splitting, or caching |
| Cross-shard | Design to minimize; use denormalization or materialized views for the rest |
| Resharding | Logical shards > physical nodes; dual-write migration for zero downtime |
| Replication | Each shard has leader-follower replication for durability |
| Secondary indexes | Local (per-shard) vs. global (async) with clear trade-offs |
Further Reading¶
| Resource | Topic | Why This Matters |
|---|---|---|
| Designing Data-Intensive Applications — Martin Kleppmann, Chapter 6 ("Partitioning") | The canonical treatment of sharding | Chapter 6 is the foundation for everything on this page. Kleppmann motivates partitioning as the path to scalability, then works through the two core strategies (partitioning by key range vs. by hash of key) and exactly why each creates or avoids hot spots. He then covers the two hardest real-world problems: partitioning secondary indexes (local/document-partitioned vs. global/term-partitioned) and rebalancing (why hash % N is a trap, and why fixed-number-of-partitions and dynamic partitioning are used instead). Read it alongside Chapter 5 (Replication) to understand why the two are always combined. |
| Designing Data-Intensive Applications — Chapter 5 ("Replication") | How each shard stays durable | Sharding splits data; replication copies it. Chapter 5 explains leader-follower, multi-leader, and leaderless replication — the mechanisms used to keep each individual shard available and durable when a node fails. |
| Vitess / Citus documentation | Production sharding in MySQL & PostgreSQL | Real systems that bolt sharding onto existing relational databases. Vitess (used by YouTube, Slack) and Citus (PostgreSQL extension) show how routing, resharding, and cross-shard query handling are implemented in practice — the operational reality behind the strategies above. |
Last updated: 2026-06-01