Cloud Storage (Google Drive)¶

What We're Building¶

A multi-tenant cloud file system that lets users upload, sync, share, and version files across devices—conceptually aligned with products like Google Drive or Dropbox. Users expect near-real-time sync, reliable offline edits, deduplicated storage at scale, and fine-grained sharing.

Core capabilities in scope:

Upload and download of files and folders with hierarchical paths
Continuous sync between clients and cloud (delta updates, not full re-uploads)
Block-level storage with content-defined chunking and content-addressable deduplication
Metadata service for names, ACLs, versions, and parent/child relationships
Notifications when remote changes occur (long polling or WebSocket)
Conflict detection and resolution strategies for concurrent edits
File history and rollback to prior versions

Why Cloud Storage Is Hard¶

Challenge	Description
Scale of data	Exabytes under management; must avoid storing duplicate bytes
Sync fan-out	One change may affect many devices; hot files amplify traffic
Consistency	Metadata must stay coherent while blocks are immutable and replicated
Conflicts	Offline edits and concurrent writers require explicit policies
Latency sensitivity	Users perceive sluggish sync as product failure

Note

In interviews, separate user-visible paths (metadata graph) from bytes on disk (content-addressed blocks). That split is the backbone of the design.

Comparison to Adjacent Systems¶

System	Similarity	Difference
Object storage (S3)	Immutable blobs, large scale	Cloud drive adds hierarchy, sync, and per-user ACLs
CDN	Edge caching	CDN is read-heavy and URL-centric; Drive is read/write with strong user semantics
Git	Content hashing, dedup	Git is repository-centric; Drive optimizes for non-technical users and binary workloads

Step 1: Requirements¶

Functional Requirements¶

ID	Requirement	Notes
F1	Upload / download files up to product limit (e.g., 5 TB max object size in many products; interview: state an assumption)	Resume interrupted transfers
F2	Folder hierarchy	Move, rename, create; stable internal IDs separate from display path
F3	Sync across devices for the same user	Must handle sleep, flaky networks, and reconnect
F4	Sharing with users and groups	Viewer/commenter/editor roles (illustrative)
F5	Version history	Restore prior versions; cap retention by policy
F6	Search / list (optional in scope)	Often delegated to search index fed by metadata CDC

Non-Functional Requirements¶

Dimension	Target (illustrative)	Rationale
Availability	99.9%–99.99% for API	Enterprise expectations; block storage higher than metadata in some designs
Durability	11 nines for committed user bytes (provider-style claim)	Achieved via replication + repair; interview: discuss erasure coding vs triple replication
Latency (metadata)	P99 < 200–300 ms for list/small reads	Interactive UI
Latency (sync notification)	Seconds for “someone else changed this”	Not always instant; often eventual with bounded delay
Consistency	Per-user linearizable metadata ops (idealized) or per-resource serializability	Trade cost vs UX
Security	TLS in transit; encryption at rest; ACL checks on every metadata path	Audit logs for sharing changes

Warning

Do not claim POSIX semantics for the whole namespace unless asked. Many products offer close-to-POSIX behavior on a mounted drive with documented exceptions (e.g., eventual consistency on metadata in some edge cases).

API Design¶

Representative REST-style surface (names illustrative). Many products also use gRPC internally.

Operation	Method	Purpose
`POST /v1/uploads`	Create upload session	Returns `upload_id`, chunk size hints, URLs for block PUTs
`PUT /v1/uploads/{upload_id}/blocks/{block_hash}`	Idempotent block upload	Content-addressed; skip if exists
`POST /v1/files/{id}/commit`	Finalize file revision	Points file to manifest of block hashes
`GET /v1/changes`	Delta feed	Since cursor; returns file/folder mutations
`GET /v1/files/{id}/content`	Download	Redirects to signed URLs or streams blocks
`POST /v1/shares`	Share resource	Grants principal + role
`GET /v1/notify`	Long poll	Hold until change or timeout
`WS /v1/stream`	WebSocket	Push notifications for subscribed folders

Change feed (conceptual JSON)

{
  "cursor": "chg_7f91…",
  "events": [
    {
      "type": "file.updated",
      "file_id": "fil_abc",
      "revision": 42,
      "etag": "W/\"v42\"",
      "path": "/Reports/Q3.pdf"
    }
  ]
}

Tip

Treat block uploads as idempotent by hash. Treat file commits as idempotent with a client mutation id to avoid duplicate revisions on retries.

Technology Selection & Tradeoffs¶

Cloud storage stacks in three planes: immutable bytes (object/block layer), authoritative metadata (tree, versions, ACLs), and sync semantics (how clients converge). Interviewers expect you to name realistic building blocks and justify trade-offs—not to pick one vendor dogmatically.

Object storage backend¶

Option	Pros	Cons	When it wins
Custom chunk store (hash-partitioned volumes, EC, scrubbers)	Full control over placement, cost, on-prem; can co-design with dedup/GC	Years of engineering; you own reliability, upgrades, and incident response	Hyperscaler or regulated environments building a proprietary plane
Ceph / MinIO (S3-compatible cluster)	Mature replication/EC; ops patterns exist; self-hostable	Operability at exabyte scale is non-trivial; feature gaps vs public cloud (multi-region, compliance SKUs)	Private cloud, hybrid, or teams that want S3 API without AWS
Managed S3-compatible (e.g., S3, GCS, Azure Blob)	Durability/availability SLAs, global replication, lifecycle tiers, compliance certs	Cost at scale; less control over internals; egress/operation pricing	Fastest path for most products; default in interviews unless “on-prem” is stated

Why it matters: Metadata references chunks by hash; the object layer only needs PUT-by-hash, GET-by-hash, lifecycle/GC hooks, and strong durability. The hard part is reference counting and async deletion in your metadata plane—not the raw blob PUT.

Metadata store¶

Option	Pros	Cons	When it wins
PostgreSQL (sharded)	ACID, constraints (`UNIQUE (parent_id, name)`), rich queries, mature tooling	Cross-shard transactions painful; need careful shard key (`owner_id` / `namespace_id`)	Default interview answer for tree + ACLs + transactional commits
etcd (or similar consistent KV)	Strong consistency, watches for coordination	Poor fit for large trees, heavy listing queries, and billions of rows—designed for small critical state	Locks, quotas, rate-limit counters, not the full file catalog
Custom B-tree / LSM on disk	Ultimate performance/cost tuning	You reimplement SQL, migrations, backup—rarely justified	Extreme embedded or legacy systems; not a typical greenfield choice

Why it matters: File names and hierarchy need transactional invariants; search/list workloads need secondary indexes. A relational model maps cleanly; pure KV is usually paired with another system for graph/path queries unless you accept heavy client-side logic.

Sync protocol¶

Approach	Pros	Cons	When it wins
Rsync-like delta sync (rolling hash, send missing segments)	Minimizes bytes on repeated similar files; great for low uplink	Complex client; server may still be chunk/manifest based—align with CDC policy	Bandwidth-sensitive clients; backup tools; complement to chunk stores
Full file replace	Simple mental model	Wastes bandwidth on large files; fights user expectations for “sync”	Small files only; rarely the main strategy at scale
Block-level + content-defined chunking + manifest commit	Stable dedup, resumable uploads, idempotent PUT(hash)	Requires manifest versioning and GC	Industry-typical for Drive/Dropbox-class products

Why it matters: Interviews reward CDC + content-addressed blocks because edits localize to a few chunks; “upload whole file every time” fails the efficiency bar unless scope is explicitly tiny files.

Conflict resolution¶

Strategy	Pros	Cons	When it wins
Last-writer-wins (LWW)	Simple; single head revision	Silent overwrite—bad for shared folders and offline	Low-stakes caches; not ideal as the only story for collaboration
Version vectors / DAG	Captures causality; enables merge tools and audit	UX complexity; still need policy for binaries	Advanced sync; technical users; foundation for branch + merge flows
User manual merge / conflict copies	Safe for binary; clear accountability	Noisy folders; user burden	Default for generic cloud drives on binary files

Our choice (interview narrative):

Bytes: Managed S3-compatible object storage (or Ceph/MinIO if hybrid/on-prem) for durability and operational leverage; content-addressed chunks with erasure coding behind the API.
Metadata: PostgreSQL (or sharded Spanner/Cockroach-class SQL) for transactional tree + ACL + revision history; etcd only for coordination (locks, leases), not the main catalog.
Sync: CDC chunking + block upload + manifest commit with cursor-based change feed and push notifications; optional rsync-style second pass only for niche bandwidth savings—not as the primary store of truth.
Conflicts: Optimistic concurrency on commit (etag/base revision); for binaries, conflict copies or explicit user resolution; reserve LWW for clearly defined single-writer resources.

Rationale: Optimize for deduplicated storage, clear consistency story on metadata, and honest conflict UX—without building a custom object store unless the prompt demands it.

CAP Theorem Analysis¶

CAP (Consistency, Availability, Partition tolerance) says that under a network partition, you cannot have both strong linearizability and full read/write availability for the same data plane. Real products partition responsibilities: different subsystems pick different points on the spectrum.

For cloud storage:

File reads (content) should stay highly available: clients can often read replicated chunks; temporary metadata staleness may block “latest” path resolution, but bytes addressed by known hash remain readable (AP-leaning for immutable blobs).
Sync conflicts need careful consistency on metadata (which revision is head, who is allowed to commit)—typically CP-leaning for the commit path (reject or branch on conflict), while notifications and change feeds are eventually consistent with bounded lag.

Subsystem	Typical CAP stance	Interview phrasing
File metadata (path, size, head revision)	CP on commit: transactional updates, version checks	“We serialize commits per file or use compare-and-swap on revision.”
File content (immutable chunks by hash)	AP for read: multiple replicas; eventual visibility of new hashes after commit	“Chunks are immutable; once committed, reads don’t need quorum metadata.”
Sync state (per-device cursor, local vs server revision)	AP with eventual convergence; repair via change feed	“Devices are sources of truth for pending work; server is source of truth for committed state.”
Sharing permissions	CP when enforcing on sensitive operations; cached reads may be eventually fresh	“Writes to ACLs go through authoritative store; reads may use cached policy with short TTL.”

flowchart TB subgraph cp["CP-leaning paths"] M[Metadata commit\nfile tree + revision] A[ACL grant/revoke] end subgraph ap["AP-leaning paths"] B[Block read by hash\nreplicated objects] N[Notify / change feed delivery] end subgraph part["Partition"] P[Network split\nclients cannot reach all replicas] end P -->|choose| M P -->|choose| B M -->|"may fail closed\nreject ambiguous commits"| X[Consistent tree] B -->|"still serve\nknown hashes"| Y[Available reads]

Note

PACELC extension: even without a partition, you trade latency (L) vs consistency (C)—e.g., reading ACL from a cache is faster but may be briefly stale. Mentioning PACELC signals seniority.

SLA and SLO Definitions¶

SLA = contract with customers (credits, legal). SLO = internal target; SLI = what you measure. Below: illustrative SLOs for a consumer/enterprise-grade drive; tune numbers to the prompt.

Category	SLI	Example SLO	Measurement window
Upload latency	Time from last byte of chunk received to ack (or from commit request to success)	P99 < 500 ms for commit; chunk PUT P99 < 1 s for ≤ 32 MB chunk under normal load	30-day rolling
Download latency	Time to first byte (TTFB) for signed URL or gateway stream	P99 TTFB < 300 ms same region; higher cross-region (disclose)	30-day rolling
Sync latency	Wall-clock from server commit to change visible on subscribed client (feed or push)	P95 < 10 s; P99 < 60 s (mobile/long-poll may widen tail)	30-day rolling
Data durability	Probability of permanent loss of committed user object	11 nines for committed bytes after ack (provider-style claim; explain replication + EC + repair)	Annual / incident-based review
Availability	Successful metadata read/write and auth checks vs all attempts	99.9% monthly (consumer); 99.95%+ (business)—exclude customer-caused throttling if defined	Monthly
Conflict resolution accuracy	Share of conflicts correctly classified (no silent wrong winner) vs detected conflicts	99.99% detection rate for concurrent commits to same base revision; 0 silent LWW on shared folders if policy forbids	Per release + sampled audits

Error budget policy (how teams operate):

Element	Policy
Budget	e.g., 0.1% monthly unavailability = ~43 minutes/month at 99.9%
Burn alerts	Page on fast burn (budget exhausted in days); ticket on slow burn
Trade-offs	If sync latency SLO slips, throttle non-critical features (preview gen) before dropping durability paths
Freeze	If budget exhausted, freeze launches until reliability work ships

Warning

Never conflate durability (bits not lost) with availability (API up). You can be available and wrong if you serve stale metadata—separate SLIs.

Database Schema¶

Logical schema (illustrative SQL-oriented). Adjust types (UUID, BIGINT), indexing, and soft-delete to your scale story.

files — one row per logical file (node); path may be materialized for perf or derived from closure table—not both without a clear source of truth.

Column	Type	Notes
`id`	UUID / BIGINT	Primary key; stable across renames
`name`	VARCHAR	Display name; sibling uniqueness with `parent_id`
`path`	TEXT	Optional denormalized path for fast listing; or omit and use `parent_id` chain
`parent_id`	FK → `files.id`	NULL or sentinel for root
`size`	BIGINT	Logical size (bytes) for latest committed revision
`content_hash`	BYTEA / CHAR(64)	Hash of content (manifest root or whole-file); aligns with dedup story
`version` / `head_revision`	BIGINT	Monotonic revision for optimistic locking
`owner_id`	UUID	Billing / primary owner
`permissions`	ENUM / JSONB	Default visibility (e.g. `private`, `anyone_with_link`) or bitmask; fine-grained grants live in `sharing`
`created_at`, `updated_at`	TIMESTAMPTZ	Audit
`deleted_at`	TIMESTAMPTZ	Soft delete for sync trash

file_versions — immutable snapshots for history and GC.

Column	Type	Notes
`id`	BIGINT	Surrogate PK
`file_id`	FK → `files.id`
`version`	BIGINT	Matches commit; UNIQUE (file_id, version)
`storage_key`	TEXT	Manifest id, or pointer to manifest table row
`timestamp`	TIMESTAMPTZ	Commit time (server authoritative)

sharing — ACL edges.

Column	Type	Notes
`file_id`	FK → `files.id`	Resource (file or folder node)
`user_id`	UUID	Principal
`permission_level`	ENUM / SMALLINT	`viewer`, `commenter`, `editor`, `owner`
`granted_by`, `granted_at`	UUID, TIMESTAMPTZ	Audit

sync_state — per device convergence (optional per file row, or summary + separate pending table).

Column	Type	Notes
`device_id`	UUID	Client-registered device
`file_id`	FK → `files.id`
`local_version`	BIGINT	Last known applied server revision on device
`synced_at`	TIMESTAMPTZ	Last successful reconcile

Tip

At scale, sync_state is often sharded with the user or stored client-side (SQLite) with server cursors—the table above is the server-side model when you track enterprise devices centrally.

Step 2: Back-of-the-Envelope Estimation¶

Assumptions¶

- Total users: 1 billion registered; DAU: 200 million
- Average stored data per paying user: 120 GB (heavy tail dominates)
- Average file size: 1.5 MB (mix of docs and photos)
- Average chunk size after CDC: 4 MB (policy-dependent; smaller for text-heavy tenants)
- Deduplication ratio (effective bytes vs logical): 1.6× (illustrative)
- Metadata operations per user per active day: 200 (opens, lists, small writes)
- Change notifications delivered per DAU per day: 30 (fan-out to devices)

Storage¶

Logical bytes ≈ 1B × 120 GB = 120 EB (upper bound if all users paid and filled—illustrative only)
After dedup (1.6×): ~75 EB effective (still illustrative; real systems use tiering and cold storage)

Per-object metadata overhead (names, ACLs, indices): order of KBs per file; dominates only for tiny files

Note

Interview math should stay order-of-magnitude. The goal is to show you understand dedup, small-file overhead, and metadata cost, not to predict revenue.

Traffic¶

Assume 200M DAU each perform 200 metadata ops/day:
200M × 200 / 86,400 ≈ 463,000 metadata RPS average; peak ×3 ≈ 1.4M RPS (illustrative peak)

Block upload/download throughput is dominated by large files and LAN-speed clients; provision backbone and edge separately.

Sync and Notifications¶

200M DAU × 30 notifications/day ≈ 6B notifications/day
Average ≈ 69,000/s; peak ×5 ≈ 350,000/s (fan-out to devices varies by clustering of activity)

Resource	Rough scale	Planning note
Metadata DB QPS	High hundreds of thousands peak	Shard by `user_id` or `namespace_id`
Block storage throughput	Terabits/s aggregate	Erasure-coded volumes + CDN for public links
Notification connections	Many millions concurrent	WebSocket gateways regionalized

Step 3: High-Level Design¶

At a high level: clients talk to API / sync services, which persist metadata in a transactional store (often sharded SQL or a strongly consistent KV with careful modeling), and file bytes in object/block storage addressed by cryptographic hash. A notification service pushes change events to online clients; others catch up via long polling or periodic change feed pulls.

flowchart LR subgraph clients["Clients"] C1[Desktop sync] C2[Mobile] C3[Web] end subgraph edge["Edge"] LB[Load balancers] API[API Gateway] end subgraph services["Core services"] MS[Metadata service] BS[Block service] NS[Notification service] CF[Change feed] end subgraph data["Data stores"] MD[(Metadata DB)] OS[(Object / block store)] CACHE[(Redis cache)] Q[[Message queue]] end C1 --> LB C2 --> LB C3 --> LB LB --> API API --> MS API --> BS MS --> MD MS --> Q Q --> NS Q --> CF BS --> OS MS --> CACHE NS --> C1 NS --> C2

Request path (upload): client chunks file with CDC, hashes chunks, uploads missing blocks to block service, commits manifest in metadata service.

Read path: client lists folder via metadata, resolves file to block list, fetches blocks (often via signed URLs), verifies hashes.

Important

Content-addressed storage means identical chunks map to one physical object. Metadata records which hashes belong to which file revision.

Step 4: Deep Dive¶

4.1 File Chunking and Deduplication¶

Fixed-size chunking is simple but shifts boundaries when bytes insert at the front, hurting dedup. Content-defined chunking (CDC) uses rolling hashes (e.g., Rabin fingerprinting) to select cut points based on local byte patterns so inserts shift only nearby chunks.

Rabin-style rolling hash (concept): maintain a sliding window of length \(W\); update hash in O(1) per byte; when hash mod P == R (or low bits hit a threshold), emit a chunk boundary. Parameters control average chunk size.

Approach	Pros	Cons
Fixed 4 MB chunks	Easy	Poor dedup on small edits shifting all tail bytes
CDC + Rabin	Stable boundaries around edits	CPU overhead; parameter tuning
Whole-file hash	Trivial	No dedup across files

Deduplication: after chunking, compute SHA-256 (or BLAKE3) per chunk. Block service stores once keyed by hash; metadata references hashes.

flowchart TD A[File bytes on disk] --> B[CDC chunker\nRabin boundaries] B --> C[Per-chunk cryptographic hash] C --> D{Hash exists in store?} D -- Yes --> E[Reference only\nmetadata write] D -- No --> F[Upload chunk bytes] F --> G[Commit manifest] E --> G

Java: Rabin fingerprint rolling (illustrative core)

public final class RabinFingerprint {
  private static final long MOD = (1L << 61) - 1; // Mersenne prime (example modulus)

  private final int windowSize;
  private final long base;
  private final long[] powTable;

  public RabinFingerprint(int windowSize, long base) {
    this.windowSize = windowSize;
    this.base = base;
    this.powTable = precomputePowers(base, windowSize);
  }

  private static long[] precomputePowers(long base, int w) {
    long[] pow = new long[w];
    pow[0] = 1;
    for (int i = 1; i < w; i++) {
      pow[i] = mulMod(pow[i - 1], base);
    }
    return pow;
  }

  private static long mulMod(long a, long b) {
    return Math.floorMod(a * b, MOD);
  }

  public long roll(long currentHash, byte outByte, byte inByte, int posInWindow) {
    long drop = mulMod(outByte & 0xFFL, powTable[windowSize - 1 - posInWindow]);
    long shifted = mulMod(currentHash - drop, base);
    return Math.floorMod(shifted + (inByte & 0xFFL), MOD);
  }
}

Note

Production CDC adds minimum/maximum chunk size guards to avoid tiny or huge chunks. Libraries like restic or rsync rolling are good offline reading.

4.2 Sync Protocol (Client-Server)¶

Clients maintain a local journal of file events from the OS file watcher, debounce rapid edits, chunk changed files, then reconcile with the server using revision tokens (etag, revision, or vector clock per file).

Notification options:

Mechanism	When it fits	Drawbacks
WebSocket	Always-on desktop/mobile foreground	Stateful infra, reconnect storms
Long polling	Mobile background limitations	Higher latency, many held requests
Periodic pull	Low-priority devices	Slower sync

sequenceDiagram participant Client participant API participant Meta as Metadata participant Block as Block store participant Notify as Notifier Client->>API: GET /changes?since=cursor API->>Meta: fetch delta Meta-->>API: file rev bumped API-->>Client: file.updated + etag Client->>API: PUT blocks (missing hashes) API->>Block: conditional put by hash Block-->>API: 201 / dedup hit Client->>API: POST /files/{id}/commit API->>Meta: txn: new revision if etag matches Meta->>Notify: publish change event Notify-->>Client: WS push (other devices)

Go: minimal sync client loop (polling variant)

package sync

import (
  "context"
  "encoding/json"
  "net/http"
  "time"
)

type ChangeResponse struct {
  Cursor string   `json:"cursor"`
  Events []Event  `json:"events"`
}

type Event struct {
  Type     string `json:"type"`
  FileID   string `json:"file_id"`
  Revision int64  `json:"revision"`
  ETag     string `json:"etag"`
}

type Client struct {
  BaseURL    string
  HTTP       *http.Client
  SinceCursor string
}

func (c *Client) Poll(ctx context.Context) error {
  ticker := time.NewTicker(15 * time.Second)
  defer ticker.Stop()
  for {
    select {
    case <-ctx.Done():
      return ctx.Err()
    case <-ticker.C:
      req, err := http.NewRequestWithContext(ctx, http.MethodGet,
        c.BaseURL+"/v1/changes?since="+c.SinceCursor, nil)
      if err != nil {
        return err
      }
      resp, err := c.HTTP.Do(req)
      if err != nil {
        continue
      }
      var cr ChangeResponse
      if err := json.NewDecoder(resp.Body).Decode(&cr); err == nil {
        c.SinceCursor = cr.Cursor
        // enqueue local reconciliation against cr.Events
      }
      resp.Body.Close()
    }
  }
}

Tip

Always combine push with pull as a backstop. Mobile OS may kill sockets; cursor-based change feeds make recovery simple.

4.3 Metadata Database Design¶

Model nodes (files/folders) with stable node_id. Store parent_id, name, owner_id, revision, content manifest id, and deleted flag. Avoid encoding full path as primary key; maintain a path cache updated asynchronously or computed via traversal for correctness.

Table (illustrative)	Purpose
`nodes`	One row per file/folder; revision, owner
`manifests`	Ordered list of `(offset, chunk_hash, length)`
`acl_entries`	Resource, principal, role
`revisions`	Historical manifests for version history

Sharding key: owner_id or namespace_id keeps co-located data for a tenant. Cross-shard moves (shared drives) require two-phase workflows or a centralized coordinator.

Warning

Uniqueness of (parent_id, name) should be enforced transactionally to prevent duplicate siblings—classic tree constraint.

4.4 Block Storage Architecture¶

Block service exposes:

PUT /blocks/{hash} with Content-Type and length
Quorum or single-leader replication inside the cell
Background scrubbing to detect bit rot
Optional encryption with per-tenant keys (wrap DEKs in KMS)

Layer	Role
API gateways	AuthN/Z, rate limits
Chunk routers	Route hash to the correct volume / cell
Volume servers	Store immutable chunks on disk
Erasure coding	Reduce raw overhead vs 3× replication

Hot spots: extremely popular files dedup to the same hashes—mitigate with CDN for public links and read caching in the block layer.

4.5 Conflict Resolution¶

Conflicts arise when two writers commit against the same base revision or when offline edits replay against a newer server revision.

Strategy	Behavior	UX
Last-writer-wins (LWW)	Higher revision timestamp wins	Simple; data loss risk
First-commit wins	Second commit rejected	Forces manual retry
Mergeable types	CRDT / OT for text	Great for docs; hard for binaries
Branch copies	Save `conflict` copy	Safest for binaries; cluttered folders

flowchart TD A[Client has base rev B] --> C{Server head == B?} C -- Yes --> D[Commit succeeds\nnew rev B+1] C -- No --> E{Auto-merge possible?} E -- Yes --> F[Merge + commit] E -- No --> G[Create conflict file\nor surface choice UI]

For Google Docs–style products, operational transforms live in a separate editor service; for binary Drive files, conflict copies are common.

4.6 File Versioning and History¶

Each successful commit creates an immutable manifest snapshot. GC reclaims unreferenced chunks asynchronously after retention policies expire.

Policy	Description
Retention window	Keep last N versions or T days
Legal hold	Block GC for custodied users
Diff storage	Optional binary deltas between manifests for savings

Point-in-time restore = metadata pointer swap to an older manifest_id plus ACL checks.

Represent sharing as ACL edges on node_id with roles. Evaluate deny vs allow carefully; most products use allow lists with owner override.

Evaluation path: resolve node_id from path, walk ancestors for inherited permissions (if supported), merge with explicit grants on the node.

Role	Typical capabilities
Viewer	Read, download
Commenter	Read + annotate (product-specific)
Editor	Modify content, add children
Owner	Share, delete, transfer

Audit: append-only audit log for compliance (who granted access to whom).

4.8 Offline Support¶

Clients queue operations in a persistent outbox while offline: chunking, hash, and manifest computation can proceed locally; uploads resume with range or chunk session APIs.

File watcher (Python, watchdog-style sketch)

import hashlib
import time
from pathlib import Path

class LocalChangeQueue:
    def __init__(self):
        self.pending = []

    def on_modified(self, path: Path):
        # debounce in real code
        self.pending.append((time.time(), path))

    def drain(self):
        # coalesce multiple events for same path
        by_path = {}
        for ts, p in self.pending:
            by_path[p] = ts
        self.pending.clear()
        return list(by_path.keys())

def fingerprint_preview(path: Path, n: int = 1 << 20) -> str:
    h = hashlib.sha256()
    with path.open("rb") as f:
        h.update(f.read(n))
    return h.hexdigest()

Desktop clients run a lightweight DB (SQLite) tracking local inode state, server revision, and upload status.

Step 5: Scaling & Production¶

Concern	Technique
Metadata hotspots	Shard by tenant; cache path→id; rate limit noisy APIs
Block hot hashes	CDN for public; internal read caches
Notification storms	Coalesce events per folder; sample low-priority clients
GC of chunks	Mark-sweep from manifests; tombstone windows
Multi-region	Metadata per region + replication; blocks geo-replicated or regional with cross-region CRR
Compliance	Data residency pinned shards; encryption keys in regional KMS

Observability: trace IDs across API → metadata → block PUT; metrics on dedup ratio, chunk upload latency, conflict rate, and sync backlog depth.

Interview Tips¶

Do	Avoid
Separate metadata and block planes	Conflating file paths with raw object keys only
Discuss CDC + hashing for dedup	Claiming whole-file upload is enough at Dropbox scale
Mention etag/revision concurrency	Ignoring offline/conflict scenarios
Cover WebSocket + cursor fallback	Assuming pure push always works on mobile
Call out small-file overhead	Ignoring listing performance and metadata cost

Tip

If time permits, relate to backup products (restic, Borg)—interviewers like knowing you understand production chunking trade-offs.

Summary¶

Chunking: CDC with Rabin fingerprints stabilizes chunk boundaries; cryptographic hashes enable deduplicated block storage.
Sync: Clients watch local FS, upload missing chunks, commit manifests; servers broadcast changes via WebSocket or clients poll change feeds with cursors.
Metadata: Tree of nodes + manifests + ACLs; shard by tenant; transactional sibling name uniqueness.
Conflicts: Choose explicit policies; binary files often become conflict copies; collaborative docs may use OT/CRDTs in a separate service.
Versions: Immutable manifests; retention and legal hold; async GC for unreferenced chunks.

Interview Checklist¶

Clarify max file size, sharing model, offline expectations, and collaboration (binary vs collaborative editing)
Quantify metadata QPS, storage with dedup, and notification fan-out
Draw clients → API → metadata DB + block store + notifier
Explain CDC and why fixed chunking fails on inserts
Walk upload: chunk, hash, conditional block put, manifest commit with etag
Walk download: resolve manifest, fetch blocks, verify hashes
Describe conflict handling for offline edits
Discuss versioning, GC, and encryption
Mention failure modes: partial uploads, stuck notifications, split-brain on clients

Sample Interview Dialogue¶

Interviewer: “Design a system like Google Drive.”

Candidate: “I’ll split the problem into metadata—folders, permissions, file revisions—and bulk storage for bytes. Blocks are content-addressed by hash so duplicates store once. Clients run a sync agent with a file watcher, chunk files using content-defined chunking so small edits don’t reshuffle the entire file, upload only unknown chunks, then commit a manifest in a transaction.”

Interviewer: “How does sync stay fast?”

Candidate: “Each file has a revision or etag. Clients fetch deltas from a change feed using a cursor. For online devices, we push notifications over WebSocket; mobile may rely on long polling or periodic pulls. The cursor makes recovery simple after reconnect.”

Interviewer: “What about two people editing the same file?”

Candidate: “If it’s binary, we usually detect base revision mismatch on commit. Options are reject and refresh, last-winner, or write a conflict copy. Collaborative text is a different subsystem—OT/CRDT—but for Drive-like binaries, conflict copies are common.”

Interviewer: “How do you delete unused chunks?”

Candidate: “Manifests reference chunk hashes. After deletes and retention windows, a GC pass marks unreachable hashes and deletes them lazily, with safeguards for snapshots and legal hold.”

Topic	Why it matters
Content-defined chunking	Traditional fixed-size chunking breaks when bytes are inserted (all subsequent chunks shift). Content-defined chunking (Rabin fingerprinting) uses rolling hashes to find cut points based on content, so edits only affect the modified chunk. This is how Dropbox and Google Drive achieve efficient deduplication and delta sync — only changed chunks are uploaded, not entire files.
Merkle trees	A Merkle tree (hash tree) recursively hashes data blocks into a single root hash. To verify a specific block, you only need the sibling hashes along the path to the root (O(log n) proof). Cloud storage uses this for integrity verification (detect corruption without re-reading everything) and efficient sync (compare root hashes → drill down to find exactly which chunks differ between client and server).
Operational transformation	Google Docs pioneered real-time collaborative editing using OT — an algorithm that transforms concurrent edit operations so they converge to the same document state regardless of arrival order. Understanding OT (and its successor, CRDTs) is essential for cloud storage systems that support simultaneous editing, because it solves the fundamental problem of conflict resolution without locking.
Rsync algorithm	Andrew Tridgell's rsync algorithm (1996) efficiently synchronizes files over a network by computing rolling checksums (Adler-32) and strong checksums (MD5) on the destination, then using them to identify matching blocks on the source. Only non-matching blocks are transferred. This is the conceptual foundation for delta-sync in cloud storage — minimizing bandwidth by transferring only the differences between local and remote file versions.

Cloud Storage (Google Drive)¶

What We're Building¶

Why Cloud Storage Is Hard¶

Comparison to Adjacent Systems¶

Step 1: Requirements¶

Functional Requirements¶

Non-Functional Requirements¶

API Design¶

Technology Selection & Tradeoffs¶

Object storage backend¶

Metadata store¶

Sync protocol¶

Conflict resolution¶

CAP Theorem Analysis¶

SLA and SLO Definitions¶

Database Schema¶

Step 2: Back-of-the-Envelope Estimation¶

Assumptions¶

Storage¶

Traffic¶

Sync and Notifications¶

Step 3: High-Level Design¶

Step 4: Deep Dive¶

4.1 File Chunking and Deduplication¶

4.2 Sync Protocol (Client-Server)¶

4.3 Metadata Database Design¶

4.4 Block Storage Architecture¶

4.5 Conflict Resolution¶

4.6 File Versioning and History¶

4.7 Sharing and Permissions¶

4.8 Offline Support¶

Step 5: Scaling & Production¶

Interview Tips¶

Summary¶

Interview Checklist¶

Sample Interview Dialogue¶

Further Reading (Patterns)¶