Design a Metrics and Monitoring System (Datadog / Prometheus–style)¶

What We're Building¶

We are designing a metrics collection, storage, querying, and alerting platform comparable in scope to Datadog, Prometheus, Grafana Mimir, Thanos, or Google Monarch: agents ship time-series samples, the backend stores them efficiently at scale, users query and visualize them, and alert rules fire when conditions breach SLOs.

Capability	Why it matters
High-cardinality ingestion	Modern microservices emit huge label combinations; the system must not collapse under cardinality explosions
Durable, compressed storage	Time-series data is append-heavy; compression (e.g., Gorilla-style) dominates cost
Fast analytical queries	Dashboards and incident response need sub-second queries over large windows
Reliable alerting	Missed pages cost revenue; false positives burn out on-call engineers
Horizontal scale	Ingestion and query load grow with services, regions, and tenants

Real-world scale (illustrative)¶

System	Scale signal (public / typical talk)
Prometheus	Single-server focus; federation and remote write for scale-out
Datadog	Trillions of points/day; multi-tenant SaaS
Google Monarch	Planet-scale; hierarchical aggregation
Cortex / Mimir / Thanos	Long-term Prometheus-compatible storage at cluster scale

Reference systems (interview vocabulary)¶

System	Positioning	Typical talking point
Prometheus	Pull-first; local TSDB; single binary simplicity	Federation + remote write for scale; TSDB block format
Datadog / Dynatrace / New Relic	SaaS agents; unified APM + metrics	Multi-tenant isolation; custom billing per host or per span
Google Monarch	Global hierarchy; Borg-aware	Aggregation tree; SLO objects as first-class
AWS CloudWatch	Regional control plane; tight AWS integration	Metric streams to Kinesis; cross-account dashboards
Grafana Mimir / Cortex	Horizontally scaled Prometheus-compatible backend	Shuffle sharding, limits, object store backend

Why monitoring matters¶

SLOs and error budgets tie product reliability to engineering velocity; metrics are the feedback loop.
Incident detection depends on latency histograms, saturation (CPU, queue depth), and golden signals (latency, traffic, errors, saturation).
Capacity planning uses historical trends; without retention and rollups, forecasting is guesswork.
Debugging pairs metrics with traces and logs—but metrics are the cheapest signal at high volume.

Note

In interviews, scope cardinality, retention, and multi-tenancy early. Many designs fail on “we’ll just index everything” or “one global Prometheus.”

Key Concepts Primer¶

This section grounds the later steps in time-series semantics, compression intuition, observability methodologies, and cardinality economics—the vocabulary senior interviewers expect when you say “Prometheus-style” or “Mimir-scale.”

Time-series vs relational data¶

Dimension	Relational (OLTP)	Time-series (metrics)
Primary access pattern	Point lookups, joins on keys	Range scans on `(series_id, time)`
Mutability	UPDATE/DELETE rows	Append-mostly; deletes are retention / block drops
Schema	Fixed columns + evolving migrations	Sparse wide model: metric + dynamic labels
Ordering	Application-defined	Time is first-class; many algorithms assume per-series monotonic `t`
Consistency	Transactions	Eventually merged replicas; idempotent samples

Tip

In interviews, say: “Each series is a narrow table of (t, v) ordered by t; the database problem is billions of these narrow tables, not one wide fact table.”

Gorilla compression — step-by-step (with bit-level intuition)¶

Goal: exploit smoothness in time and similarity in value for floating-point samples on a single series.

First timestamp (t₀): store raw (e.g., 64-bit UNIX ms).
Next timestamps: compute first delta dᵢ = tᵢ − tᵢ₋₁. For regular scrapes, dᵢ is almost constant (e.g., 15000 ms).
Delta-of-delta: dodᵢ = dᵢ − dᵢ₋₁. For perfectly regular intervals, dodᵢ = 0 → encode with few bits (often a control code in 1–2 bits for “same as last interval”).
First float value (v₀): store raw IEEE-754 bits.
Next values: xorᵢ = bits(vᵢ) XOR bits(vᵢ₋₁). Metrics often change mantissa slightly → leading zero bytes in the XOR dominate; Gorilla encodes length of meaningful bits + payload, not always full 64 bits.

Bit-level toy example (values only):

Step	Meaning	Bits (conceptual)
`v₀ = 1.0`	baseline	full 64
`v₁ = 1.01`	XOR shrinks if exponents align	~12–20 meaningful bits typical in papers’ examples
`v₂ = 1.00`	XOR with `v₁`	again short payload

Timestamps follow the same idea: dod = 0 might cost 2 bits total for that tick, vs 64 bits raw.

flowchart LR subgraph ts["Timestamp stream"] T0["t₀ raw"] T1["Δ₁ = t₁−t₀"] T2["Δ₂ = t₂−t₁"] DOD["Δ₂−Δ₁ = dod (often 0)"] end T0 --> T1 --> T2 --> DOD

Pull vs push — models and diagrams¶

Aspect	Pull (scrape)	Push (remote write / OTLP)
Who initiates	Server GET `/metrics`	Client POST to collector
Clock anchor	Scraper’s clock defines sample time	Client must stamp or server receives time
NAT / private IP	Target must be reachable	Works outbound-only
Cardinality on server	Controlled by targets discovered	Easy to accidentally flood

Pull (scrape) — control plane drives collection:

sequenceDiagram participant S as Scraper / Prometheus participant T as Target /metrics S->>T: GET /metrics (interval) T-->>S: text exposition Note over S: parse, relabel, TSDB append

Push — data plane drives delivery:

sequenceDiagram participant A as App / Agent participant C as Collector / Gateway participant I as Ingester A->>C: gRPC/HTTP batch (samples) C->>I: route(shard), ack after WAL

Note

Production stacks often use both: OTel push into a collector that also scrapes legacy exporters.

Golden signals (with concrete examples)¶

Google’s four golden signals—latency, traffic, errors, and saturation—map cleanly to metric types:

Signal	What it answers	Example metrics
Latency	How slow for users?	Histogram of `http_request_duration_seconds`, p99 panel
Traffic	How much demand?	`rate(http_requests_total[5m])` by `route`
Errors	How many failures?	`5xx` ratio, `rate(errors_total[5m])`
Saturation	How full is the pipe?	CPU %, queue depth, disk almost full, thread pool exhausted

RED vs USE — when to reach for which¶

Method	Stands for	Best for	Typical scope
RED	Rate, Errors, Duration	Request-driven services (microservices, HTTP/gRPC)	Per-service dashboards
USE	Utilization, Saturation, Errors	Resources: CPU, disk, network, memory	Nodes, pools, queues

Concrete pairing:

RED: rate(requests_total), rate(errors_total), histogram duration_seconds.
USE: CPU utilization %, run queue or scheduler wait (saturation), disk I/O errors (errors).

Tip

In interviews: RED = service-centric, USE = resource-centric; both complement golden signals (naming differs, intent overlaps).

Cardinality — definition, failure mode, and concrete math¶

Cardinality = number of distinct time series matching your selectors—equivalently, distinct label value combinations you actually emit.

Why it kills systems:

Index RAM: each new series adds posting-list entries, symbol table strings, and head-block structures.
Query fan-out: sum by (pod)(...) across 50k pods = merge 50k streams.
Alert evaluation: same explosion every eval_interval.

Concrete math (illustrative):

Suppose:

Base series per metric: 1 (no labels).
You add label pod with 5,000 values.
You add label path with 200 tracked routes.

If independent (simplified upper bound): 1 × 5,000 × 200 = 1,000,000 series for one metric family—not counting other labels.

If you add trace_id (high cardinality) on a metric scraped every 15 s:

New series per second ≈ (cardinality growth rate)—indexes cannot keep up; memory and compaction lag.

Knob	Effect
Drop / rewrite labels at ingest	Caps worst-case series count
Recording rules	Pre-aggregate to low-cardinality metrics
Hard limits per tenant	Reject or sample excess labels

Step 1: Requirements Clarification¶

Questions to ask the interviewer¶

Question	Why it matters
Push vs pull for ingestion?	Agent architecture, security, and scrape intervals
Single-tenant vs SaaS?	Isolation, noisy-neighbor controls, billing
Max cardinality per metric / globally?	Index and memory limits; label allowlists
Query language: PromQL, SQL, or custom?	Parser, planner, and compatibility with Grafana
Long-term retention vs raw resolution?	Downsampling, object storage tiering
Multi-region active-active or primary + DR?	Replication, clock skew, query consistency
Compliance (encryption at rest, audit)?	KMS, per-tenant keys, log redaction
Integration with PagerDuty, Slack, email?	Notification routing and escalation

Functional requirements¶

Area	Requirements
Metric ingestion	Accept samples: metric name, labels (dimensions), timestamp, numeric value; support counters, gauges, histograms/summaries
Time-series storage	Persist ordered (timestamp, value) pairs per series; efficient compression and retention policies
Querying / aggregation	Range queries, instant queries, `sum by`, `rate`, histogram quantiles, top-k
Dashboards	Pre-aggregated panels; variable refresh; templated labels
Alerting	Rule definitions, evaluation window, for duration, recovery detection
Anomaly detection (optional)	Baseline vs threshold; seasonality—often later phase or delegated to ML service

Non-functional requirements (targets for this walkthrough)¶

NFR	Target	Rationale
Ingestion throughput	100K+ samples/sec per shard (cluster scales horizontally)	Large microservice estates
Query latency	p99 < 1 s for dashboard-sized queries	Human-in-the-loop; tighter for critical APIs if needed
Retention	1 year raw or rolled-up; older data at coarser resolution	Cost vs precision
Availability	99.99% for ingestion + query API (excluding client misconfig)	SLO-driven; meta-monitoring separate
Durability	No silent data loss; WAL + replication	Trust in alerts and capacity graphs

Warning

99.99% for the entire stack including third-party notification delivery is unrealistic—split SLAs: ingestion, query, notification delivery (best-effort with retries).

Whiteboard checklist (expand each if asked)¶

Bucket	Detail
Ingestion	Parse text (Prometheus exposition) or OTLP; validate label names; reject invalid timestamps
Storage	Immutable blocks after close; compaction rewrites for space; delete = retention policy (not random row delete)
Query	Instant + range API; subqueries; optional SQL façade for BI
Dashboards	Recording rules precompute expensive expressions; cache panel results
Alerting	Silences, inhibition (warning suppresses if critical fires), routes
Anomaly	Often phase 2: seasonal baselines need long windows and clean data

Technology Selection & Tradeoffs¶

Choosing the right components for a metrics system is not about picking "the best" technology—it is about understanding why each choice fits the workload's access patterns, consistency needs, and operational cost. This section maps each major component to the alternatives considered and the rationale for selection.

Time-Series Database (TSDB) selection¶

Option	Strengths	Weaknesses	When to choose
Custom TSDB (Prometheus TSDB-style)	Purpose-built for `(series_id, t, v)` scans; Gorilla compression; LSM-like block layout; no external deps	Must build compaction, retention, index from scratch; single-node unless you add remote write	You need tight control over block format and compression; interview default
InfluxDB	Built-in retention policies, continuous queries, Flux language; time-aware schema	Cardinality limits in OSS; commercial clustering; different query ecosystem from PromQL	InfluxDB-native stack; moderate cardinality; team already operates it
TimescaleDB (PostgreSQL extension)	Full SQL; joins with relational metadata; familiar pg tooling; hypertable partitioning	Higher per-sample overhead than columnar TSDB; compression is good but not Gorilla-level; scaling requires Citus or manual sharding	You already run PostgreSQL; need SQL joins (e.g., alert definitions in same DB); moderate scale
ClickHouse / Apache Druid	Columnar OLAP; excellent for aggregation-heavy queries; high compression on sorted columns	Not purpose-built for per-series streaming appends; ingestion model differs from TSDB; more complex ops	Analytics-heavy workloads; long-range aggregations; you already have a streaming pipeline into OLAP
Cassandra / ScyllaDB wide-column	Proven linear horizontal scale; write-optimized LSM; tunable consistency	No native time-series compression; cardinality = partition key explosion; range scans less efficient than TSDB block format	Already operating C* at scale; willing to trade query expressiveness for write throughput

Our choice for this design: Custom TSDB block format (Prometheus-compatible) for hot data + object storage (S3/GCS) for cold blocks. Rationale: - Access pattern is append-only writes + range scans on (series, time_range) — fits TSDB block layout perfectly. - Gorilla compression gives ~10x space savings over general-purpose columnar. - PromQL compatibility means drop-in support for Grafana dashboards — the de facto visualization standard. - Object storage for cold tiers removes the need for a separate OLAP cluster for long-range queries.

Message queue / ingestion pipeline¶

Option	Strengths	Weaknesses	When to choose
Kafka	High throughput; durable log; replay capability; mature ecosystem	Operational overhead (ZooKeeper/KRaft); latency floor ~5–50 ms per batch	High-throughput ingestion; need replay for reprocessing; multi-consumer fan-out
gRPC direct push (Prometheus remote write)	Simple; no intermediate broker; low latency	No buffering if ingester is down; back-pressure must be handled at the agent	Prometheus ecosystem; agents handle local WAL and retry
NATS / NATS JetStream	Lightweight; low latency; simple ops	Smaller ecosystem; less mature persistence guarantees than Kafka	Latency-sensitive ingestion; smaller scale; simpler ops team
AWS Kinesis / GCP Pub/Sub	Managed; auto-scaling; integrated with cloud ecosystem	Vendor lock-in; cost at very high throughput; latency varies	Cloud-native deployment; team prefers managed services

Our choice: gRPC remote write for the primary push path (Prometheus-compatible, low latency, agents buffer locally) + Kafka as an optional fan-out layer for consumers that need replay (alerting, analytics). This avoids adding a mandatory broker on the hot ingestion path while providing durability through the agent's local WAL.

Metadata and rule storage¶

Option	Strengths	Weaknesses	When to choose
PostgreSQL	ACID; rich queries for alert rules, dashboards, user configs; mature tooling	Not suited for high-cardinality time-series data; separate from TSDB	Alert rules, dashboard definitions, tenant configs — low-volume, high-consistency data
etcd / Consul	Distributed KV with strong consistency; watch primitives for config propagation	Limited query capability; not designed for large datasets	Ring membership, service discovery, small config blobs
Object storage (S3)	Cheap; durable; works for block metadata	High latency for point reads; no transactions	Block meta.json, compacted index files

Our choice: PostgreSQL for alert rules, dashboard definitions, and tenant metadata. etcd for ingester ring membership and distributed coordination. Object storage for TSDB block metadata and archived blocks.

Cache layer¶

Option	Strengths	Weaknesses	When to choose
Redis	Sub-ms latency; rich data structures (sorted sets, streams); cluster mode	Memory-bound cost; persistence is secondary	Query result caching; recent alert state; tenant limit counters
Memcached	Simple; multi-threaded; good for uniform key-value caching	No data structures beyond key-value; no persistence	Simple query result caching at high throughput
In-process LRU (e.g., Ristretto, groupcache)	No network hop; lowest latency	Limited to single-process memory; cache coherence across replicas	Posting list caches; chunk header caches in store-gateway

Our choice: In-process LRU for hot posting lists and chunk headers in store-gateways (avoids network hop on every query), Redis for query frontend result caching and tenant rate-limit counters.

CAP Theorem Analysis¶

The monitoring system spans multiple data paths, each with different consistency requirements. Rather than applying CAP uniformly, we analyze each path:

Write path (ingestion)¶

CAP property	Choice	Rationale
Consistency	Eventual	Replicated ingesters may have slightly different head blocks; queries merge and deduplicate
Availability	Prioritized	Dropping samples during ingestion is worse than temporarily inconsistent reads — samples are irrecoverable
Partition tolerance	Required	Multi-AZ deployment means network partitions are expected

Classification: AP system — Availability over Consistency. Ingesters accept writes independently; replication is async within the ring. During a partition, both sides continue accepting samples; deduplication at query time resolves overlaps.

Read path (queries)¶

CAP property	Choice	Rationale
Consistency	Tunable	Dashboard queries tolerate staleness (seconds); SLO queries may need tighter bounds
Availability	Prioritized	A monitoring dashboard that is down during an incident is worse than showing slightly stale data

Classification: AP with tunable staleness — Queriers merge results from ingesters and store-gateways; if one source is partitioned, results may be partial but the query still returns (with a staleness annotation).

Alerting path¶

CAP property	Choice	Rationale
Consistency	Stronger preference	Missed alerts have direct business impact; duplicate alerts are tolerable
Availability	High	Alert evaluation must run even during partial failures

Classification: AP with compensating mechanisms — Alert rules are evaluated against the available data; if data is missing, rules may use absent() or for duration to require sustained conditions before firing. This prevents false negatives from transient partitions.

Metadata (alert rules, dashboards)¶

CAP property	Choice	Rationale
Consistency	Strong (CP)	Alert rule changes must be consistently applied; a stale rule could suppress a critical alert
Availability	Acceptable degradation during partition	Alert rules change infrequently; brief unavailability of the rule API is acceptable

Classification: CP — PostgreSQL with synchronous replication for alert rules; etcd for ring membership with Raft consensus.

flowchart TB subgraph cap["CAP Classification by Path"] W["Write path AP — accept samples, dedupe later"] R["Read path AP — serve stale over refusing queries"] A["Alerting AP + compensating (for duration, absent)"] M["Metadata CP — strong consistency for rule changes"] end

Note

In interviews, do not say "we pick AP" without specifying which data path. Monitoring systems are mixed-mode: AP for samples, CP for configuration. This nuance is what separates senior answers.

SLA and SLO Definitions¶

A monitoring system is infrastructure for other systems' SLOs — its own reliability must exceed the systems it monitors. Define separate SLAs for each capability:

Internal SLOs (engineering targets)¶

Capability	SLI (what we measure)	SLO (target)	Error budget	Consequence of miss
Ingestion availability	% of valid samples accepted (non-4xx, non-5xx)	99.99%	4.32 min/month downtime equivalent	Gaps in dashboards; SLO miscalculation
Ingestion latency	p99 time from agent push to WAL ack	< 500 ms	—	Agent buffers grow; back-pressure
Query availability	% of queries returning non-5xx	99.9%	43.2 min/month	Dashboard outages during incidents
Query latency	p99 for dashboard-sized queries (< 10K series, < 24h range)	< 1 s	—	Slow incident response
Alert evaluation lag	Time between scheduled eval and actual eval completion	< 30 s	—	Late pages; missed SLO violations
Alert notification delivery	% of firing alerts that reach at least one notification channel	99.9%	—	Silent failures; incidents go unnoticed
Data durability	% of acknowledged samples retrievable after 1 hour	99.999%	—	Trust in historical data

External SLA (customer-facing, if SaaS)¶

Tier	Availability	Query latency (p99)	Retention	Price signal
Free	99.5%	< 5 s	15 days	—
Pro	99.9%	< 2 s	90 days	Per-host
Enterprise	99.99%	< 1 s	1 year+	Per-host + custom metrics

Error budget policy¶

Error budget state	Action
> 50% remaining	Normal velocity; deploy freely
25–50% remaining	Reduce risky deploys; increase canary soak time
< 25% remaining	Feature freeze; focus on reliability work
Exhausted	Incident review required; no non-critical changes

Tip

In interviews, mention error budgets and burn rate alerts (e.g., "if we burn 10% of monthly budget in 1 hour, page immediately"). This shows you understand SLO-driven operations, not just uptime numbers.

Database Schema and Data Model¶

Time-series storage (internal block format)¶

The TSDB does not use a traditional relational schema — it uses a block-based file format. However, the logical data model is:

Series:
  series_id    uint64    (hash of metric name + sorted labels)
  metric_name  string    ("http_request_duration_seconds")
  labels       map[string]string  ({"job":"api", "method":"GET", "status":"200"})

Sample:
  series_id    uint64
  timestamp_ms int64
  value        float64

Symbol table and posting lists (index structure)¶

SymbolTable:
  symbol_id    uint32
  value        string    (interned label key or value)

PostingList:
  label_key    symbol_id
  label_value  symbol_id
  series_ids   []uint64  (sorted, compressed with delta encoding)

Relational schema (PostgreSQL — metadata)¶

-- Tenant / organization
CREATE TABLE tenants (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            TEXT NOT NULL,
    plan            TEXT NOT NULL DEFAULT 'free',   -- free | pro | enterprise
    max_series      BIGINT NOT NULL DEFAULT 100000,
    max_samples_sec BIGINT NOT NULL DEFAULT 10000,
    retention_days  INT NOT NULL DEFAULT 15,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Alert rule definitions
CREATE TABLE alert_rules (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    name            TEXT NOT NULL,
    expr            TEXT NOT NULL,          -- PromQL expression
    for_duration    INTERVAL NOT NULL DEFAULT '5 minutes',
    severity        TEXT NOT NULL DEFAULT 'warning', -- critical | warning | info
    labels          JSONB NOT NULL DEFAULT '{}',
    annotations     JSONB NOT NULL DEFAULT '{}',     -- runbook_url, dashboard_url
    eval_interval   INTERVAL NOT NULL DEFAULT '1 minute',
    enabled         BOOLEAN NOT NULL DEFAULT TRUE,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_alert_rules_tenant ON alert_rules(tenant_id) WHERE enabled;

-- Dashboard definitions
CREATE TABLE dashboards (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    title           TEXT NOT NULL,
    panels          JSONB NOT NULL DEFAULT '[]',     -- array of panel definitions
    variables       JSONB NOT NULL DEFAULT '[]',     -- template variables
    refresh_sec     INT NOT NULL DEFAULT 30,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Notification channels (PagerDuty, Slack, email)
CREATE TABLE notification_channels (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    name            TEXT NOT NULL,
    type            TEXT NOT NULL,          -- pagerduty | slack | email | webhook
    config          JSONB NOT NULL,         -- encrypted at rest; channel-specific config
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Alert routing rules
CREATE TABLE alert_routes (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    match_labels    JSONB NOT NULL DEFAULT '{}',     -- e.g., {"severity": "critical"}
    channel_id      UUID NOT NULL REFERENCES notification_channels(id),
    group_wait      INTERVAL NOT NULL DEFAULT '30 seconds',
    group_interval  INTERVAL NOT NULL DEFAULT '5 minutes',
    repeat_interval INTERVAL NOT NULL DEFAULT '4 hours',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Silence rules (temporary suppression)
CREATE TABLE silences (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    matchers        JSONB NOT NULL,         -- label matchers to silence
    starts_at       TIMESTAMPTZ NOT NULL,
    ends_at         TIMESTAMPTZ NOT NULL,
    created_by      TEXT NOT NULL,
    comment         TEXT,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_silences_active ON silences(tenant_id, ends_at) WHERE ends_at > now();

Storage sizing by table¶

Table	Row size (est.)	Rows at scale	Total size	Growth rate
`tenants`	~200 B	10K	~2 MB	Negligible
`alert_rules`	~500 B	500K (50 per tenant)	~250 MB	Slow
`dashboards`	~5 KB (panels JSONB)	100K	~500 MB	Slow
`notification_channels`	~300 B	50K	~15 MB	Negligible
`silences`	~300 B	10K active	~3 MB	Self-pruning (TTL)

Note

The metadata store is tiny compared to time-series data. PostgreSQL handles this comfortably on a single primary + replica. The real storage challenge is the TSDB blocks in object storage.

API Design¶

Ingestion API¶

Method	Path	Purpose	Auth
`POST`	`/api/v1/push`	Remote write (Prometheus-compatible protobuf)	API key / mTLS
`POST`	`/api/v1/otlp/v1/metrics`	OTLP metrics ingestion (gRPC or HTTP)	API key
`GET`	`/metrics`	Scrape endpoint (pull model — target exposes this)	—

Query API¶

Method	Path	Purpose	Auth
`GET`	`/api/v1/query`	Instant query: `?query=up&time=<ts>`	API key
`GET`	`/api/v1/query_range`	Range query: `?query=rate(http_requests_total[5m])&start=&end=&step=`	API key
`GET`	`/api/v1/series`	Find series matching label selectors	API key
`GET`	`/api/v1/labels`	List all label names	API key
`GET`	`/api/v1/label/{name}/values`	List values for a label	API key

Alert management API¶

Method	Path	Purpose	Auth
`GET`	`/api/v1/rules`	List alert rules (paginated)	API key
`POST`	`/api/v1/rules`	Create alert rule	API key
`PUT`	`/api/v1/rules/{id}`	Update alert rule	API key
`DELETE`	`/api/v1/rules/{id}`	Delete alert rule	API key
`GET`	`/api/v1/alerts`	List currently firing alerts	API key
`POST`	`/api/v1/silences`	Create silence	API key
`DELETE`	`/api/v1/silences/{id}`	Expire silence early	API key

Dashboard API¶

Method	Path	Purpose	Auth
`GET`	`/api/v1/dashboards`	List dashboards	API key
`POST`	`/api/v1/dashboards`	Create dashboard	API key
`PUT`	`/api/v1/dashboards/{id}`	Update dashboard	API key
`DELETE`	`/api/v1/dashboards/{id}`	Delete dashboard	API key

Example request/response¶

// POST /api/v1/push (simplified JSON representation of protobuf payload)
{
  "timeseries": [
    {
      "labels": [
        {"name": "__name__", "value": "http_requests_total"},
        {"name": "method", "value": "GET"},
        {"name": "status", "value": "200"}
      ],
      "samples": [
        {"timestamp_ms": 1712000000000, "value": 42.0},
        {"timestamp_ms": 1712000015000, "value": 43.0}
      ]
    }
  ]
}
// Response: 200 OK (empty body on success)
// Response: 429 Too Many Requests (rate limited)
// Response: 400 Bad Request {"error": "max_series_limit_exceeded", "limit": 100000}

// GET /api/v1/query_range?query=rate(http_requests_total[5m])&start=1712000000&end=1712003600&step=60
{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {"__name__": "http_requests_total", "method": "GET"},
        "values": [
          [1712000000, "1.5"],
          [1712000060, "1.7"]
        ]
      }
    ]
  }
}

Tip

In interviews, mention that the push API uses protobuf + snappy compression for efficiency (not JSON). The JSON examples above are for readability. Prometheus remote write protocol is the de facto standard.

Step 2: Estimation¶

Assumptions (adjust in the interview)¶

- 100,000 samples/sec ingested (cluster aggregate)
- Average series: 20 labels × ~24 B each + metric name ~40 B → ~520 B metadata (order of magnitude)
- 8 bytes timestamp + 8 bytes float per sample (before compression)
- Compression ratio ~10× on disk (Gorilla-style; workload-dependent)
- Replication factor 3 for durability

Metrics volume¶

Quantity	Calculation	Result
Samples per day	`100e3 × 86400`	8.64×10⁹ samples/day
Uncompressed sample payload	16 B (ts + value)	—
Logical sample bytes/day	`8.64e9 × 16`	~138 GB/day (values only)
With metadata churn (indexes)	+50–200%	Depends on cardinality churn

Network and fan-out (order of magnitude)¶

Path	Rough math	Comment
Ingest to gateway	`100K × ~100 B/sample` ≈ 10 MB/s	Payload + framing; gRPC keeps overhead low
Replication	× replication factor inside cluster	Erasure coding optional for cold tiers
Query responses	Highly variable	JSON for ad-hoc; Arrow / protobuf for bulk

Yearly storage envelope (illustrative)¶

Using ~14 GB/day logical compressed values (from above) × 365 ≈ 5.1 TB/year per replica group before index overhead. Real clusters add posting lists, symbol tables, and compaction temporary space—plan 2× for operations headroom in interviews.

Storage (time-series compression)¶

Compression in production often combines:

Delta-of-delta timestamps (consecutive points ~regular intervals → tiny integers).
XOR / leading-zero encoding for floats (Gorilla paper).
Block-level packing + general-purpose ZSTD on cold blocks.

Rough on-disk after 10× compression on values: ~14 GB/day per logical copy before replication. With 3× replication: ~42 GB/day physical (ignoring compaction overhead).

Query patterns¶

Pattern	Cost driver
Instant query	Single evaluation at `t`; may scan recent blocks
Range query	`start…end` with step; fan-out to many series
Aggregations	`sum by (service)` shuffles series by label hash
Cardinality queries	`count by (pod)`—dangerous if `pod` explodes

Tip

State clearly: hot data in SSD, cold in object storage (S3/GCS) with downsampled blocks for cheap long-range scans.

Query load assumptions (for read-path sizing)¶

Assumption	Example	Notes
Concurrent dashboard users	50–200	Each panel = 1–5 queries
Refresh interval	30 s	Caches dedupe identical queries
Expensive queries	`topk`, high-cardinality `count`	Rate-limit per tenant

If 500 QPS query mix with 50 ms mean server work, you need enough querier replicas and queue depth—interviews often ask only ingestion math; offering query sizing shows maturity.

Step 3: High-Level Design¶

flowchart LR subgraph clients["Data sources"] A1[Agents / exporters] A2[OpenTelemetry SDK] A3[Push gateway] end subgraph edge["Ingestion"] GW[Ingestion Gateway / Distributor] end subgraph write["Write path"] ING[Ingester] WAL[(WAL)] TS[(Time-series DB / Blocks)] end subgraph read["Read path"] QRY[Query frontend] ENG[Query engine / Evaluator] CACHE[Cache / Results] end subgraph async["Async & UX"] ALT[Alerting engine] RUL[Rule store] NOT[Notification router] DASH[Dashboard API] end A1 --> GW A2 --> GW A3 --> GW GW --> ING ING --> WAL ING --> TS QRY --> ENG ENG --> TS ENG --> CACHE ALT --> ENG RUL --> ALT ALT --> NOT DASH --> QRY

Flow:

Agents scrape or receive application metrics; they batch and send to Ingestion Gateway (auth, rate limits, routing).
Ingester appends to WAL, buffers in memory, flushes immutable blocks to the time-series store (columnar / LSM-like).
Query frontend parses PromQL-style queries, plans execution (pushdown aggregations, prune shards), merges results.
Alerting engine periodically evaluates rules (often same query engine), drives state machine, sends to PagerDuty/Slack via notification router (dedupe, silences).
Dashboard UI calls the query API (often Grafana in front).

Note

Read path and write path are often separated (different services) so spikes in queries do not starve ingestion.

Sequence: scrape / push → durable block¶

sequenceDiagram participant Agent participant GW as Gateway participant ING as Ingester participant WAL as WAL participant OBJ as Object store Agent->>GW: Push batch (or GW pulls /metrics) GW->>ING: Route by tenant/hash ING->>WAL: Append records (fsync policy) ING->>ING: Memory head + chunk encoders Note over ING: Flush immutable block (e.g. every 2h) ING->>OBJ: Upload block + index

Component responsibilities (talk track)¶

Component	Responsibility
Gateway	AuthN/Z, rate limits, admission control, routing key
Ingester	WAL, replication, cut blocks, ship to storage
Compactor	Merge small blocks; downsample; retention deletes
Store gateway	Serve queries from object storage + cache
Querier	Parse PromQL, fan-out, merge, dedupe
Ruler / Alertmanager	Rule eval, routing, silences

Step 4: Deep Dive¶

4.1 Time-series data model and compression¶

Series identity: (metric_name, sorted label map) → unique series ID (internal uint64). Each series is an ordered sequence of (timestamp_ms, float64) (or int for counters).

Concept	Description
Metric name	e.g., `http_request_duration_seconds`
Labels	Dimensions: `method`, `path`, `pod`, `region`
Samples	`(t, v)` pairs; monotonic counters need `rate()` in queries
Histogram	Multiple series suffixes `_bucket`, `_sum`, `_count` or native histogram type

Delta-of-delta timestamps (Gorilla-style intuition): if deltas between timestamps are near-constant, second-order delta encodes to few bits.

XOR float compression: store first value raw; for next, XOR with previous; leading zeros often dominate → encode run of zeros compactly.

from __future__ import annotations
from dataclasses import dataclass
from typing import List, Tuple


@dataclass
class GorillaStyleEncoder:
    """Illustrative tick encoder: delta-of-delta timestamps + XOR floats."""

    timestamps_ms: List[int]
    values: List[float]

    def encode_timestamp_deltas(self) -> List[int]:
        if not self.timestamps_ms:
            return []
        deltas: List[int] = []
        prev_ts = self.timestamps_ms[0]
        prev_delta = 0
        for ts in self.timestamps_ms[1:]:
            delta = ts - prev_ts
            dod = delta - prev_delta  # delta-of-delta
            deltas.append(dod)
            prev_ts, prev_delta = ts, delta
        return deltas

    @staticmethod
    def xor_float_bits(a: float, b: float) -> int:
        import struct

        ab = struct.pack(">d", a)
        bb = struct.pack(">d", b)
        ai = int.from_bytes(ab, "big")
        bi = int.from_bytes(bb, "big")
        return ai ^ bi


# Example: near-regular 15s scrape → small delta-of-delta
enc = GorillaStyleEncoder(
    timestamps_ms=[1_000_000 + i * 15_000 for i in range(5)],
    values=[0.12, 0.13, 0.11, 0.10, 0.09],
)
print(enc.encode_timestamp_deltas())  # mostly zeros after first tick
print(bin(GorillaStyleEncoder.xor_float_bits(enc.values[0], enc.values[1])))

Label hashing → series ID: production systems intern label strings in a symbol table and store sorted label pairs so (job="a", env="b") equals (env="b", job="a").

def canonical_series_key(metric: str, labels: dict[str, str]) -> str:
    """Human-readable key; production would hash to uint64."""
    parts = [metric] + [f'{k}="{labels[k]}"' for k in sorted(labels)]
    return "{" + ",".join(parts) + "}"


print(canonical_series_key("up", {"job": "api", "instance": "10.0.0.1:9100"}))

Delta-of-delta intuition (bits): if scrape interval is exactly 15 s, first-order delta is 15000 ms every time; second-order delta is 0. Encoders emit a short control code for “same as last interval” vs “patch.”

def leading_zero_bytes(x: int) -> int:
    """How many leading zero bytes in 64-bit XOR (illustrative metric)."""
    if x == 0:
        return 8
    b = x.bit_length()
    return max(0, 8 - (b + 7) // 8)

Tip

Mention the Gorilla (Facebook) paper explicitly—it is interview currency. Note ZSTD on frozen blocks for cold tiers.

4.2 Write path and ingestion¶

Push vs pull:

Mode	Pros	Cons
Pull (scrape)	Simple mental model; targets expose `/metrics`; Prometheus-native	Requires reachable endpoints; NAT/firewall pain
Push	Works behind NAT; batching to gateway	Need dedupe, back-pressure, auth

Production systems often support both: OTel push to collector → same pipeline as scrape.

Batching: agents buffer N samples or T ms, snappy-compress payloads, use HTTP/2 or gRPC.

WAL + memory: ingester writes append-only WAL before acknowledging; on crash, replay WAL. Memory holds active head block; when full or time window closes, flush immutable block.

LSM-tree analogy: recent writes go to memtable; flush produces SSTable-like files; compaction merges overlapping windows and drops expired data per retention.

import json
from collections import defaultdict
from pathlib import Path
from typing import DefaultDict, Dict, List


class IngesterBuffer:
    """Simplified in-memory shard buffer → flush to JSONL 'blocks' (illustrative)."""

    def __init__(self, wal_path: Path, block_dir: Path) -> None:
        self.wal_path = wal_path
        self.block_dir = block_dir
        self.buf: DefaultDict[str, List[Dict]] = defaultdict(list)

    def append_wal(self, record: Dict) -> None:
        with self.wal_path.open("a", encoding="utf-8") as f:
            f.write(json.dumps(record) + "\n")

    def ingest(self, series_key: str, ts: int, value: float) -> None:
        rec = {"series": series_key, "ts": ts, "v": value}
        self.append_wal(rec)
        self.buf[series_key].append(rec)

    def flush_block(self, block_id: str) -> None:
        self.block_dir.mkdir(parents=True, exist_ok=True)
        out = self.block_dir / f"block-{block_id}.jsonl"
        with out.open("w", encoding="utf-8") as f:
            for series, rows in self.buf.items():
                for r in sorted(rows, key=lambda x: x["ts"]):
                    f.write(json.dumps(r) + "\n")
        self.buf.clear()

Back-pressure: gateway returns 429 or drops lowest-priority tenants when ingestion lag > SLO; adaptive batching reduces scrape frequency for noisy jobs.

from dataclasses import dataclass
from typing import List


@dataclass
class Batcher:
    max_points: int
    max_wait_ms: int
    buf: List[dict]

    def should_flush(self, elapsed_ms: int) -> bool:
        return len(self.buf) >= self.max_points or elapsed_ms >= self.max_wait_ms

Warning

High-cardinality labels (e.g., user_id) will OOM the ingester and index—enforce limits, drop rules, or aggregations at the edge.

4.3 Storage engine¶

Columnar / time-oriented layout: store columns (timestamps, values) per series or per block of time; good for scanning one metric across many series (aggregation).

Block-based: immutable 2-hour (example) blocks with index (label → posting lists) + chunks for values.

Retention: drop blocks older than policy; object storage for archive.

Downsampling / rollups: for data older than 30 days, keep 5-minute aggregates; older → 1-hour—trades precision for cost.

Tier	Age	Resolution	Use case
Hot	0–7 d	Native scrape (e.g. 15 s)	Incident drill-down
Warm	7–90 d	1 min (recording rules)	Weekly trends
Cold	90 d–1 y	5–60 min	Capacity, finance
Archive	> 1 y	Daily / weekly	Compliance (optional)

flowchart TB subgraph hot["Hot SSD"] H[Open head block + recent blocks] end subgraph warm["Warm / object store"] W[Compacted blocks] end subgraph cold["Cold / glacier"] C[Downsampled blocks] end hot --> warm --> cold

from dataclasses import dataclass
from typing import List


@dataclass
class Rollup:
    window_sec: int

    def downsample(
        self, timestamps: List[int], values: List[float]
    ) -> tuple[List[int], List[float]]:
        bucket: dict[int, List[float]] = {}
        w_ms = self.window_sec * 1000
        for ts, v in zip(timestamps, values):
            b = (ts // w_ms) * w_ms
            bucket.setdefault(b, []).append(v)
        out_ts = sorted(bucket)
        out_v = [sum(bucket[t]) / len(bucket[t]) for t in out_ts]
        return out_ts, out_v

Posting lists: inverted index from label=value → list of series IDs containing it. Query job="api" AND region="us" = intersect posting lists (skip if one side is tiny).

4.4 Query engine¶

PromQL-style (conceptual): selectors metric{label="v"}, range vectors rate(http_requests_total[5m]), aggregations sum by (job) (...).

Aggregation operators: vector matching one-to-one, many-to-one with group_left.

Fan-out: querier asks all ingesters / store-gateways holding data for time range; merge deduplicated samples (replication-aware).

PromQL-style planning: AST and decomposition¶

Real engines parse the expression into an AST, then build a logical plan (selectors, functions, aggregations) and a physical plan (shard lists, index scans, pushdown). Below is a minimal illustrative pipeline: tokenize → AST → extract selectors → per-shard subplans.

from __future__ import annotations

from dataclasses import dataclass, field
from typing import Dict, List, Tuple, Union


@dataclass
class VectorSelector:
    """e.g. http_requests_total{job=\"api\"}"""

    metric: str
    matchers: Dict[str, str]


@dataclass
class Call:
    """Function call: rate(arg, window)"""

    name: str
    args: List[Union["Call", VectorSelector, str]]


@dataclass
class Aggregate:
    """sum by (job) (inner)"""

    op: str
    by_labels: Tuple[str, ...]
    inner: Union[Call, VectorSelector]


@dataclass
class QueryPlan:
    """Logical plan root."""

    root: Union[Aggregate, Call, VectorSelector]
    selectors: List[VectorSelector] = field(default_factory=list)

    def collect_selectors(self, node: object) -> None:
        if isinstance(node, VectorSelector):
            self.selectors.append(node)
        elif isinstance(node, Call):
            for a in node.args:
                self.collect_selectors(a)
        elif isinstance(node, Aggregate):
            self.collect_selectors(node.inner)


def parse_promql_style(expr: str) -> QueryPlan:
    """Tiny stand-in for a real parser — extend with a proper grammar."""
    # Illustrative: only recognizes one pattern for demo.
    if "sum by" in expr:
        inner = Call(
            "rate",
            [
                VectorSelector("http_requests_total", {"job": "api"}),
                "5m",
            ],
        )
        root = Aggregate("sum", ("job",), inner)
    else:
        root = VectorSelector("up", {})
    plan = QueryPlan(root=root)
    plan.collect_selectors(root)
    return plan


def decompose_to_shards(
    plan: QueryPlan, num_shards: int
) -> List[Dict[str, object]]:
    """Emit one executable subplan per shard (same matchers; shard prunes series)."""
    subplans: List[Dict[str, object]] = []
    for s in range(num_shards):
        subplans.append(
            {
                "shard_id": s,
                "matchers": [sel.matchers for sel in plan.selectors],
                "pushdown": ["sum"] if isinstance(plan.root, Aggregate) else [],
            }
        )
    return subplans

Planner responsibilities (talk track): predicate pushdown (apply job="api" before materializing series), hash aggregation (sum by), parallel scans per ingester, then merge with dedupe if replication overlaps.

Posting list intersection (sorted merge + early exit)¶

Inverted index maps (label, value) → sorted series IDs. AND queries intersect lists smallest-first to shrink work; iterators merge like merge-sort with early termination when one list ends.

from typing import List


def intersect_sorted(a: List[int], b: List[int]) -> List[int]:
    """Two-pointer merge; lists must be sorted ascending."""
    i, j = 0, 0
    out: List[int] = []
    while i < len(a) and j < len(b):
        if a[i] == b[j]:
            out.append(a[i])
            i += 1
            j += 1
        elif a[i] < b[j]:
            i += 1
        else:
            j += 1
    return out


def intersect_many(lists: List[List[int]]) -> List[int]:
    """Chain intersection; sort lists by length ascending first."""
    if not lists:
        return []
    ordered = sorted(lists, key=len)
    acc = ordered[0]
    for nxt in ordered[1:]:
        acc = intersect_sorted(acc, nxt)
        if not acc:
            break
    return acc

`rate()` and `increase()` with counter reset detection¶

Counters are monotonic until process restart (reset). Prometheus-style rate uses all samples in range, detects resets when v[k] < v[k-1], and skips the downward jump as a discontinuity.

from typing import List, Tuple


def increase_range(
    timestamps: List[float], values: List[float]
) -> float:
    """Total increase over [t0, tn] with counter resets (no extrapolation)."""
    if len(values) < 2:
        return 0.0
    total = 0.0
    for k in range(1, len(values)):
        dv = values[k] - values[k - 1]
        if dv < 0:
            # reset: treat previous high-water mark as segment end
            total += values[k]
        else:
            total += dv
    return total


def rate_per_second(
    timestamps: List[float], values: List[float]
) -> float:
    """Average rate across window from increase / duration."""
    if len(timestamps) < 2:
        return 0.0
    dt = timestamps[-1] - timestamps[0]
    if dt <= 0:
        return 0.0
    return increase_range(timestamps, values) / dt

Note

Production rate() also handles stale markers, extrapolation at window edges, and alignment to evaluation steps—the sketch above shows reset accounting only.

`histogram_quantile()` — linear interpolation across buckets¶

Classic Prometheus histograms expose le buckets. Given cumulative counts per bucket at time t, pick brackets around target quantile φ·total count and linearly interpolate inside that bucket (assuming uniform mass between le boundaries).

from typing import List, Tuple


def histogram_quantile(phi: float, buckets: List[Tuple[float, float]]) -> float:
    """
    buckets: sorted (le_upper_bound, cumulative_count).
    Returns estimated quantile in native units (e.g. seconds).
    """
    if not buckets or not (0.0 <= phi <= 1.0):
        return float("nan")
    total = buckets[-1][1]
    if total <= 0:
        return float("nan")
    target = phi * total
    prev_le, prev_c = buckets[0]
    if target <= prev_c:
        return prev_le
    for i in range(1, len(buckets)):
        le, c = buckets[i]
        if c >= target:
            # Linear interpolate within [prev_le, le]
            rank_in_bucket = target - prev_c
            bucket_width = max(le - prev_le, 1e-12)
            count_in_bucket = max(c - prev_c, 1e-12)
            frac = rank_in_bucket / count_in_bucket
            return prev_le + frac * bucket_width
        prev_le, prev_c = le, c
    return buckets[-1][0]

Query pushdown optimization¶

Pushdown	What moves toward storage	Why it wins
Label predicates	`job="api"` before decoding chunks	Shrinks posting lists
Aggregations	`sum by (region)` in compactor / store	Less data over network
Time slicing	Chunk files wholly outside `[start,end]` skipped	I/O bound

Shard pruning: if matchers include region="us-east", only touch ingesters owning that tenant/region slice—critical for latency.

Note

Parallelism is keyed by time range and shard; protect with query concurrency limits per tenant.

4.5 Alerting pipeline¶

Alert rule: expr + for duration + labels + annotations.

Evaluation: scheduler runs every eval_interval; executes query; compares to threshold.

State machine:

stateDiagram-v2 [*] --> Inactive Inactive --> Pending: condition true Pending --> Firing: condition true for "for" duration Pending --> Inactive: recovered Firing --> Resolved: condition false Resolved --> Inactive

Notification routing: route by severity, team label, on-call schedule (PagerDuty service / escalation policy).

Deduplication: same alert key (rule, labels) within group_wait → one notification; group_interval for updates.

from enum import Enum, auto
from typing import Dict, Optional


class AlertState(Enum):
    INACTIVE = auto()
    PENDING = auto()
    FIRING = auto()


class Alert:
    def __init__(self, name: str, labels: Dict[str, str], for_sec: int) -> None:
        self.name = name
        self.labels = labels
        self.for_sec = for_sec
        self.state = AlertState.INACTIVE
        self.true_since: Optional[float] = None

    def tick(self, now: float, condition_met: bool) -> None:
        if not condition_met:
            self.state = AlertState.INACTIVE
            self.true_since = None
            return
        if self.state == AlertState.INACTIVE:
            self.state = AlertState.PENDING
            self.true_since = now
            return
        if self.state == AlertState.PENDING and self.true_since is not None:
            if now - self.true_since >= self.for_sec:
                self.state = AlertState.FIRING

Notification deduplication (sketch):

from time import time
from typing import Dict, Tuple

AlertKey = Tuple[str, Tuple[Tuple[str, str], ...]]


class NotificationDeduper:
    """Drop duplicate notifications for the same alert key within window_sec."""

    def __init__(self, window_sec: int) -> None:
        self.window_sec = window_sec
        self.seen: Dict[AlertKey, float] = {}

    def should_send(self, key: AlertKey) -> bool:
        now = time()
        last = self.seen.get(key)
        if last is not None and now - last < self.window_sec:
            return False
        self.seen[key] = now
        return True

On-call integration: Alertmanager sends JSON webhook to PagerDuty Events API v2; include runbook URL and dashboard deep link in annotations.

4.6 Scalability — sharding and federation¶

Metric-based sharding: hash(series_id) → shard or range on tenant_id + metric prefix. Avoid hotspots by shuffle sharding tenants across nodes.

Hierarchical federation: regional Prometheus federates up to global read-only queries for SLI dashboards; avoids shipping all raw data globally.

Cross-datacenter: dual write or async replication with eventual consistency; queries may use local DC for freshness or global view with staleness bounds.

{% raw %}

def shard_for_series(series_id: int, num_shards: int) -> int:
    return series_id % num_shards


def federate_selectors(global_query: str) -> list[str]:
    """Illustrative: split one global metric into per-region selectors."""
    regions = ["us-east", "eu-west", "ap-south"]
    return [f'{global_query}{{region="{r}"}}' for r in regions]

flowchart TB subgraph r1["Region A"] P1[Prometheus / Agent pool] end subgraph r2["Region B"] P2[Prometheus / Agent pool] end subgraph global["Global read path"] FED[Federation / Thanos querier] GRA[Grafana] end P1 --> FED P2 --> FED FED --> GRA

Cross-datacenter aggregation: write locally, query globally with staleness annotations; avoid synchronous cross-region writes on the hot path.

Pattern	When to use
Dedicated global metrics	SLIs that must be comparable across regions
Region-local dashboards	Latency inside one DC—global merge is misleading
Replication	DR and query availability—not a substitute for correct aggregation semantics

Shuffle sharding (code sketch)¶

Instead of placing tenant T on one shard (blast radius = full shard), assign K random shards from N total; replicate work across those K so correlated failure needs multiple bad nodes to affect the tenant.

import hashlib
from typing import List, Set


def shard_subset(
    tenant_id: str, n_shards: int, k: int, seed: str = "monitoring"
) -> List[int]:
    """Deterministic shuffle-shard pick: k distinct shards in [0, n_shards)."""
    if k > n_shards:
        raise ValueError("k cannot exceed n_shards")
    h = hashlib.blake2b(f"{seed}:{tenant_id}".encode(), digest_size=16).digest()
    # Expand hash space by rehashing with counter until we have k picks.
    picks: Set[int] = set()
    ctr = 0
    while len(picks) < k:
        block = hashlib.blake2b(h + ctr.to_bytes(4, "big"), digest_size=8).digest()
        picks.add(int.from_bytes(block, "big") % n_shards)
        ctr += 1
    return sorted(picks)

Interview line: blast radius scales roughly as ~(K/N)² for independent failures when two replicas must agree—orders-of-magnitude smaller than single-shard placement.

Consistent hashing for ingester ring membership¶

Ring: hash ingester ID onto 0–2⁶³⁻¹; each tenant or series maps to clockwise first alive owner. Add/remove node moves only neighboring key ranges—unlike mod N which reshuffles everything.

from bisect import bisect_right
from typing import Dict, List, Tuple


class HashRing:
    """Minimal ring: sorted tokens -> ingester; virtual nodes omitted for brevity."""

    def __init__(self, nodes: List[Tuple[str, int]]) -> None:
        # nodes: (name, token)
        self.ring = sorted(nodes, key=lambda x: x[1])

    def owner(self, key_hash: int) -> str:
        tokens = [t for _, t in self.ring]
        idx = bisect_right(tokens, key_hash) % len(self.ring)
        return self.ring[idx][0]

Virtual nodes (not shown): each physical ingester claims many tokens on the ring to reduce variance in load.

Thanos / Mimir — store-gateway role¶

In long-term Prometheus-compatible stacks, blocks live in object storage (S3/GCS). Store-gateway components:

Discover block meta.json objects (often via bucket index).
Cache block metadata and downloaded chunks in local SSD.
Serve StoreAPI reads to queriers for historical time ranges.
Participate in query planning to prune blocks by labels and time.

Thanos vs Mimir naming differs slightly, but the pattern is the same: querier fans out to ingesters (hot) + store-gateways (cold), then merges and dedupes.

Distributed read path (full diagram)¶

flowchart LR subgraph client["Client"] U[Grafana / API user] end subgraph read["Read path"] QF[Query frontend] QR[Querier] SG[Store-gateway] OBJ[(Object store)] ING[Ingester / TSDB head] end U --> QF QF --> QR QR --> ING QR --> SG SG --> OBJ

Narrative: query frontend may split range queries and cache; querier executes PromQL, pushes down matchers, contacts ingesters for recent samples and store-gateways for historical blocks; gateways fetch chunk data from object storage with aggressive caching.

Tip

Name-check Thanos / Cortex / Mimir for long-term Prometheus; Monarch for hierarchical global monitoring at Google scale.

4.7 Histogram and quantile computation¶

Why quantiles are hard in distributed systems¶

A global p99 latency across all pods requires order statistics over all request durations. Shipping raw samples is bandwidth-heavy; shipping only per-pod p99 loses mixing—the p99 of p99s is not global p99 when traffic differs across pods. You need either mergeable summaries or centralized evaluation with enough data.

t-digest — intuition and minimal Python sketch¶

t-digest clusters samples near centroids with adaptive bin sizes: dense regions get more clusters; tails get fewer but large bins. Merging digests is approximate but bounded error for quantiles.

from dataclasses import dataclass, field
from typing import List, Tuple


@dataclass
class TDigestCentroid:
    mean: float
    weight: float


@dataclass
class TDigestLite:
    """Highly simplified t-digest (educational; not production-accurate)."""

    centroids: List[TDigestCentroid] = field(default_factory=list)
    compression: float = 100.0

    def update(self, x: float, w: float = 1.0) -> None:
        self.centroids.append(TDigestCentroid(mean=x, weight=w))
        self.centroids.sort(key=lambda c: c.mean)
        self._compress()

    def _compress(self) -> None:
        """Merge neighbors if cluster size small enough — real t-digest uses size limits."""
        if len(self.centroids) < 2:
            return
        merged: List[TDigestCentroid] = []
        acc = self.centroids[0]
        for c in self.centroids[1:]:
            if abs(c.mean - acc.mean) < (acc.weight + c.weight) / self.compression:
                tw = acc.weight + c.weight
                acc = TDigestCentroid(
                    mean=(acc.mean * acc.weight + c.mean * c.weight) / tw, weight=tw
                )
            else:
                merged.append(acc)
                acc = c
        merged.append(acc)
        self.centroids = merged

    def quantile(self, q: float) -> float:
        """Weighted order-statistic walk over centroids."""
        total = sum(c.weight for c in self.centroids)
        target = q * total
        cum = 0.0
        for c in self.centroids:
            if cum + c.weight >= target:
                return c.mean
            cum += c.weight
        return self.centroids[-1].mean

Warning

Real t-digest uses strict scaling functions and merge rules—use a battle-tested library in production.

Native histograms vs classic histograms¶

Aspect	Classic (`_bucket` with `le`)	Native (Prometheus native histogram)
Layout	Many counter series per bucket	Structured histogram type + sparse buckets
Cardinality	High (one series per bucket × labels)	Lower for sparse buckets
Scrape	Text lines per bucket	Binary / structured exposition
Quantile	`histogram_quantile` on cumulative counts	Engine may use better internal representation

Mergeable sketches for global percentiles¶

Sketch	Merges across nodes?	Typical use
t-digest	Yes (approximate)	Latency percentiles
HLL	Yes (approximate)	Cardinality estimates
GK / KLL	Yes (with care)	Alternative quantile tracks

Pattern: each ingester builds a digest for its traffic; querier merges digests for tenant-wide φ-quantile with error bounds proportional to compression parameter.

Error bounds and accuracy guarantees¶

Histogram buckets: error depends on bucket widths; linear interpolation assumes uniform mass within a bucket—bad if bucket spans multi-modal latency.
t-digest: error ε typically ~1/compression relative to CDF—tunable vs memory.
Distributed merge: merging approximate structures accumulates error—document SLO on query accuracy separately from ingestion SLO.

Note

For interviews, tie this back to SLO dashboards: global p99 is a product question, not just a math question—traffic-weighted merging matters.

Step 5: Scaling & Production¶

Failure handling¶

Failure	Mitigation
Ingester crash	WAL replay; another replica serves duplicate data until compacted
Zone outage	Replicate shards across AZs; read from healthy replicas
Query OOM	Limits on max series, max range, max points; query queue with timeout
Notification provider down	Retry with backoff; dead-letter queue; multi-channel fallback

Meta-monitoring¶

Monitor the monitors: ingestion lag, WAL age, compaction backlog, query latency, alert evaluation lag, notification success rate. Use separate infrastructure or vendor for paging the paging system.

Signal	What breaks if it degrades
Ingestion lag	Dashboards look “fine” while reality is on fire
Compaction backlog	Disk fills; queries slow; read amplification spikes
Rule eval lag	Late alerts; thundering herd after catch-up
Memberlist / ring health	Wrong shard routing → partial data loss visibility

Runbooks and human factors¶

SLO-based alerts beat static thresholds; pair metrics with error budgets.
Cascading failures: protect the control plane with bulkheads (separate etcd/consul for monitoring).
Game days for “metrics backend down”—operators should know fallback (synthetic checks only, etc.).

Trade-offs¶

Choice	Upside	Downside
Strong vs eventual replication	Linearizable reads	Latency + coordination cost
Long raw retention	Fine-grained forensics	Storage cost
High cardinality	Powerful drill-down	Memory and index explosion
Pull-only	Idempotent scrape	Operational complexity in k8s

Warning

Cardinality and query cost are the #1 production failure modes—interviewers reward guardrails (limits, recording rules, aggressive downsampling).

Interview Tips — Google follow-up questions¶

Topic	What they probe
Cardinality explosion	How do you cap labels? Recording rules? Adaptive sampling?
Exactly-once ingestion	At-least-once typical; idempotent keys for push; dedupe windows
Clock skew	NTP; reject too-future timestamps; out-of-order handling in TSDB
Histogram quantiles	t-digest, HLL vs exact; error bounds
Multi-tenancy fairness	Quotas, shuffle sharding, noisy neighbor isolation
Cost	Cold storage, query pushdown, aggregations at write (recording rules)
vs Logs/traces	Metrics for aggregates; Exemplars link to traces

Follow-up drill (short answers)¶

Question	Strong answer shape
“How is this different from InfluxDB?”	Column/time layout vs full database features; PromQL ecosystem vs InfluxQL
“Global percentiles?”	Mergeable sketches (t-digest) per shard + merge; exact global p99 needs heavy data movement
“Billing in SaaS?”	Ingested samples, query cost units, cardinality peaks—meter at gateway
“ML anomaly?”	Offline train on rollups; online scorer on stream—keep separate from hot path

Note

Close with what you’d build first: ingestion + WAL + query on recent blocks, then long-term store, then alerting—mirrors MVP → scale path.

MVP vs scale (explicit)¶

Phase	Build	Defer
MVP	Scrape/push → WAL → query last 14 d	Multi-year archive
Scale	Horizontal ingesters, object store blocks	Fancy ML
Mature	Global federation, tiered storage	—

Capacity Planning and Storage Sizing (Extended)¶

This section consolidates the storage and infrastructure planning into a single reference, building on the estimation in Step 2.

Storage breakdown by tier (1 year projection)¶

Tier	Data type	Resolution	Compression	Size per replica	Replicas	Total physical
Hot (SSD)	Raw samples, 0–7 days	15 s native	Gorilla ~10x	~98 GB	3	~294 GB
Warm (object store)	Compacted blocks, 7–90 days	15 s native	Gorilla + ZSTD	~1.1 TB	2 (erasure coded)	~2.2 TB
Cold (object store)	Downsampled, 90 d–1 year	5 min	ZSTD	~80 GB	2	~160 GB
Metadata (PostgreSQL)	Rules, dashboards, tenants	—	—	~1 GB	2	~2 GB
Index (posting lists)	Label → series mappings	—	Delta encoded	~20 GB per shard	3	~60 GB

Total year-1 envelope: ~2.7 TB (dominated by warm-tier object storage). At \(0.023/GB/month** for S3 standard: **~\)62/month for warm storage — monitoring storage is cheap relative to its value.

Infrastructure sizing (illustrative cluster)¶

Component	Instance type	Count	Key resource	Notes
Ingester	8 vCPU, 32 GB RAM, 500 GB SSD	3–6	RAM (active series), SSD (WAL)	Scale with ingestion rate
Querier	8 vCPU, 16 GB RAM	2–4	CPU (query evaluation)	Scale with concurrent queries
Store-gateway	4 vCPU, 16 GB RAM, 200 GB SSD cache	2–3	SSD (block cache), RAM (index cache)	Scale with query range depth
Compactor	4 vCPU, 8 GB RAM	1–2	CPU, temp disk	Runs periodically
Query frontend	2 vCPU, 4 GB RAM	2	Stateless; query splitting and caching	Behind LB
Alerting ruler	4 vCPU, 8 GB RAM	2	CPU (rule evaluation)	HA pair
PostgreSQL	4 vCPU, 16 GB RAM, 100 GB SSD	2 (primary + replica)	Storage, IOPS	Metadata only
Redis	4 GB RAM	1–2	RAM	Query cache + rate limits

Summary¶

Layer	Core idea
Model	Metric + labels = series; compress time and value streams
Write	WAL, batched flush, block files, retention + rollups
Read	Label index → posting lists; fan-out + merge; PromQL planning
Alerts	Pending → firing with `for`; dedupe + routes
Scale	Shard by series/tenant; federation for hierarchy
Technology	Custom TSDB + object store (AP for samples, CP for config); PostgreSQL for metadata; Redis for caching
SLAs	99.99% ingestion; 99.9% query; separate SLOs per capability with error budgets

This walkthrough is a structured narrative for system design interviews—not a production architecture for any one product. Adapt numbers, SLAs, and compliance to the prompt and your experience.

Design a Metrics and Monitoring System (Datadog / Prometheus–style)¶

What We're Building¶

Real-world scale (illustrative)¶

Reference systems (interview vocabulary)¶

Why monitoring matters¶

Key Concepts Primer¶

Time-series vs relational data¶

Gorilla compression — step-by-step (with bit-level intuition)¶

Pull vs push — models and diagrams¶

Golden signals (with concrete examples)¶

RED vs USE — when to reach for which¶

Cardinality — definition, failure mode, and concrete math¶

Step 1: Requirements Clarification¶

Questions to ask the interviewer¶

Functional requirements¶

Non-functional requirements (targets for this walkthrough)¶

Whiteboard checklist (expand each if asked)¶

Technology Selection & Tradeoffs¶

Time-Series Database (TSDB) selection¶

Message queue / ingestion pipeline¶

Metadata and rule storage¶

Cache layer¶

CAP Theorem Analysis¶

Write path (ingestion)¶

Read path (queries)¶

Alerting path¶

Metadata (alert rules, dashboards)¶

SLA and SLO Definitions¶

Internal SLOs (engineering targets)¶

External SLA (customer-facing, if SaaS)¶

Error budget policy¶

Database Schema and Data Model¶

Time-series storage (internal block format)¶

Symbol table and posting lists (index structure)¶

Relational schema (PostgreSQL — metadata)¶

Storage sizing by table¶

API Design¶

Ingestion API¶

Query API¶

Alert management API¶

Dashboard API¶

Example request/response¶

Step 2: Estimation¶

Assumptions (adjust in the interview)¶

Metrics volume¶

Network and fan-out (order of magnitude)¶

Yearly storage envelope (illustrative)¶

Storage (time-series compression)¶

Query patterns¶

Query load assumptions (for read-path sizing)¶

Step 3: High-Level Design¶

Sequence: scrape / push → durable block¶

Component responsibilities (talk track)¶

Step 4: Deep Dive¶

4.1 Time-series data model and compression¶

4.2 Write path and ingestion¶

4.3 Storage engine¶

4.4 Query engine¶

PromQL-style planning: AST and decomposition¶

Posting list intersection (sorted merge + early exit)¶

rate() and increase() with counter reset detection¶

histogram_quantile() — linear interpolation across buckets¶

Query pushdown optimization¶

4.5 Alerting pipeline¶

4.6 Scalability — sharding and federation¶

Shuffle sharding (code sketch)¶

Consistent hashing for ingester ring membership¶

Thanos / Mimir — store-gateway role¶

Distributed read path (full diagram)¶

4.7 Histogram and quantile computation¶

Why quantiles are hard in distributed systems¶

t-digest — intuition and minimal Python sketch¶

Native histograms vs classic histograms¶

Mergeable sketches for global percentiles¶

Error bounds and accuracy guarantees¶

Step 5: Scaling & Production¶

Failure handling¶

Meta-monitoring¶

Runbooks and human factors¶

Trade-offs¶

`rate()` and `increase()` with counter reset detection¶

`histogram_quantile()` — linear interpolation across buckets¶