Skip to content

Infrastructure

Prometheus at Trading-Firm Scale: Federation, Thanos vs Mimir, and the Cardinality Trap

How Akuna Capital's 500+ test environments caused a Prometheus cardinality explosion - and the label design, recording rules, and Thanos/Mimir tradeoffs that fixed it.

10 min
#prometheus #thanos #mimir #observability #cardinality #trading-infrastructure

When I was at Akuna Capital, we maintained over 500 concurrent test environments at peak. Each environment was a fully isolated simulation of the production trading stack: exchange simulators, strategy processes, risk managers, and order management systems. Every environment emitted metrics to Prometheus. The first time I tried to run a query across all environments, the Prometheus pod crashed with an OOM kill.

The TSDB head - the in-memory structure Prometheus uses for recent data - had grown to 38GB. The OOM kill was not a configuration issue. It was a fundamental misunderstanding of how Prometheus scales, and fixing it required redesigning our entire metrics architecture.

The Cardinality Problem

Prometheus’s storage model is a time series database. Each unique combination of a metric name and its label set is a separate time series, and each active time series requires a live entry in the TSDB head block. The head block is in memory by default. There is no free lunch here: more time series = more memory, linearly.

The cardinality of a metric set is the number of unique label-value combinations. The problem compounds multiplicatively:

series_count = unique_values_dim_1 × unique_values_dim_2 × ... × unique_values_dim_N

At Akuna, our initial instrumentation looked roughly like this:

order_latency_microseconds{exchange="binance", strategy="mm_v2", env="test-042", order_id="OID-8472901"}

Four label dimensions. Actual cardinality:

  • exchange: 12 values (the exchanges we traded on)
  • strategy: 8 values
  • env: 500+ values
  • order_id: millions of unique values (a new order ID per order)

The order_id label alone made this metric unusable at scale. 12 × 8 × 500 × 1,000,000 = 48 billion time series. Even without the order_id label, 12 × 8 × 500 = 48,000 series for a single metric. Multiply by 200+ metrics across the stack and you are at 9.6 million series - enough to OOM a 32GB Prometheus node.

Label Design Discipline

The fix starts with a rule: labels are for aggregation dimensions, not for identification. If you can answer the question “will I ever want to sum by or group by this label?”, it belongs as a label. If the answer is “I just want to be able to look up this specific instance later,” it does not belong as a label - use exemplars instead.

Concretely:

Do label on: exchange, strategy, instrument_class, environment_tier (prod/staging/test), region.

Do not label on: order_id, trade_id, session_id, request_id, client_ip. These have unbounded cardinality and you can never aggregate over them meaningfully.

Use exemplars for trace linking. OpenTelemetry exemplars let you embed a trace ID in a histogram sample without creating a new time series. When you need to go from a metric anomaly to the specific order or trace that caused it:

// Go example - attaching a trace ID as an exemplar to a histogram observation
latencyHistogram.With(prometheus.Labels{
    "exchange": order.Exchange,
    "strategy": order.Strategy,
}).ObserveWithExemplar(latencySeconds, prometheus.Labels{
    "traceID": span.SpanContext().TraceID().String(),
})

This gives you the best of both worlds: low-cardinality metrics for aggregation, and a pointer to the high-cardinality trace data for investigation.

After removing order_id and env (solving the latter by federating per-environment), our TSDB head dropped from 38GB to 1.8GB. The same query that OOM-killed the server now returned in 400ms.

Recording Rules: Pre-Computing the Expensive Aggregations

Even with controlled cardinality, ad-hoc queries over large windows are slow. The right answer is recording rules - Prometheus evaluates these on a schedule and writes the result as a new, pre-computed time series.

The most important recording rules to define for a trading setup are the ones you query constantly: strategy-level aggregations of exchange-level metrics.

groups:
  - name: trading_aggregations
    interval: 30s
    rules:
      # Per-strategy order rate across all exchanges
      - record: trading:orders_per_second:strategy:5m
        expr: |
          sum(rate(orders_submitted_total[5m])) by (strategy)

      # Per-exchange error rate - pre-computed so dashboards load instantly
      - record: trading:order_error_rate:exchange:5m
        expr: |
          sum(rate(orders_error_total[5m])) by (exchange)
          /
          sum(rate(orders_submitted_total[5m])) by (exchange)

      # P99 latency per strategy - histogram_quantile is expensive, pre-compute it
      - record: trading:order_latency_p99_seconds:strategy:5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(order_latency_seconds_bucket[5m])) by (le, strategy)
          )

      # P99 across all strategies - for the top-level health dashboard
      - record: trading:order_latency_p99_seconds:global:5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(order_latency_seconds_bucket[5m])) by (le)
          )

The rule about pre-computing: if a query appears in a dashboard that loads on every engineer’s morning review, it should be a recording rule. histogram_quantile over a rate over 5+ minutes is expensive enough to cause noticeable dashboard latency on any Prometheus instance handling more than ~50,000 active series.

Federation Architecture for 500+ Environments

The Prometheus federation model solves the multi-environment problem. Instead of a single global Prometheus scraping everything, you deploy a hierarchy:

Leaf Prometheus: one per environment. Scrapes all services in that environment, holds 6 hours of data, evaluates per-environment recording rules. No cross-environment queries run here.

Federation Prometheus: scrapes selected, pre-aggregated metrics from all leaf Prometheis via the /federate endpoint. Holds 24 hours of data. Only receives recording rule outputs from leaves - not raw metrics.

Long-term storage Prometheus: writes to Thanos or Mimir for retention beyond 24 hours.

The federation scrape config on the middle tier:

# Federated Prometheus - only pulls recording rule outputs from leaves
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 30s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"trading:.*"}'  # Only pull recording rule outputs (namespaced with "trading:")
    static_configs:
      - targets:
          - 'prom-test-001:9090'
          - 'prom-test-002:9090'
          # ... all 500+ leaf Prometheus addresses

The {__name__=~"trading:.*"} selector is the critical constraint. It means the federation layer only receives the pre-aggregated, low-cardinality metrics from each leaf - not the raw per-order metrics. This keeps the federation Prometheus’s cardinality bounded regardless of how many leaf environments exist.

Thanos vs Mimir: When to Use Which

Both Thanos and Mimir solve the same problems: high-availability Prometheus, long-term storage, and global query across multiple Prometheus instances. The tradeoffs are real and worth understanding before committing to either.

Thanos is store-and-query oriented. It uses a sidecar pattern: a Thanos sidecar runs next to each Prometheus, uploads completed TSDB blocks to object storage (S3, GCS), and makes them queryable via the Thanos Query frontend. The data model is block-based: every 2 hours, Prometheus writes a new TSDB block, Thanos uploads it, and it becomes available for long-range queries.

Thanos strengths: simple deployment model, works well with existing Prometheus setups, excellent for long-retention queries (weeks to months), the compactor reduces storage cost significantly via downsampling.

Thanos weaknesses: the sidecar approach means write failures on the Prometheus side aren’t retried through Thanos (it uploads already-written blocks, not in-flight data). High write rates cause block upload backlogs. The query layer is stateless but adds latency compared to direct Prometheus.

Mimir is write-path oriented. It implements the Prometheus remote_write API directly, replacing Prometheus’s storage with a distributed write path. Ingesters accept writes, store them in memory, and periodically flush to object storage. The query frontend is horizontally scalable and caches query results aggressively.

Mimir strengths: designed for high write rates (Grafana Labs runs it at tens of millions of samples/second), better horizontal scaling for write-heavy workloads, native multi-tenancy, faster query-path caching.

Mimir weaknesses: operationally heavier (ingesters, compactors, distributors, store-gateways, query frontends are separate components), higher complexity for small teams.

For our trading infrastructure at ZeroCopy, we use Thanos. The write rate is moderate (tens of thousands of samples per second), the team is small, and the block-upload model fits well with our DOKS deployment. If I were running a brokerage or exchange with millions of samples per second across hundreds of strategies, I would look seriously at Mimir.

The remote_write Buffer Problem

One operational detail that burned us: the Prometheus remote_write buffer.

When Prometheus writes to a remote storage backend (Thanos, Mimir, or a custom endpoint), it queues samples in a WAL-backed buffer. If the remote endpoint is unavailable, Prometheus will retry from the buffer. The buffer is bounded: if you fall too far behind, Prometheus drops samples.

More dangerously, if your Prometheus instance itself restarts, the in-memory portion of the remote_write queue is lost. Any samples that had been scraped but not yet persisted to the remote backend vanish.

The mitigation is persistent WAL configuration:

# prometheus.yml
global:
  scrape_interval: 15s

storage:
  tsdb:
    wal_segment_size: 256MB  # Larger WAL = more buffer before disk pressure

remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"
    queue_config:
      capacity: 10000          # In-memory queue size
      max_shards: 200          # Parallel write workers
      min_shards: 1
      max_samples_per_send: 500
      batch_send_deadline: 5s
      min_backoff: 30ms
      max_backoff: 5s
    write_relabel_configs:
      # Drop metrics you don't need in long-term storage to reduce cost
      - source_labels: [__name__]
        regex: "go_.*|process_.*"
        action: drop

The write_relabel_configs at the end is a significant cost lever. Go runtime metrics (go_gc_duration_seconds, go_memstats_*) and process metrics are scraped by default and have no trading-specific value. Dropping them at the remote_write relabeling stage reduces long-term storage cost by 15-25% in most deployments I have seen.

A Working prometheus.yml for a Trading Setup

This is the configuration I would start with for a trading system with 3-5 services and moderate volume:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: "trading-prod"
    env: "production"

rule_files:
  - "/etc/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

scrape_configs:
  - job_name: "order-management-system"
    static_configs:
      - targets: ["oms:8080"]
    metric_relabel_configs:
      # CRITICAL: drop any metric that has an order_id label
      # This prevents cardinality explosion if OMS accidentally emits it
      - source_labels: [order_id]
        regex: ".+"
        action: drop

  - job_name: "strategy-engine"
    static_configs:
      - targets: ["strategy:8081"]

  - job_name: "risk-manager"
    static_configs:
      - targets: ["risk:8082"]

remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"
    queue_config:
      capacity: 10000
      max_shards: 50
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "go_.*|process_.*"
        action: drop

storage:
  tsdb:
    retention.time: "24h"  # Local retention; long-term in Thanos
    wal_segment_size: 128MB

The metric_relabel_configs on the OMS scrape config is a safety net. If a developer adds an order_id label to a metric in the OMS, Prometheus will drop the metric entirely rather than let cardinality explode. It is a defensive measure and will cause metrics to disappear - which is visible and debuggable - rather than OOM-killing the server, which is a surprise on a Wednesday.

How This Breaks in Production

Failure mode 1: Order ID or trade ID added to a label. Symptom: Prometheus memory grows steadily and then OOM-kills with no warning. Root cause: a developer added order_id to a metric for debugging and forgot to remove it. The TSDB head grows at the rate of orders submitted - thousands per minute - and Prometheus can’t compact fast enough. Fix: add cardinality alerts and metric relabeling guards before this happens, not after.

Failure mode 2: histogram_quantile queries on raw metrics in production dashboards. Symptom: Grafana dashboards take 30-60 seconds to load, causing engineers to refresh constantly, creating query storms that make Prometheus itself fall behind on scrapes. Root cause: expensive range query running on every dashboard load. Fix: convert to recording rules.

Failure mode 3: Federation layer pulling raw metrics. Symptom: federation Prometheus OOMs even though leaf instances are fine. Root cause: the match[] selector on the /federate endpoint is too broad, pulling raw per-environment metrics instead of pre-aggregated recording rule outputs. Fix: namespace all recording rules with trading: prefix and restrict federation to that namespace.

Failure mode 4: Thanos sidecar falls behind on block uploads. Symptom: long-range queries return gaps; data visible in leaf Prometheus but not in Thanos Query. Root cause: leaf Prometheus is producing blocks faster than the sidecar can upload (network congestion, object storage rate limiting). Fix: tune sidecar concurrency and add a metric for thanos_shipper_upload_failures_total.

Failure mode 5: remote_write queue full, samples dropped silently. Symptom: Thanos/Mimir shows gaps in time series that correspond exactly to Prometheus restart times or network partitions. Root cause: remote_write queue overflows and Prometheus drops samples without alerting by default. Fix: alert on prometheus_remote_storage_samples_dropped_total > 0.

Failure mode 6: No cardinality monitoring. Symptom: cardinality grows slowly over weeks as features are added, until one day Prometheus OOMs with no clear cause. Root cause: no baseline established for expected series count, no alert when cardinality grows beyond threshold. Fix: alert on prometheus_tsdb_head_series > 500000 (tune to your scale) and review recording every week.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.