Infrastructure
Latency Histograms Done Right: HDR, t-digest, and Building Sub-Millisecond Dashboards
Why Prometheus summary metrics lie about your P99, how HDR histograms caught a hidden 2ms tail at Akuna Capital, and building accurate sub-millisecond latency dashboards.
For about 18 months, our Akuna Capital latency dashboards showed P99 order-processing latency at a stable 50µs. The strategy team was happy with the number. The PnL attribution team wasn’t asking questions. And then we instrumented a new latency probe that used hardware timestamps instead of clock_gettime, and the P99 came back at 2ms.
The 50µs figure wasn’t wrong. It was something worse: it was a mathematically valid calculation of a number that didn’t mean what we thought it meant. The problem wasn’t our infrastructure - it was our instrumentation model. Specifically, we were using Prometheus summaries, and Prometheus summaries have a correctness problem that is subtle enough to cost you millions of dollars before you notice.
Why Prometheus Summaries Are Wrong for Latency
A Prometheus summary calculates quantiles on the client - inside the instrumented process, before the data ever reaches Prometheus. This sounds convenient, but it has a fundamental flaw: the quantiles are not aggregatable.
Here is the problem concretely. Suppose you have four strategy processes, each reporting P99 latency:
- Process 1 P99: 45µs
- Process 2 P99: 55µs
- Process 3 P99: 70µs
- Process 4 P99: 2000µs
What is the true fleet-wide P99? It is not the average of these four numbers (542µs), and it is not the maximum (2000µs). The true P99 requires access to the underlying latency distribution from all four processes. But Prometheus summaries discard the distribution - they only store the pre-computed quantile values.
When you write the PromQL query avg(latency_summary{quantile="0.99"}), you are computing the average of four P99s, which is statistically meaningless. It is not the P99 of the combined distribution. Process 4 - the one with the 2ms tail - contributes equally to the average as Process 1. The result, 542µs, has no correct interpretation.
In our case, the process with the bad tail was handling a specific exchange. The summary’s per-instance P99 of 2ms was being averaged with three other processes showing 45-70µs, producing a “global P99” of 542µs that appeared reasonable enough not to trigger an alert. We were blind to a 2ms tail event hitting 1% of orders on one exchange.
The correct number - 2ms, the actual P99 of the combined fleet - was only recoverable after we switched to histograms.
Prometheus Histograms: The Right Model
Prometheus histograms capture the distribution rather than the quantile. Instead of computing “what is the P99?” at instrumentation time, they record how many observations fell into each latency bucket. The buckets are preserved through Prometheus scraping and can be aggregated correctly across instances.
The PromQL query to compute P99 across all instances:
histogram_quantile(0.99,
sum(rate(order_latency_seconds_bucket[5m])) by (le)
)
The sum(...) by (le) aggregates the bucket counts across all instances before computing the quantile. This is mathematically valid - you are computing the quantile from the combined distribution, not averaging pre-computed quantiles.
The critical gotcha: bucket boundaries must be correct for your resolution. Prometheus comes with default histogram bucket boundaries designed for request latencies measured in seconds: .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10. For sub-millisecond trading latencies, these are useless. The entire interesting range - 1µs to 999µs - falls in the first bucket.
You must declare explicit buckets for your workload:
# Python example - microsecond-resolution buckets for trading
from prometheus_client import Histogram
ORDER_LATENCY = Histogram(
'order_latency_seconds',
'Order processing latency',
labelnames=['exchange', 'strategy'],
buckets=[
# Below 100µs: fine-grained to catch sub-100µs improvements
0.000010, # 10µs
0.000025, # 25µs
0.000050, # 50µs
0.000075, # 75µs
0.000100, # 100µs
# 100µs-1ms: medium resolution
0.000200, # 200µs
0.000400, # 400µs
0.000700, # 700µs
0.001000, # 1ms
# 1ms-10ms: coarse - if you're here, something is wrong
0.002500, # 2.5ms
0.005000, # 5ms
0.010000, # 10ms
# Above 10ms: you have a serious problem
0.050000, # 50ms
0.100000, # 100ms
0.500000, # 500ms
1.000000, # 1s
]
)
The number of buckets does increase cardinality - each bucket boundary is an additional time series - but it is bounded and predictable. A histogram with 16 buckets and 12 exchange labels and 8 strategy labels = 16 × 12 × 8 = 1,536 series. Manageable.
HDR Histogram: Correct Percentile-Over-Percentile Aggregation
HDRHistogram (High Dynamic Range Histogram, at hdrhistogram.org) is a different instrument designed for the specific problem of capturing full-resolution latency distributions without coordinated omission.
Coordinated omission is a subtle measurement bias. When your test harness sends a request and the system is busy, the harness waits. During that wait, it is not sending new requests. The latency of the slow request is correctly captured, but all the requests that would have been sent during that period - and would also have been slow - are never sent and never measured. The result is a latency histogram that systematically underestimates tail latencies under load.
Gil Tene (who designed HdrHistogram) explains it this way: if your car’s speedometer only measured speed when you pressed the gas pedal, it would show high speeds most of the time (you press the gas when you want to move fast) and underreport time stuck in traffic (when you’re not pressing the gas). Traditional latency benchmarks have the same flaw.
HdrHistogram’s record_corrected_value method accounts for coordinated omission by inserting synthetic samples representing the “should have been sent but wasn’t” requests. This gives an accurate picture of the latency distribution under load rather than under idealized test conditions.
For Rust, the hdrhistogram crate is the standard:
use hdrhistogram::Histogram;
let mut hist = Histogram::<u64>::new_with_bounds(1, 10_000_000, 3)?;
// 1µs to 10s range, 3 significant figures
// Normal recording
hist.record(latency_micros)?;
// Coordinated-omission-corrected recording
// expected_interval = how often you expect to sample (in µs)
hist.record_correct(latency_micros, expected_interval_micros)?;
// Query percentiles
let p50 = hist.value_at_quantile(0.50);
let p99 = hist.value_at_quantile(0.99);
let p999 = hist.value_at_quantile(0.999);
let max = hist.max();
println!("P50: {}µs, P99: {}µs, P99.9: {}µs, MAX: {}µs",
p50, p99, p999, max);
HDR histograms are also mergeable - you can combine histograms from multiple processes and compute correct aggregate percentiles. This is the property that Prometheus summaries lack.
t-digest: Probabilistic Aggregation with Bounded Error
t-digest is a different approach to the same problem. Instead of fixed bucket boundaries (like Prometheus histogram) or full distribution tracking (like HDR), t-digest uses a sketch data structure with a bounded error guarantee.
The key property: t-digest gives you a maximum relative error at any quantile, bounded by a configurable parameter. For percentiles far from the median (P90, P99, P99.9), the error is smallest - which is exactly where you need accuracy most for tail latency.
ClickHouse uses t-digest for its quantile() and quantileTDigest() functions. If you’re storing trade execution data in ClickHouse for analytics, this is the right aggregation function:
-- ClickHouse: correct P99 across all orders in the last hour
SELECT
exchange,
strategy,
quantileTDigest(0.99)(latency_microseconds) AS p99_us,
quantileTDigest(0.999)(latency_microseconds) AS p999_us,
max(latency_microseconds) AS max_us
FROM order_executions
WHERE timestamp > now() - INTERVAL 1 HOUR
GROUP BY exchange, strategy
ORDER BY p99_us DESC
For Prometheus-based alerting, use Prometheus histograms. For analytics queries over stored trade data in ClickHouse, use quantileTDigest. The tools are complementary.
Building the Dashboard: Separate Panels for Each Percentile
The single biggest dashboard mistake I see is putting P50, P99, P99.9, and MAX on the same panel with the same Y-axis. In a healthy trading system, P50 might be 30µs, P99 might be 200µs, P99.9 might be 800µs, and MAX might be 5ms. On a linear scale, the P50 line is invisible at the bottom. On a logarithmic scale, the relative difference between P50 and P99 is compressed.
The correct structure is four separate panels, each with its own Y-axis scaled to the expected range for that percentile:
Panel 1: P50 (typical latency). Y-axis: 0-200µs. This is your baseline - it should be stable. A P50 regression is a broad regression affecting most orders.
Panel 2: P99 (tail latency). Y-axis: 0-2ms. This is your SLO target. Alert threshold line at 500µs. Spikes here affect 1 in 100 orders.
Panel 3: P99.9 (extreme tail). Y-axis: 0-10ms. Spikes here are rare but indicate pathological cases: lock contention, GC pauses, kernel scheduler hiccups. Useful for debugging rather than alerting.
Panel 4: MAX (ceiling). Y-axis: 0-100ms. The maximum single observation in the window. MAX is noisy and should never be used for SLOs, but it catches catastrophic one-off events - a 5-second GC pause, a temporary network partition - that percentile-based metrics smooth over.
The Grafana PromQL query for a correct P99 panel:
# Panel title: "Order Latency - P99 by Exchange (5m window)"
# Unit: microseconds (set in panel options, using "µs" with Prometheus returning seconds × 1,000,000)
histogram_quantile(0.99,
sum(
rate(order_latency_seconds_bucket{env="prod"}[5m])
) by (le, exchange)
) * 1e6
The by (le, exchange) keeps the distribution separate per exchange before computing the quantile. This gives you a P99 line per exchange on the same panel, which is exactly the view you want when investigating whether a latency spike is exchange-specific or system-wide.
For the fleet-wide P99 (all exchanges aggregated):
histogram_quantile(0.99,
sum(rate(order_latency_seconds_bucket{env="prod"}[5m])) by (le)
) * 1e6
The most important anti-pattern to avoid: never compute avg(histogram_quantile(...)) or avg(latency_summary{quantile="0.99"}). Averaging percentiles is mathematically invalid. The correct operation is histogram_quantile applied to summed bucket counts. If you find yourself averaging percentiles, you are not measuring what you think you are measuring.
The Coordinated Omission Problem in CI Benchmarks
This matters for CI performance benchmarks too, not just production dashboards. At Akuna, our CI latency benchmarks used a simple time.time() approach: record start, submit order to simulator, wait for ACK, record end. This has coordinated omission.
When we compared CI benchmark results to production traces, the CI P99 was consistently 3-5x lower than production P99. Part of this was environment - CI machines are not bare metal, production is. But part of it was coordinated omission: the benchmark would hit a slow period, pause, and then continue, never capturing the latency of the “would have been submitted during the pause” requests.
Switching to HdrHistogram with coordinated omission correction in our Criterion.rs benchmarks (Rust) brought the CI numbers within 20% of production numbers for the same hardware class. That was close enough to catch real regressions.
How This Breaks in Production
Failure mode 1: Prometheus summary used for fleet-wide latency SLO. Symptom: SLO dashboard shows P99 = 100µs, but users report intermittent slow fills. Root cause: summary quantiles are averaged across instances; a single instance with a 2ms P99 is averaged with three healthy instances, producing a fleet P99 that appears normal. There is no correct fix short of migrating to histograms.
Failure mode 2: Default Prometheus histogram bucket boundaries. Symptom: histogram_quantile returns exactly 0.005 (5ms) for P99 on a sub-millisecond system. Root cause: all observations fall into the first bucket (le=“0.005”), so the quantile function returns the upper bound of that bucket. Fix: define µs-resolution bucket boundaries at instrumentation time, before data is collected.
Failure mode 3: Averaging P99 across instances in Grafana. Symptom: P99 dashboard looks smooth and low even during incidents. Root cause: PromQL query uses avg(...) instead of histogram_quantile(0.99, sum(...) by (le)). The averaging masks the worst-performing instance.
Failure mode 4: Single-panel latency with shared Y-axis. Symptom: P50 line flatlines at zero, MAX line dominates the entire chart, making it impossible to see P99 trend. Root cause: P50 at 30µs and MAX at 5ms on the same linear-scale panel. Fix: four separate panels with separate Y-axis ranges.
Failure mode 5: MAX used as the SLO metric. Symptom: alerts fire constantly for transient one-off spikes that last 1-2 observations, saturating the on-call with noise. Root cause: MAX is the worst possible choice for SLO measurement - it captures single extreme outliers rather than a stable percentile. Fix: use P99 or P99.9 for SLOs; reserve MAX for debugging.
Failure mode 6: Coordinated omission in CI benchmarks. Symptom: CI benchmarks show 50µs P99; production traces show 500µs P99; teams dismiss the gap as “environment differences.” Root cause: CI load generator uses coordinated omission pattern (send request, wait, send next request), underreporting tail latency under load. Fix: use HdrHistogram with record_correct in CI benchmarks; deploy to staging and compare histograms before concluding that environment explains the full gap.
Continue Reading
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.