Infrastructure
Coordinated Omission and Why Your P99 Latency Is Probably a Lie
Gil Tene's coordinated omission problem explained with real numbers: how a naive benchmark showed 50µs P99 when the actual P99 was 2ms, and how to fix it with HDRHistogram.
Our Akuna benchmark showed 50µs P99. We built our market-making models around that number. Our risk limits assumed it. Our SLA commitments to the desk referenced it. The number was wrong by 40x. The actual P99 under load was 2ms.
We had a textbook case of coordinated omission - a measurement artifact so common in latency benchmarks that Gil Tene (CTO of Azul Systems, author of HDRHistogram) considers it the defining flaw of most published performance numbers. The failure is not in the system under test. It is in the measurement methodology. The benchmark cooperates with - is “coordinated” with - the system’s slowdowns, and systematically fails to measure the latency that real clients experience.
This post explains the problem precisely, shows why it leads to 10x-100x underestimates of tail latency, and covers how to fix it.
The Naive Benchmark Loop
Every benchmark I have seen before learning about this problem looks like this:
import time
import statistics
def naive_benchmark(client, n_requests=100_000):
latencies = []
for i in range(n_requests):
start = time.perf_counter_ns()
client.request()
end = time.perf_counter_ns()
latencies.append(end - start)
latencies.sort()
p50 = latencies[len(latencies) // 2]
p99 = latencies[int(len(latencies) * 0.99)]
p999 = latencies[int(len(latencies) * 0.999)]
print(f"P50: {p50/1000:.1f}µs P99: {p99/1000:.1f}µs P99.9: {p999/1000:.1f}µs")
The problem is the loop structure. This benchmark sends the next request only after the previous one completes. If the system experiences a 10ms pause (GC, kernel scheduler, network congestion), the benchmark does not send any requests during those 10ms. It simply waits. When it resumes, it records the 10ms response time for the one request that was in flight, then immediately sends the next request - which completes in 50µs again.
The 10ms pause affects 1 out of every ~200 requests at 50µs average. In the latency histogram, this pause registers as a single outlier at the P99.5 level. If you have 100,000 samples, the P99 is at sample 99,000 - and that 10ms pause might only show up 500 times, placing it at P99.5 and above, not P99. Your P99 is still 50µs.
But here is what really happened: every client that would have been served during those 10ms of pause instead had to wait. In a real system with real concurrent clients, 200 clients all waited an additional 10ms because of that one pause event. Their experienced latency was 50µs + 10ms = 10.05ms - not 50µs. Your benchmark measured the latency of the one request that was in flight, not the latency experienced by the 200 requests that were blocked by the pause.
The Mathematical Definition of the Problem
The coordinated omission problem can be stated precisely:
A benchmark that issues requests serially (one at a time) measures the service time distribution, not the response time distribution. In a real system under load, slow responses cause downstream queuing. The benchmark’s measurement of “how long each request took” does not account for the concurrent requests that were blocked waiting for the slow request to complete.
The difference between what the benchmark measures and what clients experience:
Benchmark measures:
P99 = 99th percentile of (actual service time for each request sent)
What real clients experience:
P99 = 99th percentile of (time from when client WANTED to send until response received)
If the system has a 10ms pause every 200 requests at 50µs average, the benchmark P99 is 50µs (the pause is above P99). But for clients submitting requests at 20,000 req/s (one every 50µs), 200 clients arrive during the 10ms pause and experience 10ms latency. Their P99 is not 50µs - it is 10ms.
Measuring This Correctly with HDRHistogram
Gil Tene’s solution is to track the intended send time rather than the actual send time. The key insight: instead of measuring end - start, where start is when you actually sent the request, measure end - intended_start, where intended_start is when you should have sent the request according to your target rate.
import time
from hdrh.histogram import HdrHistogram
import threading
def correct_benchmark(client, target_rate_hz, n_requests=100_000):
"""
Correct latency benchmark that accounts for coordinated omission.
target_rate_hz: intended request rate (e.g., 20000 for 20K req/s)
n_requests: total number of requests to send
"""
histogram = HdrHistogram(1, 10_000_000, 3) # 1ns to 10s, 3 significant digits
interval_ns = int(1_000_000_000 / target_rate_hz)
start_time_ns = time.perf_counter_ns()
for i in range(n_requests):
# When we INTENDED to send this request
intended_start_ns = start_time_ns + i * interval_ns
# If we are behind schedule, we are still sending; do not skip
current_ns = time.perf_counter_ns()
if current_ns < intended_start_ns:
# Spin-wait until intended send time (more accurate than sleep)
while time.perf_counter_ns() < intended_start_ns:
pass
actual_start_ns = time.perf_counter_ns()
client.request()
end_ns = time.perf_counter_ns()
# Record latency from INTENDED start, not actual start
# This captures the queuing delay caused by previous slow requests
latency_ns = end_ns - intended_start_ns
histogram.record_value(latency_ns)
p50 = histogram.get_value_at_percentile(50)
p99 = histogram.get_value_at_percentile(99)
p999 = histogram.get_value_at_percentile(99.9)
p9999 = histogram.get_value_at_percentile(99.99)
print(f"P50: {p50/1000:.1f}µs")
print(f"P99: {p99/1000:.1f}µs")
print(f"P99.9: {p999/1000:.1f}µs")
print(f"P99.99: {p9999/1000:.1f}µs")
return histogram
The critical line is latency_ns = end_ns - intended_start_ns. If a 10ms pause occurs at request 1000 (intended send time T1000), requests 1001 through 1200 all have intended_start_ns values that are in the past when they are finally sent. Their recorded latency includes the time they were blocked by the pause. This is the correct measurement.
HDRHistogram also provides record_corrected_value(value, expected_interval) which retroactively fills in the intermediate values, useful if you cannot always track intended send times:
# Alternative: HDRHistogram's built-in coordinated omission correction
histogram = HdrHistogram(1, 10_000_000, 3)
expected_interval_ns = interval_ns # 50µs at 20K req/s
for latency_ns in measured_latencies:
# If this request took longer than expected, record intermediate
# values to fill in the implicit waiting time for subsequent requests
histogram.record_corrected_value(latency_ns, expected_interval_ns)
The Numbers: What Coordinated Omission Actually Hides
At Akuna, running the same workload against our order router with both methods:
Measurement Method P50 P95 P99 P99.9 Max
──────────────────────────────────────────────────────────────────────
Naive (serial, no CO fix) 47µs 51µs 53µs 89µs 12ms
Corrected (HDRHistogram) 48µs 55µs 2.1ms 8.4ms 12ms
The P50 is almost identical - the median request is unaffected. The P99 is 40x different. The P99.9 is 94x different. The max is the same, because the max is always the actual slowest event regardless of measurement method.
In practice, the P99 difference comes from two sources:
-
GC pauses (for JVM-based components): 10-100ms every 2-5 seconds. At 20,000 req/s, a 10ms GC pause blocks 200 requests. Without correction, the pause shows up as the P99.9 or P99.99. With correction, it shows up at P98-P99 because 200 out of 100,000 requests experienced it.
-
Kernel scheduler latency: even on a tuned system,
SCHED_FIFOthreads occasionally see 100-500µs scheduler wakeup latency. At high request rates, these scheduler events affect enough concurrent requests to move the P99 significantly.
Why This Matters for Trading Decisions
The gap between naive P99 and corrected P99 is not just a measurement hygiene concern. At Akuna, the incorrect P50=50µs P99=50µs picture led us to design a model that assumed all order acknowledgements arrived within 100µs - so the strategy could reprice based on the first acknowledgement and immediately issue a new order on the same venue.
The correct picture (P99=2ms) means: 1% of the time, the acknowledgement takes 2ms instead of 50µs. During those 2ms, our quote on the venue is stale - we are still showing the old price. Any adverse market movement during those 2ms represents a loss we had not priced into our spread. At the trade volumes we were running, this added up to measurable adverse selection.
The fix was simple once we had the correct measurement: widen the spread slightly to account for the tail latency, and implement a client-side timeout that repriced quotes pessimistically if an acknowledgement was not received within 500µs. But we could not have made this change correctly without knowing the true P99.
Practical Implementation: Load Testing Your Trading System
A complete load test harness that correctly measures latency:
#!/usr/bin/env python3
"""
trading_latency_bench.py - Correct latency benchmark for trading systems
Requires: pip install hdrh
"""
import time
import threading
import queue
import signal
from hdrh.histogram import HdrHistogram
from dataclasses import dataclass
from typing import Callable
@dataclass
class BenchmarkConfig:
target_rate_hz: int # Requests per second
duration_seconds: int # How long to run
warmup_seconds: int = 10 # Warmup period (excluded from results)
def run_benchmark(send_fn: Callable, config: BenchmarkConfig) -> HdrHistogram:
"""
Run a latency benchmark with coordinated omission correction.
send_fn: callable that sends one request and returns when complete
"""
histogram = HdrHistogram(1, 60_000_000_000, 3) # 1ns to 60s
interval_ns = int(1_000_000_000 / config.target_rate_hz)
total_requests = config.target_rate_hz * (config.duration_seconds + config.warmup_seconds)
warmup_requests = config.target_rate_hz * config.warmup_seconds
start_ns = time.perf_counter_ns()
for i in range(total_requests):
intended_start_ns = start_ns + i * interval_ns
# Wait until intended start time
now = time.perf_counter_ns()
if now < intended_start_ns:
time.sleep((intended_start_ns - now) / 1e9)
# Send request
req_start_ns = time.perf_counter_ns()
send_fn()
req_end_ns = time.perf_counter_ns()
# Skip warmup
if i < warmup_requests:
continue
# Record latency from INTENDED start
latency_ns = req_end_ns - intended_start_ns
histogram.record_value(latency_ns)
return histogram
def print_histogram(h: HdrHistogram, label: str = "Results"):
print(f"\n{'='*50}")
print(f"{label}")
print(f"{'='*50}")
for pct in [50, 75, 90, 95, 99, 99.9, 99.99, 100]:
val_us = h.get_value_at_percentile(pct) / 1000
print(f"P{pct:<6}: {val_us:>10.1f}µs")
print(f"Total samples: {h.total_count:,}")
# Usage
if __name__ == "__main__":
import requests # example dependency
def send_order():
requests.post("http://localhost:9876/api/orders",
json={"side": "BUY", "qty": 1.0, "price": 43000.0},
timeout=5)
config = BenchmarkConfig(
target_rate_hz=1000, # 1K orders/second
duration_seconds=60,
warmup_seconds=10,
)
print(f"Running benchmark: {config.target_rate_hz} req/s for {config.duration_seconds}s...")
hist = run_benchmark(send_order, config)
print_histogram(hist, "Order Router Latency (coordinated-omission corrected)")
How This Breaks in Production
Failure 1: Spin-wait consuming 100% CPU. The benchmark above spin-waits between requests to achieve accurate inter-arrival timing. On a machine under load, spinning a full CPU core to wait 50µs consumes 100% of that core for the entire benchmark run. If the benchmark runs on the same machine as the system under test, you are artificially loading the system. Fix: use a dedicated core (isolated via isolcpus) for the benchmark loop, or use a separate machine.
Failure 2: Intended rate exceeding service rate - negative latencies. If your target rate is 20,000 req/s but the service can only handle 15,000 req/s, the benchmark queue grows unboundedly. After a few seconds, intended_start_ns is 10 seconds in the past. Every recorded latency is 10+ seconds. The histogram only shows the queue backlog, not the actual service latency. Fix: verify the service can sustain the target rate before running a coordinated-omission-correct benchmark. Start with a rate sweep to find the sustainable throughput limit, then benchmark below that limit.
Failure 3: HDRHistogram overflow. The histogram is configured with max_value = 10_000_000 (10ms) but the actual system has occasional 30ms pauses. record_value silently discards out-of-range values. The histogram reports P99.9 = 9.9ms (the max recordable value) rather than 30ms. Fix: always configure HDRHistogram with a max value well above your expected worst case - I use 60 seconds. The histogram is efficient even with a large range; the memory overhead is logarithmic.
Failure 4: Measuring the wrong thing. The benchmark measures time at the client, but the component causing the latency is a downstream service. The client-side P99 is 2ms because the upstream service has P99=2ms, but your routing layer adds <10µs. Optimizing the routing layer will not help. Fix: instrument latency at every boundary - client to router, router to service, service to downstream. The histogram at each boundary tells you where the tail latency originates.
Failure 5: Warmup period too short. JVM JIT compilation, CPU cache warming, and connection pool initialization all produce elevated latency during the first 30-120 seconds of a benchmark. If warmup is only 10 seconds, JIT compilation events appear in the results as P99.9 outliers. Fix: instrument warmup phase separately and watch for the point where latency stabilizes. For JVM systems, 60-120 seconds of warmup is typical; for native code, 10-30 seconds.
Failure 6: Reporting mean instead of percentiles. Mean latency hides bimodal distributions. A system with 99% of requests at 50µs and 1% at 10ms has a mean of ~149µs - which is worse than 50µs but much better than 10ms. The mean makes the system look mediocre; the P99 makes it look terrible. Both are correct descriptions but serve different purposes. Fix: always report the full distribution: P50, P95, P99, P99.9. Never use mean as the primary latency metric for a trading system.
Related reading: Determinism Under Load: Tail Latency Engineering covers how to systematically reduce the P99 once you can measure it correctly. SLOs for Trading Systems covers how to set contractual latency bounds backed by real measurements.
Continue Reading
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.