Continuous Performance Benchmarking: Catching the 5% Regression That Costs $50K/Day

The PR description said: “Refactor: replace HashMap with BTreeMap in strategy state for cleaner iteration order.” The review thread had four comments, all about code style. The PR was approved and merged.

Three days later, we noticed P99 order latency had increased by 3µs. That sounds small. At Akuna, with the volume and position sizing we ran on that strategy, a sustained 3µs regression on the hot path translated to roughly $50,000-$ 70,000 per day in worse execution quality: fills happening a tick later, adverse selection on more orders, worse alpha capture.

The engineer who wrote the PR had no idea. The reviewers had no idea. Nobody looked at the benchmark numbers because there were no benchmark numbers in CI at the time. We caught it by chance - a latency regression test that ran weekly (not on every PR) happened to run on day 3.

That incident is why I believe CI-integrated performance benchmarking is not optional for trading infrastructure. This is the system I built after.

Why Code Review Cannot Catch Performance Regressions

Code review is effective at catching correctness issues: logic errors, missing edge cases, security vulnerabilities. It is almost completely ineffective at catching performance regressions unless the regression is obvious.

A HashMap to BTreeMap change looks like a refactor with clear benefits (ordered iteration, predictable iteration order for debugging). The reviewer correctly evaluates: “Does this change the correctness of the code?” - Yes, it’s correct. “Is the code cleaner?” - Arguably yes, you get deterministic iteration. “Is this more expensive?” - HashMap has O(1) average lookup, BTreeMap has O(log n) lookup. But the number of entries in the map at the time was 8 (the strategy maintained state for 8 instruments). O(log 8) vs O(1) for 8 entries. How much could it matter?

Turns out, at the call frequency of the hot path (this was called on every market data event, roughly 500,000 times per second across all instruments), the constant factor difference in cache behavior between HashMap (cache-hostile hash traversal with random memory access) and BTreeMap (small but cache-friendly tree with 3-4 pointer hops) added 3µs per call. 3µs × 500K calls/second = 1.5 seconds of CPU time per second. The strategy was spending 1.5 CPU-seconds out of every real second on map lookups it used to spend 0.2 CPU-seconds on.

No amount of code review catches this. You need a number, and you need that number to be compared against a baseline.

Criterion.rs: Statistical Benchmarking for Rust

Criterion.rs is the standard benchmarking library for Rust. Its key property is statistical rigor: rather than running a benchmark once and reporting the result, Criterion runs it many times, applies outlier rejection, and uses Student’s t-test to determine whether a measured performance change is statistically significant.

This matters for CI integration. Cloud CI machines are noisy - execution time varies due to shared CPU, memory bandwidth contention, and unpredictable scheduling. Without statistical significance testing, you’ll generate false regression reports on every PR just from scheduler noise. Criterion’s significance testing means a regression report only fires when the measured change is real, not noise.

A typical benchmark for trading code:

// benches/order_routing.rs
use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion};
use trading_engine::{OrderRouter, Order, Exchange};

fn bench_order_routing(c: &mut Criterion) {
    let router = OrderRouter::new_test_instance();
    let order = Order::test_order();

    // Simple throughput benchmark
    c.bench_function("route_single_order", |b| {
        b.iter(|| {
            // black_box prevents the optimizer from eliminating the call
            black_box(router.route(black_box(&order)))
        })
    });
}

fn bench_strategy_evaluation(c: &mut Criterion) {
    let mut group = c.benchmark_group("strategy_eval");

    // Benchmark across different instrument counts
    for instrument_count in [4, 8, 16, 32].iter() {
        group.bench_with_input(
            BenchmarkId::new("evaluate_signals", instrument_count),
            instrument_count,
            |b, &n| {
                let strategy = build_test_strategy(n);
                let market_data = build_test_market_data(n);
                b.iter(|| {
                    black_box(strategy.evaluate(black_box(&market_data)))
                })
            }
        );
    }
    group.finish();
}

// CRITICAL: use iter_batched for benchmarks with setup cost
fn bench_order_batch(c: &mut Criterion) {
    c.bench_function("process_order_batch_100", |b| {
        b.iter_batched(
            // Setup: create fresh batch each iteration (not counted in timing)
            || generate_order_batch(100),
            // Benchmark: process the batch
            |batch| black_box(process_batch(black_box(batch))),
            criterion::BatchSize::SmallInput,
        )
    });
}

criterion_group!(
    benches,
    bench_order_routing,
    bench_strategy_evaluation,
    bench_order_batch
);
criterion_main!(benches);

The black_box calls are critical. Without them, the Rust optimizer may determine that the return value of router.route() is unused and eliminate the call entirely, benchmarking nothing. black_box creates an opaque barrier that prevents this optimization.

The CI Integration Architecture

The benchmark CI integration has two phases: baseline establishment and regression detection.

Baseline establishment runs on every merge to the main branch. It runs all benchmarks and stores the results as a JSON file in a known location (S3 bucket, artifact store, or a dedicated bench-results git branch). The baseline is keyed by branch and benchmark name.

Regression detection runs on every PR. It runs the same benchmarks and compares against the baseline. If any benchmark has regressed beyond the threshold, the PR check fails.

The threshold question is important. Using a fixed percentage threshold (fail if > 5% slower) sounds clean but has problems: it treats a 5% regression on a 1µs hot-path benchmark the same as a 5% regression on a 100ms background task. The 5µs regression on the hot path costs money; the 5ms regression on the background task is irrelevant.

Better: define per-benchmark thresholds based on the PnL impact of that specific code path.

# bench-thresholds.yml
benchmarks:
  # Hot path - order routing is on the critical path, 2% regression = investigate
  route_single_order:
    regression_threshold_pct: 2
    severity: fail  # Fail the PR

  # Signal evaluation - important but not as latency-critical
  strategy_eval/evaluate_signals/8:
    regression_threshold_pct: 5
    severity: fail

  # Background tasks - slow regressions matter less
  position_reconciliation:
    regression_threshold_pct: 20
    severity: warn  # Just warn, don't fail

The GitHub Actions Workflow

The critical detail is the runner. Standard GitHub Actions runners are shared cloud VMs. They are fine for correctness tests. They are completely wrong for performance benchmarks due to noise.

You need a dedicated bare-metal runner for benchmarks. At Akuna, we had a dedicated physical server for this. At ZeroCopy, it is a dedicated DigitalOcean Droplet (not a shared cloud instance) that runs only benchmark jobs, with CPU frequency scaling disabled and other benchmark-disrupting processes stopped.

# .github/workflows/benchmarks.yml
name: Performance Benchmarks

on:
  pull_request:
    paths:
      - 'workspace/engine/**'  # Only run when engine code changes
      - 'services/quantfund/src/**'

  push:
    branches: [main, develop]
    paths:
      - 'workspace/engine/**'

jobs:
  benchmark:
    # CRITICAL: dedicated runner, not shared ubuntu-latest
    runs-on: [self-hosted, benchmark-runner]
    timeout-minutes: 30

    steps:
      - uses: actions/checkout@v4

      - name: Install Rust toolchain
        uses: dtolnay/rust-toolchain@stable
        with:
          toolchain: stable

      # Restore previous baseline from cache
      - name: Restore benchmark baseline
        uses: actions/cache/restore@v4
        with:
          path: bench-baseline/
          key: bench-baseline-${{ github.base_ref || 'main' }}-latest
          restore-keys: |
            bench-baseline-main-

      # Run benchmarks and output JSON
      - name: Run benchmarks
        working-directory: workspace/engine
        run: |
          cargo bench --bench order_routing -- \
            --output-format bencher \
            | tee /tmp/bench-results.txt

          # Also generate Criterion JSON for detailed analysis
          cargo bench --bench order_routing -- \
            --save-baseline pr-${{ github.event.pull_request.number }}

      # Compare against baseline
      - name: Check for regressions
        run: |
          python3 scripts/check-bench-regression.py \
            --baseline bench-baseline/ \
            --current target/criterion/ \
            --thresholds bench-thresholds.yml \
            --output regression-report.md

      # Comment regression report on PR
      - name: Post regression report
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const report = fs.readFileSync('regression-report.md', 'utf8');
            github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: report
            });

      # On merge to main, update the baseline
      - name: Update baseline
        if: github.event_name == 'push' && github.ref == 'refs/heads/main'
        run: |
          cp -r target/criterion/ bench-baseline/
          # Update rolling 7-day baseline average
          python3 scripts/update-rolling-baseline.py \
            --new-results target/criterion/ \
            --rolling-days 7 \
            --output bench-baseline/rolling-average.json

      - name: Save updated baseline
        if: github.event_name == 'push' && github.ref == 'refs/heads/main'
        uses: actions/cache/save@v4
        with:
          path: bench-baseline/
          key: bench-baseline-main-${{ github.sha }}

The Noise Problem and Rolling Baseline

Comparing a PR’s benchmarks against a single-point baseline (the last main branch commit) introduces noise: if the baseline run happened during a period of high load on the benchmark machine, the baseline will be artificially slow, and the PR will appear to be faster than it is. Conversely, if the PR runs on a loaded machine, it will appear regressed.

The solution is a rolling baseline. Instead of comparing against one reference run, compare against the rolling average (or median) of the last N runs on main. Single-run noise gets averaged out.

# scripts/update-rolling-baseline.py
import json
import sys
from pathlib import Path
from datetime import datetime, timedelta
from statistics import median

def update_rolling_baseline(new_results_dir: Path, rolling_days: int, output: Path):
    rolling_file = output / "rolling-average.json"

    # Load existing rolling data
    if rolling_file.exists():
        with open(rolling_file) as f:
            rolling_data = json.load(f)
    else:
        rolling_data = {"runs": []}

    # Load new results
    new_run = {
        "timestamp": datetime.now().isoformat(),
        "benchmarks": {}
    }

    for bench_dir in new_results_dir.glob("*/new"):
        bench_name = bench_dir.parent.name
        estimates_file = bench_dir / "estimates.json"
        if estimates_file.exists():
            with open(estimates_file) as f:
                estimates = json.load(f)
                new_run["benchmarks"][bench_name] = {
                    "mean_ns": estimates["mean"]["point_estimate"],
                    "std_dev_ns": estimates["std_dev"]["point_estimate"],
                }

    rolling_data["runs"].append(new_run)

    # Keep only the last rolling_days worth of runs
    cutoff = datetime.now() - timedelta(days=rolling_days)
    rolling_data["runs"] = [
        run for run in rolling_data["runs"]
        if datetime.fromisoformat(run["timestamp"]) > cutoff
    ]

    # Compute rolling median for each benchmark
    rolling_data["rolling_medians"] = {}
    all_bench_names = set()
    for run in rolling_data["runs"]:
        all_bench_names.update(run["benchmarks"].keys())

    for bench_name in all_bench_names:
        values = [
            run["benchmarks"][bench_name]["mean_ns"]
            for run in rolling_data["runs"]
            if bench_name in run["benchmarks"]
        ]
        if values:
            rolling_data["rolling_medians"][bench_name] = median(values)

    with open(rolling_file, "w") as f:
        json.dump(rolling_data, f, indent=2)

Python Exchange Connector Benchmarks

Not all trading code is Rust. Exchange connector code (which handles WebSocket parsing, JSON deserialization, and protocol mapping) is often in Python. Python’s timeit module is adequate for simple benchmarks, but pytest-benchmark provides better CI integration:

# tests/bench_exchange_connector.py
import pytest
from quantfund.connectors.binance import BinanceConnector

@pytest.fixture
def connector():
    return BinanceConnector(test_mode=True)

@pytest.fixture
def raw_ticker_message():
    # Representative sample of real Binance ticker WebSocket message
    return b'{"e":"bookTicker","u":123456,"s":"BTCUSDT","b":"50000.10","B":"1.5","a":"50000.20","A":"2.1"}'

def test_bench_ticker_parse(benchmark, connector, raw_ticker_message):
    """Benchmark the hot path: parse raw WebSocket bytes to internal tick struct"""
    result = benchmark(connector.parse_ticker_message, raw_ticker_message)
    assert result is not None

    # pytest-benchmark captures stats - check threshold in conftest.py

# conftest.py - enforce benchmark thresholds via pytest-benchmark
def pytest_configure(config):
    config.addinivalue_line(
        "markers", "benchmark: mark test as a performance benchmark"
    )

# Run with: pytest tests/bench_exchange_connector.py --benchmark-json=bench-results.json
# Then compare in CI: python scripts/check-bench-regression.py --results bench-results.json

How This Breaks in Production

Failure mode 1: Benchmarks run on shared CI runners. Symptom: benchmark CI is noisy - same PR shows +5% regression one run and -3% improvement the next. Engineers stop trusting the results and start ignoring failures. Root cause: shared cloud runners have non-deterministic CPU timing due to noisy neighbors. Fix: dedicated physical or reserved-instance runner with CPU frequency scaling disabled (cpupower frequency-set --governor performance).

Failure mode 2: Comparing against single-point baseline. Symptom: a PR that makes no performance changes shows a regression failure because the baseline happened to run on a loaded machine. Root cause: comparing against a single reference run rather than a rolling average. Fix: rolling 7-day median baseline reduces noise by 4-5x.

Failure mode 3: Benchmarks not covering the hot path. Symptom: performance regression ships to production; post-hoc analysis shows the regressed code is not covered by any benchmark. Root cause: benchmarks were written for the “interesting” code (strategy evaluation) but not for the mundane hot-path code (map lookups, message parsing). Fix: benchmark every function on the hot path, not just the strategically interesting ones.

Failure mode 4: black_box omitted, benchmarks measure nothing. Symptom: benchmarks report 0.1ns execution time for a function that should take 1µs. Root cause: the Rust optimizer has eliminated the benchmarked code because its return value is unused. Fix: always wrap inputs and outputs with black_box() in Criterion benchmarks.

Failure mode 5: No threshold calibration. Symptom: benchmark CI either never fails (threshold too high to catch real regressions) or fails constantly (threshold too low, catching noise). Root cause: a flat percentage threshold was chosen without regard for which code paths are latency-sensitive. Fix: define per-benchmark thresholds based on PnL sensitivity of the code path.

Failure mode 6: Benchmark runner not isolated. Symptom: benchmark results vary significantly between weekday runs (runner is handling other jobs) and weekend runs (runner is idle). Root cause: the dedicated benchmark runner is not actually dedicated - it’s running other CI jobs in parallel. Fix: benchmark jobs must have exclusive access to the runner while running; use GitHub Actions runner labels and concurrency groups to enforce this.

Continuous Performance Benchmarking: Catching the 5% Regression That Costs $50K/Day

Why Code Review Cannot Catch Performance Regressions

Criterion.rs: Statistical Benchmarking for Rust

The CI Integration Architecture

The GitHub Actions Workflow

The Noise Problem and Rolling Baseline

Python Exchange Connector Benchmarks

How This Breaks in Production

Continue Reading

Sovereign Trading Infrastructure: Why the Next Generation of HFT Will Run Inside Enclaves

On-Premise GPU vs Cloud for Trading AI: When the Math Tips

AI-Driven Execution Agents: BAML/Letta Patterns for Trading Workflow Orchestration