Skip to content

Infrastructure

Profiling Production Trading Systems with perf, eBPF, and Off-CPU Flame Graphs

Debugging a P99 latency spike at Akuna Capital using perf record, Brendan Gregg flame graphs, eBPF offcpu analysis, and the critical difference between on-CPU and off-CPU profiles.

12 min
#perf #ebpf #flame-graphs #profiling #latency #trading

At Akuna, we had a profiling meeting that I still think about. The quant had a chart showing P50 = 18µs, P99 = 140µs, P99.9 = 890µs. The P50 was fine. The P99 was acceptable. The P99.9 was alarming - nearly 50x the P50 meant occasional trades were being submitted almost a millisecond after the signal, at which point the signal was worthless.

Someone suggested running a profiler to find the hot function. That is the wrong instinct for this type of problem, and it is a common mistake. A CPU profiler (sampling or instrumentation) shows you where CPU time is spent. P99.9 latency problems are almost never caused by too much CPU work - they are caused by the thread being not on the CPU for a period of time. The thread is runnable but blocked, sleeping, or waiting for a kernel operation.

This is the distinction between on-CPU profiling and off-CPU profiling. The standard tools - perf record, gprof, callgrind - only show you on-CPU time. Off-CPU time is invisible to them. An eBPF-based off-CPU profiler shows you exactly where time is lost waiting.

Part 1: On-CPU Profiling with perf

Before reaching for eBPF, understand what perf gives you natively.

perf record works by setting a hardware performance monitoring unit (PMU) counter to overflow at a specified rate (frequency), which triggers a PMI (performance monitoring interrupt) that records the current instruction pointer and optionally unwinds the call stack. This is statistical sampling - it captures what the CPU is executing when the sample fires.

# Basic CPU profile of a running process
sudo perf record -F 99 -g -p <pid> -- sleep 30

# -F 99: sample at 99 Hz (prime number to avoid lockstepping with periodic workloads)
# -g: capture call graphs (requires frame pointers; see below)
# -p <pid>: attach to running process
# -- sleep 30: run for 30 seconds then stop

# Generate report
sudo perf report --stdio
# Or interactive:
sudo perf report

The call graph collection (-g) requires frame pointers. Most production trading engines are compiled with -O2 -fomit-frame-pointer which removes frame pointers for performance, making call graphs unwind incorrectly. Options:

# Option 1: Compile with frame pointers (recommended for profiling builds)
g++ -O2 -fno-omit-frame-pointer -g trading_engine.cpp -o trading_engine

# Option 2: Use DWARF unwinding (no recompile needed, slower sampling)
sudo perf record -F 99 --call-graph dwarf -p <pid>

# Option 3: Use LBR (Last Branch Record) - Intel only, very low overhead
sudo perf record -F 99 --call-graph lbr -p <pid>

# Option 4: Use eBPF-based stack walking (bcc tools, covered below)

Generating Flame Graphs

Raw perf report output is hard to read for large call trees. Brendan Gregg’s flamegraph tool transforms it into an intuitive visualization:

# Install flamegraph tools
git clone https://github.com/brendangregg/FlameGraph /opt/flamegraph

# Record with perf
sudo perf record -F 99 -g -p <pid> -- sleep 30
sudo perf script | \
    /opt/flamegraph/stackcollapse-perf.pl | \
    /opt/flamegraph/flamegraph.pl > trading_engine_oncpu.svg

# Open the SVG in a browser - it is interactive (clickable to zoom)

Reading a flame graph: the x-axis is alphabetical (not temporal), the y-axis is call depth (bottom = deepest frame, top = where CPU time is spent), and width is proportional to the fraction of total samples where this function was on the stack. Wide towers near the top indicate where the CPU spends most time.

For a well-optimized trading engine, the on-CPU flame graph should show the strategy evaluation functions consuming the most time, with message parsing and order construction as secondary. If you see unexpected wide towers - a memory allocator, a mutex implementation, a memcpy deep in a protocol decoder - those are candidates for optimization.

# Profile a specific event type instead of CPU cycles
# Cache miss profile - find data-hungry code paths
sudo perf record -e cache-misses -g -p <pid> -- sleep 10
sudo perf script | /opt/flamegraph/stackcollapse-perf.pl | \
    /opt/flamegraph/flamegraph.pl --title "Cache Misses" > cache_misses.svg

# Branch mispredictions - find prediction-hostile code
sudo perf record -e branch-misses -g -p <pid> -- sleep 10

# TLB miss profile - confirm your huge pages are working
sudo perf record -e dTLB-load-misses -g -p <pid> -- sleep 10

Part 2: The Off-CPU Gap

The P99.9 problem at Akuna was not visible in the on-CPU flame graph. The flame graph showed a clean, narrow execution path: recv_packet → parse_md → update_book → eval_signal → submit_order. Nothing surprising. Nothing wide. Total CPU time per cycle: ~18µs.

So where did the extra 870µs go at P99.9?

The answer is off-CPU time: periods when the thread is runnable (not blocked on I/O or sleeping) but not scheduled. This happens when:

  1. The scheduler preempts the trading thread (something else got priority)
  2. The trading thread is waiting on a mutex or futex
  3. A page fault or TLB shootdown stalls the thread
  4. A syscall blocks briefly (even clock_gettime can take >1µs if it crosses a vDSO fault)

Off-CPU analysis traces exactly these events by hooking into the scheduler’s context_switch tracepoint. Every time the trading thread is descheduled, we record the time. Every time it is rescheduled, we record the wake-up callstack. The off-CPU flame graph shows where the thread spends time waiting.

Part 3: eBPF Off-CPU Analysis

BCC (BPF Compiler Collection) provides an offcputime tool that implements off-CPU analysis with minimal overhead:

# Install bcc-tools
apt-get install bpfcc-tools linux-headers-$(uname -r)

# Capture off-CPU time for a specific thread
# Use -t for thread ID (find with 'ps -eLo pid,tid,comm | grep trading')
sudo /usr/share/bcc/tools/offcputime -df -p <pid> 30 > offcpu_raw.txt

# Generate off-CPU flame graph
cat offcpu_raw.txt | \
    /opt/flamegraph/flamegraph.pl --title "Off-CPU Time" \
    --countname=us --colors=io > offcpu.svg

The offcputime script with the -df flags captures both the kernel stack and user-space stack at the moment of descheduling. The resulting flame graph shows you what the trading thread was doing (user stack) and what the kernel was doing (kernel stack) every time it was forced off the CPU, weighted by the time it spent off-CPU.

The Custom eBPF Script for Trading Analysis

The stock offcputime is good for exploration. For production monitoring, a custom eBPF program that attaches to specific callsites and measures latency directly is more precise:

/* trading_latency.bpf.c - eBPF program to measure off-CPU latency */
/* Compile with: clang -O2 -target bpf -c trading_latency.bpf.c -o trading_latency.bpf.o */

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

/* Map: thread ID → timestamp of last context switch out */
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1024);
    __type(key, u32);
    __type(value, u64);
} switch_time SEC(".maps");

/* Map: latency histogram (log2 buckets, 0-1024µs) */
struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 64);
    __type(key, u32);
    __type(value, u64);
} offcpu_hist SEC(".maps");

/* Hook: called when a thread is switched out */
SEC("tracepoint/sched/sched_switch")
int on_sched_switch(struct trace_event_raw_sched_switch *ctx) {
    u32 prev_pid = ctx->prev_pid;
    u64 ts = bpf_ktime_get_ns();

    /* Record switch-out time for the trading thread */
    if (prev_pid == TRADING_TID) {  /* set via BPF map from userspace */
        bpf_map_update_elem(&switch_time, &prev_pid, &ts, BPF_ANY);
    }

    return 0;
}

/* Hook: called when a thread is scheduled back in */
SEC("tracepoint/sched/sched_wakeup")
int on_sched_wakeup(struct trace_event_raw_sched_wakeup *ctx) {
    u32 pid = ctx->pid;
    u64 *switch_ts;

    if (pid != TRADING_TID) return 0;

    switch_ts = bpf_map_lookup_elem(&switch_time, &pid);
    if (!switch_ts) return 0;

    u64 now = bpf_ktime_get_ns();
    u64 offcpu_ns = now - *switch_ts;
    u64 offcpu_us = offcpu_ns / 1000;

    /* Log2 bucket */
    u32 bucket = 0;
    while (offcpu_us > 1 && bucket < 63) {
        offcpu_us >>= 1;
        bucket++;
    }

    u64 *count = bpf_map_lookup_elem(&offcpu_hist, &bucket);
    if (count) __sync_fetch_and_add(count, 1);

    bpf_map_delete_elem(&switch_time, &pid);
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

And the Python userspace component using BCC:

#!/usr/bin/env python3
"""off_cpu_trading.py - monitor trading thread off-CPU time"""
from bcc import BPF
import ctypes
import time
import sys

TRADING_PID = int(sys.argv[1])  # PID of trading engine
TRADING_TID = int(sys.argv[2])  # specific thread ID if multithreaded

bpf_text = open("trading_latency.bpf.c").read()
bpf_text = bpf_text.replace("TRADING_TID", str(TRADING_TID))

b = BPF(text=bpf_text)
b.attach_tracepoint(tp="sched:sched_switch", fn_name="on_sched_switch")
b.attach_tracepoint(tp="sched:sched_wakeup", fn_name="on_sched_wakeup")

print(f"Monitoring TID {TRADING_TID} for off-CPU latency... Ctrl-C to stop")

try:
    time.sleep(30)
except KeyboardInterrupt:
    pass

print("\nOff-CPU Latency Distribution (µs):")
hist = b["offcpu_hist"]
for i in range(64):
    count = hist[ctypes.c_int(i)].value
    if count > 0:
        bucket_min = 1 << i
        bucket_max = (1 << (i + 1)) - 1
        bar = '#' * min(40, count // 10)
        print(f"  {bucket_min:8d}-{bucket_max:8d} µs: {count:8d} {bar}")

What We Found at Akuna

Running this on the trading engine during a period of high P99.9, the off-CPU histogram showed:

Off-CPU Latency Distribution (µs):
           1-          1 µs:   847293 ###################  ← normal scheduler noise
           2-          3 µs:   312847 #######
           4-          7 µs:   147823 ###
           8-         15 µs:    23847
          16-         31 µs:    12394
          32-         63 µs:      847                      ← starting to look unusual
          64-        127 µs:      423
         128-        255 µs:      389                      ← 389 events of 128-255µs
         256-        511 µs:      312                      ← 312 events of 256-511µs
         512-       1023 µs:       47                      ← 47 events >512µs

The tail events - the 47 instances of >512µs off-CPU time - corresponded precisely to the P99.9 latency spikes. The stack at wakeup showed: futex_wait → pthread_mutex_lock → __malloc_hook. The trading thread was blocked waiting for a memory allocator mutex.

The trading loop itself contained no allocations. But the market data parser had a code path that allocated a std::string on error conditions - specifically when the checksum failed on a malformed packet. Under normal operation this path was never taken. But during one brief period of high packet rate, a single malformed packet was received (a common occurrence on multicast market data feeds), the error path ran, malloc was called, and it blocked on a mutex that another thread was holding.

The fix was three lines: pre-allocate an error message buffer at startup, and use snprintf into the pre-allocated buffer instead of constructing a std::string.

Production Implementation: Always-On Low-Overhead eBPF

For continuous production monitoring, we run a lightweight eBPF program that only records events above a threshold:

#!/usr/bin/env python3
"""Continuous off-CPU spike monitor - production mode"""
from bcc import BPF
import prometheus_client
import threading
import time

SPIKE_THRESHOLD_US = 50  # Only alert on off-CPU events > 50µs

bpf_prog = """
#include <uapi/linux/ptrace.h>

BPF_HASH(start, u32, u64);
BPF_PERF_OUTPUT(spike_events);

struct spike_data_t {
    u32 pid;
    u64 duration_us;
    char comm[16];
};

TRACEPOINT_PROBE(sched, sched_switch) {
    u32 prev = args->prev_pid;
    u64 ts = bpf_ktime_get_ns();
    start.update(&prev, &ts);
    return 0;
}

TRACEPOINT_PROBE(sched, sched_wakeup) {
    u32 pid = args->pid;
    u64 *ts = start.lookup(&pid);
    if (!ts) return 0;

    u64 duration_us = (bpf_ktime_get_ns() - *ts) / 1000;
    start.delete(&pid);

    if (duration_us > THRESHOLD_US) {
        struct spike_data_t data = {};
        data.pid = pid;
        data.duration_us = duration_us;
        bpf_get_current_comm(data.comm, sizeof(data.comm));
        spike_events.perf_submit(args, &data, sizeof(data));
    }
    return 0;
}
""".replace("THRESHOLD_US", str(SPIKE_THRESHOLD_US))

b = BPF(text=bpf_prog)

off_cpu_spikes = prometheus_client.Counter(
    'trading_off_cpu_spikes_total',
    'Number of off-CPU events above threshold',
    ['process', 'severity']
)
off_cpu_max = prometheus_client.Gauge(
    'trading_off_cpu_max_us',
    'Maximum observed off-CPU duration in µs'
)

def handle_spike(cpu, data, size):
    event = b["spike_events"].event(data)
    comm = event.comm.decode('utf-8', errors='replace').rstrip('\x00')
    us = event.duration_us

    severity = "critical" if us > 500 else "warning" if us > 100 else "info"
    off_cpu_spikes.labels(process=comm, severity=severity).inc()
    off_cpu_max.set(max(off_cpu_max._value.get(), us))

    if us > 100:
        print(f"[SPIKE] {comm} (PID {event.pid}): {us}µs off-CPU")

b["spike_events"].open_perf_buffer(handle_spike)

prometheus_client.start_http_server(9101)
print("Off-CPU monitor running, metrics on :9101")

while True:
    b.perf_buffer_poll(timeout=100)

How This Breaks in Production

1. perf record degrades performance. At high sampling frequencies (>1000 Hz), the PMI interrupt overhead becomes measurable on a lightly loaded system. Use 99 Hz for production profiling. The PMI fires on the trading core and briefly preempts the trading loop - even 99 interrupts per second at 5µs each is 495µs/second of overhead. Use --cpu to restrict sampling to specific CPUs if you only want to profile the trading thread.

2. BPF program attachment causes one-time latency spike. When you attach an eBPF program to a tracepoint, the kernel briefly stops each CPU, installs the hook, and resumes. This takes 1-5ms. Do not attach eBPF programs during active trading hours - attach during the pre-open window.

3. DWARF unwinding in perf is unreliable with JIT code. If your trading engine uses any JIT-compiled components (LLVM-based signal evaluation, Java-based strategy, etc.), perf record --call-graph dwarf will produce corrupt stack traces for the JIT region. Use --call-graph lbr (Intel) or compile with frame pointers and use --call-graph fp.

4. eBPF verifier rejects programs on older kernels. The BPF verifier’s complexity limit varies by kernel version. Programs that work on 5.15 may be rejected on 4.18 with “BPF program is too complex” or “exceeded program size limit.” The offcputime tool from BCC ships with multiple implementation variants for different kernel versions - always test on the target kernel.

5. Flame graph merges unrelated paths. perf’s aggregation merges stacks across all CPU modes (user, kernel, hypervisor). If your trading process and a kernel thread share a common function name (e.g., both call into a vDSO function), their stacks will be merged in the flame graph. Use --user-callchains and --kernel-callchains separately when the distinction matters.

6. Off-CPU analysis misidentifies voluntary sleeps as latency. If your trading engine calls nanosleep() or usleep() deliberately (for heartbeat timing, rate limiting, etc.), the off-CPU profiler will report those as “off-CPU time” - technically accurate but misleading. Filter known-voluntary sleep callsites from your analysis: grep -v "nanosleep\|usleep" offcpu_raw.txt | /opt/flamegraph/flamegraph.pl > filtered.svg.

Related reading: The Anatomy of a Sub-50µs Trade provides the full latency budget that profiling should measure against. Lock-Free Queues for Market Data covers the allocator-free patterns that eliminate the class of bug we found. CPU Pinning, isolcpus, and nohz_full covers reducing the scheduler events that show up in off-CPU profiles. Real-Time Scheduling on Linux covers SCHED_FIFO which reduces involuntary preemption.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.