The Anatomy of a Sub-50µs Trade: Tracing a Packet from NIC to Strategy and Back

At Akuna Capital, we ran concurrent market-making across 12 exchanges simultaneously. Each exchange had its own dedicated NIC queue, its own pinned CPU core, and its own SPSC ring buffer connecting the market data handler to the strategy thread. The system worked well - until the day our order acknowledgement latency reports came back showing 2x the latency our models had been assuming.

The number in the model was 18µs round-trip. The measured number was 37µs. The discrepancy had been there for months; it had just never been caught because our P&L attribution didn’t separate execution quality from alpha quality cleanly enough. When I finally wired up proper per-hop latency instrumentation using hardware timestamps on the Solarflare SFN8522 NICs, the answer was immediately obvious: our order encoder was running on NUMA node 0, but the NIC DMA buffer was on NUMA node 1. Every outbound packet was paying a cross-NUMA memory penalty of roughly 80ns - on every single cache miss during encode.

That 80ns figure sounds small. Multiply it by the number of cache lines touched during an order encode - roughly 12-15 for a typical binary protocol message - and you get 960ns to 1.2µs of pure memory-access overhead, before a single instruction of encoding logic runs. At 12 exchanges, that compound loss was material.

This post walks through the complete life of a packet, from NIC arrival to strategy decision to wire transmit, with measured numbers at each stage. This is the map I wish I had when I started.

Why Every Nanosecond Has a Dollar Value

Before tracing the packet, it is worth being precise about why latency matters in ways that go beyond the intuitive “faster is better.”

In market-making, your edge comes from being able to update quotes faster than the market moves against you. If the fair value of BTC/USD moves by $10 and your system takes 50µs to see that move and reprice, any market order that arrives in that 50µs window will trade against a stale quote. The loss per trade is small - maybe$ 0.50 in fill improvement the counterparty gets at your expense - but at thousands of trades per day across 12 venues, it accumulates into several thousand dollars per day of adverse selection.

At Akuna, a persistent 18µs→37µs latency regression on a single exchange corresponded to roughly $4,000-$ 6,000 per day in measurable P&L degradation across all instruments on that venue, detectable within two weeks of the regression starting. That is the dollar cost of a NUMA misconfiguration.

The failure modes are not dramatic. There is no crash, no alert, no error log. The system processes every message correctly. The P&L just quietly leaks.

The Full Stack: What a Packet Touches

Here is the complete path from NIC receive to order transmit, with typical latency contribution for each stage on a well-tuned Linux server with kernel bypass:

Stage                           Typical Latency    Notes
─────────────────────────────────────────────────────────────────────────────
NIC hardware RX + timestamp     ~100-200ns         Hardware timestamping on SFN8522
DMA to host memory              ~150-300ns         Depends on PCIe gen + NUMA
ef_vi / DPDK RX poll            ~200-500ns         Kernel bypass poll loop overhead
Application RX parse            ~300-800ns         Protocol decode, checksum verify
Ring buffer write (SPSC)        ~50-100ns          Cache-line-aligned SPSC, same NUMA
Strategy thread wakeup          ~100-300ns         CPU already spinning (SCHED_FIFO)
Market data parse (MD→internal) ~500ns-2µs         Depends on book depth, depth limit
Strategy evaluation             ~1-5µs             Signal computation, alpha model
Order construction              ~200-500ns         Fill-in fields, sequence numbers
Binary encode (FIX/binary)      ~500ns-1.5µs       Cross-NUMA adds ~1µs here (our bug)
Ring buffer read (SPSC)         ~50-100ns
ef_vi / DPDK TX submit          ~300-500ns         DMA descriptor write + doorbell
NIC TX DMA + wire               ~100-200ns         Wire latency starts here
─────────────────────────────────────────────────────────────────────────────
Total (well-tuned, same NUMA)   ~4-12µs            One-way to wire
Total (our broken config)       ~14-20µs           Cross-NUMA encoder added ~8µs

The numbers above are one-way to wire. Round-trip (order sent + ACK received + parsed) on a co-located exchange adds another 4-12µs for the exchange matching engine to echo back the ACK, giving you 8-40µs total depending on configuration quality. The sub-50µs figures commonly cited in HFT marketing include the exchange processing time; the “your infrastructure” contribution should be under 15µs one-way on a well-tuned box.

Stage 1: NIC Receive - The Race to Host Memory

When a UDP packet arrives at the NIC (we used multicast market data from the exchanges), the NIC’s hardware receive logic does several things before the host CPU sees anything:

Validates the Ethernet frame (CRC check)
Applies RSS (Receive Side Scaling) to determine which RX queue to use
Applies flow steering rules if configured (we used ethtool ntuple rules to pin each exchange’s multicast group to a specific queue)
Applies a hardware timestamp (on Solarflare SFN8522 with hwtstamp_config enabled)
Writes the frame into a DMA buffer whose physical address was pre-registered with the NIC

Steps 1-4 happen entirely on the NIC ASIC, typically in 100-200ns from packet arrival at the physical layer. Step 5 is where PCIe and NUMA matter.

The DMA write goes to a physical memory region you registered with the NIC during queue setup. If that physical memory is on NUMA node 1 but your NIC is attached to the PCIe root complex of NUMA node 0, the DMA write crosses the QPI/UPI interconnect. This adds approximately 60-80ns per cache line written, and a single packet descriptor plus payload might touch 4-8 cache lines.

# Verify which NUMA node your NIC is attached to
cat /sys/class/net/eth0/device/numa_node
# Expected: 0 or 1

# Verify where your huge page memory pool is allocated
numactl --hardware
# Look for 'available: 2 nodes (0-1)' and the memory sizes

# Check PCIe attachment
lspci -vvv | grep -A 5 "Ethernet"
# Look for "NUMA node" in the output

On a dual-socket server, getting this right means: NIC on NUMA node 0, huge page memory pool on NUMA node 0, trading process bound to NUMA node 0 CPUs. If any of these three is on the wrong node, you pay the cross-socket penalty on the hot path.

The ef_vi API (Solarflare’s kernel bypass) lets you specify the NUMA node for buffer allocation explicitly:

/* ef_vi NUMA-aware buffer allocation */
ef_pd_alloc_with_numa(&pd, dh, EF_PD_DEFAULT, /*numa_node=*/0);
ef_memreg_alloc(&memreg, dh, &pd, NULL,
                hugepage_buf,      /* must be physically on NUMA 0 */
                hugepage_buf_len);

If you use mmap(MAP_HUGETLB) to allocate the buffer without a NUMA policy, the kernel will allocate it on whatever node has free memory - which is nondeterministic and changes between reboots. Always use mbind() or allocate via numactl --membind=0.

Stage 2: Kernel Bypass - Why ef_vi, DPDK, and AF_XDP Exist

The Linux kernel’s network stack is not designed for latency. It is designed for correctness, generality, and CPU efficiency across millions of concurrent connections. Every packet that goes through the full kernel stack passes through:

softirq scheduling (the kernel has to take a software interrupt, which can be deferred)
Socket buffer (skb) allocation from a slab allocator
Protocol processing (IP, UDP) with lock acquisition
Copy from kernel space to userspace via recvmsg

On a lightly loaded system, this takes roughly 5-20µs. On a loaded system with interrupt coalescing enabled, much longer.

Kernel bypass eliminates all of this. The NIC writes packets directly into userspace memory that you own, and your application polls that memory directly. The kernel is never involved in the data path.

Here is a minimal ef_vi receive loop - the exact pattern we used at Gemini for market data:

while (running) {
    ef_event evs[EF_VI_EVENT_POLL_MAX_EVS];
    int n = ef_eventq_poll(&vi, evs, EF_VI_EVENT_POLL_MAX_EVS);

    for (int i = 0; i < n; ++i) {
        if (EF_EVENT_TYPE(evs[i]) == EF_EVENT_TYPE_RX) {
            int id = EF_EVENT_RX_RQ_ID(evs[i]);
            int len = EF_EVENT_RX_BYTES(evs[i]);

            /* Pointer arithmetic directly into DMA buffer - zero copy */
            const char *pkt = pkt_bufs + (id * PKT_BUF_SIZE) + RX_DMA_OFF;

            /* Hardware timestamp extracted from packet prefix */
            struct timespec hw_ts;
            ef_vi_receive_get_timestamp(&vi, pkt, &hw_ts);

            handle_packet(pkt + ETH_HLEN, len - ETH_HLEN, hw_ts);

            /* Repost the buffer so the NIC can use it again */
            ef_vi_receive_post(&vi, id);
        }
    }
}

The critical property here is that handle_packet receives a direct pointer into DMA memory. There is no copy. The packet data is read exactly once, from the location the NIC wrote it. This is why NUMA alignment of the DMA buffer is so important - it is the only memory read on the critical path.

The polling loop itself introduces overhead: ef_eventq_poll reads a ring descriptor to check if a new event is available. When no packet has arrived, this read hits L1 cache (the descriptor was already read on the previous iteration). The overhead is roughly 2-5 CPU cycles per poll iteration, which at 3GHz is under 2ns. This is why poll-mode drivers are used - sleeping and waking up on interrupt would add hundreds of nanoseconds for each packet.

Stage 3: Strategy Evaluation - Where Alpha Lives

After protocol decode and book update, the strategy thread runs. At Akuna, strategy evaluation included:

Book state update (inserting the new quote into our in-memory order book representation)
Fair value computation (weighted mid with depth adjustment)
Signal evaluation (spread model, inventory model, alpha model outputs)
Quote decision (should we update our bid, ask, or both?)
Order construction (fill in FIX/binary protocol fields)

The majority of this runs in 1-5µs depending on model complexity. The book update and fair value computation are the fastest parts - well-implemented book update on a SPSC ring is 200-400ns. The alpha model is the most variable, from 500ns for simple moving-average signals to 3-5µs for models that touch historical data or do floating-point-heavy calculations.

/* Strategy evaluation pseudocode - simplified from Akuna pattern */
void on_market_data(const BookUpdate* update, uint64_t rx_hw_ns) {
    /* Step 1: Update book state - ~200ns */
    book.apply(update);

    /* Step 2: Compute fair value - ~100-200ns */
    double fair = book.weighted_mid(/*depth_levels=*/5);

    /* Step 3: Evaluate signals - ~500ns-2µs */
    double alpha = alpha_model.evaluate(fair, book.imbalance(), inventory);

    /* Step 4: Quote decision - ~100ns */
    if (fabs(alpha - last_quote_mid) > requote_threshold) {
        /* Step 5: Construct order - ~200-500ns */
        Order ord = {
            .side = SIDE_BID,
            .price = fair - half_spread,
            .qty = position_sizer.qty(fair, inventory),
            .clordid = next_clordid(),
        };

        /* Measure strategy latency from market data receipt */
        uint64_t strategy_ns = rdtsc_to_ns(rdtsc()) - rx_hw_ns;
        metrics.record_strategy_latency(strategy_ns);

        tx_queue.push(ord);  /* SPSC push to encoder thread */
    }
}

The rdtsc() call for measurement is important: we used RDTSC (read timestamp counter) for intra-host latency measurement because it has essentially zero overhead (~5 cycles) compared to clock_gettime(CLOCK_REALTIME) (~25 cycles) or worse, gettimeofday() (~35 cycles). The TSC is monotonic and ticks at a fixed frequency on modern Intel processors with constant_tsc and nonstop_tsc CPU flags - verify with grep -m1 flags /proc/cpuinfo.

Stage 4: Wire Encoding and the Bug We Found

Order encoding takes the internal Order struct and serializes it into the wire format for the exchange (binary protocol, typically). At Akuna we used custom binary protocols for most venues - faster to encode than FIX ASCII, with much smaller message sizes.

The encoder is a straightforward function: write header fields, write instrument ID, write price (often in integer ticks), write quantity, compute checksum if required, write length prefix.

The bug was this: our encoder was a shared library loaded by a management process that ran on NUMA node 1. The trading engine’s encoder thread was also calling into the same code. Due to a linking quirk, the encoder’s constant lookup tables (price tick-size tables, instrument ID maps) were loaded into pages on NUMA node 1 - because that was the first process that loaded the library.

Every time the trading engine called into the encoder and touched those lookup tables, it was doing cross-NUMA reads. With 12-15 cache line reads per encode, at 80ns cross-NUMA penalty, that is 960ns to 1.2µs of added latency per order.

# How we detected it - perf stat showing LLC-load-misses on encode hotpath
sudo perf stat -e LLC-loads,LLC-load-misses,cache-references,cache-misses \
    -p <trading_engine_pid> --timeout 5000

# Output showed:
#    1,847,293  LLC-loads
#      923,441  LLC-load-misses    # 50% miss rate - completely wrong
#
# Expected for a hot path hitting L1/L2: < 1% LLC miss rate

# Then use perf c2c (cache-to-cache) to find the actual hot lines
sudo perf c2c record -p <trading_engine_pid>
sudo perf c2c report

The fix was to move the encoder into a standalone shared object with a linker script that placed all static data in a .hot section, and to use mmap with mbind(MPOL_BIND, numa_node_0_mask) to force those pages onto NUMA node 0.

/* After fix: explicit NUMA binding for encoder data tables */
void encoder_init(void) {
    size_t table_size = sizeof(instrument_table);

    /* Force allocation on NUMA node 0 */
    instrument_table_ptr = mmap(NULL, table_size,
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

    unsigned long nodemask = 1UL << 0;  /* node 0 */
    mbind(instrument_table_ptr, table_size,
          MPOL_BIND, &nodemask, 2, MPOL_MF_STRICT);

    /* Copy data into NUMA-local memory */
    memcpy(instrument_table_ptr, &_instrument_table_static, table_size);
}

After this fix, LLC miss rate on the encoder dropped from 50% to under 2%, and the one-way latency on that exchange dropped from 20µs back to 11µs.

Production Implementation: The Full Tuning Checklist

This is the exact boot configuration we used. Every command here maps directly to a measurable latency reduction.

# /etc/default/grub - kernel command line additions
GRUB_CMDLINE_LINUX="isolcpus=2-15 nohz_full=2-15 rcu_nocbs=2-15 \
    intel_idle.max_cstate=0 processor.max_cstate=0 \
    idle=poll \
    transparent_hugepage=never \
    numa_balancing=disable \
    skew_tick=1 \
    nosoftlockup \
    nmi_watchdog=0 \
    audit=0 \
    tsc=reliable clocksource=tsc"

# After editing, rebuild:
sudo update-grub && sudo reboot

# Verify isolcpus took effect
cat /sys/devices/system/cpu/isolated
# Should show: 2-15

# Verify TSC as clock source
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
# Should show: tsc

# NUMA configuration - run at startup
# Bind trading process to NUMA node 0 CPUs and memory
numactl --cpunodebind=0 --membind=0 ./trading_engine

# Or use taskset + policy combination:
taskset -c 2-7 ./trading_engine &
PID=$!
# Then bind memory policy for that process
numactl --membind=0 -p $PID  # note: this doesn't work post-fork this way
# Better: use libnuma in the process itself:
# numa_set_localalloc(); or numa_set_membind(node0_mask);

# Huge pages - static allocation at boot
echo 512 > /proc/sys/vm/nr_hugepages
# Verify allocation succeeded
grep HugePages /proc/meminfo
# HugePages_Total:     512
# HugePages_Free:      512

# Disable THP (Transparent Huge Pages) - critical
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# Make permanent in /etc/rc.local or systemd unit

# IRQ affinity - keep RX queue IRQs off the trading cores
# Find IRQ numbers for your NIC queues
grep -E 'eth0|sfn8522' /proc/interrupts | awk '{print $1}' | tr -d ':'

# Move each to a non-trading core (core 0 or 1)
for irq in $(grep eth0-rx /proc/interrupts | awk '{print $1}' | tr -d ':'); do
    echo "1" > /proc/irq/$irq/smp_affinity  # core 0 only, hex bitmask
done

# Disable irqbalance for the trading NIC
systemctl stop irqbalance
# Or configure it to avoid touching your dedicated queues:
# IRQBALANCE_BANNED_CPUS="000000fc" in /etc/default/irqbalance

# CPU frequency - disable frequency scaling on trading cores
for cpu in $(seq 2 15); do
    echo performance > /sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor
done

# Disable hardware turbo boost (for determinism, not max throughput)
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

# Verify frequency is fixed
cat /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq
# Should match scaling_max_freq

How This Breaks in Production

This section covers failure modes that are invisible in testing but surface in production. These are patterns I have personally debugged.

1. NUMA migration at startup. If your process starts on NUMA node 1 for any reason (e.g., the startup script’s shell happens to be scheduled there), the memory allocator will initially allocate from node 1. Even if you call numa_set_preferred(0) later, already-allocated pages stay on node 1. The fix is to call numa_set_localalloc() before any significant allocation, before main() completes its setup. Better: use a wrapper script with numactl --membind=0 so the OS enforces it at process start.

2. irqbalance restart after kernel update. Many Linux systems run irqbalance and restart it on package updates. After a kernel update that triggers a service restart, irqbalance will cheerfully move your RX queue IRQs to whatever core it thinks is least loaded - which might be your trading core. Set irqbalance to masked on your trading nodes, or remove it entirely and manage IRQ affinity statically in your startup scripts.

3. TSC desync after CPU sleep. If C-states are not properly disabled (specifically C6, which powers off the CPU voltage), the TSC may pause while the CPU is in deep sleep. When the CPU wakes up, the TSC jumps forward - but not by the correct amount. This causes your latency measurements to show impossible negative values or huge spikes. Detection: dmesg | grep tsc will show TSC instability warnings. Prevention: intel_idle.max_cstate=0 and processor.max_cstate=0 in the kernel cmdline, plus verify with cpupower idle-info.

4. Huge page allocation failure at boot. nr_hugepages is fulfilled from contiguous physical memory. On a freshly booted system, contiguous memory is available; after hours of uptime with memory fragmentation, the kernel may not be able to fulfill a large allocation request. If your DMA buffer allocation fails, your code likely falls back to normal pages - silently, without error if you don’t check return values carefully. The symptom is higher-than-expected LLC miss rate and latency. Prevention: allocate huge pages at boot time (before memory fragments), verify the count in /proc/meminfo, and fail fast if the expected count is not present.

5. False sharing between SPSC producer and consumer. If your ring buffer’s read index and write index live on the same 64-byte cache line, the producer and consumer cores will bounce that cache line between their L1 caches on every operation. The measured overhead is 40-100ns per operation instead of 2-5ns. Detection: perf c2c record followed by perf c2c report will show “HitM” (Hit Modified) events on the affected cache line. Fix: pad your indices to separate cache lines with __attribute__((aligned(64))).

6. PCIe link downgrade under load. Some servers will downgrade a PCIe link from Gen4 x8 to Gen3 x8 under thermal pressure, cutting DMA bandwidth in half. This manifests as suddenly higher DMA latency and increased receiver buffer drops. Detection: lspci -vvv | grep -A 2 LnkSta and compare LnkSta (current) vs LnkCap (maximum). This happened to us at Akuna once during a summer afternoon on a particularly hot trading day - the server’s PCIe controller was throttling itself. Fix: improve airflow or accept the limitation and overprovision bandwidth.

Measurement Framework: Knowing Your Numbers

The most important thing you can do is have per-stage latency instrumentation running continuously in production - not just in testing.

/* Per-stage TSC timestamps - production implementation */
struct TradeLatency {
    uint64_t nic_rx_hw_ns;      /* From ef_vi hardware timestamp */
    uint64_t app_rx_tsc;        /* First TSC after ef_eventq_poll returns */
    uint64_t md_parsed_tsc;     /* After protocol decode + book update */
    uint64_t strategy_start_tsc;
    uint64_t strategy_end_tsc;
    uint64_t encode_start_tsc;
    uint64_t encode_end_tsc;
    uint64_t tx_submitted_tsc;  /* After ef_vi_transmit() returns */
    uint64_t tx_ack_hw_ns;      /* Hardware timestamp on ACK packet */
};

/* Convert TSC to nanoseconds using a calibrated ratio */
static inline uint64_t tsc_to_ns(uint64_t tsc_delta) {
    /* tsc_hz measured at boot via calibration against HPET */
    return (uint64_t)((double)tsc_delta * 1e9 / tsc_hz);
}

Every trade generates a TradeLatency struct. These are written to a lock-free SPSC queue and consumed by a low-priority thread that writes them to disk and updates Prometheus histograms. The histogram is what catches regressions - a sudden shift in P50 latency is almost always a configuration drift (kernel update, irqbalance restart, NUMA policy change), while a shift in P99 points to something else entirely.

Related reading: CPU Pinning, isolcpus, and nohz_full covers keeping cores quiet. NUMA in Production goes deeper on NUMA topology diagnosis. Solarflare ef_vi vs DPDK vs AF_XDP covers the kernel bypass layer in detail. Lock-Free Queues for Market Data covers the SPSC ring buffer implementation.