Skip to content

Infrastructure

NUMA in Production: Why Your Trading Bot Slows Down at 3 AM and How to Diagnose It

P99 doubled from 22µs to 41µs overnight with no code changes. A background analytics job pushed the trading engine off NUMA node 0. Full numastat/perf c2c diagnosis workflow.

12 min
#numa #latency #linux #memory #hft #perf

The Akuna trading floor had a strict culture around latency. Every morning, the first thing the infrastructure team checked was the overnight P99 histogram. If your P99 had moved more than 5% without a code deploy, you were expected to explain why by stand-up.

One Tuesday morning, the BTC/USD market-maker’s P99 had gone from 22µs to 41µs - overnight, with no code changes, no configuration changes, no network events. The system was processing orders correctly. P&L was normal. But the latency had nearly doubled.

It took four hours to find the cause. The culprit was a background analytics process that had been added to the same physical server the previous Friday. It ran at 2:30 AM - well outside trading hours, so nobody had noticed it during the weekend. But the analytics process allocated 48GB of memory from NUMA node 0, where the trading engine lived, forcing the Linux kernel’s memory allocator to start pulling pages from NUMA node 1 for the trading engine’s new allocations. By the time markets opened, the trading engine’s hot data structures were scattered across both NUMA nodes. Cross-socket latency was everywhere.

This post is the guide I wrote afterward.

What NUMA Is and Why It Matters in Trading

Modern multi-socket servers do not have a single, uniform pool of memory. They have multiple banks of DRAM, each physically connected to a specific CPU socket via dedicated memory controllers. When a CPU core on socket 0 reads data from a DIMM attached to socket 0, that read goes through the local memory controller: roughly 4 memory clock cycles, approximately 40-80ns depending on timing parameters.

When that same CPU core reads data from a DIMM attached to socket 1, the request has to cross the inter-socket interconnect - Intel’s QPI (Quick Path Interconnect) on older hardware, UPI (Ultra Path Interconnect) on newer Xeon generations, and AMD’s Infinity Fabric on EPYC. This penalty is approximately 60-100ns additional latency per access on a dual-socket server, and worse on quad-socket configurations.

Dual-socket NUMA topology:

Socket 0                          Socket 1
┌─────────────────────┐          ┌─────────────────────┐
│  Core 0-15          │          │  Core 16-31         │
│                     │          │                     │
│  L3 Cache (30MB)    │          │  L3 Cache (30MB)    │
│                     │          │                     │
│  Memory Controller  │◄────────►│  Memory Controller  │
│                     │  QPI/UPI │                     │
│  DIMM 0-7 (128GB)   │          │  DIMM 8-15 (128GB)  │
└─────────────────────┘          └─────────────────────┘
        │                                   │
      PCIe 0                             PCIe 1
   (NIC, NVMe)                       (GPU, other)

Local memory access:  Core 0 → DIMM 0   ~40-80ns
Remote memory access: Core 0 → DIMM 8   ~100-160ns
Ratio: ~2x penalty for cross-socket reads

For a strategy evaluation loop that touches 50-100 cache lines, the difference between “all local” and “half remote” is roughly 2-5µs per iteration. At thousands of iterations per second, that is easily the difference between a competitive system and one that is consistently picked off by faster market participants.

The 3 AM Pattern: How Remote Memory Creeps In

NUMA locality errors rarely happen at system startup. They accumulate over time, and they accelerate during quiet periods (3-4 AM) when memory pressure increases.

Here is the sequence that burned us:

  1. Trading engine starts at 8 AM, allocates all structures on NUMA node 0 (libnuma was correctly configured). All memory is local. Latency is 22µs P99.

  2. Analytics process starts at 2:30 AM on the same physical server. It allocates 48GB. The server has 128GB per socket; after a day of operation, NUMA node 0 has maybe 80GB used (OS + trading engine steady-state). The analytics process’s first 48GB goes mostly to NUMA node 0.

  3. The analytics process triggers OOM pressure on NUMA node 0. The kernel’s kswapd starts paging out LRU pages. Some of these are pages belonging to the trading engine that have not been touched for several hours - the historical order book data structures that are accessed once on startup and then sit dormant.

  4. Those dormant pages, when evicted and later faulted back in by the trading engine on re-access, are allocated wherever memory is available - which is now NUMA node 1.

  5. When markets open at 6 AM, the trading engine’s first accesses to any data structure that was evicted go cross-socket. The L3 cache is cold. Everything is slower.

The key insight is that page eviction is not loud. There is no error, no log line, no alert. The kernel just silently moves your data to a slower tier.

Diagnosing NUMA Problems

The diagnostic workflow has three steps: identify the topology, measure the balance, find the hot misses.

Step 1: Understand your topology

# Basic topology summary
numactl --hardware
# Output:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# node 0 size: 128946 MB
# node 0 free: 89234 MB
# node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# node 1 size: 128947 MB
# node 1 free: 71120 MB
# node distances:
# node   0   1
#   0:  10  21
#   1:  21  10

# Distance 10 = local, 21 = remote. The 2.1x ratio is the NUMA penalty in relative terms.

# More detailed topology including L3 cache affinity
lstopo --of txt
# Or install hwloc and use:
hwloc-ls

Step 2: Measure the balance

# Check per-node memory statistics for your process
cat /proc/<pid>/numa_maps | head -50
# Output lines look like:
# 7f8b40000000 default file=/lib/x86_64/libc.so.6 mapped=847 mapmax=256 \
#     N0=623 N1=224 kernelpagesize_kB=4
#
# N0=623 N1=224 means 623 pages on node 0, 224 pages on node 1.
# For a trading process that should be NUMA-local, N1 should be 0.

# Simpler overview - per-node allocation for a process
numastat -p <pid>
# Output:
# Per-node process memory usage (in MBs) for PID <pid>
#                            Node 0          Node 1           Total
#                   --------------- --------------- ---------------
# Huge                        1024.0             0.0          1024.0
# Heap                         847.3           312.1          1159.4   # <-- 312MB on wrong node
# Stack                          0.1             0.0             0.1
# Private                      423.1           198.4           621.5
# -------
# Total                       2294.5           510.5          2805.0

When you see significant heap memory on the non-local node for your trading process, that is the smoking gun.

Step 3: Find the hot cache lines that are causing misses

# perf c2c: cache-to-cache transfer analysis (Linux 4.3+)
# This tool specifically detects NUMA-remote cache line accesses
sudo perf c2c record -u -g --call-graph fp -p <pid> -- sleep 10
sudo perf c2c report --stdio -k vmlinux

# Key section to look at in the report:
# === Shared Data Cache Line Table ===
# Total Sampled Hits    : 14273
#
#  ----- HITM -----  ------- Store Refs ------  --------  Data address  ---------
#  RmtHitm    LclHitm  L1 Data    L1 No  0 Load   Count     Symbol      Shared Object
#  --------  --------  --------  ------  ------  ------  ------        ------
#    8127      0          0       3421   2725   8127  0x7f8b40001080  trading_engine

# 'RmtHitm' = Remote Hit Modified lines - these are the cross-NUMA cache bounces
# A high RmtHitm count with a specific address points directly to the problem structure.

# To decode the address to a symbol:
addr2line -e trading_engine -a 0x7f8b40001080
# Or with gdb:
# (gdb) info symbol 0x7f8b40001080

In our case, perf c2c pointed directly at the instrument metadata table in the encoder - matching what we found from the LLC miss analysis described in The Anatomy of a Sub-50µs Trade.

The numastat Workflow for Continuous Monitoring

numactl and numastat give you snapshot diagnostics. For production monitoring, you want continuous tracking.

# Quick system-wide NUMA health check
numastat
# Output:
#                            node0           node1
# numa_hit              4823940723      2107832941   # allocations served locally
# numa_miss               23847123       387291847   # <-- node1 has too many misses
# numa_foreign            387291847       23847123
# interleave_hit               1234            987
# local_node            4823940723      2107832941
# other_node              23847123       387291847

# numa_miss on node1 >> node0 = the analytics process hit us

# Monitor live
watch -n 1 'numastat | grep -E "numa_miss|numa_hit"'

A healthy trading server should show numa_miss counts for the trading process’s node that are orders of magnitude lower than numa_hit. If numa_miss is growing faster than numa_hit, something on the machine is allocating from the wrong node.

# Identify which processes are allocating remotely
for pid in /proc/[0-9]*/; do
    pid_num=$(basename $pid)
    cmdline=$(cat $pid/cmdline 2>/dev/null | tr '\0' ' ' | head -c 30)
    numa_info=$(numastat -p $pid_num 2>/dev/null | tail -3)
    if echo "$numa_info" | grep -qP '\d{3,}'; then
        echo "$pid_num: $cmdline"
        echo "$numa_info"
        echo "---"
    fi
done

Fixing NUMA Problems

Once you have identified the issue, the fix hierarchy is:

Fix 1: Bind the process at startup (prevents future misallocation)

# Hard binding: process can only use node 0 CPUs and node 0 memory
numactl --cpunodebind=0 --membind=0 ./trading_engine

# Preferred (not strict): use node 0 memory when available, fall back to node 1
numactl --cpunodebind=0 --preferred=0 ./trading_engine

Use --membind (strict) rather than --preferred for trading processes. --preferred is a hint; the kernel will violate it under memory pressure. That is exactly the pressure that happens at 3 AM. Strict binding causes allocation failures if node 0 is exhausted, which is a loud failure you can detect - much better than a silent latency regression.

Fix 2: Bind memory at the application level using libnuma

#include <numa.h>
#include <numaif.h>

void trading_engine_init(void) {
    /* Fail fast if NUMA is not available or topology is wrong */
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        exit(1);
    }

    /* Verify we're running on the correct node */
    int my_node = numa_node_of_cpu(sched_getcpu());
    if (my_node != TRADING_NUMA_NODE) {
        fprintf(stderr, "Wrong NUMA node: expected %d, got %d\n",
                TRADING_NUMA_NODE, my_node);
        exit(1);
    }

    /* Set strict local allocation policy */
    struct bitmask *node_mask = numa_bitmask_alloc(numa_num_configured_nodes());
    numa_bitmask_setbit(node_mask, TRADING_NUMA_NODE);
    numa_set_membind(node_mask);
    numa_bitmask_free(node_mask);

    /* All subsequent allocations will be NUMA-local or fail */
}

Fix 3: Isolate the analytics process to NUMA node 1

# Move the analytics process to node 1 - it has no latency requirements
numactl --cpunodebind=1 --membind=1 /opt/analytics/run_analytics.sh

This is the fix we actually deployed: the analytics process was moved to NUMA node 1, and trading was strict-bound to node 0. Problem permanently resolved.

Fix 4: Use a dedicated physical server for anything that allocates significant memory

This is the proper solution. Mixing latency-sensitive and batch-workload processes on the same physical server is an operational risk that NUMA binding only partially mitigates. At Gemini, the rule was simple: nothing that is not directly involved in the trade path runs on trading hardware.

How This Breaks in Production

1. --membind causes OOM on node 0 during memory pressure. If you use strict binding and a memory leak or unexpected allocation causes node 0 to fill up, your process will get SIGKILL instead of gracefully allocating from the other node. This is the correct behavior - you want a loud failure, not a silent slowdown - but you must have memory headroom (plan for 2x your expected steady-state usage on node 0).

2. Huge pages allocated before numa policy is set. If you call mmap(MAP_HUGETLB) before calling numa_set_membind(), those huge pages may be allocated on whichever node the allocator chooses. Always set the NUMA policy before the first allocation. In C++, this means doing it in a constructor that runs before any static initializers that allocate memory - or using a dedicated __attribute__((constructor)) function with priority higher than any allocating constructors.

3. Kernel memory mapped by VDSO is always on node 0. The VDSO pages (used for fast gettimeofday() and clock_gettime()) are mapped into every process’s address space by the kernel and are always on node 0. If your trading process is on node 1, every call to clock_gettime(CLOCK_REALTIME) crosses a NUMA boundary. This is one reason to use RDTSC directly for intra-host timing: the TSC is a per-CPU register, no memory access required.

4. NUMA rebalancing daemon. The kernel has an optional NUMA balancing feature (numa_balancing=1 in /proc/sys/kernel/numa_balancing) that periodically scans process memory and migrates pages to the node that accesses them most. This is helpful for general workloads but catastrophic for trading: the migration causes minor page faults and TLB flushes, each adding microseconds of latency. Always disable it: echo 0 > /proc/sys/kernel/numa_balancing.

5. NUMA interleaving set by a previous process. If your trading engine is started by a wrapper script that uses numactl --interleave=all for memory bandwidth testing, and then the actual trading process inherits that policy via fork(), all your allocations will be interleaved across both nodes. Check the active policy at startup: cat /proc/<pid>/numa_maps | grep -v "^0000" | head - if you see N0 and N1 counts roughly equal across all regions, you are running with interleave policy.

6. Memory migration during live trading. move_pages() and mbind(..., MPOL_MF_MOVE) can migrate pages between NUMA nodes on a live process. If any monitoring agent or profiling tool calls these on your trading process, the migration causes TLB invalidations and brief latency spikes. This has happened to us with a well-meaning ops engineer running numactl --membind=0 -p <pid> on a live trading process - the intent was correct but the execution caused a 200-500µs latency hiccup as pages were physically moved.

The Permanent Fix: Make NUMA Locality Verifiable

The only way to know your NUMA configuration has not drifted is to measure it continuously.

#!/bin/bash
# Paste into /etc/cron.d/numa-health-check - runs every minute
* * * * * root /opt/trading/scripts/check_numa_health.sh

# check_numa_health.sh
#!/bin/bash
TRADING_PID=$(pgrep -x trading_engine)
if [ -z "$TRADING_PID" ]; then exit 0; fi

# Check that < 1% of trading process memory is on wrong node
TOTAL=$(numastat -p $TRADING_PID 2>/dev/null | tail -1 | awk '{print $NF}')
REMOTE=$(numastat -p $TRADING_PID 2>/dev/null | tail -1 | awk '{print $3}')

if [ -n "$TOTAL" ] && [ -n "$REMOTE" ]; then
    PCT=$(echo "scale=2; $REMOTE * 100 / $TOTAL" | bc)
    if (( $(echo "$PCT > 5" | bc -l) )); then
        echo "NUMA_ALERT: $PCT% of trading_engine memory is remote (PID $TRADING_PID)" \
            | logger -p user.crit -t numa-health
        # Or push to your alerting system
    fi
fi

Related reading: The Anatomy of a Sub-50µs Trade shows where NUMA fits in the full latency budget. CPU Pinning, isolcpus, and nohz_full covers the CPU-affinity side of the same problem. Huge Pages Done Right covers the interaction between huge pages and NUMA policy - they interact in non-obvious ways. Linux Tunable Drift covers how NUMA configuration gets reset after kernel updates.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.