Infrastructure
CPU Pinning, isolcpus, and nohz_full: Building a Quiet Core for Latency-Critical Code
How to build a genuinely quiet CPU core for HFT using isolcpus, nohz_full, rcu_nocbs, and proper IRQ migration - with the grub cmdline that actually works.
The first time I got a genuinely quiet core at Gemini, I stared at the latency histogram for about ten seconds before trusting it. The P99 and P50 were within 200ns of each other. On a kernel with no special configuration, P99 is typically 10-20x the P50 on a loaded system.
The core was not overclocked. The strategy was not simpler. We had not changed a line of application code. We had changed three kernel boot parameters, moved IRQs off the core, and killed one daemon. That was it.
This post documents exactly what we did, why each piece is necessary, and what happens if you skip any of them.
The Three Sources of Noise on a Linux CPU Core
When Linux runs on a CPU core with default configuration, three things will regularly interrupt your application code regardless of what you are doing:
Noise source 1: The timer tick. The kernel’s hardware APIC timer fires every 1ms (CONFIG_HZ=1000) or 4ms (CONFIG_HZ=250). This interrupt handler updates jiffies, runs the scheduler’s load-balancing logic, and checks whether the current process has exhausted its time slice. The interrupt takes roughly 1-5µs depending on what jiffies maintenance triggers. This happens even if you own the whole core.
Noise source 2: RCU callbacks. Linux uses Read-Copy-Update (RCU) as a low-overhead synchronization mechanism for kernel data structures. When a writer updates an RCU-protected structure, it defers freeing the old version until all current readers finish. Those deferred callbacks run on every CPU, including yours. On a busy kernel, these callbacks fire every few milliseconds and can take 10-50µs.
Noise source 3: Scheduler load balancing. Even if your process runs exclusively on core 4, the scheduler periodically scans all cores looking for imbalances and may decide to migrate other tasks onto your core. isolcpus prevents this, but only this - it does nothing about the timer tick or RCU.
The common mistake is to add isolcpus=4-7 to the kernel command line and think the job is done. You have eliminated noise source 3. Sources 1 and 2 are still fully active.
The Complete Solution: Three Parameters That Work Together
# /etc/default/grub - the kernel command line additions
isolcpus=4-7 nohz_full=4-7 rcu_nocbs=4-7
isolcpus=4-7 removes cores 4-7 from the scheduler’s domain. No kernel threads, no user processes, nothing gets scheduled here unless you explicitly place it there with taskset or sched_setaffinity. This eliminates noise source 3.
nohz_full=4-7 enables tickless operation on cores 4-7. When a core in nohz_full has exactly one runnable task (yours), the timer tick is suppressed entirely. No more 1ms or 4ms timer interrupts. The tick only resumes if a second task becomes runnable on that core, which isolcpus prevents from happening. This eliminates noise source 1.
rcu_nocbs=4-7 offloads RCU callback processing from cores 4-7 to dedicated callback threads that run on the non-isolated cores. Your isolated core never processes RCU callbacks. This eliminates noise source 2.
These three parameters must be used together. nohz_full without rcu_nocbs still fires RCU callbacks on your core, which forces brief re-enablement of the timer tick to process them. isolcpus alone gives you scheduler isolation but still hits you with ticks and RCU. All three together produce a core that the kernel largely ignores.
# Full recommended grub configuration
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=4-7 nohz_full=4-7 rcu_nocbs=4-7 \
intel_idle.max_cstate=0 processor.max_cstate=0 \
idle=poll \
irqaffinity=0-3 \
transparent_hugepage=never"
# Rebuild grub and reboot
sudo update-grub
sudo reboot
# Verify after reboot
cat /sys/devices/system/cpu/isolated
# Expected: 4-7
cat /sys/devices/system/cpu/nohz_full
# Expected: 4-7
The irqaffinity Parameter: What Most Guides Miss
Moving IRQs off your trading core is separate from isolcpus. The kernel parameter irqaffinity=0-3 sets the default affinity mask for all new IRQs to cores 0-3, keeping them away from cores 4-7. However, this is a default - it does not retroactively move IRQs that were assigned before the parameter took effect, and it does not override IRQs that explicitly request specific affinities.
After every boot, you must verify that your NIC’s interrupt handlers are not on the trading cores:
# Check current IRQ distribution
cat /proc/interrupts | head -5 # header
# Look for your NIC's RX queues:
grep -E "eth0|sfn8522|ens" /proc/interrupts
# Output format: IRQ# : CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 : type : device
# You want the counts for CPU4-CPU7 to be 0 for your NIC's RX queues
# Move a specific IRQ off core 4 (example IRQ 120, hex 0x0f = cores 0-3)
echo "0f" > /proc/irq/120/smp_affinity
# Script to move ALL IRQs off isolated cores
ISOLATED_MASK="0f" # hex bitmask for cores 0-3 only
for irq_dir in /proc/irq/[0-9]*/; do
irq=$(basename $irq_dir)
# Skip the 0 directory
[ "$irq" = "0" ] && continue
current=$(cat $irq_dir/smp_affinity 2>/dev/null) || continue
# Only modify if it would touch isolated cores
echo "$ISOLATED_MASK" > $irq_dir/smp_affinity 2>/dev/null || true
done
The smp_affinity file takes a hexadecimal bitmask. Core 0 = bit 0 = 0x01, core 1 = 0x02, core 2 = 0x04, core 3 = 0x08, so cores 0-3 = 0x0f. For a system with more than 32 cores, you need smp_affinity_list which takes a numeric range instead.
Using taskset and cgroup cpusets to Place Processes
With isolated cores set up, you place your trading process on the isolated cores:
# taskset: bind to cores 4-7
taskset -c 4-7 ./trading_engine
# For even more control - a specific single core per strategy
taskset -c 4 ./strategy_btc
taskset -c 5 ./strategy_eth
taskset -c 6 ./strategy_sol
# Core 7 reserved for the order encoder thread
# Verify the actual affinity
taskset -p <pid>
# Output: pid <pid>'s current affinity list: 4
# numactl for combined CPU + NUMA binding (preferred for trading)
numactl --cpunodebind=0 --membind=0 taskset -c 4 ./trading_engine
For managing multiple processes, cgroup cpusets give you a clean administrative boundary:
# Create a cpuset for trading processes
mkdir /sys/fs/cgroup/cpuset/trading
echo "4-7" > /sys/fs/cgroup/cpuset/trading/cpuset.cpus
echo "0" > /sys/fs/cgroup/cpuset/trading/cpuset.mems # NUMA node 0
echo "1" > /sys/fs/cgroup/cpuset/trading/cpuset.cpu_exclusive # exclusive
# Move a process into the cgroup
echo $TRADING_PID > /sys/fs/cgroup/cpuset/trading/tasks
# Create a cpuset for everything else
mkdir /sys/fs/cgroup/cpuset/system
echo "0-3" > /sys/fs/cgroup/cpuset/system/cpuset.cpus
echo "0-1" > /sys/fs/cgroup/cpuset/system/cpuset.mems
# Move all non-trading processes here
The cgroup cpuset approach is more maintainable than per-process taskset calls because it is enforced persistently - a process cannot escape its cpuset even via fork().
The Production Story: What Breaks When You Forget IRQs
At Gemini, we had a runbook for deploying to new hardware. The runbook covered isolcpus, nohz_full, rcu_nocbs, and huge pages. It did not cover IRQ affinity configuration - it was assumed to be handled by the irqaffinity=0-3 kernel parameter.
On one particular server deployment, the NIC driver was loaded by udev before the kernel’s IRQ affinity defaults were fully applied. The result: four of the eight RX queues had their IRQ handlers sitting on core 4 - our primary market data processing core.
The latency profile was not uniformly bad. P50 was fine: 18µs. P99 was 340µs. The variance was enormous. The IRQ handler was firing on core 4 every time a burst of packets arrived, evicting L1 cache contents, stalling the pipeline, and exiting - leaving the strategy thread to refill its cache from L2/L3.
We found it within ten minutes because we had cat /proc/interrupts in our standard boot verification script:
# Boot verification script - excerpt
echo "=== IRQ Distribution Check ==="
echo "Trading cores (4-7) should have 0 counts on all NIC IRQs:"
grep -E "eth0|sfn" /proc/interrupts | awk '{
core4=$6; core5=$7; core6=$8; core7=$9;
if (core4+core5+core6+core7 > 0) {
print "FAIL: IRQ "$1" has counts on trading cores: "core4" "$5" "$6" "$7
}
}'
echo ""
echo "=== Verified isolated cores ==="
cat /sys/devices/system/cpu/isolated
echo ""
echo "=== Timer tick status (should show 'dynamic tick' on cores 4-7) ==="
for cpu in 4 5 6 7; do
echo -n "CPU$cpu: "
cat /sys/devices/system/cpu/cpu$cpu/cpuidle/state0/name 2>/dev/null || echo "N/A"
done
The fix was to add explicit IRQ affinity configuration to the startup script, run after udev completes, before the trading engine starts.
Verifying Quiet Cores with a Jitter Benchmark
The only way to confirm your isolation is working is to measure it directly. Here is a simple jitter measurement program:
#include <time.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <stdlib.h>
#define ITERATIONS 1000000
static inline uint64_t rdtsc(void) {
uint32_t lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ((uint64_t)hi << 32) | lo;
}
int compare_u64(const void *a, const void *b) {
uint64_t x = *(uint64_t*)a, y = *(uint64_t*)b;
return (x > y) - (x < y);
}
int main(void) {
uint64_t deltas[ITERATIONS];
uint64_t prev = rdtsc();
for (int i = 0; i < ITERATIONS; i++) {
/* Tight loop - just reading TSC repeatedly */
/* Any deviation from the baseline interval is kernel interruption */
for (volatile int j = 0; j < 100; j++) {} /* ~30-50ns of work */
uint64_t now = rdtsc();
deltas[i] = now - prev;
prev = now;
}
qsort(deltas, ITERATIONS, sizeof(uint64_t), compare_u64);
double tsc_hz = 3.0e9; /* calibrate against your CPU frequency */
printf("P50: %6.1f ns\n", deltas[ITERATIONS / 2] * 1e9 / tsc_hz);
printf("P99: %6.1f ns\n", deltas[(int)(ITERATIONS * 0.99)] * 1e9 / tsc_hz);
printf("P99.9: %6.1f ns\n", deltas[(int)(ITERATIONS * 0.999)] * 1e9 / tsc_hz);
printf("Max: %6.1f ns\n", deltas[ITERATIONS - 1] * 1e9 / tsc_hz);
return 0;
}
# Compile and run on a non-isolated core (baseline)
gcc -O2 -o jitter jitter.c
taskset -c 0 ./jitter
# Expected output (non-isolated):
# P50: 85.3 ns
# P99: 1247.6 ns
# P99.9: 4823.1 ns
# Max: 62341.2 ns
# Run on an isolated + nohz_full + rcu_nocbs core
taskset -c 4 ./jitter
# Expected output (properly isolated):
# P50: 84.8 ns
# P99: 91.2 ns
# P99.9: 104.7 ns
# Max: 387.4 ns ← occasional NMI or hardware interrupt; acceptable
The dramatic compression of P99 and P99.9 - from 1247ns to 91ns, and from 4823ns to 104ns - is the payoff from isolation. The max being ~400ns (rather than 62µs) means even the worst-case interrupt on an isolated core is acceptable.
Additional Daemons to Kill
These services run by default on most Linux distributions and will interrupt your isolated cores if not disabled:
# irqbalance - the main offender; rebalances IRQs between CPUs continuously
systemctl disable irqbalance
systemctl stop irqbalance
# tuned - a performance tuning daemon that periodically resets kernel parameters
# (including IRQ affinities and CPU governors)
systemctl disable tuned
systemctl stop tuned
# cpuspeed / cpufreqd - frequency scaling daemons
systemctl disable cpuspeed 2>/dev/null || true
# watchdog (NMI watchdog) - fires NMI interrupts to detect hangs
# Can still fire on isolated cores if not disabled
echo 0 > /proc/sys/kernel/watchdog
echo 0 > /proc/sys/kernel/nmi_watchdog
# Verify watchdog is off
cat /proc/sys/kernel/watchdog
# Should show: 0
# audit daemon - writes to kernel audit log, causes occasional softirq
systemctl disable auditd
The tuned service is particularly dangerous for trading systems. It has profiles like throughput-performance and latency-performance that will periodically reset IRQ affinities, CPU governors, and kernel parameters to match its profile - silently undoing your manual configuration. If you cannot remove it (because another team requires it), at minimum disable the service and manually set the parameters at boot via a dedicated startup script with higher systemd priority.
How This Breaks in Production
1. nohz_full silently disabled because the core has more than one task. The tickless behavior only activates when exactly one runnable task is on the core. If a kernel thread migrates to your isolated core (some kernel threads ignore isolcpus), the tick resumes. Kernel threads that can run on isolated cores include kthreads launched by some device drivers and kworker threads for certain subsystems. Check: ps -eLo psr,comm | awk '$1==4' to see everything running on core 4. You may be surprised.
2. GRUB command line not applied after update. On Ubuntu/Debian, grub-mkconfig regenerates the GRUB configuration from /etc/default/grub. After some OS package updates (particularly grub-common), the configuration is regenerated and your custom parameters may be placed after conflicting parameters from the standard template. After any OS update, verify cat /proc/cmdline shows your isolation parameters.
3. nohz_full interacting badly with POSIX timers. If your application uses timer_create() with CLOCK_PROCESS_CPUTIME_ID or SIGALRM, these timers require a periodic tick to fire. With nohz_full, the tick is suppressed, and your timers fire late or not at all. Trading engines that use SIGALRM for heartbeat checking will break silently on isolated cores. Use timerfd_create() with monotonic clock instead, which is driven by the high-resolution timer subsystem that works without the tick.
4. rcu_nocbs offload thread pinned to wrong core. The RCU callback offload threads (rcuoc/4, rcuob/4, etc.) run on non-isolated cores by default. But if isolcpus is set inconsistently with rcu_nocbs (e.g., you isolate cores 4-7 but only set rcu_nocbs=4-5), the callback threads for cores 6-7 may not be moved. Verify: ps -eLo psr,comm | grep rcu and confirm the rcu threads are on cores 0-3.
5. SMI (System Management Interrupt) bypasses all isolation. SMIs are generated by the hardware and are completely invisible to the OS - they are not reflected in /proc/interrupts, not affected by isolcpus, and cannot be disabled by the OS. An SMI on your trading core takes 100-300µs, during which your trading engine appears to freeze. Sources include: BIOS thermal management, ECC memory scrubbing, power management. Detection: Intel’s rdmsr 0x34 (SMI counter register); instrument this in your latency measurement infrastructure. If you see occasional unexplained 200-300µs spikes that do not correlate with any software event, SMI is the likely cause.
6. Frequency scaling on isolated cores. isolcpus and nohz_full do not set the CPU frequency governor. If your isolated cores are in powersave mode (the default on many distributions), the CPU will slow down when idle between packets. For a polling-mode application, this is not idle in the traditional sense - but if you use cpu_relax() or brief sleeping, the governor may drop the frequency. Set the governor to performance and verify: cat /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor.
Related reading: The Anatomy of a Sub-50µs Trade shows how CPU isolation fits in the full latency budget. NUMA in Production covers the memory-access side of the same isolation problem. Interrupt Affinity, MSI-X, and the Multi-Queue NIC goes deeper on NIC interrupt management. Real-Time Scheduling on Linux covers SCHED_FIFO which is the final piece after isolation is in place.
Continue Reading
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.