Skip to content

Infrastructure

Interrupt Affinity, MSI-X, and the Multi-Queue NIC: Engineering Determinism into Network IO

How irqbalance moved an RX queue IRQ to the trading core mid-session, what MSI-X actually is, and how to correctly configure per-queue interrupt affinity for HFT.

11 min
#interrupt-affinity #msi-x #nic #irq #hft #linux-networking

It was 09:47 on a Tuesday morning at Akuna, 47 minutes into the European equity open, when the trading engine’s P99 latency jumped from 21µs to 78µs and stayed there. The P50 did not move. The system was processing orders correctly. The P&L was tracking its model. But the tail had tripled.

The htop view in CPU bar mode told the story immediately: core 0, which was supposed to be handling all non-trading work, was showing 0% busy. Core 4, our primary market data processing core, was showing brief 100% spikes every few milliseconds. But we had no code running on core 4 that could cause 100% CPU usage - the strategy thread was running at about 12% average load.

cat /proc/interrupts | grep eth0 showed the answer: the eth0 RX queue 0 IRQ had 847,293 events on CPU4 since boot. At 09:47, those events should have been zero. The IRQ had been on CPU0 when we started. Something had moved it.

That something was irqbalance. After a brief period of inactivity during the pre-open auction (trading volume was low, CPU4 was idle), irqbalance decided CPU4 was underutilized and would benefit from owning the busiest NIC interrupt. When the open arrived and volume picked up, the IRQ was already on the trading core.

What MSI-X Is and Why It Matters

Traditional PCI interrupts (INTx) used shared interrupt lines: multiple devices would share a single IRQ line, and the kernel had to query each device to find out which one triggered the interrupt. This was slow and serialized.

MSI (Message Signaled Interrupts) replaced shared lines with memory writes: a device signals an interrupt by writing a specific value to a specific memory address. This eliminated shared lines and allowed dedicated per-device IRQs.

MSI-X extended this: a single device can have up to 2,048 independent MSI vectors. For a multi-queue NIC, this means each RX queue can have its own dedicated IRQ, independent of other queues. The critical implication: you can pin each queue’s interrupt handler to a specific CPU core, with no interaction between queues.

Legacy INTx (shared line):
  NIC → IRQ pin → shared IRQ line → all CPUs interrupted → kernel polls all devices

MSI (one vector):
  NIC → memory write → one CPU → one device interrupt handler

MSI-X (multiple vectors):
  NIC RX Queue 0 → memory write → CPU 0 → queue 0 handler
  NIC RX Queue 1 → memory write → CPU 2 → queue 1 handler
  NIC RX Queue 2 → memory write → CPU 4 → queue 2 handler (trading core)
  NIC TX Queue 0 → memory write → CPU 1 → TX completion handler

With MSI-X, the hardware provides the interrupt routing. The CPU that handles the NIC’s interrupt is the CPU that receives the MSI-X write - and you control which address each vector targets (within the limits of the IOMMU/MSI-X table configuration).

Verifying MSI-X Is Active

# Check whether your NIC is using MSI-X
lspci -vvv -s <device_pci_address> | grep -A 5 "MSI-X"
# Output should contain:
# Capabilities: [b0] MSI-X: Enable+ Count=64 Masked-
#   Vector table: BAR=4 offset=00000000
#   PBA: BAR=4 offset=00008000
# "Enable+" means MSI-X is active.

# Count active MSI-X vectors for a device
grep $(cat /sys/class/net/eth0/device/irq) /proc/interrupts
# Or list all vectors:
ls /sys/class/net/eth0/device/msi_irqs/ 2>/dev/null
# Output: 0  1  2  3  4  5  6  7  (8 vectors = 8 queues)

# See the full IRQ list with CPU distribution
cat /proc/interrupts | head -3  # header
grep "eth0\|sfn8522\|mlx5" /proc/interrupts
# Format: IRQ#  :  CPU0  CPU1  CPU2  CPU3  CPU4  CPU5  CPU6  CPU7  : type : device+queue
# Example:
#  120:    4823193         0         0         0         0         0         0         0  PCI-MSI  eth0-rx-0
#  121:          0   3920183         0         0         0         0         0         0  PCI-MSI  eth0-rx-1
#  122:          0         0   2847291         0         0         0         0         0  PCI-MSI  eth0-rx-2
# Perfect: each queue's IRQ is handled by exactly one CPU.

Configuring IRQ Affinity

The smp_affinity file for each IRQ accepts a hexadecimal bitmask where bit 0 = CPU 0, bit 1 = CPU 1, etc. Only one CPU should handle each NIC queue IRQ for a trading engine - the CPU that polls the corresponding queue.

# Complete IRQ affinity setup script - run at boot after irqbalance is disabled
#!/bin/bash

set -euo pipefail

# Find all IRQ numbers for the trading NIC
NIC="eth0"  # or "sfn8522", "enp4s0f0", etc.
NIC_IRQS=$(grep "$NIC" /proc/interrupts | awk '{print $1}' | tr -d ':')

echo "NIC IRQs found: $NIC_IRQS"

# Define the mapping: queue number → CPU core
# RX queue 0 → CPU 0 (management core)
# RX queue 1 → CPU 2 (management core)
# RX queue 2 → CPU 4 (trading core - market data)
# RX queue 3 → CPU 5 (trading core - order acks)
# TX queue 0 → CPU 1 (management core)
declare -A QUEUE_TO_CPU
QUEUE_TO_CPU[0]=0
QUEUE_TO_CPU[1]=2
QUEUE_TO_CPU[2]=4
QUEUE_TO_CPU[3]=5

queue_idx=0
for irq in $NIC_IRQS; do
    if [ -n "${QUEUE_TO_CPU[$queue_idx]+x}" ]; then
        cpu=${QUEUE_TO_CPU[$queue_idx]}
        # Convert CPU number to hex bitmask (CPU 4 = bit 4 = 0x10)
        mask=$(printf "%x" $((1 << cpu)))
        echo "$mask" > /proc/irq/$irq/smp_affinity
        echo "IRQ $irq (queue $queue_idx) → CPU $cpu (mask 0x$mask)"
    fi
    ((queue_idx++)) || true
done

# Verify the configuration
echo ""
echo "=== Verification ==="
grep "$NIC" /proc/interrupts | awk '{
    printf "IRQ %s: CPU distribution: ", $1
    for (i=2; i<=NF-3; i++) printf "CPU%d=%s ", i-2, $i
    print ""
}'

For a system with more than 32 CPUs, use smp_affinity_list which takes CPU numbers directly:

# smp_affinity_list: human-readable CPU list
echo "4" > /proc/irq/122/smp_affinity_list  # CPU 4 only
echo "0-3" > /proc/irq/120/smp_affinity_list  # CPUs 0-3 (bitmask would be 0x0f)

Disabling irqbalance Properly

The simplest approach is to disable irqbalance entirely on trading servers:

systemctl disable irqbalance
systemctl stop irqbalance

If other teams require irqbalance for management traffic (which is a reasonable concern on shared infrastructure), configure it to leave your trading queues alone using the IRQBALANCE_BANNED_CPUS environment variable:

# /etc/default/irqbalance
# Hex bitmask of CPUs irqbalance should never touch
# Bits 4-7 set = 0xF0 = cores 4-7
IRQBALANCE_BANNED_CPUS="00000000000000F0"

# Or use the per-IRQ hint file (more surgical):
# irqbalance respects hints in /proc/irq/<N>/affinity_hint
for irq in 120 121 122 123; do
    echo "10" > /proc/irq/$irq/affinity_hint  # CPU 4 only
done

systemctl restart irqbalance

The affinity_hint file is read by irqbalance as a “soft suggestion” - it still may override it under heavy imbalance. For guaranteed protection of trading cores, use IRQBALANCE_BANNED_CPUS.

RSS, RPS, RFS, and XPS: Choosing the Right Distribution Mechanism

There are four mechanisms for distributing NIC workload across CPUs, and they work at different layers:

RSS (Receive Side Scaling) - hardware-level. The NIC computes a hash of the packet’s 5-tuple (src IP, dst IP, src port, dst port, protocol) and selects an RX queue based on the hash. You control the hash key and the queue→CPU mapping. This is the right mechanism for high-volume trading: the NIC sorts packets into queues before the CPU is even involved.

# Configure RSS hash key for deterministic routing
# ethtool -X sets the indirection table mapping hash → queue
ethtool -X eth0 hkey 6d:5a:56:da:25:5b:0e:c2:41:67:25:3d:43:a3:8f:b0
# The hash key should be set once and pinned - changes affect which queue each flow lands in

# Set the indirection table: hash values 0-7 map to queues 0-3
ethtool -X eth0 equal 4  # 4 queues, round-robin distribution

# Or explicit mapping for flow pinning:
# Flow to/from exchange A always → queue 2 (trading core)
# Flow to management → queue 0 (management core)

RPS (Receive Packet Steering) - software-level RSS. Runs in the kernel softirq layer, allows steering packets to CPUs regardless of NIC queue count. Useful when the NIC only has one queue but you want software-level distribution. Higher overhead than RSS - use RSS if available.

RFS (Receive Flow Steering) - extension of RPS that considers which CPU is actually running the application that owns each flow. Routes packets to the CPU where the socket is being read. Not appropriate for trading: you want explicit control, not heuristic-based routing.

XPS (Transmit Packet Steering) - CPU → TX queue mapping. Ensures that each CPU’s transmitted packets go to a specific TX queue, avoiding cross-queue contention on transmit.

# Configure XPS: CPU 4 uses TX queue 2
echo "10" > /sys/class/net/eth0/queues/tx-2/xps_cpus  # 0x10 = CPU 4

# Verify:
cat /sys/class/net/eth0/queues/tx-2/xps_cpus
# 10 (hex)

For a trading engine using kernel bypass (ef_vi, DPDK, AF_XDP), RSS is still relevant - it determines which RX queue each flow lands in at the NIC level, before kernel bypass intercepts it. You must configure RSS to route your trading flows to the queues you have configured for kernel bypass.

# Configure ntuple rules for deterministic flow routing
# Route market data multicast group to queue 2 (trading core queue)
ethtool -N eth0 flow-type udp4 dst-ip 239.1.1.100 dst-port 4000 action 2

# Route order management to queue 3
ethtool -N eth0 flow-type tcp4 dst-ip 10.0.1.50 dst-port 7777 action 3

# List configured rules
ethtool -n eth0

Verifying the Configuration Is Holding

After configuration, verify that traffic is flowing through the expected queues and that your trading cores are not receiving unintended interrupts:

# Monitor interrupt distribution in real-time
watch -n 1 "grep eth0 /proc/interrupts"

# Check per-queue packet counters (faster than watching /proc/interrupts)
cat /sys/class/net/eth0/statistics/rx_packets  # total
# Per-queue:
for q in /sys/class/net/eth0/queues/rx-*/; do
    echo -n "$(basename $q): "
    # Not all drivers expose per-queue stats here
    cat $q/../../../eth0/statistics/ 2>/dev/null || echo "N/A"
done

# ethtool per-queue stats (driver-specific)
ethtool -S eth0 | grep "rx_queue_\|rx.*_packets"
# Intel i40e output example:
# rx_queue_0_packets: 4823193
# rx_queue_1_packets: 3920183
# rx_queue_2_packets: 2847291  ← should be highest for market data queue

How This Breaks in Production

1. irqbalance restarts on kernel update. On Ubuntu/Debian, upgrading the irqbalance package restarts the service, which immediately begins rebalancing. Package updates often happen automatically via unattended-upgrades. On trading servers, automatic updates should be disabled entirely (apt-mark hold irqbalance) or managed through a change control window.

2. NIC driver reload resets IRQ affinity. After an rmmod/insmod cycle or after ethtool -r (driver reset), the NIC is re-initialized with default IRQ affinity (usually all CPUs). This happens automatically if the driver detects a link state change or if the NIC is reset due to a firmware bug. Your IRQ configuration must be re-applied after any driver reload.

3. MSI-X vector count reduced under kernel debug mode. Some kernel debugging features (CONFIG_DEBUG_LOCK_ALLOC, KASAN) reduce the number of MSI-X vectors available per device because they consume vectors for their own bookkeeping. A debug kernel that normally allows 64 MSI-X vectors may only allow 16. If your NIC has 8 RX queues but the kernel can only allocate 4 MSI-X vectors, the driver will silently fall back to fewer queues - and your flow steering rules will route incorrectly.

4. Wrong hash type causes flow spreading. If RSS is configured with a symmetric hash (where hash(A→B) == hash(B→A)), your outbound order traffic and inbound ACK traffic for the same TCP connection will land on the same queue, which is what you want. But some NIC firmware uses asymmetric hashes by default, splitting request and response paths across different queues. For UDP market data, this does not matter (unidirectional). For TCP order management, verify with: ethtool -x eth0 and check the indirection table to understand how bidirectional flows are distributed.

5. Interrupt coalescing counteracting your affinity work. Even with perfect IRQ affinity, if interrupt coalescing is set too aggressively (e.g., rx-usecs=100), the NIC will batch 100µs of packets before firing a single IRQ. Your strategy code then processes a burst of stale data rather than each packet as it arrives. For market data receive, disable coalescing entirely or set a very small value: ethtool -C eth0 rx-usecs 0 rx-frames 1. This increases interrupt rate (and CPU overhead for non-bypass paths) but eliminates the coalescing delay.

6. Multi-RSS flow collision during burst. If two high-rate flows hash to the same RSS queue, they both land on the same CPU core. This is predictable from the hash function and NIC queue count. With 4 queues and a busy multicast market data source plus a direct exchange TCP connection, there is a 25% chance they collide. Check collision: capture and compare the queue IDs of both flows with tcpdump -i eth0 -e -c 100 | grep queue. If they collide, adjust the ntuple rule to explicitly pin one flow to a different queue, overriding RSS.

Related reading: The Anatomy of a Sub-50µs Trade shows where interrupt handling fits in the full latency path. Solarflare ef_vi vs DPDK vs AF_XDP covers kernel bypass, which removes interrupts from the hot path entirely. CPU Pinning, isolcpus, and nohz_full covers the CPU isolation that IRQ affinity must coordinate with. Linux Tunable Drift covers how IRQ affinity gets reset after system events.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.