Skip to content

Infrastructure

Linux Tunable Drift: Why Your Carefully Tuned Box Is Slower After a Kernel Update

How a Spectre mitigation patch silently added 15% latency regression, what resets your tuning without warning, and how to govern a trading server against configuration drift.

12 min
#linux-tuning #kernel-updates #configuration-drift #spectre #hft #devops

In January 2018, the Spectre and Meltdown vulnerabilities were published. Within 72 hours, every major Linux distribution had issued kernel updates with mitigations. Within a week, trading infrastructure teams worldwide were debugging mysterious latency regressions.

Ours showed up four days after the update. P50 latency on the BTC/USD market-maker had gone from 18µs to 21µs - a 3µs regression, about 17%. The system had not changed. The code had not changed. We had not deployed. The kernel had auto-updated, silently, via unattended-upgrades.

The Spectre mitigation (specifically, IBPB - Indirect Branch Predictor Barrier) added a fence instruction to every syscall return path. Even though our hot path did not make syscalls, the mitigation also changed how the CPU’s branch predictor behaved for indirect calls - and our strategy loop used virtual dispatch. Every virtual function call in the hot path had slightly more branch misprediction overhead.

We would not have caught it in testing. The regression was gradual and proportional to virtual dispatch frequency, not a step function. It only showed up in production latency histograms, correlated precisely with the kernel update timestamp.

This post covers the full taxonomy of events that reset your tuning without warning, and how to govern against drift.

What Resets Your Tuning

Trading server tuning involves dozens of kernel parameters, IRQ affinities, CPU frequency governors, and process priorities. None of these are permanently applied by default - they are runtime state that various agents will happily overwrite.

Category 1: Kernel updates

A kernel update does not just add security patches. It can change:

  • Default values for /proc/sys/ parameters (sysctl defaults live in the kernel source, not in your config files)
  • CPU idle state handling (new C-state drivers, changed C-state latencies)
  • Branch prediction behavior (Spectre/Retpoline mitigations change indirect branch handling)
  • Memory management (THP compaction algorithms, page migration heuristics)
  • Scheduler tuning constants (CFS bandwidth, wakeup latency knobs)

None of these changes announce themselves. A kernel update from 5.15.0-89 to 5.15.0-91 might change 847 lines of kernel source including the scheduler, the memory management subsystem, and three CPU driver files. There is no “latency regression” section in the changelog.

# Track kernel version in your baseline benchmark
uname -r > /var/lib/trading-baseline/kernel-version.txt
md5sum /boot/vmlinuz-$(uname -r) > /var/lib/trading-baseline/kernel-checksum.txt

# Alert if kernel changes between benchmark runs
diff <(uname -r) /var/lib/trading-baseline/kernel-version.txt && \
    echo "KERNEL UNCHANGED" || \
    echo "KERNEL CHANGED - RE-BASELINE REQUIRED"

Category 2: Service restarts and package upgrades

These services will silently reset your tuning when they restart:

ServiceWhat it resets
irqbalanceAll IRQ affinities
tunedCPU governors, IRQ affinities, kernel params to match its active profile
cpupower / cpufreqdCPU frequency governors
NetworkManagerEthtool settings, coalescing, ring sizes
systemd-udevNIC driver parameters on link state change
chronyd / ntpdClock synchronization settings
cpuspeedC-state limits
acpidPower management settings including C-states

The common thread: these services are designed to “manage” hardware, which means they periodically re-apply their configurations. Your manual overrides from the shell do not persist across their restart.

# Check which services are actively rewriting your config
# Run during trading hours and watch for parameter changes
watch -n 60 '
echo "=== CPU governors ==="
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | sort | uniq -c

echo "=== THP ==="
cat /sys/kernel/mm/transparent_hugepage/enabled

echo "=== NMI watchdog ==="
cat /proc/sys/kernel/nmi_watchdog

echo "=== RT runtime ==="
cat /proc/sys/kernel/sched_rt_runtime_us
'

Category 3: Cloud init and instance metadata agents

If your trading server runs on cloud infrastructure (or was originally provisioned from a cloud image), cloud-init scripts may re-run on boot. These scripts commonly set “recommended” kernel parameters that override your custom configuration. AWS EC2’s cloud-init, for example, sets net.core.rmem_default, net.core.wmem_default, and vm.swappiness to values appropriate for general-purpose workloads, not HFT.

# Check what cloud-init configured
cat /var/log/cloud-init-output.log | grep -E "sysctl|net\.|vm\." | tail -30

# Disable cloud-init from running on subsequent boots
# (after initial provisioning is complete)
touch /etc/cloud/cloud-init.disabled

# Or disable specific modules in /etc/cloud/cloud.cfg
# Remove 'runcmd' and 'write_files' from the 'cloud_final_modules' list

Category 4: Spectre/Meltdown mitigations added post-boot

Some distributions apply CPU microcode updates via the microcode_ctl package, which can add new hardware mitigations that affect performance. These updates apply during boot but their effects persist. After a microcode update, the kernel may enable new mitigation paths that were previously stubbed out.

# Check current mitigation status
grep -r "" /sys/devices/system/cpu/vulnerabilities/
# Output shows status of each mitigation:
# Spectre v1: Mitigation: __user pointer sanitization
# Spectre v2: Mitigation: Retpoline; IBPB: conditional; IBRS_FW; RSB filling; PBRSB-eIBRS Not affected
# L1TF: Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT disabled

# Evaluate the performance cost of each mitigation
# Some can be partially disabled on trusted bare-metal:
# (ONLY appropriate for dedicated, single-tenant hardware with trusted software)
# mitigations=off in kernel cmdline disables all - test the performance delta first

Building a Baseline Benchmark

You cannot detect regression without a baseline. The baseline must be:

  1. Machine-specific: run on the actual production hardware, not a representative machine
  2. Automated: run as part of the boot sequence, not manually
  3. Stored: results written to a persistent location with timestamps and system metadata
  4. Alerting: compared against the previous run, with alerting on deviation
#!/bin/bash
# /opt/trading/scripts/run-baseline.sh
# Run at system startup before trading begins

BASELINE_DIR="/var/lib/trading-baseline"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
REPORT_FILE="$BASELINE_DIR/baseline_$TIMESTAMP.json"

mkdir -p "$BASELINE_DIR"

# Capture system state
KERNEL=$(uname -r)
CPU_MODEL=$(grep "model name" /proc/cpuinfo | head -1 | cut -d: -f2 | xargs)
CMDLINE=$(cat /proc/cmdline)
SPECTRE_MITIGATIONS=$(paste -s -d'|' /sys/devices/system/cpu/vulnerabilities/* 2>/dev/null)

# Run latency benchmark - tight loop with TSC measurement
# This is a simplified proxy; in production use your actual strategy loop
cat << 'EOF' > /tmp/latency_bench.c
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

static inline uint64_t rdtsc() {
    uint32_t lo, hi;
    __asm__ __volatile__("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}

int cmp_u64(const void *a, const void *b) {
    uint64_t x = *(uint64_t*)a, y = *(uint64_t*)b;
    return (x > y) - (x < y);
}

int main() {
    const int N = 100000;
    uint64_t *samples = malloc(N * sizeof(uint64_t));
    volatile int work = 0;

    for (int i = 0; i < N; i++) {
        uint64_t t0 = rdtsc();
        /* Simulate ~20 cache-line reads (representative of order book access) */
        for (int j = 0; j < 20; j++) work += j;
        uint64_t t1 = rdtsc();
        samples[i] = t1 - t0;
    }

    qsort(samples, N, sizeof(uint64_t), cmp_u64);

    /* Output cycles - convert to ns based on TSC frequency */
    printf("{\"p50_cycles\":%lu,\"p99_cycles\":%lu,\"p999_cycles\":%lu,\"max_cycles\":%lu}\n",
        samples[N/2], samples[N*99/100], samples[N*999/1000], samples[N-1]);

    free(samples);
    return 0;
}
EOF

gcc -O2 -o /tmp/latency_bench /tmp/latency_bench.c
# Run pinned to core 4 (isolated trading core)
BENCH_RESULT=$(taskset -c 4 /tmp/latency_bench)

# Write full report
cat > "$REPORT_FILE" << EOF
{
    "timestamp": "$TIMESTAMP",
    "kernel": "$KERNEL",
    "cpu_model": "$CPU_MODEL",
    "cmdline_md5": "$(echo "$CMDLINE" | md5sum | cut -d' ' -f1)",
    "spectre_mitigations": "$SPECTRE_MITIGATIONS",
    "isolated_cpus": "$(cat /sys/devices/system/cpu/isolated 2>/dev/null)",
    "tsc_clocksource": "$(cat /sys/devices/system/clocksource/clocksource0/current_clocksource)",
    "thp_enabled": "$(cat /sys/kernel/mm/transparent_hugepage/enabled | grep -o '\[.*\]' | tr -d '[]')",
    "rt_runtime_us": "$(cat /proc/sys/kernel/sched_rt_runtime_us)",
    "benchmark": $BENCH_RESULT
}
EOF

echo "Baseline written to $REPORT_FILE"

# Compare against previous baseline
PREV=$(ls -t "$BASELINE_DIR"/baseline_*.json 2>/dev/null | sed -n '2p')
if [ -n "$PREV" ]; then
    PREV_P99=$(python3 -c "import json; d=json.load(open('$PREV')); print(d['benchmark']['p99_cycles'])")
    CURR_P99=$(python3 -c "import json; d=json.load(open('$REPORT_FILE')); print(d['benchmark']['p99_cycles'])")
    REGRESSION_PCT=$(python3 -c "print(f'{($CURR_P99 - $PREV_P99) / $PREV_P99 * 100:.1f}')")

    echo "P99 regression vs previous baseline: $REGRESSION_PCT%"

    if python3 -c "import sys; sys.exit(0 if float('$REGRESSION_PCT') > 5.0 else 1)"; then
        echo "ALERT: P99 latency regressed by $REGRESSION_PCT% since last baseline" | \
            logger -p user.crit -t trading-baseline
        # Or send to your alerting system
    fi
fi

The Full Audit Script: Checking All Tuning Parameters

This script captures the complete state of all trading-relevant kernel parameters and outputs a structured diff when run twice:

#!/bin/bash
# /opt/trading/scripts/audit-tuning.sh

echo "=== AUDIT: $(date) ==="
echo "Kernel: $(uname -r)"
echo ""

echo "--- CPU Isolation ---"
echo "isolated: $(cat /sys/devices/system/cpu/isolated 2>/dev/null || echo 'not set')"
echo "nohz_full: $(cat /sys/devices/system/cpu/nohz_full 2>/dev/null || echo 'not set')"
echo "cmdline isolcpus: $(grep -oP 'isolcpus=\S+' /proc/cmdline || echo 'not set')"
echo ""

echo "--- CPU Frequency ---"
for cpu in 0 2 4 6; do
    gov=$(cat /sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor 2>/dev/null || echo 'N/A')
    freq=$(cat /sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_cur_freq 2>/dev/null || echo 'N/A')
    echo "CPU$cpu: governor=$gov, freq=${freq}kHz"
done
echo ""

echo "--- C-States ---"
for cpu in 4 5; do
    echo -n "CPU$cpu: "
    for state in /sys/devices/system/cpu/cpu$cpu/cpuidle/state*/name; do
        echo -n "$(cat $state)=$(cat $(dirname $state)/disable) "
    done
    echo ""
done
echo ""

echo "--- THP ---"
echo "enabled: $(cat /sys/kernel/mm/transparent_hugepage/enabled)"
echo "defrag: $(cat /sys/kernel/mm/transparent_hugepage/defrag)"
echo ""

echo "--- NUMA ---"
echo "numa_balancing: $(cat /proc/sys/kernel/numa_balancing)"
echo ""

echo "--- IRQ Affinity (NIC) ---"
grep -E "eth0|sfn|mlx5|ens" /proc/interrupts | awk '{
    printf "IRQ %s: ", $1
    for(i=2; i<=NF-3; i++) printf "CPU%d=%d ", i-2, $i
    print ""
}'
echo ""

echo "--- RT Scheduling ---"
echo "sched_rt_runtime_us: $(cat /proc/sys/kernel/sched_rt_runtime_us)"
echo "RT processes:"
ps -eLo pid,tid,policy,rtprio,comm 2>/dev/null | grep -v "^- " | \
    awk '$3 != "TS" {print}' | head -20
echo ""

echo "--- Kernel Parameters ---"
sysctl -a 2>/dev/null | grep -E "^(net\.core\.(rmem|wmem)|vm\.(swappiness|nr_hugepages)|kernel\.(sched_rt|nmi_watchdog|watchdog)|net\.ipv4\.tcp_)"
echo ""

echo "--- Spectre Mitigations ---"
for f in /sys/devices/system/cpu/vulnerabilities/*; do
    echo "$(basename $f): $(cat $f)"
done
echo ""

echo "--- Huge Pages ---"
grep -E "HugePages|Hugepagesize" /proc/meminfo
echo ""

echo "--- Clock Source ---"
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
echo ""
# Usage: run twice, diff the outputs
./audit-tuning.sh > /tmp/state_before.txt
# ... perform system operation (kernel update, package upgrade, etc.) ...
./audit-tuning.sh > /tmp/state_after.txt
diff /tmp/state_before.txt /tmp/state_after.txt

Governing Against Drift: The Runbook

The production governance model we implemented after the Spectre incident:

Rule 1: Pin the kernel version. Do not run automatic kernel updates on trading servers.

# Ubuntu/Debian: hold the kernel package
sudo apt-mark hold linux-image-$(uname -r) linux-headers-$(uname -r)

# Verify holds
apt-mark showhold | grep linux

# For new kernel versions: test on a staging server first, run baseline benchmark,
# compare P99 regression, then approve the update for production in a change window

Rule 2: Disable all configuration-managing daemons.

# Permanently disable the offenders on trading servers
for svc in irqbalance tuned cpupower acpid; do
    systemctl disable $svc 2>/dev/null
    systemctl stop $svc 2>/dev/null
    echo "$svc: $(systemctl is-enabled $svc 2>/dev/null || echo 'not installed')"
done

Rule 3: Apply all tuning via a single idempotent script with high systemd priority.

# /etc/systemd/system/trading-tuning.service
[Unit]
Description=Trading Server Kernel Tuning
After=network.target
Before=trading-engine.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/opt/trading/scripts/apply-trading-tuning.sh

[Install]
WantedBy=multi-user.target

The apply-trading-tuning.sh script re-applies every parameter: CPU governors, THP, NUMA balancing, IRQ affinities, RT limits, huge pages. It is idempotent (running it twice is safe). It runs after every boot, before the trading engine starts.

Rule 4: Checksum /proc/cmdline and alert on change.

# In the baseline script:
CMDLINE_CHECKSUM=$(md5sum /proc/cmdline | cut -d' ' -f1)
STORED_CHECKSUM=$(cat /var/lib/trading-baseline/cmdline-checksum.txt 2>/dev/null)

if [ -n "$STORED_CHECKSUM" ] && [ "$CMDLINE_CHECKSUM" != "$STORED_CHECKSUM" ]; then
    echo "ALERT: kernel command line has changed" | logger -p user.crit -t trading-drift
    diff <(echo "$STORED_CHECKSUM") <(echo "$CMDLINE_CHECKSUM")
fi

echo "$CMDLINE_CHECKSUM" > /var/lib/trading-baseline/cmdline-checksum.txt

Rule 5: Continuous parameter monitoring with Prometheus.

#!/usr/bin/env python3
"""trading_config_exporter.py - Prometheus exporter for trading tuning parameters"""
from prometheus_client import Gauge, start_http_server
import time
import subprocess

GAUGES = {
    'trading_thp_enabled': Gauge('trading_thp_enabled', '1=always,0=never,-1=madvise'),
    'trading_numa_balancing': Gauge('trading_numa_balancing', 'NUMA auto-balancing state'),
    'trading_rt_runtime_us': Gauge('trading_rt_runtime_us', 'RT scheduler bandwidth'),
    'trading_hugepages_free': Gauge('trading_hugepages_free', 'Free 2MB huge pages'),
    'trading_nmi_watchdog': Gauge('trading_nmi_watchdog', 'NMI watchdog state'),
}

def collect():
    # THP
    thp = open('/sys/kernel/mm/transparent_hugepage/enabled').read()
    if '[never]' in thp: GAUGES['trading_thp_enabled'].set(0)
    elif '[always]' in thp: GAUGES['trading_thp_enabled'].set(1)
    else: GAUGES['trading_thp_enabled'].set(-1)

    # NUMA balancing
    nb = open('/proc/sys/kernel/numa_balancing').read().strip()
    GAUGES['trading_numa_balancing'].set(int(nb))

    # RT runtime
    rt = open('/proc/sys/kernel/sched_rt_runtime_us').read().strip()
    GAUGES['trading_rt_runtime_us'].set(int(rt))

    # Huge pages
    for line in open('/proc/meminfo'):
        if 'HugePages_Free' in line:
            GAUGES['trading_hugepages_free'].set(int(line.split()[1]))
            break

    # NMI watchdog
    nmi = open('/proc/sys/kernel/nmi_watchdog').read().strip()
    GAUGES['trading_nmi_watchdog'].set(int(nmi))

start_http_server(9102)
while True:
    collect()
    time.sleep(15)

Configure Prometheus alerts:

# prometheus/rules/trading-config.yml
groups:
  - name: trading-config-drift
    rules:
      - alert: THPEnabled
        expr: trading_thp_enabled != 0
        severity: critical
        annotations:
          summary: "THP is not disabled on trading server"

      - alert: NUMABalancingEnabled
        expr: trading_numa_balancing != 0
        severity: warning

      - alert: RTThrottlingEnabled
        expr: trading_rt_runtime_us != -1
        severity: warning
        annotations:
          summary: "RT scheduler throttling active (not -1)"

      - alert: HugePagesInsufficient
        expr: trading_hugepages_free < 100
        severity: critical

How This Breaks in Production

1. Staging kernel does not match production. You test your baseline on kernel 5.15.0-89. Production runs 5.15.0-91 because the auto-update ran on the staging server’s most recent reboot while production has not rebooted since. The benchmarks do not match because they are measuring different kernels. Enforce kernel version parity between staging and production as a pre-deployment gate.

2. The tuning script runs but is not idempotent. If apply-trading-tuning.sh only appends to /etc/sysctl.conf rather than using sysctl -w with explicit values, running it twice creates duplicate entries. Duplicate sysctl entries cause unpredictable behavior (last value wins in some versions, first in others). Use sysctl -w param=value which is always idempotent.

3. Microcode update changes branch predictor behavior. Intel CPU microcode updates are applied by the kernel at boot via the microcode_ctl package. After a microcode update, branch prediction behavior can change significantly. This is not reflected in the kernel version (same kernel, different microcode). Add microcode version to your baseline metadata: grep "microcode" /proc/cpuinfo | head -1.

4. CPU frequency not fixed due to BIOS override. Even with scaling_governor=performance and no_turbo=1, some BIOS configurations override the OS settings. The BIOS may enforce its own power management state that overrides scaling_governor writes. Verify the actual CPU frequency during load: perf stat -e cpu-cycles,instructions -- sleep 1 then calculate cycles/second. If it does not match your expected clock speed, the BIOS is interfering.

5. Kernel security update that disables PCIe ASPM. ASPM (Active State Power Management) for PCIe links allows the link to enter low-power states when idle. Some Spectre-variant mitigations add memory barriers around PCIe MMIO reads that interact badly with ASPM transitions. A kernel update enabling these barriers caused one of our servers to take 2-5µs to access NIC registers that previously took 200ns, because the PCIe link had to wake from L1 state. Fix: disable ASPM in BIOS or via kernel cmdline pcie_aspm=off.

Related reading: CPU Pinning, isolcpus, and nohz_full covers the parameters most commonly lost to drift. Huge Pages Done Right covers THP configuration that needs drift protection. Interrupt Affinity, MSI-X, and the Multi-Queue NIC covers IRQ affinity which irqbalance restarts will reset. NUMA in Production covers NUMA balancing, another common drift target.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.