Infrastructure
Huge Pages Done Right: Static, Transparent, and Why Most HFT Firms Disable THP
How a THP compaction stall caused a 400µs latency spike mid-session, plus the correct way to configure static huge pages for trading systems in production.
The 400µs spike showed up on a Thursday afternoon at 14:23 UTC, during a period of normal but elevated market activity. The order book for ETH/USD had been updating at about 8,000 messages per second - not our busiest period - and then for exactly one trading cycle, the strategy thread took 400µs instead of its normal 22µs. One cycle. Then normal operation resumed.
We had 6ms of trace data around the event because we recorded every cycle’s latency to a ring buffer. The spike correlated with nothing in the order flow, nothing in the network, nothing in the CPU metrics. The CPU was at 14% utilization. There was no GC (we were running C++). No memory allocations in the trading loop. No locks.
The perf stat trace, which I had fortunately already running at low overhead, showed one event type spiking at that exact timestamp: minor_page_faults. Three hundred and forty-two minor page faults in one cycle.
The culprit was Linux’s Transparent Huge Pages (THP) compaction. The kernel had decided, during that 14:23 window, to attempt to defragment memory by collapsing 4KB pages belonging to our order book into a single 2MB huge page. This required briefly unmapping and remapping a 2MB region of memory, triggering 342 minor faults as the pages were faulted back in. Each fault was cheap - but 342 of them, plus the TLB flush for the 2MB region, added up to ~400µs.
Every serious HFT firm I know of has THP disabled. This post explains why, and what the alternative looks like.
The Physics of TLB Misses and Huge Pages
Every virtual memory access on x86-64 requires translating a virtual address to a physical address. This translation goes through the page table - a multi-level radix tree stored in memory. For a 4-level page table (the default on 64-bit Linux), a full translation requires 4 memory reads: PGD → PUD → PMD → PTE.
The hardware caches recent translations in the TLB (Translation Lookaside Buffer). L1-ITLB has 64 entries on modern Intel, L1-DTLB has 64 entries, L2 TLB has 1,536 entries. When you miss the TLB and must walk the page table:
TLB miss resolution overhead (approximate):
All 4 levels in L1 cache: ~40-60 cycles (~13-20ns at 3GHz)
Mix of L1/L2 cache hits: ~100-200 cycles (~33-67ns)
One level from DRAM: ~400-800 cycles (~130-270ns)
All 4 levels from DRAM: ~1200-2000 cycles (~400-670ns)
The number of TLB entries your working set occupies depends directly on page size. With 4KB pages, a 2MB order book requires 512 TLB entries (2MB / 4KB). You do not have 512 L1-DTLB entries; your 2MB order book constantly evicts TLB entries needed for other data structures. With 2MB huge pages, that same 2MB order book occupies exactly one TLB entry.
The TLB coverage difference is dramatic:
Memory region size 4KB pages needed 2MB pages needed Coverage per TLB entry
──────────────────────────────────────────────────────────────────────────────────────
Order book (2MB) 512 entries 1 entry 512x difference
DMA buffer (64MB) 16,384 entries 32 entries 512x difference
Strategy state (8MB) 2,048 entries 4 entries 512x difference
Total example 18,944 entries 37 entries
Available L2 TLB: 1,536 entries 1,536 entries
With 4KB pages: 18,944 entries needed / 1,536 available = 12.3x oversubscribed (constant TLB misses)
With 2MB pages: 37 entries needed / 1,536 available = 2.4% utilization (essentially no TLB misses)
On a trading system where every cycle of the strategy loop touches the order book, DMA buffers, and strategy state, the difference between “12x oversubscribed TLB” and “2% TLB utilization” is measurable in µs per cycle.
Two Kinds of Huge Pages: Static vs. Transparent
Linux offers two mechanisms for huge pages, and they have completely different operational properties.
Static huge pages (also called HugeTLB pages) are pre-allocated at boot time from physically contiguous memory. You request N pages, the kernel carves them out of physical RAM, and they remain reserved until you explicitly free them. You use them via mmap(MAP_HUGETLB) or by mounting hugetlbfs. They are never subject to compaction, never swapped, never split by the kernel. Once allocated, they are rock-solid.
Transparent Huge Pages (THP) are the kernel’s attempt to automatically use huge pages for any allocation larger than 2MB, without requiring application changes. The kernel monitors your memory access patterns and opportunistically “promotes” regions of 512 contiguous 4KB pages into a single 2MB huge page - and degrades them back when memory is needed elsewhere.
The promotion and demotion process is the problem. The kernel’s khugepaged daemon and the direct reclaim path both perform this compaction. Compaction requires:
- Finding 512 contiguous 4KB pages that can be merged
- Temporarily unmapping the target region
- Copying data into the huge page
- Remapping the huge page
- Sending TLB flush IPIs to all other CPUs that have this region mapped
Step 5 - the TLB flush - is a cross-CPU interrupt. Every CPU that has accessed the region recently receives an IPI (inter-processor interrupt) to flush its TLB entry. On an 8-core system, this can be 7 concurrent IPI deliveries. The target CPU (your trading thread) pauses to handle the IPI, flushes its TLB, and resumes. This is the 400µs event we observed.
Disabling THP
# Disable THP immediately (no reboot required)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# Verify
cat /sys/kernel/mm/transparent_hugepage/enabled
# Output: always madvise [never]
# The bracketed option is the current setting.
# Stop khugepaged from running
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
# Make permanent - add to /etc/rc.local or a systemd unit:
# [Unit]
# Description=Disable THP
# DefaultDependencies=no
# After=local-fs.target
# [Service]
# Type=oneshot
# ExecStart=/bin/bash -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled && echo never > /sys/kernel/mm/transparent_hugepage/defrag'
# [Install]
# WantedBy=multi-user.target
# Add to grub for kernel-level enforcement
# In /etc/default/grub:
GRUB_CMDLINE_LINUX="... transparent_hugepage=never"
sudo update-grub
Disabling THP means you are responsible for allocating huge pages explicitly wherever they matter. The performance benefit of huge pages does not go away - you just control when and where they are used, without the kernel’s compaction running under you.
Allocating Static Huge Pages
# Allocate at boot (most reliable - physically contiguous memory available)
# Add to /etc/sysctl.d/99-hugepages.conf:
vm.nr_hugepages = 512 # 512 × 2MB = 1GB
vm.nr_overcommit_hugepages = 0 # no overcommit
# Apply immediately:
sysctl -p /etc/sysctl.d/99-hugepages.conf
# Verify allocation
cat /proc/meminfo | grep -E "HugePages|Hugepagesize"
# HugePagesTotal: 512
# HugePages_Free: 512
# HugePages_Rsvd: 0
# HugePages_Surp: 0
# Hugepagesize: 2048 kB
# 1GB pages (if BIOS supports it - check with 'grep pdpe1gb /proc/cpuinfo')
# These cannot be freed once allocated - use for permanent DMA buffers only
echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Using Huge Pages in Code
#include <sys/mman.h>
#include <string.h>
#include <stdio.h>
#define HUGE_PAGE_SIZE (2UL * 1024 * 1024) /* 2MB */
#define ORDER_BOOK_SIZE (64 * HUGE_PAGE_SIZE) /* 128MB order book */
/* Allocate from hugetlbfs - preferred for trading engines */
void* alloc_hugepage(size_t size) {
void *ptr = mmap(NULL, size,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
-1, 0);
if (ptr == MAP_FAILED) {
/* Fallback diagnostic */
FILE *f = fopen("/proc/meminfo", "r");
char line[256];
while (fgets(line, sizeof(line), f)) {
if (strstr(line, "HugePages")) printf("%s", line);
}
fclose(f);
return NULL;
}
/* Fault in all pages now to avoid page fault latency during trading */
/* memset also verifies the allocation is actually backed by huge pages */
memset(ptr, 0, size);
return ptr;
}
/* Verify the allocation is actually backed by huge pages */
int verify_hugepage_backing(void *ptr, size_t size) {
/* Check /proc/self/smaps for the allocation */
FILE *smaps = fopen("/proc/self/smaps", "r");
char line[256];
uintptr_t target = (uintptr_t)ptr;
int found = 0, hugepage_backed = 0;
while (fgets(line, sizeof(line), smaps)) {
uintptr_t start, end;
if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
if (start <= target && target < end) {
found = 1;
}
}
if (found && strstr(line, "AnonHugePages:")) {
int anon_huge_kb;
sscanf(line, "AnonHugePages: %d kB", &anon_huge_kb);
hugepage_backed = (anon_huge_kb > 0);
break;
}
if (found && strstr(line, "THPeligible: 0")) {
break; /* THP-ineligible but may still be hugetlbfs */
}
}
fclose(smaps);
return hugepage_backed;
}
For NUMA-aware huge page allocation (combine with the NUMA policy from the previous post):
#include <numaif.h>
#include <numa.h>
void* alloc_hugepage_numa(size_t size, int numa_node) {
void *ptr = mmap(NULL, size,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
-1, 0);
if (ptr == MAP_FAILED) return NULL;
/* Bind to specific NUMA node before faulting in */
unsigned long nodemask = 1UL << numa_node;
if (mbind(ptr, size, MPOL_BIND, &nodemask, sizeof(nodemask) * 8,
MPOL_MF_STRICT | MPOL_MF_MOVE) != 0) {
munmap(ptr, size);
return NULL;
}
/* Fault in all pages - now they are allocated on numa_node */
memset(ptr, 0, size);
return ptr;
}
Verifying Huge Page Usage at Runtime
# Check which allocations in your trading process are backed by huge pages
cat /proc/<pid>/smaps | grep -A 20 "^[0-9a-f]" | grep -E "Size:|AnonHugePages:|KernelPageSize:|MMUPageSize:"
# The MMUPageSize field tells you the actual page size for each mapping:
# MMUPageSize: 4 kB → normal page
# MMUPageSize: 2048 kB → 2MB huge page
# MMUPageSize: 1048576 kB → 1GB huge page
# Quick summary: total huge page usage by your process
grep -E "^(AnonHugePages|HugePages)" /proc/<pid>/smaps | \
awk '{sum+=$2} END {print sum/1024 " MB in huge pages"}'
# hugeadm utility (from libhugetlbfs-bin package)
hugeadm --pool-list
# Output:
# Size Minimum Current Maximum Default
# 2097152 512 512 512 *
# 1073741824 0 0 0
How This Breaks in Production
1. Allocation succeeds but pages are not actually huge. mmap(MAP_HUGETLB) can succeed (return a non-MAP_FAILED value) but the system will immediately use standard pages if huge page allocation fails, depending on the MAP_HUGETLB | MAP_HUGE_2MB flags. With just MAP_HUGETLB, the kernel attempts huge pages and silently falls back. Use MAP_HUGETLB | MAP_HUGE_2MB together and verify via /proc/<pid>/smaps. A wrong allocation will show MMUPageSize: 4 kB instead of 2048 kB.
2. Huge pages allocated but not pre-faulted. mmap() returns a virtual address range but no physical pages are assigned until you write to them. If your trading loop touches a huge page region for the first time during active trading (e.g., on the first message of the day), it will trigger a page fault - expensive even for huge pages. Always memset your huge page allocations to zero at startup (before the market opens), which forces all physical page assignments to happen upfront.
3. Huge page pool exhausted by another process. Huge pages are a system-wide pool. If another process allocates MAP_HUGETLB before your trading engine starts, your allocation may fail. Add a startup check:
# Pre-flight check in trading engine startup script
REQUIRED_HUGEPAGES=256
AVAILABLE=$(grep HugePages_Free /proc/meminfo | awk '{print $2}')
if [ "$AVAILABLE" -lt "$REQUIRED_HUGEPAGES" ]; then
echo "FATAL: Not enough huge pages. Need $REQUIRED_HUGEPAGES, have $AVAILABLE" >&2
exit 1
fi
4. NUMA-misaligned huge pages. If you allocate huge pages before setting the NUMA memory policy, they may be allocated on the wrong NUMA node. Unlike normal pages (which mbind + MPOL_MF_MOVE can relocate), huge pages cannot be migrated between NUMA nodes after allocation. You must set the NUMA policy first, then allocate. If you discover a misallocation, the only fix is to free the huge page and reallocate with the correct policy.
5. Huge page compaction under memory pressure even with THP disabled. Disabling THP prevents promotion of 4KB pages into huge pages, but it does not prevent the kernel from splitting an explicitly-allocated MAP_HUGETLB region if it runs out of physical memory. With static huge pages (pre-allocated via vm.nr_hugepages), this cannot happen - those pages are reserved and the kernel cannot touch them. With mmap(MAP_HUGETLB | MAP_POPULATE) from the general pool, extreme memory pressure can cause issues. The safe path: always use statically allocated huge pages (vm.nr_hugepages at boot).
6. Kernel security patches replacing 2MB with 4KB for sensitive kernel structures. Retpoline, L1TF, MDS, and other speculative execution mitigations have periodically required the kernel to use 4KB pages for data structures that were previously using 2MB pages, to enable fine-grained permissions. This does not affect your userspace huge page allocations, but it does increase kernel TLB pressure, which can indirectly increase interrupt handling latency. After any kernel security update, re-run your baseline latency benchmarks.
Related reading: NUMA in Production covers the interaction between NUMA memory policy and huge page allocation. CPU Pinning, isolcpus, and nohz_full covers the CPU-side tuning that works alongside correct memory configuration. Solarflare ef_vi vs DPDK vs AF_XDP covers DMA buffer management, which must use huge pages for minimum latency. Linux Tunable Drift covers how THP gets re-enabled after system updates.
Continue Reading
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.