Determinism Under Load: Tail Latency Engineering for 24/7 Crypto Trading Systems

At ZeroCopy, we guarantee sub-100µs signing latency as a product SLA. The guarantee covers P99.9 - not just the median. Anyone can hit a 42µs P50. Holding P99.9 under 100µs on a 24/7 system that never has a maintenance window is an engineering problem of a different character.

The median is easy to optimize. The tail is where the real work is. Tail latency is adversarial: every background process, every kernel interrupt, every GC pause, every thermal event attacks your P99.9 from an unexpected direction. This post covers the systematic approach I use to find and eliminate each source of tail latency degradation.

Why Tail Latency Matters More Than Median

At Akuna Capital, our market-making models ran on a 50µs round-trip assumption. The system delivered 47µs median. It also delivered 2ms P99 (which we did not know because of coordinated omission - see the coordinated omission post). That 2ms P99 translated into $4,000-$ 6,000 per day in adverse selection.

The relationship is simple: tail latency events cause missed cancellations and stale quotes. A 2ms delay on a cancel request in a fast market means the cancel arrives after the fill - you pay a fee for a fill you were trying to avoid. At $0.10 per fill and 50,000 fills per day, even a 0.1% miss rate is$ 50/day. At scale, it is significant.

For ZeroCopy’s signing service: if our signing P99.9 is 100µs and a trading firm is signing 10,000 transactions per second, 10 transactions per second experience >100µs signing latency. If the firm’s overall latency budget is 200µs end-to-end, those 10 transactions per second miss the budget. That is 86,400 budget misses per day. Our customer measures their revenue impact from missed-budget events; we need to eliminate them.

The Taxonomy of Tail Latency Sources

Every tail latency spike I have encountered in trading systems traces to one of six root causes. I will address each with a specific mitigation:

Root Cause                      Typical Spike    Frequency         Mitigation
───────────────────────────────────────────────────────────────────────────────────────
GC pause (JVM/Go)               1-100ms          Every 1-30s       Language choice / GC tuning
Kernel scheduler jitter         50-500µs         Every 10-100ms    CPU isolation, SCHED_FIFO
Lock contention                 100µs-10ms       Under load        Lock-free data structures
Thermal throttling              50-500µs         Every 30-300s     Cooling, power profiles
NUMA cross-socket access        50-200ns/miss    Per cache miss     Thread/memory pinning
Interrupt coalescing            10-200µs         Every N packets    IRQ affinity, coalescing config

CPU Isolation: The Single Most Impactful Change

The Linux scheduler by default treats all CPUs as a pool. Background tasks, kernel threads, softirqs, and your trading process all compete for the same cores. When a kernel thread preempts your strategy thread, your trade takes 50-500µs longer than it should.

CPU isolation removes a set of cores from the general scheduler pool entirely. No kernel tasks, no background processes, no IRQ handlers - only the processes you explicitly pin to those cores.

# In /etc/default/grub - add to GRUB_CMDLINE_LINUX
GRUB_CMDLINE_LINUX="isolcpus=2,3,4,5 nohz_full=2,3,4,5 rcu_nocbs=2,3,4,5"
# isolcpus: remove from general scheduler
# nohz_full: disable timer interrupts (reduces jitter by ~5-10µs)
# rcu_nocbs: offload RCU callbacks to other cores

# After reboot, verify
cat /sys/devices/system/cpu/isolated
# Should show: 2-5

Then pin your critical threads:

import os
import ctypes

def pin_thread_to_cpu(cpu_id: int):
    """Pin the calling thread to a specific CPU."""
    libc = ctypes.CDLL('libc.so.6', use_errno=True)

    # CPU_ZERO + CPU_SET in ctypes
    cpu_set_size = 128  # bytes (1024 CPUs)
    cpu_set = (ctypes.c_uint8 * cpu_set_size)()

    # Set the bit for cpu_id
    cpu_set[cpu_id // 8] = 1 << (cpu_id % 8)

    result = libc.sched_setaffinity(0, cpu_set_size, cpu_set)
    if result != 0:
        errno = ctypes.get_errno()
        raise OSError(errno, f"sched_setaffinity failed for CPU {cpu_id}")

    print(f"Pinned to CPU {cpu_id}")

def set_realtime_priority(priority: int = 80):
    """Set SCHED_FIFO scheduling policy for the calling thread."""
    SCHED_FIFO = 1

    class sched_param(ctypes.Structure):
        _fields_ = [('sched_priority', ctypes.c_int)]

    libc = ctypes.CDLL('libc.so.6', use_errno=True)
    param = sched_param(sched_priority=priority)
    result = libc.sched_setscheduler(0, SCHED_FIFO, ctypes.byref(param))
    if result != 0:
        raise OSError(ctypes.get_errno(), "sched_setscheduler failed")

For Rust (which is what ZeroCopy’s signing service uses):

use std::thread;

fn pin_to_isolated_cpu(cpu_id: usize) {
    let mut cpu_set = nix::sched::CpuSet::new();
    cpu_set.set(cpu_id).expect("CPU ID out of range");
    nix::sched::sched_setaffinity(nix::unistd::Pid::from_raw(0), &cpu_set)
        .expect("sched_setaffinity failed - are you root?");
}

The impact on our signing service:

Metric          Before isolation    After isolation
──────────────────────────────────────────────────────────
P50             42µs                41µs
P99             380µs               68µs
P99.9           1.2ms               87µs
Max (1h run)    12ms                310µs

CPU isolation alone reduced our P99.9 by 14x. Nothing else I have tried comes close.

Eliminating GC: Language Selection for Latency-Critical Paths

For the signing service, we chose Rust. There is no garbage collector, no stop-the-world pause. Every allocation is deterministic. The memory model is explicit.

If you are building in a GC’d language (Go, Java, Python), the GC is an uncontrollable latency source. You can tune it but not eliminate it.

For Go (which many crypto systems use), the relevant tuning:

// GOGC=1000 means only run GC when heap is 10x the live set size.
// For a signing service with a small working set, this means GC runs rarely.
import "runtime/debug"

func init() {
    // Only collect when heap grows 1000% above baseline
    // Tradeoff: higher memory usage, less frequent pauses
    debug.SetGCPercent(1000)

    // Pre-allocate working memory to minimize allocation during hot path
    debug.SetMemoryLimit(512 * 1024 * 1024)  // 512MB limit
}

For Java with G1GC (if you genuinely cannot avoid JVM):

# JVM flags for latency-critical trading components
-XX:+UseG1GC
-XX:MaxGCPauseMillis=10        # Target 10ms max pause
-XX:G1HeapRegionSize=16m       # Larger regions = fewer collections
-XX:ConcGCThreads=4            # Dedicated GC threads (on non-isolated CPUs)
-XX:+AlwaysPreTouch             # Pre-fault heap pages at startup
-Xms4g -Xmx4g                  # Fixed heap size - no resize pauses
-XX:+DisableExplicitGC          # Prevent System.gc() calls

For latency-critical paths in Java, G1GC with these flags typically achieves 5-50ms P99.9. For sub-100µs P99.9, you need C/C++ or Rust.

Request Hedging: Buy Down P99 at 2x Cost

Request hedging is a technique borrowed from distributed systems: send the same request to two independent replicas simultaneously and take the first response. The latency of a hedged request is min(latency_A, latency_B) - dramatically better at the tail.

The math works out favorably when the tail is driven by independent failure modes. If each replica independently has a 1% chance of a 2ms pause, the probability that both replicas experience a pause on the same request is 0.01% - you have reduced the tail event rate by 100x at 2x the cost.

import asyncio
from typing import TypeVar, Callable, Awaitable, Optional

T = TypeVar('T')

async def hedge_request(
    primary: Callable[[], Awaitable[T]],
    secondary: Callable[[], Awaitable[T]],
    hedge_delay_ms: float = 5.0,
) -> T:
    """
    Send primary request, and if no response within hedge_delay_ms,
    also send secondary request. Return whichever responds first.

    hedge_delay_ms: how long to wait before sending the secondary
    (set this to your P95 latency to avoid unnecessary double-sends)
    """
    primary_task = asyncio.create_task(primary())

    try:
        # Wait for primary with a short timeout
        result = await asyncio.wait_for(
            asyncio.shield(primary_task),
            timeout=hedge_delay_ms / 1000.0
        )
        return result
    except asyncio.TimeoutError:
        # Primary is slow - send secondary hedge
        secondary_task = asyncio.create_task(secondary())

        done, pending = await asyncio.wait(
            [primary_task, secondary_task],
            return_when=asyncio.FIRST_COMPLETED
        )

        # Cancel the slower one
        for task in pending:
            task.cancel()
            try:
                await task
            except asyncio.CancelledError:
                pass

        return done.pop().result()

For signing services, hedging requires two independent signing nodes. The request is sent to both; both will produce valid signatures (since the key is in a TEE or HSM, not split between them). You take the first response and discard the second. The overhead is the additional signing capacity - but for a latency SLA product, this is often worth it.

Our ZeroCopy signing service with hedging:

Configuration               P50     P99     P99.9
───────────────────────────────────────────────────
Single node                 42µs    68µs    87µs
Hedged (hedge at P95=55µs)  42µs    53µs    61µs

The P99.9 improvement from 87µs to 61µs comes from eliminating the tail events that exceed 55µs on one node - they always have a faster result from the other node.

Thermal Throttling: The Silent Killer

On a trading server running continuously, CPU temperature is a real concern. When core temperature exceeds the TJ_MAX threshold (typically 100°C on Intel Xeon), the CPU throttles its clock frequency, sometimes by 50-80%. A single throttle event lasting 100ms can push thousands of concurrent requests into the P99.9 tail.

# Monitor thermal throttling
watch -n 0.5 'cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq'

# Check for thermal throttle events
grep -c "CPU.*throttled" /var/log/syslog

# Verify CPU is running at maximum non-turbo frequency (no throttling)
# For a 3.5GHz Xeon with turbo up to 4.2GHz:
# Normal: 3500000 (3.5GHz base)
# Throttled: 800000-2000000 (emergency throttle)

# Lock CPU to fixed performance frequency (no turbo, no power saving)
# Eliminates frequency variations that cause latency spikes
cpupower frequency-set -g performance
# Alternatively, for specific cores:
for cpu in 2 3 4 5; do
    echo performance > /sys/devices/system/cpu/cpu${cpu}/cpufreq/scaling_governor
done

For AWS instances: c5n family instances have a fixed clock frequency with Intel Turbo Boost always on. There is no thermal throttling at typical trading workloads on these instances because the physical cooling is handled by AWS. However, on a bare-metal server in a co-location facility, thermal management is your responsibility.

ZeroCopy’s Actual Numbers: Nitro Enclave Attestation Overhead

The AWS Nitro Enclave adds an attestation step to every signing operation. Here is what that costs in practice:

Operation                              P50      P99      P99.9
──────────────────────────────────────────────────────────────────
vsock round-trip (empty payload)       8µs      12µs     18µs
KMS decrypt (cached key)               31µs     47µs     63µs
ECDSA secp256k1 sign (32-byte hash)    3µs      5µs      8µs
Total signing operation                42µs     64µs     89µs
Total with hedging (P95 hedge)         42µs     53µs     61µs

These numbers are on an isolated CPU core with SCHED_FIFO, on a c5n.2xlarge instance with c5n.metal as the Nitro parent instance. The parent instance isolation matters - the Nitro enclave shares the parent instance’s underlying hardware with the rest of its vCPUs.

The vsock latency is relatively stable (8-18µs range) because it is a local socket between two VMs on the same physical machine - there is no network involved. The KMS decrypt dominates the latency because it requires a network call to the KMS endpoint, even though AWS routes this through their internal network to a local KMS node. The 31µs P50 on KMS reflects that routing.

How This Breaks in Production

Failure 1: Isolated CPU used by a kernel thread. The isolcpus boot parameter removes the CPU from the user-space scheduler, but it does not prevent kernel interrupt handlers from using it. An NIC interrupt handler pinned to your isolated CPU via irqbalance can add 10-50µs spikes. Fix: disable irqbalance on isolated CPUs, explicitly set IRQ affinity away from isolated cores. Check with cat /proc/interrupts and grep isolated /proc/irq/*/affinity_list.

Failure 2: SCHED_FIFO thread starving I/O threads. Your signing thread is SCHED_FIFO priority 80. It receives a burst of 1,000 simultaneous signing requests. It processes all 1,000 before yielding. The I/O thread (which reads from the vsock) is SCHED_OTHER and cannot run until the signing thread yields. The vsock buffer fills up and drops requests. Fix: use SCHED_FIFO for the signing core but set cooperative yield points, or use separate CPU cores for I/O and signing.

Failure 3: Memory allocator contention. The signing service allocates request buffers on every inbound connection. Under high load, malloc contention causes latency spikes - multiple threads waiting for the global allocator lock. Fix: use a per-thread memory pool for request buffers. In Rust, use a typed_arena or pre-allocated Vec per thread. In C++, use jemalloc or tcmalloc which are designed for multi-threaded allocation.

Failure 4: Thermal throttling on bare metal without monitoring. A cooling fan fails in the co-location facility. The server throttles from 3.5GHz to 1.2GHz. Signing latency triples. No alert fires because the metric being monitored is signing latency, but the threshold for that metric is set too high (1ms). Fix: monitor CPU frequency directly alongside service latency. An alert threshold of scaling_cur_freq < base_frequency * 0.9 detects throttling before it becomes a latency crisis.

Failure 5: Hedging amplifying backend load. Request hedging sends 1.x requests for every actual request. Under normal conditions, the secondary is cancelled quickly. Under high load, the primary is slow and the secondary also sends - suddenly you have 2x the load on both backends simultaneously. The backends get more loaded, which makes them slower, which triggers more hedging, which increases load further. Fix: implement load shedding before hedging. If the backend queue depth exceeds a threshold, reject requests rather than hedge them. Hedging works when failures are independent; it makes coordinated overload worse.

Failure 6: P99.9 SLA violation going undetected in rolling average. Your SLA monitoring computes a 1-minute rolling average of P99.9. A single 30-second burst of elevated latency pushes P99.9 to 200µs - twice the SLA. But the 30-second burst is averaged with 30 seconds of normal P99.9=80µs, resulting in a reported P99.9 of 140µs. The SLA says <100µs; you breached it; your monitoring did not fire. Fix: use a high-watermark alert that fires if P99.9 exceeds the SLA threshold in any 10-second window, not just in the rolling average.

Related reading: Coordinated Omission and Why Your P99 Is Probably a Lie covers how to measure these numbers correctly. AWS Nitro Enclaves for Wallet Signing covers the ZeroCopy signing architecture that achieves these latency numbers.

Determinism Under Load: Tail Latency Engineering for 24/7 Crypto Trading Systems

Why Tail Latency Matters More Than Median

The Taxonomy of Tail Latency Sources

CPU Isolation: The Single Most Impactful Change

Eliminating GC: Language Selection for Latency-Critical Paths

Request Hedging: Buy Down P99 at 2x Cost

Thermal Throttling: The Silent Killer

ZeroCopy’s Actual Numbers: Nitro Enclave Attestation Overhead

How This Breaks in Production

Continue Reading

Sovereign Trading Infrastructure: Why the Next Generation of HFT Will Run Inside Enclaves

On-Premise GPU vs Cloud for Trading AI: When the Math Tips

AI-Driven Execution Agents: BAML/Letta Patterns for Trading Workflow Orchestration