Skip to content

Infrastructure

Why Trading Time Is Different: TSC, CLOCK_MONOTONIC, HPET, and the Lies Each Tells

A practitioner's guide to clock sources in trading systems: TSC drift, CLOCK_MONOTONIC overhead, HPET calibration, and why TAI beats CLOCK_REALTIME for audit logs.

12 min
#latency #clocks #tsc #ptp #trading-infrastructure #linux #time-sync

At Gemini, we discovered our timestamps were drifting 2µs/second on AWS instances. The problem surfaced six weeks after deployment, when our compliance team flagged that the fill sequences in our audit trail were occasionally inverted - an acknowledgment appearing before the order it was acknowledging. The timestamps were close: 1.3µs apart in the wrong direction. But in an audit trail, “close” means nothing. Either the record is correct or it is not.

The root cause was that we had been reading CLOCK_REALTIME in user-space without fully accounting for how the Linux vDSO synchronizes that clock, and how AWS’s virtualization layer introduces discontinuities at hypervisor preemption boundaries. We had assumed the kernel would handle it. The kernel does not guarantee what we assumed.

This post is the document I wish had existed before that incident. It covers every clock source available on a Linux trading server, what each one actually does, how fast each one is, and the specific ways each one lies to you.

The Fundamental Problem with Trading Time

Most software treats time as a monotonically increasing number you read from the OS. For a web server or a database, that abstraction is fine. For a trading system, you need to understand five distinct concepts:

  • Read latency: how long does querying the clock cost?
  • Resolution: what is the smallest increment the clock can represent?
  • Accuracy: how close to “real” UTC time is the value?
  • Monotonicity: can the value ever decrease or jump backward?
  • Cross-socket consistency: if two cores on different NUMA nodes read the clock simultaneously, do they get the same value?

No single clock source gives you all five. The selection is a tradeoff, and the wrong tradeoff costs you either performance, correctness, or compliance.

TSC: The Fastest Clock and Its Failure Modes

The Time Stamp Counter is a 64-bit register incremented once per CPU cycle. On a 3.5GHz processor, that is one tick approximately every 0.28ns. Reading it costs a single RDTSC instruction - roughly 20-30 CPU cycles, or about 6-8ns on a modern Intel Xeon. Nothing else comes close.

The appeal for HFT is obvious. At Akuna, our critical path measured inter-event intervals using raw TSC reads. You get nanosecond-resolution timestamps at near-zero cost with no kernel involvement.

The problem is that “one tick per CPU cycle” was only true on old CPUs that ran at a fixed frequency. Modern CPUs have frequency scaling (Intel’s SpeedStep, AMD’s Cool’n’Quiet), turbo boost, and power management states that change the core frequency dynamically. If the TSC incremented at the core frequency, it would drift every time the CPU scaled down. Intel fixed this with the invariant TSC: a flag (constant_tsc + nonstop_tsc in /proc/cpuinfo) indicating the TSC increments at a fixed, advertised rate regardless of CPU frequency changes.

# Verify your CPU has invariant TSC
grep -m1 'flags' /proc/cpuinfo | tr ' ' '\n' | grep -E 'constant_tsc|nonstop_tsc'

If you see both flags, your TSC increments at a constant rate. If not, raw RDTSC is unsafe for latency measurement.

The multi-socket problem. Even with invariant TSC, there is a second failure mode: TSC skew across sockets. On a dual-socket system, the two CPUs have independent TSC registers that are reset at power-on. They should be synchronized, but in practice they drift. On AWS instances, I have measured TSC skew of 40-200ns between the two NUMA nodes of an m5.8xlarge. If your strategy thread on socket 0 timestamps an event and your order sender on socket 1 timestamps its transmission, the difference will include that skew.

# Check TSC skew between CPUs using perf
for cpu in 0 8; do
  taskset -c $cpu python3 -c "
import ctypes, time
libc = ctypes.CDLL('librt.so.6')
ts = ctypes.c_long * 2
t = ts()
libc.clock_gettime(1, t)  # CLOCK_MONOTONIC = 1
print(f'CPU {$cpu}: {t[0] * 1_000_000_000 + t[1]}')
"
done

For single-socket systems with constant_tsc, raw RDTSC is appropriate for intra-process interval measurement. For cross-socket or cross-process timestamps, you need a synchronized source.

CLOCK_MONOTONIC: The Safe Workhorse

CLOCK_MONOTONIC is the kernel’s monotonic clock. It is guaranteed to never go backward and never jump forward discontinuously. It advances at a rate calibrated against hardware timers and corrected by NTP/PTP adjustments.

The overhead is ~20-50ns when the vDSO path is active. This is dramatically cheaper than a full syscall (which would be ~200-300ns) because Linux maps a small region of kernel memory into every process’s address space (the vDSO - virtual Dynamic Shared Object) containing a fast, lock-free implementation of clock_gettime.

#include <time.h>

static inline uint64_t monotonic_ns(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (uint64_t)ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}

The vDSO implementation reads a kernel-maintained sequence counter and timestamp, performs simple arithmetic, and returns. No syscall, no context switch. You can verify the vDSO is active with:

# Check if vDSO is mapped
cat /proc/self/maps | grep vdso

The lie CLOCK_MONOTONIC tells. CLOCK_MONOTONIC is not synchronized across machines. It starts from an arbitrary point at boot and drifts unless actively corrected. If machine A and machine B both emit events with CLOCK_MONOTONIC timestamps, you cannot compare those timestamps without also knowing the clock offset between the two machines - which may be in the hundreds of microseconds range even with NTP.

More subtly, CLOCK_MONOTONIC is not guaranteed to advance at the same rate in all situations. Under adjtimex frequency correction (how PTP/NTP slowly nudge the clock), the rate changes by small amounts. For nanosecond-precision intervals shorter than ~1ms, this is usually acceptable. For absolute timestamps on audit logs, it is not.

HPET: The Calibration Source

The High Precision Event Timer is a hardware timer on the motherboard that runs at a fixed frequency - typically 10-25MHz. At 14.318MHz, one tick is ~70ns. HPET is accessed via memory-mapped I/O, which makes it expensive: roughly 200-800ns per read depending on the chipset and cache state.

You should almost never read HPET directly in application code. Its role in a trading system is as a calibration source - the kernel uses HPET to calibrate the TSC frequency at boot and periodically thereafter.

# Check HPET availability
cat /sys/bus/platform/devices/HPET*/hpet/hpet_count 2>/dev/null || echo "HPET not exposed"

# Verify kernel is using TSC as clocksource (preferred on modern systems)
cat /sys/devices/system/clocksource/clocksource0/current_clocksource

On a well-configured trading server, you should see tsc as the current clocksource. If you see hpet, the kernel has decided TSC is unreliable on this system - investigate why before relying on any timestamp.

CLOCK_REALTIME vs CLOCK_TAI: Why Your Audit Log Needs TAI

CLOCK_REALTIME tracks wall-clock UTC time. It is the right clock for displaying “what time is it?” to humans. For audit logs, it has a critical failure mode: leap seconds.

UTC is periodically adjusted by inserting or removing a leap second to keep it synchronized with the Earth’s rotation. When a positive leap second is inserted, UTC goes: 23:59:58 → 23:59:59 → 23:59:60 → 00:00:00. Linux historically handled this by “smearing” the leap second - slowing the clock slightly over a 24-hour window. Cloud providers (AWS, GCP) all smear now, which means CLOCK_REALTIME is not actually UTC during the smear window. It is off by fractions of a second.

TAI (International Atomic Time) never has leap seconds. TAI is currently 37 seconds ahead of UTC (as of 2026, with the most recent leap second in 2016). If you record timestamps in TAI and need to convert to UTC, you subtract 37 seconds - but you do not have to worry about your log timestamps smearing or jumping.

#include <time.h>

static inline uint64_t tai_ns(void) {
    struct timespec ts;
    clock_gettime(CLOCK_TAI, &ts);
    return (uint64_t)ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}

At Gemini, after the timestamp inversion incident, we switched audit log timestamps to CLOCK_TAI synchronized via PTP. The fill sequence inversions disappeared. The TAI offset from UTC is known and constant within any deployment window, so regulators can convert if they need UTC.

The vDSO Fast Path: How It Actually Works

The vDSO implementation of clock_gettime(CLOCK_MONOTONIC) works as follows:

  1. Read a sequence counter from shared kernel memory. If odd, a write is in progress - spin.
  2. Read the wall_clock_snsec (nanoseconds since boot) and wall_to_monotonic offset from shared memory.
  3. Read the TSC via RDTSC.
  4. Compute (TSC - last_tsc_base) * tsc_to_ns_scale >> 32 + monotonic_base.
  5. Read the sequence counter again. If changed, retry from step 1.
Clock source latency comparison (measured on AWS c5n.2xlarge):

Source                    Read Latency (ns)    Resolution    Monotonic    Network-Synced
──────────────────────────────────────────────────────────────────────────────────────────
Raw RDTSC                 6-8                  ~0.3ns        Yes*         No
CLOCK_MONOTONIC (vDSO)    20-40                1ns           Yes          No
CLOCK_REALTIME (vDSO)     20-40                1ns           No           Yes (NTP)
CLOCK_TAI (vDSO)          20-40                1ns           Yes          Yes (NTP/PTP)
CLOCK_MONOTONIC_RAW       150-300              1ns           Yes          No
HPET                      200-800              ~70ns         Yes          No
CLOCK_MONOTONIC (syscall) 200-300              1ns           Yes          No

*TSC is monotonic per-core but can skew across cores without constant_tsc.

Verifying TSC Stability and Detecting When to Avoid RDTSC

Before using raw TSC in production, run this verification:

#!/bin/bash
# tsc_stability_check.sh
# Run for 60 seconds, report TSC drift vs CLOCK_MONOTONIC

python3 << 'EOF'
import ctypes
import time
import statistics

libc = ctypes.CDLL('librt.so.6', use_errno=True)

class timespec(ctypes.Structure):
    _fields_ = [('tv_sec', ctypes.c_long), ('tv_nsec', ctypes.c_long)]

CLOCK_MONOTONIC = 1

def get_monotonic_ns():
    ts = timespec()
    libc.clock_gettime(CLOCK_MONOTONIC, ctypes.byref(ts))
    return ts.tv_sec * 10**9 + ts.tv_nsec

# Measure TSC frequency by comparing against CLOCK_MONOTONIC
def measure_tsc_freq():
    import subprocess
    # Read TSC via rdtsc inline - approximate with perf stat
    t1 = get_monotonic_ns()
    time.sleep(1.0)
    t2 = get_monotonic_ns()
    return (t2 - t1) / 1e9  # effective seconds of monotonic time per second

samples = []
for i in range(10):
    elapsed = measure_tsc_freq()
    samples.append(elapsed)
    time.sleep(0.1)

drift_ppm = (statistics.stdev(samples) / statistics.mean(samples)) * 1e6
print(f"Clock stability: {drift_ppm:.2f} ppm jitter")
print(f"{'SAFE for TSC use' if drift_ppm < 10 else 'WARNING: high jitter, avoid raw TSC'}")
EOF

For production, the check I use before enabling raw TSC in any service:

# Check all required CPU flags
required_flags="constant_tsc nonstop_tsc rdtscp"
for flag in $required_flags; do
    if grep -q "$flag" /proc/cpuinfo; then
        echo "OK: $flag present"
    else
        echo "MISSING: $flag - do not use raw RDTSC"
        exit 1
    fi
done

# Ensure we're on single socket or TSC sync is verified
socket_count=$(lscpu | grep "Socket(s):" | awk '{print $2}')
if [ "$socket_count" -gt 1 ]; then
    echo "WARNING: Multi-socket system - verify TSC sync before using RDTSC for cross-core timestamps"
fi

Production Implementation: A Timestamp Strategy

Here is the timestamp strategy I use for trading systems today:

Use Case                           Clock Source            Rationale
──────────────────────────────────────────────────────────────────────────────────────────────
Intra-process interval (same core) RDTSC                  6-8ns cost, nanosecond resolution
Intra-process interval (any core)  CLOCK_MONOTONIC (vDSO) 20-40ns, safe across cores
Order construction timestamp        CLOCK_TAI (vDSO)       UTC-convertible, leap-safe
Audit log timestamp                 CLOCK_TAI (vDSO)       Regulatory compliance
Network event timestamp             HW timestamp (PTP NIC) NIC-level, before kernel sees it
Cross-machine correlation           CLOCK_TAI synced by PTP Sub-100ns accuracy possible

The key principle: measure intervals with the cheapest monotonic source; record absolute timestamps with the most accurate synced source.

How This Breaks in Production

Failure 1: TSC jump after CPU migration. Process gets migrated from CPU 0 to CPU 8 (different socket). The new TSC is 200ns ahead. Your “interval” for the operation that spanned the migration is negative 200ns. The histogram records a negative latency. Your P99 calculation is undefined. Fix: pin latency-critical threads to a single CPU with taskset or pthread_setaffinity_np.

Failure 2: vDSO page unmapped after fork. After fork(), the vDSO mappings are preserved, but if the parent has done unusual mmap operations, the child may read stale data. Symptom: clock_gettime returns timestamps from the past. Fix: always call clock_gettime after fork to force re-initialization, or avoid fork in latency-critical paths entirely.

Failure 3: CLOCK_REALTIME going backward. NTP correction causes a step adjustment. Your “duration” for an operation is negative because the end timestamp precedes the start. All benchmarks recording this window are corrupted. Fix: use CLOCK_MONOTONIC for interval measurement, CLOCK_REALTIME only for absolute display. Never compute a duration by subtracting CLOCK_REALTIME values.

Failure 4: AWS hypervisor preemption causing TSC discontinuity. The hypervisor steals the vCPU for ~100µs to service another tenant. The TSC resumes from where it left off but wall time has advanced. The next clock_gettime(CLOCK_MONOTONIC) call sees a large forward jump. Your latency histogram shows a spike that is not real. Fix: use hardware timestamps from PTP-capable NICs for events that must be accurately timed; the NIC timestamps before the kernel stack and is not subject to hypervisor interruption artifacts.

Failure 5: Leap second smear corrupting interval computation. During the 24-hour smear window, CLOCK_REALTIME advances slightly slower than real time. If you compute a duration by subtracting CLOCK_REALTIME values and one endpoint is inside the smear window and the other is outside, you get a systematically wrong duration. For a system that processes 1 million events per day during the smear, every interval is wrong by up to 0.5ms. Fix: switch to CLOCK_TAI for all latency-critical timestamping. CLOCK_TAI never smears.

Failure 6: HPET as clocksource in production. The kernel fell back to HPET because TSC was deemed unreliable (often happens on VMs with poor TSC emulation). Every clock_gettime(CLOCK_MONOTONIC) call now costs 200-800ns instead of 20-40ns. A strategy that calls clock_gettime 100,000 times per second (not unusual for per-event timestamping) adds 20-80ms of pure timestamp overhead per second. Fix: check /sys/devices/system/clocksource/clocksource0/current_clocksource on startup and alert if it is not tsc. If HPET, investigate the TSC reliability issue before deploying.


Related reading: PTP in Production: Solarflare Hardware Timestamping covers how to get sub-100ns accuracy across machines. The Anatomy of a Sub-50µs Trade shows how these clocks fit into the full order lifecycle.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.