Skip to content

Infrastructure

PTP in Production: Implementing IEEE 1588 on Solarflare for Nanosecond Exchange Fairness

How I implemented PTP hardware timestamping on Solarflare SFN8522 at Gemini to achieve ±30ns clock sync and sub-100µs MiFID II RTS-25 compliance.

11 min
#ptp #ieee1588 #solarflare #time-sync #hft #latency #mifid #nanoseconds

At Gemini, I implemented PTP hardware timestamping on Solarflare SFN8522 NICs to achieve sub-100ns clock synchronization between our co-location nodes and the exchange matching engine. This is not an unusual capability - Solarflare’s onload stack has supported hardware PTP timestamps since the SFN6000 series - but the implementation details matter enormously, and the gap between “PTP running” and “PTP working correctly” is measured in nanoseconds that will eventually show up in your audit trail or your P&L.

Our achieved accuracy was ±30ns from the exchange’s grandmaster clock. MiFID II RTS-25 requires transaction timestamps within 1ms of UTC for HFT systems; we were inside 50µs with hardware timestamping. This post covers the full implementation: how PTP works mechanically, why software timestamps fail for trading, the Solarflare-specific configuration, and how to verify you have actually achieved what you think you have.

Why Software Timestamps Are Insufficient for Trading

Before explaining what PTP does, it is worth being precise about what it replaces and why the replacement is necessary.

NTP - the standard time synchronization protocol running on every Linux server - achieves accuracy of roughly ±1ms on a well-connected server with a local stratum-1 source, and ±5-50ms on a server syncing from the public internet. This sounds adequate. It is not.

The deeper problem is not the millisecond accuracy of NTP. It is the latency jitter in software timestamps. When a packet arrives at a Linux machine, the kernel timestamps it during interrupt processing - the sk_buff receive timestamp is recorded in softirq context, after the packet has traversed the DMA buffer, the driver ring, and the kernel network stack. On a busy server, the time between physical arrival at the NIC and the software timestamp being written can vary by 50-500µs depending on CPU load, interrupt coalescing settings, and kernel scheduler behavior.

Physical packet arrival (wire)

NIC hardware timestamp (PTP): +0ns (reference)

DMA to host memory:           +100-200ns

Driver interrupt:             +1-50µs (variable)

softirq processing:           +5-100µs (variable)

SO_TIMESTAMPING kernel:       +50-500µs (variable)

User-space recvmsg():         +1-5µs (syscall)

When you use SO_TIMESTAMPING with SOF_TIMESTAMPING_SOFTWARE, you get the kernel timestamp - potentially 500µs after the packet actually arrived. When you use SOF_TIMESTAMPING_RAW_HARDWARE on a PTP-capable NIC, you get the timestamp the NIC hardware recorded at the physical layer, accurate to ~5ns.

For PTP’s delay measurement to work correctly, these hardware timestamps are essential. Without them, your delay measurements include the jitter of the software stack, and the clock correction you compute will be wrong by hundreds of microseconds.

How PTP Works: The Two-Step Mechanism

PTP (IEEE 1588-2008) synchronizes clocks by measuring the network delay between two endpoints and correcting for it. The fundamental insight is that if you can precisely timestamp when a packet is sent and when it arrives, you can compute the one-way propagation delay and adjust accordingly.

The two-step PTP exchange works as follows:

Master                          Slave
  │                               │
  ├──── Sync (t1) ───────────────►│  Slave records arrival time t2
  ├──── Follow_Up (t1) ──────────►│  Master sends precise t1
  │                               │
  │◄─── Delay_Req (t3) ───────────┤  Slave sends, records t3
  ├──── Delay_Resp (t4) ──────────►│  Master records t4, sends it
  │                               │
  Offset = ((t2 - t1) - (t4 - t3)) / 2
  Delay  = ((t2 - t1) + (t4 - t3)) / 2

The Sync message carries the master’s transmit time t1. In two-step mode, t1 is not embedded in the Sync itself (because hardware timestamping records the actual wire-time, which is not known until after the packet leaves the NIC) - instead, t1 is sent in a Follow_Up message immediately after. One-step mode embeds t1 directly in the Sync via hardware assist, reducing the message count but requiring more complex NIC support.

The delay request is sent from slave to master, which records the arrival time t4 and returns it. The four timestamps allow computing both the offset (how far the slave clock is from the master) and the round-trip delay. The slave’s PTP software then applies frequency and phase corrections to bring the slave clock into alignment.

Transparent clocks and boundary clocks are intermediary devices (switches, routers) that add their residence time (how long the packet was queued) to the correction field in the PTP message, allowing end-to-end accuracy even through non-PTP-transparent switches.

For direct connections to an exchange’s PTP grandmaster (typical in co-location), the path is often a single switch. If that switch supports transparent clock or boundary clock mode, you get much better accuracy than if it treats PTP frames as ordinary UDP traffic.

Solarflare SFN8522: Hardware Timestamping Configuration

The SFN8522 performs hardware timestamping at the physical layer (MAC level) - the timestamp is recorded as the first bit of the Start Frame Delimiter leaves or arrives at the line. This is as close to wire time as software can get without a standalone GPSDO.

Prerequisites:

# Verify NIC model and driver
ethtool -i eth0 | grep -E 'driver|version'
# Should show: driver: sfc

# Verify PTP hardware clock support
ethtool -T eth0
# Should show: PTP Hardware Clock: 0
# and "hardware-transmit" and "hardware-receive" in capabilities

The linuxptp package provides ptp4l (the PTP daemon) and phc2sys (which syncs the Linux system clock to the PTP hardware clock on the NIC).

# Install
apt-get install linuxptp

# Verify PTP clock device
ls -la /dev/ptp*
# Should show /dev/ptp0 for the SFN8522

ptp4l configuration for Solarflare on a co-location connection:

# /etc/ptp4l.conf
[global]
# Use hardware timestamping
time_stamping           hardware

# Two-step mode (standard for SFN8522)
twoStepFlag             1

# Clock accuracy (for grandmaster priority)
clockAccuracy           0xFE

# Transport: L2 (Ethernet) for lowest latency, or UDP for IP networks
# For exchange co-location, L2 is preferred
network_transport       L2

# Domain number - must match exchange grandmaster
domainNumber            0

# Logging
logging_level           6
summary_interval        1

# PI servo parameters - tune for your network characteristics
pi_proportional_const   0.0
pi_integral_const       0.0
pi_proportional_scale   0.0
pi_proportional_exponent -0.3
pi_proportional_norm_max 0.7
pi_integral_scale       0.0
pi_integral_exponent    0.4
pi_integral_norm_max    0.3
step_threshold          0.000000100

[eth0]
# Interface-specific settings
# Start ptp4l as slave
ptp4l -f /etc/ptp4l.conf -i eth0 -s

# Sync the system clock (CLOCK_REALTIME) to the PHC (PTP hardware clock)
# -s: slave mode
# -c: source clock (ptp0 = hardware)
# -w: wait until ptp4l is synchronized before starting
phc2sys -s /dev/ptp0 -c CLOCK_REALTIME -w -n 24 -O 37
# -O 37: TAI offset (seconds ahead of UTC) - update when leap seconds change

Verifying Sync Quality

“PTP is running” is not the same as “PTP is accurate.” Verify with pmc (PTP Management Client):

# Query offset from master
pmc -u -b 0 'GET CURRENT_DATA_SET'
# Look for:
# offsetFromMaster   - current offset in ns (target: < 100ns)
# meanPathDelay      - one-way delay to master in ns

# Check grandmaster identity
pmc -u -b 0 'GET GRANDMASTER_SETTINGS_NP'

A well-functioning implementation on a dedicated co-location link:

offsetFromMaster: 28 ns
meanPathDelay:   380 ns

A problematic implementation (software timestamps instead of hardware):

offsetFromMaster: 45000 ns
meanPathDelay:   210000 ns

The phc2sys log gives you a continuous stream of offset measurements:

phc2sys[1234.456]: CLOCK_REALTIME phc offset  31 s2 freq +12340 delay   180
phc2sys[1234.556]: CLOCK_REALTIME phc offset -12 s2 freq +12328 delay   180
phc2sys[1234.656]: CLOCK_REALTIME phc offset   8 s2 freq +12334 delay   178

The offset column is the difference between the PHC and CLOCK_REALTIME in nanoseconds. If this is consistently under ±100ns, your system clock is synchronized to the PTP hardware clock at that precision.

Critical monitoring alert thresholds I use:

offsetFromMaster > 500ns: WARNING - investigate switch/path issues
offsetFromMaster > 5000ns: CRITICAL - halt trading, investigate
phc2sys offset > 1000ns: WARNING - PHC and system clock diverging
ptp4l not synced (state != SLAVE): CRITICAL - halt trading

Integrating Hardware Timestamps into Your Application

To use the NIC’s hardware timestamps in your application, you need SO_TIMESTAMPING:

#include <linux/net_tstamp.h>
#include <linux/sockios.h>

// Enable hardware timestamps on socket
int enable_hw_timestamps(int sock) {
    int flags = SOF_TIMESTAMPING_RX_HARDWARE |  // Use HW timestamps on receive
                SOF_TIMESTAMPING_RAW_HARDWARE |  // Return raw HW timestamp
                SOF_TIMESTAMPING_TX_HARDWARE |   // Use HW timestamps on transmit
                SOF_TIMESTAMPING_OPT_CMSG |      // Deliver via ancillary data
                SOF_TIMESTAMPING_OPT_ID;         // Tag each timestamp with ID

    return setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
}

// Receive with hardware timestamp
ssize_t recv_with_hw_timestamp(int sock, void *buf, size_t len,
                                struct timespec *hw_ts) {
    char ctrl_buf[CMSG_SPACE(sizeof(struct scm_timestamping))];
    struct iovec iov = { .iov_base = buf, .iov_len = len };
    struct msghdr msg = {
        .msg_iov = &iov,
        .msg_iovlen = 1,
        .msg_control = ctrl_buf,
        .msg_controllen = sizeof(ctrl_buf),
    };

    ssize_t ret = recvmsg(sock, &msg, 0);

    for (struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg); cmsg;
         cmsg = CMSG_NXTHDR(&msg, cmsg)) {
        if (cmsg->cmsg_level == SOL_SOCKET &&
            cmsg->cmsg_type == SO_TIMESTAMPING) {
            struct scm_timestamping *ts = (struct scm_timestamping *)CMSG_DATA(cmsg);
            // ts->ts[2] is the raw hardware timestamp
            *hw_ts = ts->ts[2];
        }
    }
    return ret;
}

With Solarflare’s onload stack, you can go further: onload delivers the hardware timestamp directly to user-space without any kernel involvement at all, using ONLOAD_SOF_TIMESTAMPING_STREAM:

# Using onload for lower-overhead hardware timestamps
EF_TIMESTAMPING_ONLOAD=1 onload ./your_trading_app

What We Achieved and Why It Mattered

Our final numbers at Gemini on the SFN8522 hardware:

Metric                             Value
────────────────────────────────────────────────────────
PTP offset from exchange clock     ±30ns (steady state)
PTP offset worst case (24h)        ±85ns
phc2sys correction residual        ±15ns
Application receive timestamp      NIC hardware (before kernel)
Timestamp accuracy vs UTC          < 50µs (via exchange's GPS ref)
MiFID II RTS-25 requirement        1ms for category 1 HFT
Our margin vs requirement          20x

The business implication was two-fold. First, compliance: our audit trail was demonstrably accurate at sub-millisecond precision with verifiable calibration artifacts. When regulators requested our timestamp evidence under RTS-25, we had phc2sys logs, pmc offset measurements, and a documented calibration procedure - not a generic “we use NTP” statement. Second, operational: the accurate hardware timestamps let us measure our own order processing latency (from NIC receive to NIC transmit) at sub-microsecond precision, which was how we found and fixed several latency regressions.

How This Breaks in Production

Failure 1: Switch replacing PTP hardware timestamps with software timestamps. A switch upgrade replaces the transparent-clock-capable switch with a standard L3 switch. PTP frames now get software-timestamped by the switch’s management CPU, adding 1-10ms of jitter to the delay measurement. Your pmc output shows offsetFromMaster drifting to ±50µs. The system is still “running PTP” - it is just not doing it accurately. Fix: verify your switch supports and has enabled transparent clock mode. Check with the exchange’s co-location team what switch hardware is in your path.

Failure 2: Kernel upgrade changing hardware timestamp behavior. A kernel upgrade changes the SO_TIMESTAMPING behavior subtly - the hardware timestamp path is now disabled by default for this NIC. Your application falls back to software timestamps without any error. The timestamps are now 100-500µs less accurate than expected. Fix: add a startup check that verifies hardware timestamps are active: compare hardware vs software timestamp deltas on 100 test packets; if the gap is > 1µs consistently, alert and refuse to start.

Failure 3: ptp4l restarting without phc2sys restarting. ptp4l restarts after a configuration change, resynchronizes the PHC, but phc2sys (which syncs CLOCK_REALTIME to the PHC) does not restart. CLOCK_REALTIME is now tracking a stale PHC that drifted during the ptp4l restart window. System timestamps are off by up to 10ms. Fix: use systemd PartOf= or BindsTo= to ensure phc2sys always restarts when ptp4l restarts. Monitor the phc2sys offset separately from the ptp4l offset.

Failure 4: Multiple PTP domains on the same network. An exchange operator uses domain 0 for their grandmaster. Your internal NTP-over-PTP infrastructure uses domain 0 as well. ptp4l syncs to the wrong grandmaster - possibly your own server - creating a circular synchronization dependency. Fix: always confirm the domain number with the exchange. Run separate physical interfaces for exchange PTP and internal time distribution.

Failure 5: TAI offset not updated after a leap second. A leap second is inserted. The TAI offset from UTC increases from 37 to 38 seconds. Your phc2sys -O 37 command is wrong by one second. CLOCK_TAI timestamps are now 1 second off relative to any system that correctly updated. Fix: automate TAI offset updates from the IERS leap second table. Maintain a monitoring check that compares CLOCK_TAI - CLOCK_REALTIME against the expected TAI offset.

Failure 6: NIC firmware bug causing hardware timestamp drift. Solarflare releases a firmware update. Post-update, hardware timestamps drift by +/-200ns over a 10-minute window due to a PPS (pulse-per-second) synchronization bug in the new firmware. Everything looks correct from the kernel’s perspective - pmc shows good offset - but application-level timestamps are oscillating. Fix: pin your NIC firmware version in your configuration management system. Test timestamp stability after any NIC driver or firmware update using a 1-hour logging run before deploying to production.


Related reading: MiFID II Clock Sync Requirements for Crypto Compliance covers the regulatory side of what this infrastructure enables. Why Trading Time Is Different covers the clock sources your application should use once PTP is feeding the system clock.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.