Skip to content

Infrastructure

Real-Time Scheduling on Linux: SCHED_FIFO, SCHED_DEADLINE, and Priority Inversion in Trading Engines

SCHED_FIFO for HFT, priority inversion from the Mars Pathfinder to trading latency, priority inheritance mutexes, and the near-miss kernel lockup from a misconfigured RT process.

12 min
#real-time #sched-fifo #linux-scheduler #priority-inversion #hft

In 1997, the Mars Pathfinder lander started resetting itself repeatedly after landing. The reset was triggered by a watchdog timer that fired when tasks did not complete within their scheduled deadlines. The cause was a textbook priority inversion: a low-priority meteorology data collection task held a mutex that a high-priority communication task needed. A medium-priority task that had no dependency on the mutex was scheduled between them, holding the CPU away from the low-priority task that could not release the mutex. The high-priority task starved, the watchdog fired, the lander reset.

The fix was a one-line patch: enable priority inheritance on the mutex.

I think about the Mars Pathfinder regularly when writing trading engine threading code. Priority inversion on a trading system does not reset the hardware - it just loses a trade, incorrectly computes a risk position, or sends an order at a stale price. The consequences are financial rather than spectacular, but the mechanism is identical.

This post covers real-time scheduling on Linux for trading systems: what each scheduling class does, how to configure it correctly, and what failure modes to watch for.

The Linux Scheduling Classes

Linux has three main scheduling classes, and understanding which one to use requires understanding what each guarantees.

SCHED_OTHER (CFS - Completely Fair Scheduler) is the default for all processes. CFS attempts to give each runnable thread a fair share of CPU time, weighted by nice value. A nice -20 thread gets ~80% of the CPU when competing with a nice 0 thread. But “fair share” means that when your trading strategy thread is runnable at the same time as a dozen other threads, it gets approximately 1/N of the CPU - even if you have set a very negative nice value. CFS does not provide latency guarantees; it provides fairness guarantees. These are different.

SCHED_FIFO is the highest-priority real-time scheduling class. A SCHED_FIFO thread at priority 99 (the maximum) will preempt any CFS thread immediately and will not be preempted except by:

  • Another SCHED_FIFO thread at the same or higher priority
  • Kernel interrupt handlers (which are not user threads)
  • NMI (non-maskable interrupt) - hardware only

A SCHED_FIFO thread runs until it voluntarily yields (by calling sched_yield(), blocking on I/O, or sleeping) or is preempted by a higher-priority real-time thread. There is no time quantum; it does not get preempted when its “time slice” runs out. For a trading loop that polls for market data and processes orders, SCHED_FIFO at priority 99 means the thread owns its CPU core until it chooses to give it up.

SCHED_RR is like SCHED_FIFO but with a time quantum (default: 100ms). After the quantum expires, the thread is moved to the back of its priority queue. For trading, prefer SCHED_FIFO - you want deterministic preemption behavior, and the time quantum in SCHED_RR adds a potential source of latency.

SCHED_DEADLINE is the most sophisticated: the kernel implements Earliest Deadline First (EDF) scheduling. You specify a runtime (how many µs of CPU time you need per period) and a deadline (within how many µs of the period start you need to be done). The kernel guarantees that as long as the total utilization across all SCHED_DEADLINE tasks is below 100%, each task will meet its deadline. This sounds ideal for trading but has a critical gotcha covered in the failure modes section.

Configuring SCHED_FIFO

# Set a running process to SCHED_FIFO priority 99
# Requires root or CAP_SYS_NICE capability
chrt -f 99 -p <pid>

# Start a new process with SCHED_FIFO priority 80
chrt -f 80 ./trading_engine

# Verify
chrt -p <pid>
# Output: pid <pid>'s current scheduling policy: SCHED_FIFO
#         pid <pid>'s current scheduling priority: 99

# Allow non-root users to set real-time priority up to 90
# Add to /etc/security/limits.conf:
echo "trading soft rtprio 90" >> /etc/security/limits.conf
echo "trading hard rtprio 90" >> /etc/security/limits.conf
# Or for a specific user:
echo "nikhil soft rtprio 99" >> /etc/security/limits.conf
echo "nikhil hard rtprio 99" >> /etc/security/limits.conf

# Verify limits are in effect (after re-login or pam_limits reload)
ulimit -r
# Should show: 99

Setting priority in code, with error checking:

#include <sched.h>
#include <string.h>

int set_realtime(int priority) {
    struct sched_param param;
    memset(&param, 0, sizeof(param));
    param.sched_priority = priority;

    if (sched_setscheduler(0, SCHED_FIFO, &param) != 0) {
        perror("sched_setscheduler");
        return -1;
    }

    /* Verify it took effect */
    int policy = sched_getscheduler(0);
    if (policy != SCHED_FIFO) {
        fprintf(stderr, "Expected SCHED_FIFO, got %d\n", policy);
        return -1;
    }

    return 0;
}

/* In main, after initialization and before the hot loop: */
if (set_realtime(99) != 0) {
    fprintf(stderr, "FATAL: Could not set SCHED_FIFO. Trading without RT priority.\n");
    /* Either exit or continue with degraded performance - your choice */
}

Thread Priorities Within the Trading Engine

Not all threads in a trading engine have the same latency requirements. Using priority 99 for everything defeats the purpose - the scheduler uses priorities to resolve contention, and if everything is priority 99, you are back to FIFO ordering (which is correct within the same priority level, but provides no protection between components).

A practical priority scheme for a trading engine:

Priority 99: Market data receive thread
             (must process packets as soon as they arrive)

Priority 98: Strategy evaluation thread
             (must run immediately after market data is processed)

Priority 97: Order transmit thread
             (must run immediately after strategy decides to order)

Priority 80: Order acknowledgement processing thread
             (important but not on the hot path for alpha)

Priority 70: Risk monitoring thread
             (needs to run frequently but not immediately)

Priority 60: Logging and metrics thread
             (low latency requirement - just needs to keep up)

Priority 10: SCHED_OTHER: management, config, health check threads
             (fine to be preempted freely)

The thread at priority 99 always gets the CPU over all others when runnable. This creates a strict execution order, matching the data dependency graph.

Priority Inversion and Priority Inheritance

Priority inversion occurs when a low-priority thread holds a resource (mutex, semaphore) that a high-priority thread needs. The high-priority thread blocks waiting for the resource. A medium-priority thread that has no dependency on the resource gets scheduled (because the low-priority thread cannot run while it is preempted, and the high-priority thread is blocked). The low-priority thread cannot run, cannot release the resource, and the high-priority thread starves.

Timeline without priority inheritance:

Time →   0ms    1ms    2ms    3ms    4ms    5ms    6ms    7ms    8ms
───────────────────────────────────────────────────────────────────────
LP (P=1) [lock] [held........sleeping/preempted]        [rel]
MP (P=5)                [running......................]
HP (P=99)              [blocked waiting for lock........][RUNS]

HP starved for 6ms by LP/MP interaction.
In trading, this means 6ms of stale quotes.

The solution is priority inheritance: when a high-priority thread blocks on a mutex held by a lower-priority thread, the mutex implementation temporarily boosts the low-priority holder’s priority to match the high-priority waiter. This ensures the mutex holder can preempt the medium-priority thread and release the mutex quickly.

#include <pthread.h>

/* Create a mutex with priority inheritance protocol */
pthread_mutex_t mutex;
pthread_mutexattr_t attr;

pthread_mutexattr_init(&attr);
pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
pthread_mutex_init(&mutex, &attr);
pthread_mutexattr_destroy(&attr);

/* This mutex will automatically boost the priority of any thread
 * that holds it when a higher-priority thread contends for it. */

In a trading engine, the best policy is to avoid mutexes entirely on the hot path (using lock-free structures as covered in Lock-Free Queues for Market Data). But for initialization, configuration, and infrequent operations that do require mutex protection, always use PTHREAD_PRIO_INHERIT if the mutex may be accessed by SCHED_FIFO threads at different priorities.

The Near-Miss Kernel Lockup

At Akuna, we had a near-miss with SCHED_FIFO that went into the runbook as a required reading incident.

A new developer was testing a market simulation tool that ran on the trading server to reproduce exchange behavior. He set it to SCHED_FIFO priority 99 “to ensure it got good scheduling.” The simulation tool had a bug: an infinite loop with no sched_yield() or blocking calls.

An infinite loop in a SCHED_FIFO priority-99 thread completely starves everything else on that CPU core - including the kernel’s watchdog threads. The watchdog on Linux is at SCHED_FIFO priority 99 as well (using SCHED_FIFO/99 is the typical watchdog implementation). A single stuck SCHED_FIFO/99 thread can cause the kernel’s watchdog to stop running, which after the watchdog timeout triggers a kernel panic and reboot.

The save was that the simulation tool was on a different CPU than the trading engine (good NUMA practice - management processes on separate cores). The trading engine continued running. The simulation CPU went into a soft lockup (visible in dmesg), which triggered an alert, and an operator killed the simulation process before the watchdog timeout caused a panic.

The lesson: never set a user-space process to SCHED_FIFO/99 unless it has been carefully audited for any possible infinite loop or blocking path. The kernel’s own SCHED_FIFO/99 threads (watchdog, migration) must be able to run.

# A safer priority ceiling for user-space RT threads
# Leave priority 99 for kernel RT threads
# Use 95 as maximum for user-space trading
chrt -f 95 ./trading_engine

# Monitor for soft lockups
dmesg | grep -E "soft lockup|watchdog"
# Any output here means a thread is stuck spinning without yielding

# Check currently running RT threads and their priorities
ps -eLo pid,tid,policy,rtprio,comm | grep -v "^-\|SCHED_OTHER" | sort -k4 -rn | head -20

SCHED_DEADLINE: The Gotcha

SCHED_DEADLINE looks attractive because it provides formal guarantees: “I need X µs of CPU time every Y µs.” But it has a critical interaction with the Linux CPU bandwidth controller.

When a SCHED_DEADLINE task uses up its runtime budget for a period, it is throttled until the next period - even if its deadline has not been met, and even if the CPU is otherwise idle. This is a fundamental property of the EDF algorithm: the scheduler must throttle over-budget tasks to maintain the feasibility of other tasks’ deadlines.

For a trading engine, being throttled mid-cycle is worse than not having real-time scheduling at all. A strategy evaluation that is in the middle of computing a fair value and submitting an order must not be stopped and forced to wait 1ms for the next period. The consequences are worse than the problem being solved.

Use SCHED_FIFO for trading. Use SCHED_DEADLINE for industrial control systems and media encoding where the periodic-task model fits naturally.

# SCHED_DEADLINE configuration (for non-trading use cases):
# runtime=500µs, deadline=1ms, period=1ms
chrt --deadline --sched-runtime 500000 --sched-deadline 1000000 \
     --sched-period 1000000 -p <pid>

# WARNING: If your task overruns its runtime budget, it will be
# throttled and miss its own deadline. The scheduler does not
# compensate; it enforces the budget strictly.

RT Throttling: The Default Kill Switch

Linux has a default safety mechanism: RT_BANDWIDTH throttling, controlled by:

cat /proc/sys/kernel/sched_rt_period_us
# Default: 1000000 (1 second)

cat /proc/sys/kernel/sched_rt_runtime_us
# Default: 950000 (950ms out of every 1s)

This means SCHED_FIFO and SCHED_RR threads together are limited to 95% of CPU time per second on each CPU. The remaining 5% is reserved for SCHED_OTHER threads, including the login shell and anything needed to recover from an RT runaway.

For a trading server where you own the hardware and have tested your RT threads, you can remove this limit:

# Allow RT threads to use 100% of CPU time
# (removes the 5% reservation - only do this on dedicated trading hardware)
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
# Verify:
cat /proc/sys/kernel/sched_rt_runtime_us
# -1

# Make permanent in /etc/sysctl.d/99-rt-trading.conf:
echo "kernel.sched_rt_runtime_us = -1" >> /etc/sysctl.d/99-rt-trading.conf

Setting this to -1 removes your safety net: a runaway RT thread can now lock the entire system. Only do this if you have tested your RT threads rigorously and have a hardware watchdog (separate from the OS software watchdog) that can reset the server if it goes unresponsive.

How This Breaks in Production

1. SCHED_FIFO with blocking I/O. If a SCHED_FIFO thread calls read() or write() on a regular file, a DNS lookup, or any other blocking syscall, it blocks - but at high priority. When the I/O completes and the thread is made runnable again, it immediately preempts everything else. This is correct behavior, but it means any slow I/O path in a SCHED_FIFO thread affects the entire system. Log from a SCHED_OTHER thread, never from SCHED_FIFO.

2. Memory page faults in SCHED_FIFO. A page fault in a SCHED_FIFO thread blocks on I/O (disk read for a paged-out page) and holds CPU priority while doing so. This prevents other RT threads from running during the fault. Use mlock() to lock all trading-engine memory into RAM (prevents paging), and pre-fault all memory before entering SCHED_FIFO mode:

/* Lock all current and future memory into RAM */
mlockall(MCL_CURRENT | MCL_FUTURE);
/* Then fault in the entire locked address space */
/* This prevents page faults during trading */
memset(huge_page_buffer, 0, buffer_size);

3. CPU affinity + SCHED_FIFO and the scheduler’s migration threads. When you pin a SCHED_FIFO thread to a specific core with taskset, the kernel’s migration threads still try to move it. On some kernel versions, a race condition where the scheduler decides to migrate the RT thread and simultaneously the thread is in a SCHED_FIFO blocking call can cause the thread to end up on a different core. Verify with periodic sched_getcpu() calls in the strategy loop and alert if the CPU changes.

4. Forgetting to set RT priority after fork()/exec(). fork() creates a child process with the same scheduling policy as the parent. exec() preserves scheduling policy. If your trading engine spawns a subprocess (e.g., to run a configuration script), that subprocess inherits SCHED_FIFO/95. If the subprocess hangs, it now blocks at high priority. Always reset scheduling policy in child processes: call sched_setscheduler(0, SCHED_OTHER, &param) immediately in the child process after fork().

5. SCHED_FIFO starvation of lower-priority threads on shared infrastructure. If your trading server also runs monitoring agents, logging collectors, or infrastructure daemons, they run at SCHED_OTHER and get minimal CPU time when your SCHED_FIFO thread is active. This means your Prometheus node exporter may stop scraping metrics, your fluentd log shipper may buffer up, and your heartbeat to an ops monitoring system may miss beats - triggering false alerts. Either allocate a dedicated non-isolated CPU for these services or accept that they will have degraded performance during market hours.

Related reading: CPU Pinning, isolcpus, and nohz_full covers CPU isolation which is a prerequisite for meaningful SCHED_FIFO - priority without isolation still competes for a shared CPU. Lock-Free Queues for Market Data covers eliminating the mutexes that cause priority inversion. Profiling with perf, eBPF, and Off-CPU Flame Graphs covers detecting priority inversion with off-CPU analysis. The Anatomy of a Sub-50µs Trade covers where scheduling fits in the end-to-end latency budget.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.