Infrastructure
Lock-Free Queues for Market Data: SPSC, MPMC, and the Pitfalls of False Sharing
Replacing a mutex queue with a lock-free SPSC ring buffer dropped P99 from 50µs to 4µs at Akuna. Correct C++17 implementation with cache-line alignment and false-sharing failure mode.
The first version of our market data pipeline at Akuna used std::queue<MarketData> + std::mutex. The architecture was clean: a receive thread pushed parsed messages into the queue, and the strategy thread popped them. The mutex protected concurrent access.
In testing, with synthetic market data at controlled rates, it worked. In production, with real exchange feeds during the European equity open, the mutex became a bottleneck. The receive thread and strategy thread were contending for the same lock up to 50,000 times per second. Each contention was cheap - maybe 100ns - but at 50,000 per second that was 5ms per second of pure lock overhead. More importantly, it was 5ms of variance: when the contention was high, the strategy thread could be blocked for 500µs waiting for the receive thread to release the lock.
The fix was a lock-free SPSC (Single-Producer Single-Consumer) ring buffer. Latency dropped from 50µs P99 to 4µs P99. Not because we did less work - we did the same work - but because the synchronization mechanism went from a kernel futex (which can sleep) to a pair of atomic reads/writes (which never sleep).
This post explains the implementation. The code is C++17 and production-safe.
Why Mutexes Are Wrong for Market Data
A std::mutex is implemented as a futex (fast userspace mutex) on Linux. When the mutex is uncontended, acquiring and releasing it takes about 20-30ns - two atomic operations (compare-and-swap, memory barrier). This is acceptable.
When the mutex is contended, the acquiring thread calls futex_wait(), which is a syscall that tells the kernel “put me to sleep until this mutex is available.” The kernel context-switches the thread out. When the holding thread releases the mutex, it calls futex_wake(), another syscall. The kernel wakes the waiting thread, which must be rescheduled before it can proceed. This round-trip - sleep → wake → reschedule - takes at minimum 1-3µs on a well-tuned Linux system, and can take hundreds of µs if the scheduler is busy.
At 50,000 market data messages per second, with the receive thread and strategy thread both running continuously, contention is nearly constant. The expected time between message arrivals is 20µs. The time to process and release the mutex is 2-5µs for the strategy thread. The receive thread is blocked approximately 10-25% of the time.
The lock-free alternative: make synchronization happen through atomic operations and memory ordering guarantees, never through kernel involvement. When the SPSC ring is empty, the strategy thread reads an atomic variable and sees “nothing new”; it spins, reading again. No syscall, no sleep, no kernel. When a new message arrives, the atomic variable is updated. The strategy thread sees the update on the next poll iteration. Total cost: 2-5 CPU cycles.
The Correct SPSC Ring Buffer
Here is a complete, correct SPSC ring buffer implementation. Every design decision is deliberate.
#include <atomic>
#include <cstdint>
#include <cassert>
#include <new> // for std::hardware_destructive_interference_size
// Cache line size - use hardware value if available (C++17), else 64
#ifdef __cpp_lib_hardware_interference_size
constexpr size_t CACHE_LINE = std::hardware_destructive_interference_size;
#else
constexpr size_t CACHE_LINE = 64;
#endif
/**
* SPSC ring buffer - Single Producer, Single Consumer.
*
* Requirements:
* - Exactly one producer thread calls push()
* - Exactly one consumer thread calls pop()
* - No other threads access this object
* - T must be trivially copyable (no allocations in copy constructor)
*
* Properties:
* - Wait-free (both push and pop complete in bounded time)
* - Zero allocation after construction
* - Cache-line isolated head and tail to prevent false sharing
* - Power-of-2 capacity for index masking with bitwise AND (no modulo)
*/
template<typename T, size_t Capacity>
class SPSCQueue {
static_assert((Capacity & (Capacity - 1)) == 0,
"Capacity must be a power of 2");
static_assert(Capacity >= 2, "Capacity must be at least 2");
public:
SPSCQueue() : head_(0), tail_(0) {}
// Called only by the producer thread.
// Returns true if push succeeded, false if the queue is full.
[[nodiscard]] bool push(const T& item) noexcept {
const size_t tail = tail_.load(std::memory_order_relaxed);
const size_t next_tail = (tail + 1) & MASK;
// Check if full: next slot is where the consumer is reading
// We load head_ with acquire to synchronize with consumer's release store
if (next_tail == head_.load(std::memory_order_acquire)) {
return false; // Queue full
}
// Write the item BEFORE updating tail
// The consumer reads data[head] only after seeing head_ advance
buffer_[tail] = item;
// Release store: makes the buffer write visible to the consumer
// before the tail update is visible
tail_.store(next_tail, std::memory_order_release);
return true;
}
// Called only by the consumer thread.
// Returns true if pop succeeded (item is written to out), false if empty.
[[nodiscard]] bool pop(T& out) noexcept {
const size_t head = head_.load(std::memory_order_relaxed);
// Check if empty: consumer has caught up to producer
// We load tail_ with acquire to synchronize with producer's release store
if (head == tail_.load(std::memory_order_acquire)) {
return false; // Queue empty
}
// Read the item
out = buffer_[head];
// Release store: signals to producer that this slot is now available
head_.store((head + 1) & MASK, std::memory_order_release);
return true;
}
// Approximate size - may be stale. Safe to call from either thread.
size_t size_approx() const noexcept {
const size_t head = head_.load(std::memory_order_relaxed);
const size_t tail = tail_.load(std::memory_order_relaxed);
return (tail - head) & MASK;
}
bool empty() const noexcept {
return head_.load(std::memory_order_relaxed) ==
tail_.load(std::memory_order_relaxed);
}
private:
static constexpr size_t MASK = Capacity - 1;
// Each atomic index on its own cache line.
// Without padding, head_ and tail_ would share a 64-byte cache line.
// The producer writes tail_, the consumer writes head_.
// If they share a cache line, every write by one thread invalidates
// the cache line in the other thread's L1 cache - 8-10x slower.
alignas(CACHE_LINE) std::atomic<size_t> head_;
alignas(CACHE_LINE) std::atomic<size_t> tail_;
// Buffer storage - separate from the hot atomic variables to prevent
// the buffer from evicting the atomic cache lines
alignas(CACHE_LINE) T buffer_[Capacity];
};
The False Sharing Disaster: A Concrete Measurement
The alignas(CACHE_LINE) on head_ and tail_ is not cargo-culting. Here is what happens without it.
Without the alignment, a typical implementation allocates:
offset 0: head_ (8 bytes) ← on cache line 0
offset 8: tail_ (8 bytes) ← on cache line 0 (same line!)
offset 16: buffer_[0] ← on cache line 0
...
The producer thread writes tail_ on every push - invalidating the cache line containing head_. The consumer thread holds head_ in its L1 cache for reading. Every time the producer writes, the consumer’s L1 cache line is invalidated via MESI protocol (Modified-Exclusive-Shared-Invalid). The consumer’s next read of head_ must re-fetch the cache line from L2 or L3.
This is “false sharing”: two threads sharing a cache line not because they share data, but because the hardware granularity is 64 bytes, larger than either variable.
# Measuring false sharing overhead with perf c2c
sudo perf c2c record -u -- ./spsc_benchmark
sudo perf c2c report --stdio
# Look for lines with high RmtHitm or LclHitm counts
# The address column will point to your unpadded atomic variables
# Benchmark numbers from our systems:
# SPSC without padding:
# throughput: ~8M messages/sec (12x slower)
# P99 pop latency: ~100-150ns
#
# SPSC with cache-line padding:
# throughput: ~95M messages/sec
# P99 pop latency: ~8-12ns
The performance difference is approximately 8-12x, consistent across architectures. This is not hypothetical - it is the measurable cost of cache coherence traffic over the processor interconnect.
Memory Ordering: Why Each Barrier Is Where It Is
The memory ordering choices in the SPSC implementation above are deliberately minimal. Here is the reasoning:
tail_.load(std::memory_order_relaxed) in push(): The producer reads its own variable. No other thread writes tail_. A relaxed load is sufficient - we just need the value, no synchronization required.
head_.load(std::memory_order_acquire) in push(): The producer is checking whether the consumer has advanced the head pointer, freeing a slot. The consumer’s store to head_ was a release store. The matching acquire here ensures that if we see the updated head_ value, we also see any stores the consumer made before updating head_. (For a simple ring buffer, there are no such stores, but the pairing is the correct pattern.)
tail_.store(next_tail, std::memory_order_release) in push(): After writing buffer_[tail] = item, we update the tail pointer. The release ordering ensures that the buffer write is visible to any thread that subsequently reads tail_ with an acquire load. Without this, the compiler or hardware could reorder the buffer write to after the tail update, and the consumer might read a not-yet-written buffer slot.
tail_.load(std::memory_order_acquire) in pop(): The consumer checks if there is a new item by reading tail_. The producer’s release store to tail_ guarantees that any buffer writes before the store are visible to us once we see the updated tail. This is the critical guarantee.
A common mistake is to use std::memory_order_seq_cst (sequentially consistent) everywhere “to be safe.” This compiles to MFENCE instructions on x86, which are expensive (50-100 cycles). For a SPSC queue used 50,000 times per second, the difference is 50,000 × 75ns = 3.75ms per second of unnecessary overhead. The acquire/release pairing is sufficient and costs nothing on x86 (x86 already has a strong memory model; acquire/release compile to ordinary loads and stores).
MPMC for Multi-Strategy Environments
If you have multiple producer threads (e.g., receive threads for different exchanges) or multiple consumer threads (e.g., risk thread + strategy thread both consuming market data), SPSC is not sufficient. You need MPMC (Multiple Producer, Multiple Consumer).
MPMC is significantly more complex and has higher overhead. A correct lock-free MPMC requires compare-and-swap (CAS) operations on the head and tail, which can fail and retry:
/* Simplified MPMC sketch - not production-ready without further validation */
template<typename T, size_t Capacity>
class MPMCQueue {
struct Slot {
alignas(CACHE_LINE) std::atomic<size_t> sequence;
T data;
};
static constexpr size_t MASK = Capacity - 1;
alignas(CACHE_LINE) std::atomic<size_t> head_{0};
alignas(CACHE_LINE) std::atomic<size_t> tail_{0};
Slot slots_[Capacity];
public:
MPMCQueue() {
for (size_t i = 0; i < Capacity; i++)
slots_[i].sequence.store(i, std::memory_order_relaxed);
}
bool push(const T& item) {
size_t tail = tail_.load(std::memory_order_relaxed);
for (;;) {
Slot& slot = slots_[tail & MASK];
size_t seq = slot.sequence.load(std::memory_order_acquire);
intptr_t diff = (intptr_t)seq - (intptr_t)tail;
if (diff == 0) {
/* Slot is ready to write */
if (tail_.compare_exchange_weak(tail, tail + 1,
std::memory_order_relaxed))
{
slot.data = item;
slot.sequence.store(tail + 1, std::memory_order_release);
return true;
}
/* CAS failed - another producer claimed this slot, retry */
} else if (diff < 0) {
return false; /* Queue full */
} else {
tail = tail_.load(std::memory_order_relaxed); /* Reload */
}
}
}
/* pop() is symmetric */
};
The CAS retry loop means MPMC push/pop can take variable time if there is contention. In practice, for 2-4 producers/consumers, the CAS succeeds on the first try the vast majority of the time. But for HFT hot paths, prefer SPSC with multiple queues (one per producer-consumer pair) over MPMC. The topology is more complex but the per-queue performance is deterministic.
Practical Usage in a Trading Engine
/* Market data pipeline - single receiver, single strategy thread */
constexpr size_t QUEUE_DEPTH = 1024; /* power of 2, tune based on burst size */
SPSCQueue<MarketDataUpdate, QUEUE_DEPTH> md_queue;
/* Receiver thread (pinned to core 4) */
void receiver_thread() {
while (running) {
ef_event evs[16];
int n = ef_eventq_poll(&vi, evs, 16);
for (int i = 0; i < n; i++) {
MarketDataUpdate update;
parse_market_data(evs[i], &update);
/* Non-blocking push - drop if queue full */
if (!md_queue.push(update)) {
metrics.queue_overflow_count++;
/* Log the drop for post-session analysis, but don't block */
}
}
}
}
/* Strategy thread (pinned to core 5) */
void strategy_thread() {
MarketDataUpdate update;
while (running) {
/* Tight polling loop - no sleep, no yield */
if (md_queue.pop(update)) {
uint64_t queue_latency_ns =
rdtsc_to_ns(rdtsc()) - update.enqueue_tsc;
metrics.queue_latency.record(queue_latency_ns);
strategy.on_market_data(update);
}
/* On empty: spin immediately.
* cpu_relax() here would be wrong for an isolated core -
* it introduces a PAUSE instruction (~10 cycles) that reduces
* polling rate. Only use cpu_relax() if power consumption matters. */
}
}
How This Breaks in Production
1. The producer and consumer on the same CPU. SPSC assumes the producer and consumer run on different CPUs. If they run on the same CPU (e.g., both threads happen to be scheduled to core 4 due to a scheduling error), the consumer’s spin loop prevents the producer from ever being scheduled, causing a live-lock. The consumer spins forever on an empty queue; the producer never runs. Detection: ps -eLo psr,tid,comm should show the producer and consumer on different CPUs. Mitigation: use a spin limit followed by pthread_yield() to force a scheduler cycle if the queue is empty for too long.
2. Queue overflow drops during flash crashes. A flash crash or news event can generate a burst of market data at 10-100x the normal rate. If your queue depth is sized for the average rate, it will overflow during the burst, dropping messages. Dropped market data messages that affect your position’s valuation can cause incorrect risk calculations. Size the queue for the expected burst rate, not the average rate. 1,024 entries for a 50k msg/sec average rate is dangerously small if 1ms bursts can hit 200k msg/sec.
3. Non-power-of-2 capacity causes undefined behavior. The bitmask optimization (tail + 1) & MASK only works correctly when Capacity is a power of 2. If Capacity is 1000 and MASK is 999 (binary 1111100111), the bitmask does not wrap correctly. The static_assert in the implementation catches this at compile time - do not remove it.
4. T contains heap allocations. If T is std::string, std::vector, or any type with a non-trivial copy constructor, copying it into and out of the ring buffer may call malloc(). Recall the Akuna P99.9 incident: a single allocation in an unexpected code path caused 890µs off-CPU latency. Define T as a flat struct with fixed-size arrays. Parse into T without dynamic allocation.
5. Sequence point ordering with non-x86 architectures. On x86, stores are never reordered with other stores (the processor has a Total Store Order memory model). The acquire/release barriers in the implementation compile to no-op instructions on x86 - the ordering is guaranteed by hardware. On ARM64 (increasingly common in trading infrastructure for cloud and Apple silicon deployments), the ordering guarantees are weaker and the barriers do compile to real instructions (dmb ishld, stlr). The implementation above is correct on ARM64, but verify with objdump -d that your compiler is emitting the expected barrier instructions.
6. Saturating the cache with buffer data evicts the atomic indices. If T is large (e.g., a 200-byte full order book snapshot), the buffer_ array at 1,024 entries is 200KB - larger than most L2 caches. When the producer writes to a slot and the consumer reads from it, those cache lines cycle through L2 and L3. The atomic head_ and tail_ variables, which should be hot in L1, may get evicted if they share a cache set with buffer entries. The alignas(CACHE_LINE) declarations help, but on very large T types, also consider placing head_ and tail_ in a separate structure that is allocated far from the buffer.
Related reading: Profiling with perf, eBPF, and Off-CPU Flame Graphs covers how to detect the allocator contention issue that lock-free queues solve. CPU Pinning, isolcpus, and nohz_full covers the CPU isolation that makes the spin loop feasible. The Anatomy of a Sub-50µs Trade shows where queue latency fits in the end-to-end budget. Real-Time Scheduling on Linux covers SCHED_FIFO which prevents the producer-consumer same-CPU failure mode.
Continue Reading
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.