Skip to content

Infrastructure

TCP Tuning for Trading: Why You Should Probably Stop Using BBR and What to Use Instead

BBR's bandwidth probing introduces 50-100µs jitter on latency-sensitive trading connections. Here's the congestion control and socket configuration that actually works.

13 min
#tcp #networking #latency #linux #hft #congestion-control

We had been running BBR on all our exchange connections at Akuna Capital. One of our quant researchers noticed that our fill latency histogram had a peculiar bimodal distribution - most fills came back in 40-60µs, but roughly 3% came back in 110-150µs. The 99th percentile was blowing our budget.

I spent two weeks with perf, tcpdump, and kernel source code before I found the culprit: BBR’s built-in bandwidth probing cycle. Every 8 round-trips, BBR probes for more bandwidth by temporarily inflating its pacing rate. On a trading connection sending small messages at irregular intervals, this probe fires at the worst possible time - right when you’re sending an order. The result is a 50-100µs delay on whatever packet is in flight during the probe.

This post covers what I learned about TCP configuration for latency-sensitive trading connections: congestion control selection, socket buffer sizing, kernel parameter tuning, and the failure modes that will bite you at 3 AM.

Why Congestion Control Matters More Than You Think

Most engineers treat TCP congestion control as infrastructure plumbing - you set it once and forget it. In a bulk-data context (streaming video, file transfer), that’s fine. In a trading context, where you’re sending 50-200 byte messages and care deeply about the latency of each individual packet, the algorithm you choose has a measurable effect on your P99.

BBR: Brilliant for Throughput, Wrong for Trading

BBR (Bottleneck Bandwidth and Round-trip propagation time) was designed by Google to maximize throughput on high-bandwidth, high-latency paths like transoceanic links. It achieves this by periodically probing the available bandwidth with a “PROBE_BW” phase where it paces at 125% of estimated bandwidth.

For trading, this is a problem. Here’s what the probe looks like from the kernel’s perspective:

BBR state machine (simplified):
  STARTUP → DRAIN → PROBE_BW

  PROBE_BW cycle (8 RTTs):
    - 1 RTT at 125% pacing rate (PROBE_UP)
    - 1 RTT at 75% pacing rate (PROBE_DOWN)
    - 6 RTTs at 100% pacing rate (CRUISE)

When your 48-byte order hits the socket during PROBE_UP, BBR tries to send it at 125% of its normal rate. On a gigabit link with ~50µs RTT, that probe adds roughly 60-80µs of additional pacing delay because the kernel is rate-limiting outbound packets to implement the BBR pacing.

Measurement: BBR vs CUBIC on trading workload

I benchmarked this using a synthetic test: 1,000 messages/second, 64-byte payload, to a server in the same datacenter (measured RTT: 45µs).

Algorithm     P50       P95       P99       P99.9
BBR          47µs      52µs      118µs     180µs
CUBIC        46µs      53µs       61µs      89µs
Reno         47µs      55µs       68µs      97µs

CUBIC wins for P99 and P99.9 on this workload. BBR’s median is competitive, but the tail latency is unacceptable for trading.

Why CUBIC Wins for Small-Message Trading Flows

CUBIC’s window growth is based on a cubic function of elapsed time since the last congestion event. For trading flows on low-latency datacenter links, congestion events are rare, which means CUBIC mostly operates in steady-state with a large enough window to not constrain throughput.

More importantly: CUBIC doesn’t have a probing cycle. It doesn’t periodically inflate its sending rate to test bandwidth. It just responds to packet loss and ECN signals when they occur.

To set system-wide congestion control:

# Check current setting
sysctl net.ipv4.tcp_congestion_control

# Set to CUBIC permanently
echo "net.ipv4.tcp_congestion_control = cubic" >> /etc/sysctl.d/99-trading.conf
sysctl -p /etc/sysctl.d/99-trading.conf

To override per-connection in your trading application:

import socket

def create_trading_socket(host: str, port: int) -> socket.socket:
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

    # Set CUBIC on this specific socket (Linux 5.14+)
    # TCP_CONGESTION = 13
    sock.setsockopt(socket.IPPROTO_TCP, 13, b'cubic\x00')

    return sock

TCP_NODELAY and Nagle’s Algorithm

If you’re not setting TCP_NODELAY, you are buffering your orders. Full stop.

Nagle’s algorithm was designed in 1984 to reduce small-packet proliferation on slow links. It works by holding a small outbound packet in the send buffer until either the previous in-flight packet is acknowledged or the buffer grows to MSS (typically 1460 bytes for Ethernet).

For a 64-byte order message, Nagle will hold it in the buffer waiting for an ACK that won’t come for 45µs. That’s 45µs of artificial delay you’ve added to every single order.

import socket

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

In Rust:

use std::net::TcpStream;
use socket2::{Socket, TcpKeepalive};

let stream = TcpStream::connect("exchange.example.com:9000")?;
stream.set_nodelay(true)?;

TCP_QUICKACK is the complement to TCP_NODELAY on the receiving end. When TCP_QUICKACK is set, the kernel sends ACKs immediately rather than using delayed ACK (which batches ACKs every 40-200ms to reduce overhead). This matters when your exchange connection is bidirectional - you’re sending orders and receiving market data on the same socket.

# TCP_QUICKACK = 12
sock.setsockopt(socket.IPPROTO_TCP, 12, 1)

Critical gotcha: TCP_QUICKACK resets after each recv() call. The kernel treats it as a one-shot option. You must re-set it after every receive operation:

def recv_market_data(sock: socket.socket) -> bytes:
    data = sock.recv(4096)
    sock.setsockopt(socket.IPPROTO_TCP, 12, 1)  # Re-arm QUICKACK
    return data

Socket Buffer Sizing: The BDP Calculation

Getting socket buffer sizes right is the difference between absorbing a burst and dropping a packet. The formula is the Bandwidth-Delay Product:

BDP = bandwidth × RTT

For a 1 Gbps link with 500µs RTT to an exchange:

BDP = 1,000,000,000 bits/sec × 0.0005 sec = 500,000 bits = 62,500 bytes ≈ 64 KB

The kernel default (net.core.rmem_default) is typically 212 KB, which is already adequate for this. But for a 10 Gbps link with 1ms RTT to a remote co-location site:

BDP = 10,000,000,000 × 0.001 = 10,000,000 bits = 1.25 MB

The kernel default will undersell your pipe by 6x. You need larger buffers.

# For a 10G link to remote exchange
import socket

BUFFER_SIZE = 4 * 1024 * 1024  # 4 MB - leave headroom

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, BUFFER_SIZE)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, BUFFER_SIZE)

Note: the kernel doubles the value you set to allow space for overhead. So setting 4 MB gives you ~8 MB actual buffer.

System-level limits:

# Check current maximums
sysctl net.core.rmem_max
sysctl net.core.wmem_max

# If your per-socket request exceeds rmem_max, the kernel silently caps it
# Raise the ceiling to allow large per-socket buffers
echo "net.core.rmem_max = 16777216" >> /etc/sysctl.d/99-trading.conf
echo "net.core.wmem_max = 16777216" >> /etc/sysctl.d/99-trading.conf

Important: do not blindly set huge buffers on every socket. Large send buffers increase the time between when you call send() and when the kernel actually sends the data - the data sits in the buffer longer. For latency-sensitive trading, you want buffers large enough to not block, but not so large that you’re hiding queueing delay.

TIME_WAIT Accumulation and SO_REUSEPORT

Trading systems create and destroy TCP connections more than most applications - reconnecting after exchange downtime, refreshing sessions, etc. Each closed connection enters TIME_WAIT state for 2×MSL (Maximum Segment Lifetime, typically 60 seconds on Linux). At a typical trading firm that cycles connections frequently, you can accumulate thousands of TIME_WAIT sockets, consuming kernel memory and potentially exhausting ephemeral port space.

# Check TIME_WAIT accumulation
ss -s | grep "TIME-WAIT"
# Or: netstat -tan | grep TIME_WAIT | wc -l

SO_REUSEPORT allows multiple sockets to bind to the same local address and port. When a new connection arrives, the kernel distributes it across the bound sockets using a hash. For trading, the relevant use is allowing a new socket to take the local port of a recently-closed connection:

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

The kernel parameter net.ipv4.tcp_tw_reuse allows TIME_WAIT sockets to be reused for new outbound connections when the timestamp option is enabled. This is generally safe for client-side connections:

echo "net.ipv4.tcp_tw_reuse = 1" >> /etc/sysctl.d/99-trading.conf

Do not set net.ipv4.tcp_tw_recycle - it was removed in Linux 4.12 because it caused connection resets when NAT was involved, and it’s dangerous in any environment with load balancers or cloud networking.

TCP Keepalives for Exchange Connections

Crypto exchanges typically close idle connections after 30 seconds to 5 minutes. Without keepalives, you’ll discover the connection is dead when you try to send your next order - exactly at the moment of a fast market move when you most need it.

TCP keepalive at the kernel level:

sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)

# After 10s idle, start probing
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 10)

# Probe every 5s
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 5)

# Give up after 3 failed probes (total: 10s + 3×5s = 25s detection time)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3)

These values mean: if the connection is idle for 10 seconds, send a keepalive probe. If the probe fails, retry every 5 seconds. After 3 failures, declare the connection dead.

For most crypto exchanges, I use idle=10, interval=3, count=3. This gives you detection in under 20 seconds for a dead connection, which is fast enough to reconnect and get back in the market before the next significant price move.

Note: most exchange WebSocket APIs also have their own application-level ping/pong mechanism (Binance sends a ping every 3 minutes, Bybit disconnects after 20 seconds without a ping). Application-level pings are more reliable for detecting zombie connections because they test the full application stack, not just network connectivity. See WebSocket at HFT Scale for the full reconnection state machine.

The Complete sysctl Configuration

Here is the full /etc/sysctl.d/99-trading.conf I run on trading infrastructure. Each parameter has a rationale.

# /etc/sysctl.d/99-trading.conf
# TCP tuning for latency-sensitive trading applications

# --- Congestion control ---
# CUBIC: no probing cycles, good steady-state behavior for datacenter links
net.ipv4.tcp_congestion_control = cubic

# --- Socket buffer sizes ---
# Default buffers (per socket)
net.core.rmem_default = 262144
net.core.wmem_default = 262144
# Maximum buffers (ceiling for setsockopt)
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# TCP-specific read buffer: min, default, max
net.ipv4.tcp_rmem = 4096 262144 16777216
net.ipv4.tcp_wmem = 4096 262144 16777216

# --- TIME_WAIT handling ---
# Allow reuse of TIME_WAIT sockets for new outbound connections
net.ipv4.tcp_tw_reuse = 1
# Max TIME_WAIT sockets before forcible recycling (safety valve)
net.ipv4.tcp_max_tw_buckets = 262144

# --- Syn backlog and connection queuing ---
# Max pending connections per socket (set listen() backlog to match)
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535

# --- Fast retransmit and recovery ---
# Send 3 duplicate ACKs before fast retransmit (RFC default, keep at 3)
net.ipv4.tcp_reordering = 3
# Don't wait for slow start after idle (keeps pacing fast on burst)
net.ipv4.tcp_slow_start_after_idle = 0

# --- Keepalive defaults (can be overridden per-socket) ---
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_keepalive_intvl = 5
net.ipv4.tcp_keepalive_probes = 3

# --- Timestamps ---
# Required for tcp_tw_reuse to work; also provides RTT measurement
net.ipv4.tcp_timestamps = 1

# --- Network device queue ---
# Increase NIC queue depth to absorb bursts
net.core.netdev_max_backlog = 65536
net.core.netdev_budget = 600

Apply with:

sysctl -p /etc/sysctl.d/99-trading.conf

Measuring Your Baseline Before Tuning

Do not apply all of these parameters at once. Measure first, then change one variable at a time. The only way to know if your tuning worked is to have a before-and-after benchmark.

My preferred measurement approach for exchange connections:

import socket
import time
import statistics
import struct

def measure_tcp_rtt(host: str, port: int, n_samples: int = 10000) -> dict:
    """
    Send a minimal ping-pong over TCP and measure round-trip latency.
    Assumes a server that echoes whatever it receives.
    """
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
    sock.setsockopt(socket.IPPROTO_TCP, 12, 1)  # TCP_QUICKACK
    sock.connect((host, port))

    latencies = []
    payload = b'\x00' * 64  # Simulate a small order message

    for _ in range(n_samples):
        t0 = time.perf_counter_ns()
        sock.sendall(payload)
        sock.recv(64)
        t1 = time.perf_counter_ns()
        latencies.append((t1 - t0) / 1000)  # Convert to µs

    sock.close()

    latencies.sort()
    return {
        'p50': statistics.median(latencies),
        'p95': latencies[int(0.95 * n_samples)],
        'p99': latencies[int(0.99 * n_samples)],
        'p999': latencies[int(0.999 * n_samples)],
        'max': max(latencies),
    }

Run this before and after each sysctl change. If P99 doesn’t improve, revert the change. Sysctl parameters interact in non-obvious ways and the correct settings depend on your specific hardware, NIC driver, and network path.

For deeper TCP-level visibility, use ss -ti (part of iproute2) to inspect per-socket congestion control state:

# See congestion control state for all established TCP connections
ss -ti dst exchange.example.com

This shows you the actual cwnd (congestion window), RTT estimates, and which algorithm is active - ground truth rather than what you think you configured.

How This Breaks in Production

1. BBR probe fires during order burst Symptom: P99 fill latency has a bimodal distribution. The slow tail cluster is consistently 1.5-2x your P50. Correlates with periods of high order rate. Root cause: BBR’s PROBE_BW phase fires during your burst. The kernel’s pacing rate temporarily inflates, delaying the probe window’s packets. Fix: Switch to CUBIC. Verify with ss -ti that the socket is using CUBIC, not BBR.

2. Nagle delay on reconnect Symptom: First order after reconnecting is consistently slow. Market data starts arriving immediately, but the first outbound order is delayed by 40-200µs. Root cause: TCP_NODELAY is set on the socket, but you created a new socket object for reconnection and forgot to set it again. The new socket has Nagle enabled by default. Fix: Encapsulate socket creation in a factory function that always sets TCP_NODELAY, TCP_QUICKACK, and other options.

3. QUICKACK silently disarms Symptom: Your P95 and P99 for market data receipt gradually degrades over a session. Fresh connections are fast; existing connections slow down. Root cause: TCP_QUICKACK resets to disabled after each recv() call. Your initial setup sets it once, but it silently disarms and you start getting delayed ACKs. Fix: Re-enable TCP_QUICKACK after every recv() call in your hot path.

4. Buffer size exceeds rmem_max - silently capped Symptom: Your application sets SO_RCVBUF to 16 MB, but under high market data load you’re dropping packets. You assume 16 MB is in use, but you have ~200 KB. Root cause: If SO_RCVBUF > net.core.rmem_max, the kernel silently caps the buffer at rmem_max without returning an error. If rmem_max is the default 212 KB, your 16 MB request is silently truncated. Fix: After calling setsockopt, use getsockopt to verify the actual buffer size was applied:

actual = sock.getsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF)
assert actual >= BUFFER_SIZE, f"Buffer capped at {actual}, expected {BUFFER_SIZE}"

5. TIME_WAIT exhaustion during exchange reconnect storm Symptom: During an exchange outage, all trading instances reconnect simultaneously. The reconnect succeeds for some instances, fails with “Address already in use” for others. Recovery takes minutes instead of seconds. Root cause: Rapid connect/disconnect cycles accumulate TIME_WAIT sockets. Ephemeral port range (typically 32768-60999 = 28,231 ports) exhausts when every connection is in TIME_WAIT. Fix: Set net.ipv4.tcp_tw_reuse = 1, widen the ephemeral port range (net.ipv4.ip_local_port_range = 1024 65535), and implement connection pooling so you’re not creating fresh sockets for every session.

6. tcp_slow_start_after_idle re-applies slow start on idle exchanges Symptom: After a quiet market period (no fills for 30+ seconds), your first burst of orders is slow. The first few packets have high latency, then it recovers. Root cause: With tcp_slow_start_after_idle = 1 (default), after an idle period the kernel resets the congestion window to initcwnd (typically 10 segments) and re-runs slow start. The first 10 packets are fine, but burst #11 onward is rate-limited until cwnd expands. Fix: Set net.ipv4.tcp_slow_start_after_idle = 0. On a low-latency datacenter link where you’re not competing with other flows, this is safe. On a congested WAN path, be more careful.


The core principle: TCP was designed for reliability and throughput. For latency-sensitive trading, you need to selectively disable or tune the behaviors that optimize for throughput at the expense of consistent per-packet latency. Disable Nagle, arm QUICKACK, pick CUBIC over BBR, and measure everything before and after. The parameters that matter most are the ones your specific hardware and network path expose - the only way to find them is to instrument and measure.

For the memory tuning that complements these network changes, see the Linux memory tuning series. For the time synchronization that makes your latency measurements meaningful, see PTP in Production with Solarflare.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.