Why Your Matching Engine Doesn't Belong on Kubernetes (and What Does)

At Akuna Capital, the matching engines ran on bare metal servers with systemd units and CPU affinity pinning. There was no container runtime, no kubelet, no iptables rules inserted by kube-proxy. The only processes on those hosts were the market data handler, the matching engine itself, and a minimal observability agent - each pinned to specific CPU cores with taskset and isolated from each other via Linux cgroups at the resource level, not the scheduling level.

Meanwhile, our order management system, risk engine, operator console, and all of the tooling that surrounded the matching engine ran on Kubernetes. The K8s cluster managed hundreds of pods, did rolling deployments, handled auto-scaling, ran sidecars and init containers. The two worlds coexisted cleanly because we had been deliberate about which workloads belonged in which environment.

Most teams building trading systems today reach for Kubernetes first because it is the default for production infrastructure in 2026. Sometimes that is fine. When it is not fine, the problems do not announce themselves loudly - they show up as unexplained latency variance in latency histograms, P99.9 spikes that “only happen sometimes,” and a debugging session that eventually traces back to a cgroup throttle event or an iptables traversal.

This post explains where K8s fails for trading and where it is excellent, so you can make the split deliberately rather than after a painful postmortem.

The Three Ways Kubernetes Hurts Trading Latency

1. cgroup CPU Scheduling: No Hard Latency Guarantees

Kubernetes manages CPU resources through cgroups (control groups). When you specify resources.requests.cpu: "4" and resources.limits.cpu: "4" in your Pod spec, you are telling the kernel’s Completely Fair Scheduler (CFS) to give your container 4 CPU cores worth of time. What you are not getting is dedicated CPU time - you are getting a share of a time-sliced system.

The CFS scheduler operates on scheduling periods (typically 100ms by default, configurable via cpu-period). Within each period, your container can run for up to its allotted share. If your container is the only one requesting CPU, it will run continuously. If the node is heavily loaded, your container shares CPU with others.

For a trading strategy thread that needs to process a market data event in under 50µs, the CFS scheduler presents a problem. The scheduler may preempt your thread to service another cgroup. That preemption can last anywhere from a few microseconds to tens of milliseconds depending on the scheduling queue. You will see this as P99+ latency spikes that are not caused by anything in your application code.

The Kubernetes Guaranteed QoS class (requests == limits for both CPU and memory) reduces this problem by giving your containers highest scheduling priority. But even Guaranteed pods do not get dedicated cores - they share the CPU time-slice pool with other Guaranteed pods and with system processes.

The Linux solution for dedicated CPU cores is CPU isolation via isolcpus kernel parameter and taskset affinity binding. Kubernetes does not support this model. You can work around it with the CPU Manager static policy and the Topology Manager, which allow Kubernetes to pin pods to specific CPUs on a node - but this is a complex configuration, not well-supported across all CNI plugins, and adds another layer of complexity to diagnose when things go wrong.

Our measurement: a hot path strategy thread running as a Guaranteed pod on Kubernetes versus the same thread running as a bare metal process with isolcpus + taskset:

K8s Guaranteed pod: P50: 14µs, P99: 67µs, P99.9: 4.2ms (CFS preemption events)
Bare metal + CPU isolation: P50: 11µs, P99: 29µs, P99.9: 38µs

The P99.9 difference - 4.2ms vs 38µs - is what makes bare metal non-negotiable for sub-100µs workloads. A 4ms spike during a volatile market event can mean trading against a stale quote for thousands of microseconds.

2. kube-proxy iptables: 50-200µs Per Connection

Every Kubernetes Service creates iptables rules via kube-proxy. When your strategy pod sends an order to your order router Service, that packet traverses the iptables chain before reaching the destination. The overhead is not zero.

iptables rules are evaluated linearly. A cluster with hundreds of Services has hundreds to thousands of iptables rules per packet. The actual overhead depends on where your rule falls in the chain, but it is consistently in the 50-200µs range for real clusters, and grows with the number of Services.

The alternative - kube-proxy in IPVS mode - reduces the per-packet overhead to sub-microsecond (IPVS uses hash tables, not linear rule chains). Many production clusters still run iptables mode because it is the default and changing modes requires draining the cluster. If you are running K8s for trading, be explicit about which kube-proxy mode you are using and measure the overhead.

Even with IPVS, the Kubernetes networking layer adds at least one iptables traversal for NAT (source NAT for external traffic, DNAT for Service VIP resolution). For hot-path trading, this is overhead that adds nothing - you know exactly which IP you want to reach, and you do not need service discovery for your matching engine.

3. Container Start Time Is Measured in Seconds

When a trading node fails, you want to recover in milliseconds (warm standby switch) or at worst a few seconds (restart from a known-good checkpoint). Container start time in Kubernetes introduces a lower bound on recovery that you cannot easily shrink.

A typical trading container start sequence:

Image pull (if not cached): 30-120 seconds for a 500MB trading image
Container runtime initialization: 1-3 seconds
Pod network setup (CNI plugin): 0.5-2 seconds
Application startup (exchange connections, state restoration): 5-30 seconds

Even with image pre-pulled and CNI fast-path, you are looking at 7-35 seconds for a full container restart. A systemd-managed trading process on bare metal restarts in under 2 seconds (the process crash + systemd restart = 200ms, application startup is the same regardless of deployment model).

For latency-tolerant trading systems (>1ms acceptable, which covers the majority of algorithmic trading), this difference does not matter. For market-making at competitive venues, a 7-second recovery window versus a 2-second recovery window is a material difference in exposure.

What Kubernetes Handles Beautifully in Trading

The case against K8s for the matching engine hot path does not mean K8s is wrong for trading infrastructure. The opposite is true: Kubernetes is excellent for everything that surrounds the matching engine.

The Control Plane Layer

+-------------------------------------------+
|         TRADING CONTROL PLANE (K8s)       |
|                                           |
|  +-----------+  +----------+  +--------+ |
|  |    OMS    |  |   Risk   |  | Admin  | |
|  | (order    |  |  Engine  |  |  API   | |
|  | mgmt sys) |  |          |  |        | |
|  +-----------+  +----------+  +--------+ |
|                                           |
|  +--------+  +---------+  +----------+   |
|  |  Feed  |  |  Market |  |  Audit   |   |
|  |Handler |  |  Data   |  |   Log   |   |
|  | (slow) |  | Storage |  | Ingest  |   |
|  +--------+  +---------+  +----------+   |
+-------------------+--+-------------------+
                    |  |
              low-lat|  |state
                    |  |
+-------------------+--+-------------------+
|          TRADING HOT PATH (bare metal)    |
|                                           |
|  +-------------------+  +-----------+    |
|  |  Market Data      |  |  Strategy |    |
|  |  Handler (pinned) |  |  Engine   |    |
|  |  CPU cores 0-3    |  |  CPU 4-7  |    |
|  +-------------------+  +-----------+    |
|                                           |
|  +-----------------------------------+    |
|  |  Order Router                     |    |
|  |  CPU cores 8-11                   |    |
|  +-----------------------------------+    |
+-------------------------------------------+

The control plane components - order management system, risk engine (at the aggregate level, not the per-order level), admin API, feed handlers for slower data - do not have microsecond latency requirements. They need reliability, observability, easy deployment, and the ability to scale. These are exactly what Kubernetes provides.

Order management system: Manages open orders, tracks fills, reconciles positions. Does not need sub-millisecond latency. Needs reliable state (database-backed), easy restarts after deployment, and good observability. K8s with a StatefulSet and a PersistentVolumeClaim is fine.

Risk engine (coarse-grained): The real-time per-order risk check runs on the hot path (bare metal). The aggregate risk engine that monitors drawdown, VaR, position limits against the full portfolio runs on K8s. It queries position state from the OMS and runs checks on a 100ms-1s cycle. Perfect K8s workload.

Feed handlers for reference data: Market data for reference purposes - end-of-day prices, corporate actions, reference rates - comes in slowly and does not drive live trading decisions. K8s, easily.

Operator tooling: Admin dashboards, deployment pipelines, configuration management, alerting. K8s with Helm charts and ArgoCD.

The Deployment Advantage

Rolling deployments, rollbacks, blue-green, canary - Kubernetes handles all of these natively for stateless and lightly-stateful services. Updating the risk engine configuration, deploying a new version of the operator dashboard, rotating secrets - all can happen without manual SSH sessions and systemctl restarts.

For bare metal trading nodes, you are doing deployments the “old way”: SSH into the host, stop the service, replace the binary, restart with the new version. This is not terrible - it is how most serious trading infrastructure has always worked - but it requires discipline and runbooks. There is no kubectl rollout undo for a bad binary on a bare metal host.

The Full Architecture Split

This is the architecture we run at ZeroCopy and recommend to clients:

Bare metal / dedicated VM (systemd-managed):

Market data handler: pinned to CPU cores 0-3, NUMA node 0
Strategy execution: pinned to CPU cores 4-7, NUMA node 0
Order router: pinned to CPU cores 8-11, NUMA node 0
Connectivity: direct NIC queue assignment, kernel bypass where possible
Recovery: warm standby instance, systemd auto-restart on crash

Kubernetes cluster (DOKS or EKS):

Order management system (StatefulSet, PVC for position state)
Coarse-grained risk engine (Deployment, HPA for scaling)
Market data storage ingest (writes to ClickHouse)
Admin API (Deployment, Ingress with mTLS)
Observability stack (Prometheus, Grafana, AlertManager)
Feed handlers for reference data (Deployment)

The interface between the two worlds: the hot path emits events to NATS (running outside both the bare metal cluster and K8s, on a dedicated NATS cluster for reliability). The K8s control plane subscribes to NATS for fills, position updates, and risk events. The hot path subscribes to NATS for configuration updates from the risk engine.

The systemd unit for the strategy engine:

# /etc/systemd/system/strategy-engine.service
[Unit]
Description=ZeroCopy Strategy Engine
After=network.target nats-client-ready.target
Wants=nats-client-ready.target

[Service]
Type=exec
User=trading
Group=trading

# CPU affinity: cores 4-7 (NUMA node 0)
CPUAffinity=4-7

# NUMA memory binding: force allocations to NUMA node 0
Environment="NUMA_MEMORY_NODE=0"
ExecStartPre=/usr/bin/numactl --cpunodebind=0 --membind=0 -- true
ExecStart=/usr/bin/numactl --cpunodebind=0 --membind=0 -- \
    /opt/zerocopy/bin/strategy-engine \
    --config /etc/zerocopy/strategy.toml \
    --exchange-config /etc/zerocopy/exchanges.toml

# Graceful shutdown: signal SIGTERM, wait 30s, then SIGKILL
# The engine handles SIGTERM by canceling open orders before exiting
KillMode=process
KillSignal=SIGTERM
TimeoutStopSec=30

# Restart policy: restart on failure, but not if it exits cleanly
Restart=on-failure
RestartSec=2s
StartLimitIntervalSec=60s
StartLimitBurst=3  # Max 3 restarts in 60s before giving up + alerting

# Monitoring: notify when ready
NotifyAccess=main

# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ReadWritePaths=/var/lib/zerocopy /var/log/zerocopy

[Install]
WantedBy=multi-user.target

The Latency-Tolerance Threshold

To be direct about when this argument does not apply: if your acceptable hot-path latency is above 1 millisecond, Kubernetes is probably fine for everything, including your strategy execution.

The CFS scheduling jitter and kube-proxy overhead are sub-millisecond in most cases. If you are building a mean-reversion strategy that operates on 5-second windows, or an execution algorithm that targets VWAP over 30 minutes, the difference between 50µs and 4ms on a P99.9 event is not material to your edge.

The bare metal + systemd approach adds operational complexity. You lose rolling deployments, self-healing, declarative configuration management, and all the other things K8s gives you. You pay that cost in operational burden. The question is whether your latency requirement justifies that cost.

My threshold: if your hot path requires P99 under 200µs, bare metal is worth it. If P99 of 1ms is acceptable, run everything on Kubernetes and enjoy the operational simplicity.

How This Breaks in Production

CPU affinity fighting the kernel. If you pin your strategy thread to CPU 4 but Linux’s IRQ balancer also assigns network interrupts to CPU 4, your strategy thread shares a core with interrupt processing. This causes exactly the kind of deterministic latency degradation that affinity pinning was supposed to prevent. Always pin IRQ affinity explicitly (/proc/irq/*/smp_affinity) as part of your bare metal setup, and verify with htop that interrupt processing is not on your strategy cores.

systemd restart loop masking crashes. If your strategy engine has a startup bug (bad config file, unreachable exchange, certificate error), systemd will restart it up to StartLimitBurst times before giving up. During those restarts, your trading system looks healthy to external monitors (the process is running) but is not actually trading. Add explicit readiness probes - a small HTTP endpoint that returns 200 only after the engine is connected to exchanges and has received its first market data - and monitor that, not just process health.

K8s kube-proxy iptables rules consuming CPU on the trading host. If you are running trading nodes on VMs within a Kubernetes node pool (a common cost-saving setup), the kubelet and kube-proxy processes on the same host consume CPU, interfere with IRQ assignment, and add iptables overhead that affects all traffic from that host. Trading nodes should be dedicated hosts - either bare metal or VMs in a node pool where the only workloads are trading workloads.

NATS as the single interface point becoming a bottleneck. Using NATS as the boundary between bare metal and K8s is clean architecturally, but if your NATS cluster is under-sized or misconfigured, it becomes a latency bottleneck at exactly the wrong moment - when market volatility is high and message rates spike. Run NATS on dedicated hardware, not as a K8s pod, and over-provision it by at least 3x your average load.

Missing graceful shutdown for bare metal after exchange session opens. A SIGTERM to your strategy engine should cancel open orders and close positions before exiting. If it does not, and your engine crashes or is restarted during a live session, those open orders sit at the exchange until they fill or expire - while your new instance starts up without knowledge of them. This is a position state split: your internal state says flat, your exchange state says long. The reconciliation, if your system does not handle it, can lead to doubling up on positions on restart. Implement and test the SIGTERM handler before going live.

Container runtime cgroup version mismatch. If your K8s cluster uses cgroup v2 and you configure CPU affinity hints through Kubernetes Topology Manager, the behavior differs between cgroup v1 and v2 in subtle ways - particularly around how CPU quota is enforced on pods with exclusive CPU assignments. If you are mixing bare metal nodes (which may have different kernel versions) with K8s nodes in the same cluster, verify cgroup version consistency and test Topology Manager behavior explicitly.