Infrastructure
Solarflare ef_vi vs DPDK vs AF_XDP: A Decision Framework for Kernel Bypass in 2026
ef_vi vs DPDK vs AF_XDP on identical hardware: ef_vi P99 was 31ns above median; DPDK was 87ns. When the variance gap, not the median, should drive the kernel bypass decision.
At Gemini, the infrastructure team had an ongoing argument about whether to keep ef_vi or migrate to DPDK. The DPDK camp had good arguments: better community, more hardware support, a cleaner API, no vendor lock-in. The ef_vi camp had one argument: numbers.
When we ran the side-by-side benchmark on identical hardware - Solarflare SFN8522 with ef_vi versus the same NIC with DPDK’s sfc PMD - the median latency difference was 18ns. That sounds like nothing. But on a NIC that was receiving market data at 50,000 packets per second, 18ns per packet was 900µs per second of cumulative overhead that the ef_vi path simply did not pay.
More importantly, the variance was different. ef_vi’s P99 was 31ns above P50. DPDK’s P99 was 87ns above P50. The variance matters more than the median in market-making - you pay the tail, not the average.
This post covers what makes each approach different, when to choose each one, and the benchmarks you should run before making the decision for your environment.
The Kernel Network Stack Problem
Standard Linux network I/O involves the kernel copying packet data from NIC-owned DMA buffers into socket buffers (sk_buff), processing them through the IP/UDP stack, acquiring socket locks, and finally copying the data into your userspace buffer via recvmsg(). Each of these steps has CPU cost and introduces latency variance because the scheduler can preempt between them.
On an untuned system, the kernel path adds 5-20µs per packet. Even with tuning (interrupt coalescing disabled, IRQ pinned, NUMA-local socket), the kernel path is typically 1-5µs - and more importantly, it has high variance because softirq processing competes with your application for CPU time.
Kernel bypass removes the kernel from the data path entirely. The NIC writes packets directly into memory you control (DMA), and your application reads them directly. No copy, no locks, no softirq, no kernel involvement.
Standard kernel path:
NIC → DMA → sk_buff (kernel) → sock → recvmsg() → app buffer
Latency: 1-5µs, high variance
Kernel bypass path:
NIC → DMA → app buffer (direct)
Latency: 100-500ns, low variance
The three bypass approaches differ in where they implement this bypass and what abstraction they expose to your application.
ef_vi: The Lowest Floor
ef_vi is Solarflare’s proprietary user-space library for their SolarCapture/Onload family of NICs. It exposes an interface at the level of individual NIC descriptor rings - you manage RX and TX descriptor rings directly, with a very thin C library providing just enough abstraction to be portable across Solarflare generations.
The overhead is genuinely minimal: ef_eventq_poll() reads a single ring descriptor to check for new packets. On a cache-warm poll that finds no packets, this is 2-4 CPU cycles. On a poll that finds a packet, you get a pointer directly into DMA memory - zero copy.
/* Minimal ef_vi receive loop - production-grade */
ef_vi vi;
ef_driver_handle dh;
ef_pd pd;
ef_memreg memreg;
/* Initialization (do once at startup) */
ef_driver_open(&dh);
ef_pd_alloc_with_numa(&pd, dh, EF_PD_DEFAULT, /*numa_node=*/0);
/* Allocate huge-page DMA buffer on correct NUMA node */
void* dma_buf = mmap_hugepage_numa(DMA_BUF_SIZE, /*node=*/0);
ef_memreg_alloc(&memreg, dh, &pd, NULL, dma_buf, DMA_BUF_SIZE);
ef_vi_alloc_from_pd(&vi, dh, &pd, EF_VI_FLAGS_DEFAULT);
/* Post initial RX descriptors */
for (int i = 0; i < RX_RING_SIZE; i++) {
ef_vi_receive_post(&vi, i);
}
/* Hot path poll loop - this is the trading engine receive path */
while (1) {
ef_event events[EF_VI_EVENT_POLL_MAX_EVS];
int n = ef_eventq_poll(&vi, events, EF_VI_EVENT_POLL_MAX_EVS);
for (int i = 0; i < n; i++) {
if (EF_EVENT_TYPE(events[i]) != EF_EVENT_TYPE_RX) continue;
int buf_id = EF_EVENT_RX_RQ_ID(events[i]);
int pkt_len = EF_EVENT_RX_BYTES(events[i]);
/* Zero-copy: direct pointer into DMA buffer */
const uint8_t *pkt = (uint8_t*)dma_buf + (buf_id * BUF_SIZE) + PREFIX_LEN;
/* Hardware timestamp - sub-nanosecond precision on SFN8522 */
ef_precisetime hw_ts;
ef_vi_receive_get_precise_timestamp(&vi, pkt - PREFIX_LEN, &hw_ts);
process_packet(pkt, pkt_len - ETH_HLEN, hw_ts);
ef_vi_receive_post(&vi, buf_id); /* repost for reuse */
}
}
The critical property is the hardware timestamp. Solarflare NICs timestamp packets in hardware at arrival, before the host CPU is even notified. This timestamp is embedded in the packet prefix and retrieved via ef_vi_receive_get_precise_timestamp(). The precision is sub-nanosecond for SFN8522 and newer NICs with IEEE 1588 PTP sync active. This timestamp is what you use for latency attribution, regulatory compliance, and best-execution analysis.
ef_vi overhead profile on SFN8522:
Benchmark Measured (ns)
─────────────────────────────────────────────
ef_eventq_poll() no packet 2-3 ns
ef_eventq_poll() with 1 packet 8-12 ns
Full RX path (poll → process ptr) 15-22 ns
Full TX path (submit → doorbell) 18-28 ns
Hardware timestamp resolution < 1 ns
The limitation is vendor lock-in. ef_vi only works on Solarflare NICs. You cannot test on a laptop, cannot use it in a cloud environment, cannot fall back to a different NIC vendor without rewriting your network layer.
DPDK: Portable Performance
DPDK (Data Plane Development Kit) provides a portable kernel-bypass framework that works across many NIC vendors through a poll-mode driver (PMD) abstraction layer. The Intel i40e, Mellanox mlx5, and Solarflare sfc PMDs are all used in production trading environments.
The abstraction cost is real but measured. DPDK’s rte_eth_rx_burst() does more work than ef_vi’s ef_eventq_poll(): it handles multiple PMD implementations behind a function pointer table, manages multiple packet descriptors per call, and allocates rte_mbuf structures from a memory pool. Each mbuf points to a packet buffer - not the DMA buffer directly, but an mbuf-managed buffer (which may involve a copy depending on the PMD).
#include <rte_ethdev.h>
#include <rte_mbuf.h>
#define RX_RING_SIZE 1024
#define MBUF_POOL_SIZE 8192
/* DPDK initialization - more involved than ef_vi */
struct rte_mempool *mbuf_pool = rte_pktmbuf_pool_create(
"MBUF_POOL", MBUF_POOL_SIZE,
/* cache_size */ 256,
/* private_data_size */ 0,
RTE_MBUF_DEFAULT_BUF_SIZE,
rte_socket_id() /* NUMA-local */
);
/* Configure a single RX queue on port 0 */
rte_eth_dev_configure(/*port=*/0, /*nb_rx_queues=*/1, /*nb_tx_queues=*/1, &port_conf);
rte_eth_rx_queue_setup(0, /*queue=*/0, RX_RING_SIZE, rte_eth_dev_socket_id(0),
NULL, mbuf_pool);
rte_eth_dev_start(0);
/* Hot path - DPDK receive */
struct rte_mbuf *pkts[BURST_SIZE];
while (1) {
uint16_t nb_rx = rte_eth_rx_burst(/*port=*/0, /*queue=*/0,
pkts, BURST_SIZE);
for (uint16_t i = 0; i < nb_rx; i++) {
const uint8_t *data = rte_pktmbuf_mtod(pkts[i], uint8_t*) + ETH_HDR_LEN;
uint32_t len = pkts[i]->data_len - ETH_HDR_LEN;
process_packet(data, len, /*hw_ts=*/pkts[i]->timestamp);
rte_pktmbuf_free(pkts[i]);
}
}
The rte_pktmbuf_free() call is important - unlike ef_vi where you manually repost the buffer, DPDK has a reference-counted pool. The free is cheap (single atomic decrement), but it is additional work ef_vi does not do.
DPDK overhead profile (Intel i40e PMD):
Benchmark Measured (ns)
─────────────────────────────────────────────
rte_eth_rx_burst() no packets 5-8 ns
rte_eth_rx_burst() with 1 packet 22-35 ns
Full RX path (burst → data ptr) 28-45 ns
rte_pktmbuf_free() 3-6 ns
Hardware timestamp availability Depends on NIC/driver
DPDK also requires EAL initialization (rte_eal_init()), which maps huge pages, initializes NUMA-local memory pools, and sets up the PMD. This initialization is more complex than ef_vi but only runs once at startup.
AF_XDP: In-Kernel, No Root Required
AF_XDP is a Linux socket type (added in kernel 4.18, significantly improved in 5.x) that provides kernel bypass without requiring a dedicated userspace driver or root access for all operations. It uses eBPF programs to redirect specific packet flows from the kernel’s XDP layer directly into UMEM (userspace memory) that you pre-register.
The key advantage is that AF_XDP works in environments where DPDK and ef_vi cannot: cloud VMs (where the hypervisor’s VF driver may not have a DPDK PMD), environments with strict security policies (no privileged DMA ring manipulation), and systems where the NIC is shared between trading and other workloads (AF_XDP operates on a subset of queues, leaving others for the OS).
#include <linux/if_xdp.h>
#include <bpf/xsk.h>
#define FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
#define NUM_FRAMES 4096
/* AF_XDP socket setup */
struct xsk_socket_config xsk_cfg = {
.rx_size = NUM_FRAMES / 2,
.tx_size = NUM_FRAMES / 2,
.libbpf_flags = 0,
.xdp_flags = XDP_FLAGS_DRV_MODE, /* XDP_FLAGS_SKB_MODE for non-native */
.bind_flags = XDP_ZEROCOPY, /* zero-copy mode */
};
/* Allocate UMEM in hugepages on correct NUMA node */
void *umem_area = mmap_hugepage_numa(FRAME_SIZE * NUM_FRAMES, /*node=*/0);
struct xsk_umem *umem;
struct xsk_ring_prod fill_ring;
struct xsk_ring_cons comp_ring;
xsk_umem__create(&umem, umem_area, FRAME_SIZE * NUM_FRAMES,
&fill_ring, &comp_ring, NULL);
struct xsk_socket *xsk;
struct xsk_ring_cons rx_ring;
xsk_socket__create(&xsk, "eth0", /*queue=*/0, umem, &rx_ring, NULL, &xsk_cfg);
/* Hot path */
struct pollfd fds = { .fd = xsk_socket__fd(xsk), .events = POLLIN };
while (1) {
/* poll() with timeout=0 is non-blocking - pure poll mode */
int ret = poll(&fds, 1, 0);
if (ret <= 0) continue;
uint32_t idx_rx;
uint32_t rcvd = xsk_ring_cons__peek(&rx_ring, BATCH_SIZE, &idx_rx);
if (!rcvd) continue;
for (uint32_t i = 0; i < rcvd; i++) {
const struct xdp_desc *desc = xsk_ring_cons__rx_desc(&rx_ring, idx_rx + i);
uint8_t *pkt = xsk_umem__get_data(umem_area, desc->addr);
process_packet(pkt + ETH_HDR_LEN, desc->len - ETH_HDR_LEN, 0 /* no hw ts */);
}
xsk_ring_cons__release(&rx_ring, rcvd);
/* Refill fill ring */
}
AF_XDP overhead profile:
Benchmark Measured (ns)
─────────────────────────────────────────────
xsk_ring_cons__peek() no packets 8-15 ns
Full RX path (zero-copy mode) 45-80 ns
Full RX path (copy mode/SKB) 80-150 ns
Hardware timestamp availability Limited (via SO_TIMESTAMPNS)
The AF_XDP path has more overhead than ef_vi or DPDK because it still goes through the XDP eBPF program (even a minimal pass-through program has overhead), and the kernel/userspace boundary is thicker than a raw DMA ring. In zero-copy mode with native XDP driver support (not SKB mode), the gap narrows significantly.
Decision Matrix
Criterion ef_vi DPDK AF_XDP
──────────────────────────────────────────────────────────────────────
Median RX overhead 10-20ns 25-50ns 50-80ns
P99 overhead stability Excellent Good Good
Hardware requirement Solarflare Many vendors Any (Linux 4.18+)
Portability None High High
Hardware timestamps Sub-ns PTP Vendor-varying SO_TIMESTAMPNS
Root requirement Yes (DMA) Yes (DMA/huge) Reduced (CAP_NET_ADMIN)
Works in cloud VMs No Limited (SR-IOV) Yes
Mixed workload sharing No No Yes (per-queue)
Maintenance burden Low (vendor) Medium (PMD) Low (kernel)
Community / docs Thin Excellent Growing
Production maturity 15+ years HFT 10+ years HFT 3-4 years
Choose ef_vi if:
- You are on Solarflare hardware (SFN7000, SFN8000, XtremeScale)
- Every nanosecond of median latency matters (co-location, fastest-gun market-making)
- You have a dedicated trading NIC not shared with anything else
- You need sub-nanosecond hardware timestamps for regulatory compliance (MiFID II RTS 25)
Choose DPDK if:
- You need hardware portability (multiple NIC vendors, mixed environments)
- You want a mature ecosystem (Intel DPDK, Mellanox, production deployments documented everywhere)
- Median latency in the 25-50ns range is acceptable (most market-making is fine here)
- You are not on Solarflare hardware
Choose AF_XDP if:
- You are in a cloud or VM environment where DMA ring access is restricted
- Compliance policies prohibit kernel-bypass with ring-0 driver modes
- You need to share a NIC queue between trading and non-trading workloads
- You want to use the full Linux firewall/filtering stack on some queues while bypassing it on others
- The 50-80ns overhead is acceptable (intraday stat-arb, slower execution strategies)
Hybrid Architecture: ef_vi for Receive, DPDK for Management
At Gemini, we eventually settled on a hybrid: ef_vi for the latency-critical RX path (market data receive), with the management plane (control messages, order management acknowledgements that did not need sub-20ns processing) going through a regular kernel socket. This gave us the best latency on the hot path without the operational complexity of managing all traffic through ef_vi.
Traffic class Interface Latency
─────────────────────────────────────────────────
Market data RX ef_vi queue 0 15-22ns
Order TX ef_vi queue 1 18-28ns
Order ACK RX ef_vi queue 2 15-22ns (hardware timestamped)
Management RX kernel socket 1-3µs
Heartbeat TX kernel socket < 1ms
The multi-queue NIC lets you dedicate specific queues to ef_vi while leaving other queues for the kernel stack. Configure flow steering rules to pin each traffic class to its designated queue via ethtool -N ntuple rules.
How This Breaks in Production
1. PMD not loaded at DPDK EAL init. If the appropriate kernel module for your NIC is not unbound from the kernel driver before DPDK EAL init, rte_eth_dev_count_avail() returns 0 and there is nothing to configure. The standard flow: modprobe vfio-pci (or uio_pci_generic), then dpdk-devbind.py --bind vfio-pci 0000:41:00.0. If the IOMMU is not enabled in the BIOS (common on older servers), vfio-pci will fail silently; fall back to uio_pci_generic which requires allow_unsafe_interrupts or no-iommu mode.
2. ef_vi buffer exhaustion under sustained load. The ef_vi RX ring has a finite number of slots. If your application cannot process packets as fast as they arrive (e.g., the strategy is blocked waiting for a mutex), the ring fills and packets are dropped - silently, with a counter increment in ef_vi_stats. Unlike the kernel stack (which will return EAGAIN or block), ef_vi just drops. Monitor ef_vi_stats.rx_ev_lost and alert on any nonzero value.
3. AF_XDP XDP program not loading in native mode. AF_XDP’s zero-copy mode requires XDP_FLAGS_DRV_MODE, which requires the NIC driver to have native XDP support. Not all drivers support native XDP; the fallback is XDP_FLAGS_SKB_MODE, which works but routes through the sk_buff path and removes the zero-copy benefit. Check support: ethtool -i eth0 shows the driver, then check the kernel source for ndo_xdp in that driver. Mellanox mlx5, Intel i40e/ice, and recent bnxt_en all support native XDP. Many virtual NIC drivers do not.
4. DPDK huge page allocation failing at scale. DPDK pre-allocates huge pages at startup for all its memory pools. On a system where huge pages were not allocated at boot (or were exhausted by another process), rte_eal_init() will fail. The error message is often cryptic: EAL: No free hugepages reported in hugepages-2048kB. Always allocate huge pages at boot, not on-demand, and verify the count before starting DPDK.
5. ef_vi hardware clock drift affecting timestamps. The hardware timestamps from ef_vi are derived from the NIC’s onboard clock, which must be synchronized to PTP to be meaningful for regulatory purposes. If the PTP sync daemon (ptp4l) stops running or the PTP master becomes unreachable, the NIC clock drifts - but ef_vi continues stamping packets with the drifting clock, with no error indication. Detection: monitor the offset reported by ptp4l in its logs; alert if it exceeds 1µs. The timestamp will continue to be self-consistent (good for relative latency measurement) but incorrect for absolute time (bad for regulatory reporting).
6. Interrupt mode fallback on busy systems. Both DPDK and af_xdp have mechanisms that fall back from pure poll mode to interrupt-driven mode when the CPU is otherwise busy. In DPDK, this is the rte_power infrastructure; in AF_XDP, simply not passing timeout=0 to poll() causes the kernel to sleep and wake on interrupts. In both cases, the fallback can be triggered inadvertently by a misconfigured scheduler or a burst of kernel activity. Always verify your application is in pure poll mode by checking that no sleeps or blocking syscalls occur on the trading thread.
Related reading: The Anatomy of a Sub-50µs Trade shows where kernel bypass fits in the end-to-end latency budget. Interrupt Affinity, MSI-X, and the Multi-Queue NIC covers the NIC queue configuration that works alongside all three bypass approaches. PTP in Production with Solarflare covers the hardware timestamping setup for regulatory compliance. CPU Pinning, isolcpus, and nohz_full covers the CPU-level isolation that makes the poll loop deterministic.
Continue Reading
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.