AWS for Trading: Instance Families, Placement Groups, and the Networking Choices That Save Microseconds

The alert came in at 2:47 AM UTC. P99 order acknowledgement latency had jumped from 94µs to 680µs. Not a crash - everything was processing correctly. Just a 7x latency spike that would cost us several basis points per trade until it was resolved.

The root cause, discovered at 4:15 AM after correlating CloudWatch metrics with our internal timestamping, was mundane: a routine AWS maintenance event had migrated one of our strategy nodes to a different physical host. That new host was not in the same cluster placement group. The inter-instance traffic that had been traversing a sub-5µs intra-rack path was now routing through the standard VPC fabric at 300-500µs.

We fixed it by forcing instance replacement back into the placement group. But the incident crystallized exactly why every infrastructure decision for a trading system on AWS - instance family, bare metal vs virtualized, placement group type, network driver configuration - needs to be made with measured numbers, not vendor marketing claims.

This post is the complete decision framework we use at ZeroCopy, with the numbers that drove each choice.

The Instance Family Decision

AWS instance families are not interchangeable. The latency-relevant choices for trading workloads come down to three families, each with distinct tradeoffs.

c6i vs c7i vs c7gn

c6i (Intel Ice Lake, current generation) is our primary choice. The c6i family runs on Intel Ice Lake Xeon processors with all-core turbo frequencies of 3.5 GHz. For trading workloads, what matters is single-threaded performance - your hot path strategy thread runs on one core, and you want that core to sustain its frequency without thermal throttling neighboring cores. Ice Lake’s per-core turbo holds well under the typical trading workload pattern (high single-core utilization, moderate total package load).

The c6i family also has the most mature driver stack on AWS. ENA (Elastic Network Adapter) support is complete, the latest Nitro card generation is present, and the instance type has been deployed long enough that you are unlikely to get scheduled on problematic hardware revisions. Network bandwidth tops out at 50 Gbps for the larger instances.

c7i (Intel Sapphire Rapids) is the natural successor. Sapphire Rapids adds AVX-512 VNNI instructions, which matter if you have ML inference in your hot path, and slightly better single-core IPC. The practical latency improvement over c6i is 5-15% for compute-bound paths and essentially zero for I/O-bound paths. The main reason I have not moved ZeroCopy’s primary nodes to c7i yet is fleet maturity - c7i launched more recently and the hardware revision pool is shallower. I have seen higher variance in measured latency on early-access c7i instances than on equivalent c6i, possibly due to fewer hardware revision cycles.

c7gn (Graviton3E, network-optimized) is the most interesting option for specific workloads and deserves careful analysis. The “n” suffix indicates these instances are built around a newer Nitro card generation with 200 Gbps of network bandwidth - 4x what c6i offers at the same core count. They also support ENA Express natively at maximum throughput.

The tradeoff is the ISA. Graviton3E is ARM (Ampere). If your trading system is written in Java, Python, or a managed language, this is fine - JIT compilers and runtimes produce excellent ARM code, and you might see better performance here than on Intel equivalents due to the memory bandwidth improvement. If you have hand-optimized x86-64 SIMD routines - which most serious HFT codebases do - you are looking at a rewrite or a measurable regression. At ZeroCopy our Rust engine uses some x86-64-specific SIMD intrinsics in the order serialization path. Until those are ported to ARM NEON equivalents, c7gn is not on the table for the hot path.

For market data preprocessing (which is bandwidth-intensive but not latency-critical in the same way), c7gn is worth benchmarking. The 200 Gbps network bandwidth means you can handle a much larger feed multiplexing workload without network becoming the bottleneck.

The Decision Matrix

Workload	Recommended Family	Reasoning
Strategy execution (hot path)	c6i.metal or c7i.metal	Mature driver stack, best latency determinism, x86 SIMD support
Market data feed handlers	c7gn.8xlarge	200 Gbps ENA Express saturates multiple exchange feeds
Risk engine (stateful, moderate latency)	c6i.4xlarge or c6i.8xlarge	No need for bare metal; 50 Gbps sufficient
Order management (control plane)	m6i.2xlarge	Memory-optimized for position state; cost-efficient
Monitoring and observability	t3.xlarge or c6g.large	Latency-insensitive; Graviton for cost
Admin and operational tooling	t3.small	Latency-irrelevant; minimize cost

Bare Metal vs Virtualized: The Hypervisor Tax

The c6i.metal vs c6i.4xlarge decision is one of the most misunderstood in trading infrastructure. The marketing claim is that AWS’s Nitro hypervisor has “near-bare-metal performance.” That claim is largely correct for throughput. It is meaningfully incorrect for latency determinism.

The issue is not average latency - on a lightly loaded c6i.4xlarge, your average latency is within 2-3µs of bare metal. The issue is worst-case latency, specifically the scheduling jitter introduced by the hypervisor.

On a virtualized instance, the Nitro hypervisor must occasionally preempt your VM to perform housekeeping tasks: updating NVRAM state, handling the Nitro card’s management plane, processing inter-VM communication. These preemptions are brief - typically 1-10µs - and infrequent. But “infrequent” in trading terms means they happen multiple times per second under real market conditions.

Our measurement approach at ZeroCopy uses hdrhistogram to capture latency at microsecond resolution across 10-million-sample windows. The results:

c6i.4xlarge: P50: 12µs, P99: 48µs, P99.9: 890µs, P99.99: 3.2ms
c6i.metal: P50: 11µs, P99: 31µs, P99.9: 45µs, P99.99: 68µs

The average and P99 difference is meaningful but not dramatic. The P99.9 and P99.99 difference is the reason to pay the bare metal premium. On c6i.metal, the worst-case latency stays bounded. On c6i.4xlarge, the tail goes into the milliseconds.

The cost premium for c6i.metal vs c6i.4xlarge is approximately 3-4x at on-demand pricing (you are paying for the full physical host). With reserved pricing, the break-even analysis depends on your trading frequency and the cost of tail latency. For any strategy that trades more than a few hundred times per day, the bare metal premium pays for itself through reduced adverse selection.

One practical note: bare metal instances have longer launch times (5-8 minutes vs 30-90 seconds for virtualized). This matters for your failover design. If your recovery procedure involves spinning up a new instance, you either need to keep a warm standby running or accept a longer recovery window.

Cluster Placement Groups: The Intra-Rack Guarantee

AWS placement groups are network topology hints that fundamentally change the latency characteristics between instances.

Cluster placement groups place instances on the same physical rack, or as physically close to each other as possible within a single Availability Zone. The result is that traffic between instances in the same cluster group traverses the fewest possible network hops - typically a single top-of-rack switch. Measured RTT between instances in a cluster group: 2-5µs. Without a cluster group, inter-instance RTT within the same AZ is typically 50-300µs depending on physical placement.

This is not a guarantee from AWS. The documentation says “low latency” and “high throughput” without giving precise numbers. In practice, properly configured cluster placement groups consistently deliver sub-5µs cross-instance RTT, but you should measure your specific configuration because there are exceptions (instances near the cluster size limit may be placed less optimally).

resource "aws_placement_group" "trading_hot_path" {
  name     = "trading-hot-path"
  strategy = "cluster"

  # spread_level only applies to spread placement groups
  # For cluster: the placement is automatic within same rack
}

resource "aws_instance" "strategy_node" {
  ami                    = data.aws_ami.trading_base.id
  instance_type          = "c6i.metal"
  placement_group        = aws_placement_group.trading_hot_path.id
  availability_zone      = var.primary_az  # Must pin to single AZ

  # Required: tenancy must be default (dedicated conflicts with placement groups)
  tenancy = "default"

  network_interface {
    network_interface_id = aws_network_interface.strategy_primary.id
    device_index         = 0
  }
}

Spread placement groups are the opposite approach: they place instances on distinct physical hardware, maximizing resilience. Each instance in a spread group is guaranteed to be on a different physical host with independent power and network. The tradeoff is that cross-instance latency reverts to the standard VPC fabric (50-300µs).

Spread groups are right for your control plane: if your risk engine, order management system, and operator interface are on different physical hosts, a single hardware failure cannot take down all three. The control plane does not need sub-5µs cross-instance RTT, so the latency tradeoff is acceptable.

Partition placement groups are designed for distributed storage systems like Cassandra and HDFS where you want rack-level fault isolation with multiple instances per partition. Rarely relevant for trading infrastructure.

Placement Group Limitations

This is where practitioners often get surprised:

Cluster placement groups work only with specific instance families: C, M, R, and I series. You cannot put a T3 or M4 instance in a cluster placement group.
All instances in a cluster placement group must be in the same Availability Zone. This is a hard constraint - you cannot build a cross-AZ cluster placement group.
There are capacity limits per group. AWS recommends launching all instances in a cluster group at once in a single request; launching them individually over time may fail with InsufficientCapacityError once the rack fills up.
Moving an existing instance into a placement group requires stopping, modifying, and restarting it. This means a maintenance window for any instance you want to add to an existing cluster group.

The single-AZ requirement is the biggest operational constraint. Your primary trading cluster must exist entirely within one AZ. This is fine - you should architect for AZ-level failures at the disaster recovery layer, not the hot path. But it means your “multi-AZ” story for the trading hot path is really “multi-region active-passive” rather than “active-active across AZs.”

ENA Express: The Sub-100µs Network Driver

Standard ENA (Elastic Network Adapter) is AWS’s default network driver for modern instances. It delivers solid throughput and reasonable latency - but “reasonable” for general-purpose workloads, not for trading.

ENA Express is AWS’s RDMA-adjacent feature for inter-instance networking. When both instances support ENA Express (both in the same placement group, both using ENA Express-capable instance types), the traffic path bypasses some of the Nitro network processing pipeline. The result is a measurable reduction in both average latency and variance.

Our measurements:

Configuration	P50 RTT	P99 RTT	P99.9 RTT
Standard ENA, no placement group	85µs	420µs	1.8ms
Standard ENA, cluster placement group	12µs	48µs	95µs
ENA Express, cluster placement group	8µs	28µs	41µs
ENA Express, cluster placement group, jumbo frames	7µs	24µs	37µs

ENA Express requires explicit enablement - it is not on by default. You enable it per network interface:

resource "aws_network_interface" "strategy_primary" {
  subnet_id = var.primary_subnet_id

  attachment {
    instance     = aws_instance.strategy_node.id
    device_index = 0
  }
}

# ENA Express must be enabled via the AWS CLI or SDK after interface creation
# Terraform aws provider (as of 5.x) uses ena_srd_specification block:
resource "aws_instance" "strategy_node" {
  # ...

  ena_srd_specification {
    ena_srd_enabled = true

    ena_srd_udp_specification {
      ena_srd_udp_enabled = true  # Enable for UDP market data feeds
    }
  }
}

ENA Express only works when both endpoints are ENA Express-capable instances within the same placement group. If your strategy node has ENA Express enabled but your market data handler does not, you get standard ENA performance. Both ends must be configured.

Jumbo Frames: Reducing Segmentation Overhead

AWS VPCs support jumbo frames with an MTU of 9001 bytes, up from the standard 1500. Within the VPC, jumbo frames are supported by default - the underlying Nitro card handles the larger frames natively. Traffic that exits to the internet (through an internet gateway or NAT) is automatically fragmented to 1500 at the boundary.

For trading infrastructure, jumbo frames matter in two places:

Market data feeds: A typical equity market data message is 200-500 bytes. But market data arrives in bursts - a trading halt or volatility spike can generate 50-100 messages in a single millisecond. With standard MTU, that burst requires multiple TCP segments. With jumbo frames and batch-oriented protocols, you can fit more messages per segment, reducing the number of interrupt events on the receiving NIC.

Order book snapshots: Full order book state synchronization between strategy instances can involve multi-megabyte transfers. With jumbo frames, the segmentation overhead is dramatically reduced.

Jumbo frames do not reduce wire latency for single-packet messages. A 200-byte market data message takes the same time to transmit whether your MTU is 1500 or 9001. The benefit is reduced overhead for large or bursty transfers, and reduced CPU overhead from fewer segmentation events.

Configure jumbo frames in your EC2 user data:

# Set MTU on the primary interface
ip link set dev eth0 mtu 9001

# Make it persistent
echo 'MTU=9001' >> /etc/sysconfig/network-scripts/ifcfg-eth0

# Verify
ip link show eth0 | grep mtu
# Expected: mtu 9001

The Complete Infrastructure Decision Tree

When I’m designing a new trading system on AWS, I work through this sequence:

Is this on the hot path? (strategy execution, order routing, market data consumption) If yes, go to step 2. If no, use standard instance types, no placement group, standard ENA.
What is the acceptable P99.9 latency? If under 100µs, use c6i.metal or c7i.metal in a cluster placement group with ENA Express. If 100µs-1ms is acceptable, use c6i.4xlarge in a cluster placement group with ENA Express.
What is your cross-instance communication pattern? If instances need to communicate on the hot path (e.g., strategy to order router), they must be in the same cluster placement group. If they communicate on the control plane only, standard placement is fine.
What is your failure tolerance? If a single-host failure cannot be tolerated even briefly, you need a warm standby (same placement group, running but idle) or accept a 5-8 minute recovery window for bare metal instance replacement.
What is your total instance count? Cluster placement groups have practical size limits (around 10-15 bare metal instances per group before you start seeing capacity errors). If your cluster needs to be larger, partition it across multiple placement groups and measure the cross-group latency impact.

How This Breaks in Production

Placement group capacity errors at scale-out. If you launch your trading cluster in a placement group successfully and then try to add one more instance six months later, AWS may not have capacity on the same rack. The error is InsufficientCapacityError. The workaround is to launch all instances simultaneously at cluster creation, or to stop and restart the entire group when adding capacity. Neither is pleasant during a production incident.

ENA Express silently degrading to standard ENA. ENA Express requires both endpoints to support it. If you replace one instance in a cluster group with a new launch that has the wrong AMI (an older AMI without ENA Express support), you lose ENA Express performance on that connection without any alert. Add a startup check that validates the ENA Express status of all interfaces.

MTU mismatch causing silent packet loss. If your application sets jumbo frames but your network path has a router with standard MTU (this can happen with VPN overlays or direct connect configurations), you will see intermittent packet loss that looks like random network errors. Explicitly test MTU end-to-end before going live, not just within the VPC.

Cluster placement group and reserved instance mismatch. You can purchase reserved instances for a specific instance type and AZ but reserved instances do not guarantee placement group membership. Your reserved capacity might launch successfully on a different physical host than your placement group requires. Always launch placement group instances at the same time and verify physical host affinity before purchasing reserved capacity for that configuration.

Hypervisor migration breaking placement group membership. As in the incident at the start of this post: AWS maintenance events can move instances to different physical hosts. When this happens, a virtualized instance in a cluster placement group may land on a host outside the original rack. Bare metal instances are immune to this - they cannot be migrated without explicit stop/start. This is another argument for bare metal on the trading hot path: the latency guarantee is stable across maintenance events.

c7gn ARM binary incompatibility. If you decide to move feed handlers to c7gn for the 200 Gbps network bandwidth and your deployment pipeline builds x86-64 binaries, the instances will start but your binaries will fail with Exec format error. Multi-arch builds via Docker buildx are the fix, but this is a silent failure mode if your container registry does not enforce manifest validation.