Skip to content

Infrastructure

Multi-Region Trading Infrastructure: Failover, Active-Active, and the CAP Theorem Applied to PnL

Why active-active is almost never correct for trading. ZeroCopy's active-passive failover design, WAN latency budgets, and what 'multi-region' actually means for a real trading desk.

11 min
#multi-region #failover #trading-infrastructure #cap-theorem #disaster-recovery #aws

Every trading infrastructure conversation eventually reaches the question: “But what if your primary data center goes down?” The answer in most industries is active-active multi-region: run everywhere, route traffic to the nearest healthy region, continue automatically if one region fails. This approach works beautifully for stateless web services. For trading, it is almost certainly wrong, and deploying it naively will cost you more than the downtime it was supposed to prevent.

This post explains why, starting with the theorem that makes the answer clear.

The CAP Theorem Applied to a Live Trading Book

The CAP theorem states that a distributed system cannot simultaneously provide all three of:

  • Consistency: All nodes see the same data at the same time
  • Availability: Every request receives a response
  • Partition tolerance: The system continues operating despite network partitions

For trading, the critical question is what happens when your two regions cannot communicate - a network partition. You must choose what to sacrifice.

Option A: Sacrifice Availability (CP). When you cannot confirm the other region’s state, stop trading in both regions. You preserve consistency - both regions agree on positions - but you are unavailable until the partition resolves. You lose PnL during the partition window, but you never have contradictory state.

Option B: Sacrifice Consistency (AP). Both regions continue trading independently during the partition. You remain available and continue making PnL, but both regions are now building position state without knowledge of the other. When the partition heals, you have two incompatible position books that need reconciliation.

For a trading system, the consequences of inconsistency are asymmetric. If you sacrifice availability during a 2-minute partition, you lose 2 minutes of PnL - call it 0.1% of a good day. If you sacrifice consistency, both regions may have taken positions that together exceed your risk limits. Your hedge may be on the wrong side. When the partition heals, your reconciliation may require unwinding positions at unfavorable prices in a market that has already moved against you while you were trading blind.

The math almost never favors sacrificing consistency for trading systems. Take the availability sacrifice, limit your downtime exposure through fast failover, and preserve the integrity of your position book.

Why Active-Active Is Wrong for Trading

Active-active means two (or more) live trading instances simultaneously sending orders to the same exchanges. Even without a network partition, this creates problems:

Doubled position risk. If both regions run the same strategy and both see the same market signal, both will attempt to take the same trade. You now have 2x the intended position size with 2x the risk, but only 1x the alpha - the edge of the strategy was priced for a single unit of position, not two.

Order deduplication failure modes. The standard response is “use idempotency keys to prevent duplicate orders.” This works for idempotent operations (creating the same order twice = one order). It does not work for directional strategies where the intent is to take one unit of position - if both regions think they are the authoritative trading instance, both will generate unique orders with different keys that both succeed.

Hedging state divergence. If region A takes a long position in BTC and region B’s hedge for that position fails because of an exchange connectivity issue, region A has a naked long that region B does not know about. Your aggregate risk position is wrong.

Exchange rate limit exhaustion. Most exchanges enforce per-account order rate limits. If two regions are both actively trading, they share that rate limit. Under market volatility (when you most want to trade aggressively), hitting the rate limit has double the consequence because both regions are contributing to it.

The only scenario where active-active is correct for trading is when your two regions are trading completely orthogonal strategies on completely different accounts with separate risk limits - essentially running two separate trading desks that happen to share infrastructure. In that case, “active-active” is a misleading label; you are really running two independent active-passive systems that happen to be co-located.

Active-Passive with Deterministic Failover

The correct architecture is active-passive: one region actively trades, one region is warm (running, connected, but not trading). Failover is deterministic, automated, and designed around a clear contract of how long it takes and what state is lost.

ZeroCopy’s failover design:

Primary Region (us-east-1)            DR Region (ap-southeast-1)
+-----------------------------+       +-----------------------------+
|  Strategy Engine (ACTIVE)   |       |  Strategy Engine (STANDBY)  |
|  OMS (ACTIVE)               |       |  OMS (WARM, read-only)      |
|  Risk Engine (ACTIVE)       |       |  Risk Engine (STANDBY)      |
+-----------------------------+       +-----------------------------+
           |                                       |
           | NATS replication (position events)    |
           +---------------------------------------+
           |
           v
    +------+------+
    |  Exchange   |
    | Connections |
    +-------------+
        (only primary maintains live sessions)

The position state replication happens via NATS: every fill event and position update that the primary OMS processes is published to a NATS subject that the DR OMS subscribes to. The DR OMS maintains an in-memory position book that mirrors the primary, but writes nothing to exchanges.

The health check is explicit and fast:

class FailoverController:
    def __init__(
        self,
        primary_health_url: str,
        failover_threshold_ms: int = 500,  # 500ms without response = unhealthy
        confirmation_count: int = 3         # 3 consecutive failures before failover
    ):
        self.primary_health_url = primary_health_url
        self.failover_threshold_ms = failover_threshold_ms
        self.confirmation_count = confirmation_count
        self._failure_count = 0
        self._last_healthy = time.monotonic()

    async def check_primary_health(self) -> bool:
        try:
            async with asyncio.timeout(self.failover_threshold_ms / 1000):
                response = await self.http_client.get(
                    self.primary_health_url,
                    headers={"X-Health-Check": "failover-controller"}
                )
                if response.status_code == 200:
                    self._failure_count = 0
                    self._last_healthy = time.monotonic()
                    return True
        except (asyncio.TimeoutError, httpx.ConnectError):
            pass

        self._failure_count += 1

        if self._failure_count >= self.confirmation_count:
            await self.initiate_failover()
            return False

        return False

    async def initiate_failover(self):
        # 1. Verify primary is truly unreachable (not just slow)
        # Check from multiple vantage points if possible

        # 2. Promote DR to active
        await self.dr_oms.set_active(True)

        # 3. Register DR exchange sessions (new sessions - see below)
        await self.dr_strategy.connect_to_exchanges()

        # 4. Alert operators
        await self.alerting.send_critical(
            f"FAILOVER INITIATED: primary unreachable for "
            f"{self._failure_count * self.failover_threshold_ms}ms. "
            f"DR region now active."
        )

        # 5. Update DNS to route operator traffic to DR region
        await self.dns.update_operator_endpoint("dr")

The WAN Latency Budget

The physical distance between your primary and DR regions determines your minimum failover time and your position state replication lag. These numbers are real constraints, not configurable parameters.

us-east-1 (Virginia) to ap-southeast-1 (Singapore):

  • Round-trip latency: 220-250ms
  • Position state replication lag: one-way, so ~120ms
  • Time to establish new exchange sessions from DR: 2-5 seconds (TLS handshake + FIX/WebSocket session establishment)
  • Total failover time (health check + session establishment): 8-15 seconds

us-east-1 to eu-west-1 (Ireland):

  • Round-trip latency: 70-90ms
  • Position state replication lag: ~45ms
  • Time to establish new exchange sessions: 2-5 seconds
  • Total failover time: 5-10 seconds

us-east-1 to us-west-2 (Oregon):

  • Round-trip latency: 60-80ms
  • Position state replication lag: ~35ms
  • Time to establish exchange sessions: 2-5 seconds
  • Total failover time: 5-8 seconds

These numbers define the position state exposure window: the maximum number of events that DR might have missed between the last successful replication and the moment primary became unreachable. At 100 events/second with 120ms replication lag, your DR is at most 12 events behind primary at any given time. Over a 15-second failover window, you might miss 1,500 events. Your startup reconciliation must handle this.

The Exchange Session Transfer Problem

This is the part that gets overlooked until it causes an incident.

FIX protocol sessions (the standard for most institutional exchanges) and WebSocket sessions (common for crypto) are stateful. A session has:

  • A sequence number that increments with every message
  • A session ID known to the exchange
  • A heartbeat timeout (typically 30 seconds for FIX)

When your primary region goes down and your DR region tries to connect, the exchange sees a new connection request. It does not “transfer” the session from the primary - it creates a new one. The sequence number resets. Any orders sent by the primary that the exchange has processed but the primary has not received confirmation for are in an unknown state from the DR perspective.

The exchange session transfer problem means:

  1. DR cannot resume where primary left off - it must start fresh
  2. DR must query the exchange for open orders before sending any new ones
  3. The sequence number reset triggers exchange-side controls (some exchanges alert or throttle on unexpected session resets)

The startup reconciliation flow for DR promotion:

// DR startup reconciliation after failover
async fn reconcile_on_promotion(
    exchange_clients: &ExchangeClients,
    local_position_state: &PositionState,
) -> Result<ReconciliationReport, ReconciliationError> {
    let mut report = ReconciliationReport::new();

    for exchange in exchange_clients.iter() {
        // Step 1: Connect to exchange (new session, sequence starts at 1)
        exchange.connect().await?;

        // Step 2: Query exchange for all open orders
        let exchange_open_orders = exchange.fetch_open_orders().await?;

        // Step 3: Query exchange for recent fills (last 5 minutes)
        let exchange_recent_fills = exchange.fetch_recent_fills(
            Duration::from_secs(300)
        ).await?;

        // Step 4: Compare against our local state (built from NATS replication)
        let local_open_orders = local_position_state.open_orders_for_exchange(exchange.name());

        for exchange_order in &exchange_open_orders {
            if !local_open_orders.contains_key(&exchange_order.client_order_id) {
                // Exchange has an open order we do not know about
                // This is the window between last replication and primary failure
                report.add_discrepancy(Discrepancy::UnknownOpenOrder {
                    exchange: exchange.name().to_string(),
                    order: exchange_order.clone(),
                });

                // Cancel it - we cannot safely continue trading with ghost orders
                exchange.cancel_order(&exchange_order.exchange_order_id).await?;
            }
        }

        // Step 5: Reconcile position from fills
        let reconciled_position = compute_position_from_fills(&exchange_recent_fills)?;
        report.set_reconciled_position(exchange.name(), reconciled_position);
    }

    Ok(report)
}

What “Multi-Region” Actually Means in Practice

When I say ZeroCopy operates across AWS and DigitalOcean with multi-region architecture, here is what that means concretely:

Trading hot path: Active-passive. Primary on AWS us-east-1. DR on DigitalOcean NYC3 (different provider for infrastructure independence). Position state replicated via NATS. Failover time: ~10 seconds. Manual failover confirmation for planned maintenance; automatic failover on health check failure.

Control plane and observability: Genuinely multi-region. ArgoCD, Prometheus, Grafana, and the Command Center backend run across multiple regions with no single point of failure. None of these have the position-state consistency requirements of the trading hot path. An operator dashboard being unavailable for 30 seconds is annoying. Position state being inconsistent for 30 seconds is expensive.

NATS cluster: Three nodes across three AZs within a single region (for low replication latency), with a NATS gateway connection to the DR region for position state streaming. The NATS cluster itself is not multi-region active-active - the DR cluster is a subscriber, not a full JetStream replica. Full NATS JetStream multi-region replication adds too much latency for position state publishing.

Database: Primary PostgreSQL in us-east-1, streaming replication to DR. For a disaster recovery scenario, promote the DR replica. Replication lag is typically under 100ms for the 220ms round-trip to Singapore.

DNS: AWS Route53 with health checks routing the operator API. Normal state: us-east-1. Failover: 60-second TTL + health check = ~90-second DNS failover. This is acceptable for the operator API but not for exchange connectivity (which is direct, not DNS-routed).

How This Breaks in Production

Split-brain during failover. The most dangerous scenario: primary is slow but not dead. Health checks time out and failover initiates. DR starts trading. Then primary recovers. Now both are active. Your failover controller must have a clear protocol for this: when DR promotes, it must signal primary to deactivate (via a shared coordination service - we use a Consul KV entry with a TTL). If primary cannot be signaled to deactivate, it must fail safe by stopping trading on recovery and requiring manual re-promotion.

Replication lag hiding the real-time position. DR’s position book lags primary by 120ms. In normal operation, this is fine - it is close enough. During a high-frequency burst (earnings release, macro event) when primary is processing 500+ events per second, replication lag can spike to 1-2 seconds. If failover happens during this window, DR may be significantly behind. Monitor replication lag as a real-time metric, not just a general health indicator.

Exchange session reconnect throttling. Some exchanges enforce a cooldown after a session reset - they will not accept a new session establishment for 5-60 seconds after a clean disconnect, depending on the venue. If your primary disconnected uncleanly (crash rather than graceful shutdown), the exchange may apply a longer cooldown. Your DR startup sequence must handle reconnect delays per venue and should not assume all exchanges will be reachable immediately on failover.

Position state replay missing non-NATS state. NATS replication covers fill events and position updates. It may not cover out-of-band state: limit orders that were placed but not yet filled (and thus have no “fill event”), risk parameter changes made through the operator API that were stored in the database but not published to NATS. DR’s position book may be accurate for fills but missing pending order state. Always include a startup database sync step in the DR startup sequence, not just NATS event replay.

RTT variance on cross-region health checks. Your health check from DR to primary uses the same WAN path as your position replication. When the WAN path is congested, health checks time out at the same time replication lags increase. This creates a false positive: primary is healthy but appears unhealthy because the health check uses the congested path. Separate health check and replication network paths where possible, and use multiple vantage points for health checking.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.