Blue-Green vs Canary for Stateful Trading Services: Why Naive Canaries Fail

The position state split happened on a Tuesday at 9:47 AM EST, eleven minutes after the NYSE open. We had deployed a canary release of our order management system - the standard 10% canary that DevOps best practices recommend - and for twenty minutes, two versions of our OMS were simultaneously processing orders. Version 1.4.2 was handling 90% of traffic. Version 1.4.3 was handling 10%.

The versions agreed on almost everything. The change was minor: a fix to how we computed realized PnL for partial fills. The bug only appeared when a multi-leg options order filled partially across multiple exchanges. But it meant that for any trade that version 1.4.3 touched, the computed position was slightly different from what version 1.4.2 would have computed.

Our risk engine, which pulled position state from the OMS to perform aggregate portfolio checks, was now getting inconsistent data. Depending on which OMS instance it queried, the same portfolio might show different net exposures. For twenty minutes, our risk manager had a view of the portfolio that was an incoherent blend of two computation models.

We caught it before it led to a limit breach. But only barely, and only because an alert fired on the discrepancy between two risk engine instances that happened to query different OMS versions. The near-miss triggered a complete review of our deployment approach for stateful trading services.

Why Stateless Deployment Patterns Fail for Trading

The canonical argument for canary deployments assumes that your service is stateless. Version 1 and version 2 produce the same outputs for the same inputs because both versions are deriving outputs from shared, external state (a database, a cache). You can send 10% of traffic to version 2, monitor error rates and latency, and roll back cleanly if something goes wrong - the state is consistent because neither version owns it.

Trading services do not work this way. The OMS owns position state. It does not just read positions from a database - it maintains an in-memory model that reflects every fill event in sequence, because latency requirements preclude round-tripping to the database on every order event. The in-memory model is the authoritative state during a live session. The database is a persistence layer for recovery.

Two versions of the OMS maintaining independent in-memory state will diverge within seconds of handling their first fill events differently. Even if the divergence is small - a rounding difference, a different sequence for partial fills - it is still a fork. You now have two coherent versions of reality that disagree.

The specific failure modes:

Position drift. Version 1.4.2 says BTC exposure is + $45,000. Version 1.4.3 says BTC exposure is +$ 44,800. Both are wrong in different ways. Any downstream system reading position data will see inconsistent values depending on which OMS instance it queries.

Hedge imbalance. If your risk management system detects that a position is too large and sends a hedging order to reduce it, that order is sized against the risk engine’s view of position. If the risk engine’s view was sourced from a version 1.4.3 instance that underestimates exposure, the hedge will be undersized, leaving you with residual exposure that neither version knows about cleanly.

Reconciliation nightmare. At end-of-day, when you reconcile your internal position state against exchange-reported positions, you have a state that is the result of processing orders through two different computation models. The reconciliation diff tells you something is wrong but cannot tell you which version was correct.

Recovery after rollback. If you roll back version 1.4.3 after twenty minutes of processing, the state changes it made to the database are still there. Your version 1.4.2 instances need to reconcile against a database that contains entries created by version 1.4.3’s logic.

The Only Safe Pattern: Blue-Green with Dark Launch

The deployment pattern that actually works for stateful trading services is blue-green with a dark-launch verification phase. It is more conservative than canary - the new version does not process production traffic until you have high confidence it behaves correctly - but that conservatism is the point.

The phases:

Phase 1: Dark Launch (Read-Only Shadow Mode)

The new version (green) is deployed alongside the existing version (blue). Green subscribes to the same market data and order event feeds as blue. It processes every event, updates its in-memory state, and computes every output. But it does not write to the database, does not respond to API requests, and does not send orders to exchanges.

For every event processed, the green instance emits its computed output to a comparison topic. A separate verification service reads from both the blue output topic and the green shadow output topic and compares them. You define the tolerance: for position values, maybe 1 basis point difference is acceptable (floating-point representation differences), but any signed discrepancy in trade direction or order state is an alert.

# Verification service: subscribes to blue and green shadow outputs
class BlueGreenVerifier:
    def __init__(self, tolerance: float = 0.0001):
        self.tolerance = tolerance
        self.discrepancies: list[Discrepancy] = []

    async def compare_outputs(
        self,
        event_id: str,
        blue_output: PositionState,
        green_output: PositionState
    ) -> VerificationResult:
        discrepancies = []

        for symbol in set(blue_output.positions) | set(green_output.positions):
            blue_pos = blue_output.positions.get(symbol, Decimal("0"))
            green_pos = green_output.positions.get(symbol, Decimal("0"))

            if blue_pos == 0 and green_pos == 0:
                continue

            # Signed difference matters - not just magnitude
            if blue_pos != 0:
                rel_diff = abs(green_pos - blue_pos) / abs(blue_pos)
            else:
                rel_diff = abs(green_pos)  # Blue says 0, green says non-zero

            if rel_diff > self.tolerance:
                discrepancies.append(Discrepancy(
                    event_id=event_id,
                    symbol=symbol,
                    blue_position=blue_pos,
                    green_position=green_pos,
                    relative_difference=rel_diff,
                    severity="CRITICAL" if rel_diff > 0.01 else "WARNING"
                ))

        return VerificationResult(
            event_id=event_id,
            match=len(discrepancies) == 0,
            discrepancies=discrepancies
        )

The dark launch phase runs for a minimum period that covers at least one full trading session - from open to close - across multiple market conditions. In practice, we require three trading days of clean shadow operation before we consider a version ready for traffic.

Phase 2: Atomic Traffic Cutover

After the dark launch verification clears, we perform a single atomic switch: all API traffic moves from blue to green. There is no gradual ramp. Traffic goes from 0% green to 100% green in one step.

The atomicity is implemented at the load balancer level (target group swap in AWS ALB, or a single service selector change in Kubernetes). The blue instance continues running for a short period afterward in case an immediate rollback is needed, but it is no longer receiving new requests.

This is the key difference from a canary: at no point are two versions simultaneously handling production traffic. The verification work happens in shadow mode (dark launch), not in a live traffic split.

Phase 3: Blue on Standby

After cutover, the blue instance remains running but idle for a rollback window - typically 30 minutes to 1 hour. During this window, if we detect any anomaly in green’s behavior, we can swap traffic back to blue instantly (seconds, not minutes).

After the rollback window closes, we drain blue’s connections and stop the process. The blue deployment remains available for a full session rollback if needed.

The systemd Unit for Graceful Shutdown

Graceful shutdown is critical for trading services. When a SIGTERM is received (whether from a deployment cutover, a system restart, or a crash recovery), the service must:

Stop accepting new orders
Cancel all open orders at the exchange
Wait for fill confirmations or timeout
Write final position state to the database
Exit cleanly

# /etc/systemd/system/oms.service
[Unit]
Description=ZeroCopy Order Management System
After=network.target postgresql.service nats.service
Requires=postgresql.service

[Service]
Type=notify
User=trading

ExecStart=/opt/zerocopy/bin/oms \
    --config /etc/zerocopy/oms.toml \
    --db-url=${OMS_DB_URL}

# SIGTERM triggers graceful shutdown:
# 1. Stop accepting new order submissions
# 2. Cancel all open orders at exchanges (with 10s timeout per exchange)
# 3. Flush position state to database
# 4. Exit with code 0 on success, non-zero on error
KillSignal=SIGTERM
KillMode=process

# Allow up to 45 seconds for graceful shutdown before SIGKILL
# This covers: 10s exchange cancel + 5s confirmation wait + 10s DB flush + buffer
TimeoutStopSec=45

# DO NOT auto-restart on graceful shutdown (exit code 0)
# DO auto-restart on crash (exit code != 0)
Restart=on-failure
RestartSec=5s

# SD_NOTIFY: service signals READY=1 when connected to all exchanges
# This prevents systemd from marking the service as started before it's ready
NotifyAccess=main

[Install]
WantedBy=multi-user.target

The corresponding SIGTERM handler in the application:

// In the OMS application (Rust)
async fn handle_shutdown_signal(
    exchange_clients: Arc<ExchangeClients>,
    order_state: Arc<RwLock<OrderState>>,
    db: Arc<Database>,
) {
    // Signal received: stop accepting new orders first
    order_state.write().await.set_draining(true);

    tracing::info!("Shutdown initiated: canceling open orders");

    // Cancel all open orders at each exchange concurrently
    let cancel_futures: Vec<_> = exchange_clients
        .iter()
        .map(|client| {
            let orders = order_state.read_sync().open_orders_for_exchange(client.name());
            cancel_orders_with_timeout(client.clone(), orders, Duration::from_secs(10))
        })
        .collect();

    let cancel_results = futures::future::join_all(cancel_futures).await;

    for result in &cancel_results {
        if let Err(e) = result {
            tracing::error!("Failed to cancel orders at exchange: {}", e);
            // Continue shutdown even if cancel fails - we will reconcile on restart
        }
    }

    // Wait for outstanding fill acknowledgements (max 5 seconds)
    tokio::time::timeout(
        Duration::from_secs(5),
        order_state.read().await.wait_for_pending_fills()
    ).await.ok();

    // Persist final position state
    let final_positions = order_state.read().await.positions().clone();
    if let Err(e) = db.save_session_checkpoint(&final_positions).await {
        tracing::error!("Failed to save session checkpoint: {}", e);
        // Log but do not block shutdown - the exchange is the source of truth
    }

    tracing::info!("Graceful shutdown complete");
    // Notify systemd of clean exit
    let _ = sd_notify::notify(true, &[sd_notify::NotifyState::Stopping]);
}

When to Deploy: Market Close Windows

For stateful trading services, the only safe deployment window is after market close and before the next open. This sounds obvious but is frequently violated in practice.

The reasoning: during a live session, your OMS holds position state that reflects open trades. A deployment during a session requires:

Persisting all in-memory state to the database (risky - the in-memory state is authoritative)
Starting the new version and restoring from the database (slow - potentially missing events that occurred between persist and start)
Or running the dark launch simultaneously during a live session (complex - position state is changing continuously, making verification harder)

Market close windows solve all of this. At market close, all positions are flat (if you are running properly) or there is a well-defined end-of-day state. You can take a clean checkpoint, deploy the new version, verify it restores correctly from the checkpoint, and have it ready for the next session open.

The deployment window for most crypto-trading desks (24/7 markets) is trickier because there is no “close.” We use low-volume periods (Sunday midnight UTC for BTC, which shows minimal volume) and require a full shadow verification against recent trading history before cutover. The principle is the same: cutover during the lowest-activity period to minimize exposure if something goes wrong.

Where Canary Deployment Is Fine in Trading

Not all trading services are stateful in the problematic sense. Several components handle canary deployment without issues:

Market data feed processors (stateless after normalization): Each message is processed independently. Two versions of the feed handler producing slightly different normalized outputs is a discrepancy worth testing, but running 10% canary does not create the position state split problem. The outputs are compared and the winner is the correct normalization.

Reference data services: REST APIs that serve static reference data (instrument specs, exchange calendars, settlement rules). Completely stateless in the request-response sense. Canary fine.

Alerting and monitoring services: They observe state but do not modify it. A new version of your alert engine running on 10% of events is safe - if it fires incorrectly, the consequence is a false alert, not an incorrect position.

Admin and operator APIs: Authentication, configuration management, operator dashboards. No trading state - canary is the right pattern here.

The rule of thumb: if the service writes trading state (positions, orders, fills), use blue-green with dark launch. If it reads trading state or is entirely stateless, canary is appropriate.

How This Breaks in Production

The shadow instance falling behind. The dark launch verification only works if the green shadow instance stays in sync with blue. If green is processing events at 95% of blue’s rate due to a performance regression, the shadow starts falling behind. Verification output becomes stale. You think green is fine, but green is actually processing a market state from 30 minutes ago. Always monitor shadow lag explicitly - if it exceeds 5 seconds, pause verification and investigate before proceeding.

Exchange cancel failures during shutdown leaving ghost orders. If the exchange is temporarily unreachable when your SIGTERM handler tries to cancel open orders, and your timeout expires before connectivity is restored, those orders remain open at the exchange. Your new instance starts up without knowing about them. Implement an explicit startup reconciliation step: on boot, query the exchange for open orders and reconcile against your last persisted checkpoint. Any discrepancy is an alert, not a silent assumption.

Blue instance state becoming stale before rollback window closes. After cutover to green, blue continues running but receives no new events. If green processes trades for 45 minutes and you need to roll back, blue’s position state is 45 minutes stale. Rolling back to blue gives you correct order routing (blue knows all the valid instruments and order types) but incorrect position state. Always have a procedure for restoring position state from the database on rollback, not just switching traffic back.

Dark launch replay not covering edge cases. Three days of shadow operation covers normal market conditions. It may not cover limit-up/limit-down events, settlement days, or the specific multi-leg partial-fill scenario that triggered the original bug. Build a market condition replay harness that can replay historical edge cases against the green shadow before any deployment, regardless of how many real market days have passed.

Load balancer health checks masking startup failures. After cutover, the load balancer health check passes as soon as the HTTP endpoint responds 200. But “responds 200” does not mean the OMS is connected to all exchanges and has restored full position state from the database. Implement a deep health check endpoint that only returns 200 when the application is fully initialized - all exchange connections live, all position state restored, all recovery steps complete - and use that as your load balancer health check target.

Timezone confusion during market-close deployment windows. You plan to deploy at NYSE close (4:00 PM EST). Your automation runs in UTC. Someone changed a cron job. The deployment fires at 4:00 PM UTC, which is 11:00 AM EST - during live trading. This is a real incident category. All deployment automation should reference the relevant exchange’s local close time explicitly, not a UTC timestamp, and should verify that volume is below threshold before proceeding.