Infrastructure
Incident Response for Trading Systems: Why You Can't 'Just Roll Back' a Trade
At Upside we discovered a $200K undisclosed net long. The IR playbook that doesn't work for trading, and the one that does - from halt to flatten to post-incident PnL analysis.
About eight months into my time at Upside, we had an incident that I still think about. A bug in our position reconciliation code caused our system to believe we were flat across all instruments when we actually had a $200K net long in technology equities. The bug had been live for 11 days.
The discovery came not from a monitoring alert but from a trader on the desk who noticed the afternoon mark-to-market didn’t match what he expected. He pulled up the position ledger, ran the reconciliation manually, and called me.
What happened next was unlike any incident I had managed in web infrastructure. There was no rollback. There was no “let’s restore from the last known good state.” The position was real, live on the exchanges, accumulating mark-to-market in real time as the market moved. The reconciliation code was fixed in 20 minutes - that was the easy part. But we had 11 days of trades to review, a live exposure to assess, and a decision to make about whether to flatten immediately or hold through the close.
That incident shaped how I think about IR for trading systems. This is the framework.
Why Standard Software IR Breaks for Trading
Standard software incident response runs on two assumptions that trading systems violate completely.
The first assumption is that the system can be restored to a known good state. In software, this means rollback: redeploy the last working version, and the system is in the state it was before the bad deploy. In trading, transactions are irrevocable. A fill acknowledged by the exchange is a legal obligation. You cannot roll back a trade any more than you can unspend cash. The “state” of the system after a bug includes real financial positions that exist in the world, not just in your database.
The second assumption is that the incident’s impact is bounded to the period when the bug was active. A web bug that caused 503 errors for 10 minutes had 10 minutes of impact. A trading bug that caused incorrect position tracking for 11 days had 11 days of impact: every trade executed during that period may need to be reviewed for correctness, every hedge that was or wasn’t placed needs to be evaluated, every risk limit check that ran against the incorrect position state may have produced wrong answers.
This fundamentally changes the shape of incident response. The question is not “how do we restore the system?” - the system is already restored once the code is fixed. The question is “what is the current state of the world, and what do we need to do about it?”
The Severity Matrix for Trading
Standard severity matrices use availability as the primary axis. P0 = complete outage, P1 = major degradation, P2 = minor degradation. For trading, the primary axis is financial exposure:
P0: Money is moving in the wrong direction. The system is actively executing trades that should not be executed, or not executing trades that should be. Every second of delay increases the financial exposure. Examples: strategy executing on incorrect signal, risk limits not being enforced, hedge trades failing to submit while directional trades go through. Required action: halt new orders within 2 minutes.
P1: System is down and losing PnL. The system is not actively making things worse, but open positions are accumulating PnL in an unmonitored or unmanaged state. Strategy has stopped trading but has positions open. Monitoring has failed and you don’t know the current position state. Required action: assess and communicate position state within 15 minutes.
P2: System is degraded but not actively losing. Execution quality is worse than expected, latency is elevated, data is stale. The system is trading, but not optimally. Examples: market data 500ms stale causing worse fills, order routing falling back to secondary exchange with wider spreads. Required action: quantify the quality degradation, decide whether to continue or halt, open a ticket.
The key difference from standard P0/P1/P2: the classification is based on the financial direction of the impact, not the technical severity. A complete database outage that happens when the strategy is flat and markets are closed is a P1. A single-line bug that is causing the strategy to size 2x larger than intended on every trade is a P0 even if the system appears healthy from a technical standpoint.
The IR Flow for a Trading Incident
Step 1: Halt new orders (if P0 or ambiguous). This is the default action for any trading IR where the cause is unknown. Halting new orders prevents the system from making a bad situation worse. It is reversible. The cost of halting incorrectly - missing some PnL while you investigate - is bounded and known. The cost of continuing to trade incorrectly - accumulating exposure against a bad signal or with broken risk controls - is unbounded.
At Upside, the halt was a single API call to the OMS that toggled a trading_enabled flag. Every strategy checked this flag before submitting orders. The halt took effect within 500ms of being issued. This was the most important piece of infrastructure we built, and it was also the simplest.
# OMS halt endpoint - the most important API in the system
@app.post("/trading/halt")
async def halt_trading(reason: str, operator: str):
"""
Halts all new order submission system-wide.
Takes effect within one strategy cycle (< 1 second).
Existing open orders are NOT cancelled - that's a separate action.
"""
await db.execute(
"UPDATE system_state SET trading_enabled = FALSE, "
"halt_reason = $1, halt_operator = $2, halt_time = NOW()",
reason, operator
)
# Broadcast to all strategy processes via NATS
await nats.publish("system.halt", json.dumps({
"enabled": False,
"reason": reason,
"operator": operator,
"timestamp": time.time()
}))
logger.critical("TRADING HALTED by %s: %s", operator, reason)
return {"status": "halted", "reason": reason}
Step 2: Assess current position state. Before deciding whether to flatten or hold, you need an accurate picture of what you actually own. This means running a position reconciliation against the exchange directly, not trusting your internal position tracker (which may be the source of the bug). Every exchange provides an account positions endpoint. Use it.
# Direct exchange position query - trust this over internal state during incidents
async def get_ground_truth_positions(exchange_client):
"""
Query exchange directly for current positions.
This is the authoritative state during an IR when internal tracking is suspect.
"""
positions = await exchange_client.fetch_positions()
return {
pos['symbol']: {
'qty': pos['contracts'],
'side': pos['side'],
'avg_price': pos['entryPrice'],
'unrealized_pnl': pos['unrealizedPnl'],
'market_value': pos['notional']
}
for pos in positions
if pos['contracts'] != 0
}
Step 3: Decide - flatten or hold. This decision cannot be made by the on-call engineer alone. It requires the risk manager (or whoever owns the position risk mandate) and, depending on the size of the position, leadership. The inputs to the decision:
- Current market conditions: are we at a favorable exit point, or would flattening now lock in a large loss?
- Risk limits: does the current position exceed any hard risk limit that requires immediate flattening regardless of PnL?
- Cause of the incident: is the cause fixed? If the strategy is still generating bad orders, you must flatten.
- Time until next market close: if a close is within 30 minutes, holding through the close may be the lowest-risk option.
At Upside, our rule was: if the position exceeds 150% of the intended size, we flatten regardless of market conditions. Below that, it was a judgment call made jointly by engineering and risk.
Step 4: Execute the decision, communicate, and document. If flattening: submit market orders to close positions, watch fills, confirm flat state against exchange. If holding: set manual monitoring alerts on the position’s PnL every 15 minutes until next close.
Communicate to all stakeholders simultaneously: traders, risk, compliance, and leadership should all receive the same status update at the same time. Do not deliver news sequentially - the first person to hear it becomes a relay, and information degrades in transmission.
The Exchange Support Line
Every trading firm should have a direct exchange support contact - not the public ticket system, but a phone number or chat handle for a named support representative.
We used this exactly once in my time at Upside, and having it was worth everything. A connectivity issue with a major exchange caused us to lose visibility into our order status. We had orders live on the exchange that we could not see. We called the support line, gave them our participant ID, and within 8 minutes had a confirmed list of our open orders and their current status. With the ticket system, the same information would have taken 2-3 hours.
Establish this relationship before you need it. Most exchanges will assign a dedicated support contact to firms above a certain trading volume threshold. If you’re below that threshold, the public support queue is what you have - which is a reason to hit the volume threshold.
Post-Incident Review for Trading: PnL Impact Analysis
A standard post-incident review asks: what broke, why, and what do we change? A trading PIR adds a fourth question: what did this cost?
The PnL impact analysis for the Upside reconciliation bug required:
-
Identifying the affected period. When did the bug introduce incorrect position state? In our case, a git bisect found it 11 days prior.
-
Reconstructing what the system believed vs. what was true. We replayed the position tracker logic against the trade history with the bug, and without the bug, to produce two parallel position ledgers: “system believed” and “actual.”
-
Calculating the cost of divergence. For each time the two ledgers diverged and the strategy made a decision based on the incorrect ledger (placed or skipped a hedge, sized a position incorrectly), calculate the PnL impact of that decision versus the counterfactual.
-
Attributing the total impact. In our case, the total impact was approximately $42,000 in worse execution quality over 11 days - not catastrophic, but real. The system had been consistently undersizing hedges because it believed it was more hedged than it was.
This analysis matters for three reasons. First, it gives you the real cost of the incident, which is the most important input to prioritization of the fix. Second, it may have regulatory implications - if the incorrect position state caused a risk limit breach, that may require disclosure. Third, it improves future root cause analysis by establishing a model for how bugs translate to financial impact.
Runbook Structure for “Strategy Producing Unexpected Orders”
This is the actual runbook structure I use for a P1 incident involving unexpected order generation:
# Runbook: Strategy Producing Unexpected Orders
## Severity: P1 (escalate to P0 if orders are creating net exposure beyond risk limits)
## Immediate Actions (< 2 minutes)
1. Halt new orders: `curl -X POST http://oms:8080/trading/halt -d '{"reason":"unexpected-orders","operator":"<your-name>"}'`
2. Check current open orders: `curl http://oms:8080/orders/open`
3. Cancel all open orders if cause is unknown: `curl -X POST http://oms:8080/orders/cancel-all`
## Assess Current State (< 10 minutes)
1. Query exchange directly for positions: `./scripts/exchange-positions.sh`
2. Compare against internal position tracker: `./scripts/reconcile-positions.sh`
3. If they differ: the internal tracker is suspect. Use exchange state as ground truth.
4. Calculate net exposure: `./scripts/calculate-net-exposure.sh`
## Escalate if:
- Net exposure exceeds $50K in any single instrument
- Net exposure exceeds $150K total
- Root cause is not identified within 15 minutes
- Any fill was received that was not in the expected order set
## Escalation contacts:
- Risk manager: [name and number]
- Trading desk lead: [name and number]
- Exchange support: [name and direct contact]
## Common Causes
### Signal generation error
- Check: `grep "ERROR\|WARN" /var/log/strategy/strategy.log | tail -100`
- Check: `curl http://strategy:8081/metrics | grep signal_`
- Fix: identify incorrect signal, correct code, restart strategy
### Position tracker divergence
- Check: `./scripts/reconcile-positions.sh > /tmp/reconcile.txt && diff /tmp/reconcile.txt /tmp/last-known-good.txt`
- Fix: reconcile position state from exchange, restart with correct state
### Risk limits not loaded
- Check: `curl http://risk:8082/limits`
- Fix: reload risk configuration, verify limits are correctly applied
## Post-Incident
1. Capture trade history during incident period
2. Run PnL impact analysis: `./scripts/pnl-impact-analysis.sh --start=<incident_start> --end=<incident_end>`
3. File post-incident review within 48 hours
How This Breaks in Production
Failure mode 1: No trading halt mechanism. Symptom: during a P0 incident, the on-call engineer has no way to stop the strategy from continuing to place orders. By the time the code fix is deployed (15-20 minutes), the strategy has placed hundreds of additional orders on an incorrect signal. Root cause: halt mechanism was never built because “we’ll just restart the process.” Process restart takes 30-60 seconds and the strategy resumes where it left off.
Failure mode 2: Trusting internal position state during an incident. Symptom: engineer halts trading and assesses positions using the internal tracker, concludes the system is flat, resumes trading. Strategy immediately generates a large order because the actual position on the exchange is not flat. Root cause: the incident was caused by a position tracking bug, but the IR used the same buggy tracker to assess current state. Always query the exchange directly during any position-related incident.
Failure mode 3: Flattening without checking market conditions. Symptom: engineer calls “flatten everything” on a large position at a moment of low liquidity. Market impact of the flatten is 10x larger than a controlled exit would have been. Root cause: flatten-immediately became a reflex rather than a decision. The correct reflex is “halt first, assess, then decide on exit strategy.”
Failure mode 4: No direct exchange support contact. Symptom: orders are orphaned at the exchange due to connectivity issue; engineer cannot determine their status; spends 2 hours waiting for ticket response while positions are unmonitored. Root cause: firm relies on public support queue. Fix: establish direct support contacts before the first incident.
Failure mode 5: Post-incident review skips PnL impact. Symptom: the incident report describes the technical root cause and fix, but the financial impact is described as “immaterial” without quantification. Root cause: IR process borrowed from web infrastructure playbook, which does not require PnL analysis. Fix: add PnL impact analysis as a mandatory section of every trading PIR, even if the result is zero.
Failure mode 6: Simultaneous flatten and position state query. Symptom: engineer submits flatten orders while the position state query is still running. The flatten query completes, shows flat, but the flatten orders haven’t filled yet. Engineer marks the incident resolved. Position fill confirmations arrive 2 minutes later, creating a new position in the opposite direction. Root cause: not waiting for the flatten fills before confirming flat state. Fix: reconcile only after all pending orders show a final state (filled or cancelled) from the exchange.
Continue Reading
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.