Infrastructure
SLOs for Systems That Can't Degrade Gracefully: Error-Budget Math When Downtime = Direct PnL Loss
Why Google's SRE error budget model is dangerously wrong for trading. Session-aware SLOs, PnL-equivalent budgets, and burn rate alerting for live trading systems.
There is a slide in almost every SRE onboarding deck that shows the error budget formula: take your SLO target, subtract your actual reliability, and what remains is the budget you can spend on deployments, experiments, and velocity. Google invented this mental model, and for most software it works well. For trading systems, it is dangerous.
I learned this at Upside, where we ran $500M AUM across automated strategies, 24 hours a day, 7 days a week. The SLOs I inherited were textbook: 99.9% uptime, measured monthly, 43-minute error budget per month. The error budget was managed the same way a web team manages it - we burned it on Thursday afternoon deployments when traffic was low.
Six weeks in, I realized we had the model completely backwards. The budget-burning exercise had implicitly assumed that 43 minutes of downtime was equally costly regardless of when it occurred. For a trading system during an active, high-volatility session, one minute of downtime can cost more than the entire month’s worth of quiet-hour downtime combined. The model was actively encouraging us to take risks at the wrong times.
Why the Error Budget Model Breaks for Trading
The Google SRE error budget model makes one foundational assumption: that availability loss is roughly uniformly costly over time. This assumption holds for consumer products where traffic follows a diurnal curve. It does not hold for trading systems where the cost of unavailability is directly correlated with market conditions.
Consider two scenarios. In the first, your order routing service is down for 10 minutes at 2 AM on a Sunday. Market volumes are low, spreads are wide, your strategy has minimal exposure, and the positions you might have taken are immaterial. The PnL cost is close to zero. In the second scenario, the service is down for 2 minutes immediately after a CPI print when the S&P futures are moving 50 basis points. Your strategy wants to rebalance across seven instruments simultaneously. You cannot touch any of them. The PnL cost is not 2/10th of the first scenario - it is 10 to 50 times higher, because the market is moving, your existing positions are accumulating mark-to-market losses, and you cannot hedge.
Error budgets measured in minutes-per-month aggregate these two scenarios identically. When you allow a deployment during low volatility (burning cheap budget) and accidentally extend into an earnings announcement (burning expensive budget), the monthly SLO report shows green while your traders are on the phone asking why you missed the move.
The second failure mode is more subtle. Standard error budgets encourage teams to “freeze” deployments when the budget is nearly exhausted. In trading, a freeze in quiet hours has zero cost - but a freeze that extends across a major macro event costs real money through operational debt. The budget model gives the same weight to both.
The Correct Model: Session-Aware SLOs
The fix is to make SLOs session-aware. Instead of a single monthly measurement, define SLOs independently for trading sessions versus off-hours, and apply different budget weights to each.
At Upside, we segmented the trading calendar into four zones:
Zone 1: Active session, elevated volatility. Defined by VIX above 20 and within regular market hours. SLO target: 99.99% (5.2 minutes of budget per month in this zone). Budget weight: $500/minute of downtime equivalent.
Zone 2: Active session, normal volatility. Regular market hours, VIX below 20. SLO target: 99.95% (22 minutes). Budget weight: $100/minute.
Zone 3: Extended hours. Pre-market and after-hours. SLO target: 99.9%. Budget weight: $20/minute.
Zone 4: Maintenance window. 11 PM to 4 AM ET. SLO target: 98%. No budget weight - this is the deployment window.
This structure immediately changed our behavior. Deployments happened only in Zone 4. Changes that needed validation were staged in Zone 3 first. Zone 1 was strictly protected: no non-emergency changes, period.
What to Measure: Trading-Specific SLIs
The SLIs that matter for trading are different from the SLIs that matter for a web API. Here are the three I would instrument on any trading system first.
Order-to-ACK latency. The time from when the strategy sends an order to when it receives an exchange acknowledgement. This is a direct proxy for execution quality and also a system health indicator.
SLO: P99 order-to-ACK latency < 500ms during Zone 1 and Zone 2 sessions. Anything above 500ms suggests queuing, network degradation, or exchange-side issues.
Crucially, this SLO is P99, not P50. The P50 is usually fine. The P99 is what kills you - the 1-in-100 order that gets stuck in a queue during a volatile move, fills at a bad price, and creates an outsized loss.
Fill rate. The fraction of orders that receive either a fill or a rejection within 5 seconds of submission. An order that neither fills nor rejects within 5 seconds is a zombie - it might be live on the exchange, or it might have been dropped in transit. Your system does not know, and your position tracking is in an undefined state.
SLO: > 99.5% of submitted orders receive fill or reject within 5 seconds.
This sounds easy. It is not. Exchange connectivity issues, protocol-level acknowledgement drops, and network partitions all create zombie orders. At Akuna, we had a class of bugs for 18 months where a specific exchange would silently stop sending ACKs during connectivity hiccups while still accepting orders. The fill rate SLO would have caught this in minutes. We caught it after a manual position reconciliation three months later.
Market data freshness. The age of the most recent market data tick for each instrument your strategy is trading.
SLO: < 100ms stale during Zone 1 and Zone 2 sessions.
A strategy trading on stale data is operating on a false model of the world. It will place orders that fill at prices the market has already moved away from. Stale data is more insidious than downtime because the system appears to be working: orders route, fills arrive, PnL is recorded. But the fills are systematically worse than they should be.
The PnL-Aware Error Budget
Once you have session-aware SLOs, you can construct a PnL-equivalent error budget that makes the cost of reliability failures legible to business stakeholders.
The mechanics are straightforward. For each zone, you have an estimated dollar cost per minute of downtime. Multiply actual downtime minutes (in each zone) by the cost weight, sum across zones, and track this PnL-equivalent budget monthly.
The critical formula:
pnl_budget_remaining = monthly_budget_dollars - sum(downtime_minutes[zone] * cost_per_minute[zone])
At Upside, we set the monthly PnL budget at 0, it triggered a change freeze regardless of what the time-based SLO showed. When a Zone 1 event happened (even just 2 minutes), it showed up as $1,000 of budget consumed - immediately visible to everyone, not hidden inside a rolling average.
This model also makes the business case for reliability investment obvious. If you’re burning $8,000/month in PnL-equivalent budget, that is the return on a dedicated SRE sprint. The conversation stops being about abstract uptime percentages.
Prometheus Recording Rules for These SLIs
Here is a working Prometheus configuration for tracking the three SLIs described above. These use recording rules to pre-compute the expensive aggregations.
# prometheus/rules/trading-slis.yml
groups:
- name: trading_sli_recording
interval: 30s
rules:
# Order-to-ACK latency P99 - requires histogram metric from your OMS
- record: trading:order_ack_latency_p99:5m
expr: |
histogram_quantile(0.99,
sum(rate(order_ack_latency_seconds_bucket{env="prod"}[5m])) by (le, exchange)
)
# Fill rate - orders that completed within 5s vs total
- record: trading:fill_rate_5s:5m
expr: |
sum(rate(orders_completed_within_5s_total{env="prod"}[5m])) by (strategy)
/
sum(rate(orders_submitted_total{env="prod"}[5m])) by (strategy)
# Market data staleness - max age across all instruments per feed
- record: trading:market_data_staleness_max_ms:1m
expr: |
max(
(time() - market_data_last_tick_timestamp_seconds) * 1000
) by (feed, exchange)
- name: trading_slo_compliance
rules:
# SLO burn rate - how fast are we burning the latency SLO?
# "1.0" = burning exactly at budget rate; ">1.0" = burning faster
- record: trading:order_ack_slo_burn_rate:1h
expr: |
(
1 - (
sum(rate(order_ack_latency_seconds_bucket{
env="prod",
le="0.5"
}[1h]))
/
sum(rate(order_ack_latency_seconds_count{env="prod"}[1h]))
)
) / 0.001 # 0.001 = 1 - 0.999 SLO target
# Fill rate SLO compliance - rolling 1h window
- record: trading:fill_rate_slo_compliance:1h
expr: |
sum(rate(orders_completed_within_5s_total{env="prod"}[1h]))
/
sum(rate(orders_submitted_total{env="prod"}[1h]))
Alerting on SLO Burn Rate
The burn rate alert pattern from the Google SRE workbook translates directly. Two windows, two thresholds:
Fast burn alert (page immediately): Burn rate > 14x over the last 1 hour. This means you will exhaust the SLO budget in 3 days at the current rate. During an active session, this is almost always a real incident.
Slow burn alert (ticket): Burn rate > 3x over the last 6 hours. This means exhaustion in 10 days. Needs human attention but not an immediate page.
# prometheus/rules/trading-slo-alerts.yml
groups:
- name: trading_slo_alerts
rules:
- alert: OrderLatencyFastBurn
expr: |
trading:order_ack_slo_burn_rate:1h > 14
and
trading:order_ack_slo_burn_rate:5m > 14
for: 2m
labels:
severity: page
zone: active_session
annotations:
summary: "Order latency SLO burning at {{ $value }}x rate"
description: >
P99 order-to-ACK latency SLO is burning {{ $value }}x faster than budget.
Current P99: {{ with query "trading:order_ack_latency_p99:5m" }}{{ . | first | value | humanizeDuration }}{{ end }}.
Runbook: https://wiki.internal/runbooks/order-latency-burn
pnl_impact: "Estimated ${{ mul $value 35 }}/hour at current burn rate"
- alert: OrderLatencySlowBurn
expr: |
trading:order_ack_slo_burn_rate:1h > 3
and
trading:order_ack_slo_burn_rate:6h > 3
for: 15m
labels:
severity: ticket
annotations:
summary: "Order latency SLO slow burn - {{ $value }}x rate over 6h"
description: >
SLO budget will be exhausted in {{ div 720 $value | humanizeDuration }} at current rate.
- alert: FillRateSLOBreach
expr: trading:fill_rate_slo_compliance:1h < 0.995
for: 5m
labels:
severity: page
annotations:
summary: "Fill rate below 99.5% SLO - {{ $value | humanizePercentage }}"
description: >
Orders are not completing within 5s. Check for exchange connectivity issues,
zombie orders, or OMS queue saturation.
Runbook: https://wiki.internal/runbooks/fill-rate-slo
- alert: MarketDataStaleness
expr: trading:market_data_staleness_max_ms:1m > 100
for: 1m
labels:
severity: page
annotations:
summary: "Market data stale by {{ $value }}ms on {{ $labels.feed }}"
The and condition on both windows for the fast burn alert is important. A 1-hour window alone can fire during brief spikes. Requiring the 5-minute window to also show elevated burn rate confirms the signal is real and current.
Session Gating for Alert Routing
One more pattern that changed our on-call experience significantly: route alerts through a session gate. An alert that fires at 2 AM during Zone 4 should not page the on-call engineer - it should create a ticket for morning review.
This requires a metric that encodes the current trading zone. We exported this from a small service that read the market calendar:
# Silence pages outside active sessions unless severity is critical
- alert: ZombieOrderDetected
expr: zombie_orders_total > 0
for: 1m
labels:
severity: "{{ if eq (trading_session_zone) \"1\" }}page{{ else }}ticket{{ end }}"
In practice, we used Alertmanager inhibit rules instead of dynamic severity:
# alertmanager/config.yml
inhibit_rules:
- source_matchers:
- alertname = "OffHoursMaintenanceWindow"
target_matchers:
- severity = "ticket"
equal: []
A synthetic alert named OffHoursMaintenanceWindow fired only during Zone 4, and inhibited all ticket-severity alerts during that window. This meant the ticket queue stayed clean and engineers slept through the night when the system was doing expected maintenance-window work.
How This Breaks in Production
Failure mode 1: SLO target set without session weighting. Symptom: the SLO dashboard shows green while traders are complaining about missed fills during a volatile session. Root cause: the SLO is measuring availability uniformly across all hours, so the 0.1% of Zone 1 time is drowned out by the 99.9% of Zone 4 availability.
Failure mode 2: Error budget spent on Zone 4 deployments, exhausted before Zone 1 incident. Symptom: the change freeze kicks in during quiet hours, blocking a critical fix from being deployed before the next major macro event. Root cause: budget is not zone-weighted, so cheap Zone 4 burns and expensive Zone 1 burns are counted identically.
Failure mode 3: Fill rate SLO not tracked, zombie orders accumulate. Symptom: position tracking diverges from exchange state; a reconciliation run at end-of-day finds unexplained open positions. Root cause: no metric tracking order completion rate; zombie orders that are live on the exchange but orphaned in the OMS are invisible.
Failure mode 4: Market data freshness SLO missing. Symptom: strategy places orders that fill at prices worse than expected, P&L attribution shows consistent negative execution quality. Root cause: market data feed is silently stale but operational - ticks are arriving, but delayed. Without an explicit freshness SLO, the system appears healthy.
Failure mode 5: Burn rate windows too wide. Symptom: fast burn alert fires 30 minutes into an incident, by which time significant PnL has been lost. Root cause: the burn rate calculation uses only a 1-hour window; a short but severe spike doesn’t show up until it has been running long enough to shift the average.
Failure mode 6: No runbook link on alerts. Symptom: on-call engineer receives a page, spends 10 minutes in Slack asking what the alert means while the incident continues. Root cause: alerts are technically correct but lack the operational context needed to act quickly. Every page should have a single-sentence description of what to do first, a runbook link, and an estimated PnL impact. If you cannot write the one-sentence summary, the alert is not ready to page.
Continue Reading
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.