Skip to content

Infrastructure

Alerting Hygiene for a 24/7 Trading Desk: The Page-Tax and How to Pay It Down

From 40 pages/week to under 5 actionable pages: the alerting audit, 3-tier model, dead man's switch, and Prometheus alert rules that rebuilt Upside's on-call culture.

12 min
#alerting #sre #prometheus #on-call #trading-infrastructure #observability

The 3 AM page that changed how I design alerting wasn’t dramatic. It was a Prometheus alert: StrategyHeartbeatMissed. I woke up, SSH’d into the system, checked the strategy logs, found nothing wrong, confirmed the strategy was processing orders normally, and acknowledged the alert. Total time: 22 minutes.

The next night, same alert. Same outcome.

When I joined Upside as Head of Engineering, the on-call engineer was receiving 38-42 pages per week. Not 38-42 incidents - 38-42 individual alert notifications, each requiring at least a glance and usually 15-30 minutes of investigation. Some were real. Most weren’t. The cumulative effect was an on-call rotation where engineers came to dread the rotation, started ignoring alerts, and eventually stopped distinguishing between “this needs attention now” and “this is probably noise.”

When I left Upside, the count was 4 actionable pages per week. Here is everything I changed.

The Page-Tax Concept

Every false page is a tax on human attention. For a consumer web company with a large SRE team, this tax is merely inefficient - you hire more on-call engineers, you spread the load. For a 4-person engineering team running a 24/7 trading operation, human attention is your only non-renewable resource.

Attention has three properties that make the page-tax particularly destructive in trading contexts:

Depletion is permanent within a shift. An engineer paged at 2 AM who spends 20 minutes on a false positive is not at full cognitive capacity for the rest of the night. If a real incident happens at 3 AM, they are impaired.

Desensitization accumulates over weeks. After seeing the same false alert fire 50 times, engineers learn to dismiss it without investigation. When it fires once for a real reason, it gets dismissed too.

Trading incidents have hard time constraints. A web engineer can spend 5 minutes reading the alert before deciding to act. In trading, a real incident has a 2-minute window before it starts costing money. If the on-call engineer is conditioned to assume alerts are noise, those 2 minutes evaporate.

The goal of alerting hygiene is to make every page credible. When a page fires, the on-call engineer should be able to act on it immediately, with confidence that it represents a real problem.

The Alert Audit

The first thing I did at Upside was run an alert audit. I pulled 90 days of Alertmanager history and categorized every alert that had fired:

Actionable page: Required immediate human action, within 5 minutes. Something was actively wrong and would continue to be worse without intervention. These were the alerts I wanted.

Informational noise: Fired as a notification that something happened, but required no action. Examples: StrategyStarted, DeploymentComplete, DailyPnLSummary.

False positive: Fired, was investigated, found to be fine. This included miscalibrated thresholds, alerts that should have had longer for durations, and alerts based on metrics that had known transient behavior.

Stale alert: Had not fired in 90 days. Either the condition it was monitoring no longer existed, or it was misconfigured and never triggered.

At Upside, the breakdown was roughly: 12% actionable, 31% informational noise, 42% false positives, 15% stale. Only 12% of pages were doing useful work.

The action after the audit was straightforward:

  • Delete all stale alerts. If it hasn’t fired in 90 days, it’s not monitoring anything real.
  • Move informational alerts to Slack notifications or weekly digests. Nothing that requires no action should page anyone.
  • Recalibrate false positive alerts. Each one needed either a higher threshold, a longer for duration, or a better query.

The 3-Tier Alert Model

After the audit, I restructured every alert into exactly three tiers with clear contracts:

Page: Requires human action within 5 minutes. Only fires outside business hours if it cannot wait until morning. Every Page alert must have a runbook. On-call engineer is expected to act within 5 minutes of receipt.

Ticket: Requires human action within 24 hours. Creates a Jira ticket automatically via Alertmanager webhook. Does not wake anyone up. On-call reviews Ticket alerts at the start of their shift.

Notification: Informational. Sent to a Slack channel. No action required. Reviewed weekly at the SRE sync.

The tier assignment for every alert is documented in the alert itself via a label:

labels:
  severity: "page"  # or "ticket" or "notification"
  tier: "1"         # SLO tier: 1=active session, 2=off-hours, 3=maintenance

Alertmanager routes based on these labels:

# alertmanager/config.yml
route:
  receiver: "default"
  routes:
    - matchers:
        - severity = "page"
      receiver: "pagerduty"
      continue: false

    - matchers:
        - severity = "ticket"
      receiver: "jira-webhook"
      continue: false

    - matchers:
        - severity = "notification"
      receiver: "slack-notifications"
      continue: false

Alert Design Rules

Every alert in the system must satisfy all of these criteria before it ships:

Rule 1: Every alert has a runbook link. The runbook link goes in the annotations.runbook_url field. It must point to a page that exists and is up to date. An alert without a runbook link is not allowed to page anyone.

Rule 2: Every page has a one-sentence summary of what to do first. The summary goes in annotations.summary. It is not a description of the problem - it is the first action the on-call engineer should take. “Check exchange connectivity dashboard, then verify position state.” Not “Order fill rate has dropped below SLO.”

Rule 3: Every alert must have fired at least once in the last 90 days. If it hasn’t, it is either misconfigured or monitoring something that never happens. Either way, it should not be in the active alerting config. This is enforced via a quarterly alert review.

Rule 4: Every alert has an estimated PnL impact. For trading systems, “something is wrong” is not enough context. “Fill rate below SLO - estimated $200/minute impact at current volume” gives the on-call engineer the information they need to escalate or handle independently.

Multi-Window Alerting

The biggest reduction in false positives came from switching from single-threshold alerting to multi-window alerting.

A single-threshold alert fires when a metric exceeds a threshold for a continuous for duration. The problem: a brief spike lasting 10 seconds will trigger an alert with for: 0m but be invisible with for: 5m. Neither is right - the 0m fires on every transient, the 5m misses real incidents that resolve within 5 minutes.

Multi-window alerting solves this by requiring the metric to be elevated in two different windows simultaneously. The fast window catches sudden spikes; the slow window catches sustained degradation. Requiring both prevents the transient false positives while catching both types of real incidents.

# prometheus/rules/fill-rate-alerts.yml
groups:
  - name: fill_rate
    rules:
      # Fast burn: SLO burning 14x faster than budget over 1 hour
      # AND confirmed by 5-minute window (prevents transient fires)
      - alert: FillRateFastBurn
        expr: |
          (
            trading:fill_rate_slo_burn_rate:1h > 14
            and
            trading:fill_rate_slo_burn_rate:5m > 14
          )
        labels:
          severity: page
        annotations:
          summary: "Immediate action: check OMS connectivity to all exchanges"
          runbook_url: "https://wiki.internal/runbooks/fill-rate-fast-burn"
          pnl_impact: "~${{ mul $value 50 }}/hour at current burn rate"

      # Slow burn: SLO burning 3x faster than budget sustained over 6 hours
      # AND confirmed by 1-hour window
      - alert: FillRateSlowBurn
        expr: |
          (
            trading:fill_rate_slo_burn_rate:6h > 3
            and
            trading:fill_rate_slo_burn_rate:1h > 3
          )
        labels:
          severity: ticket
        annotations:
          summary: "Review OMS logs for rejected orders and partial fills"
          runbook_url: "https://wiki.internal/runbooks/fill-rate-slow-burn"

The recording rules trading:fill_rate_slo_burn_rate:1h and trading:fill_rate_slo_burn_rate:5m should be pre-computed - see the SLO post for the recording rule definitions.

The Dead Man’s Switch

The most important alert we added at Upside was also the simplest: the dead man’s switch.

A dead man’s switch (also called a watchdog or heartbeat alert) fires when you do NOT receive an expected signal. It is the inverse of every other alert in the system. All other alerts fire when something is wrong. The dead man’s switch fires when something has gone so wrong that the system can no longer tell you it’s broken.

For a trading system, silent failures are the most dangerous. A strategy that crashes and stops trading does not generate an order error rate spike - it generates zero orders. Zero orders is not an error condition in any individual metric. The dead man’s switch catches this:

# prometheus/rules/heartbeat-alerts.yml
groups:
  - name: heartbeat
    rules:
      # Strategy must emit a heartbeat metric every 60 seconds
      # If we go 5 minutes without seeing it, something is seriously wrong
      - alert: StrategyHeartbeatMissed
        expr: |
          (time() - strategy_last_heartbeat_timestamp_seconds) > 300
        for: 0m  # No grace period - heartbeat missing IS the incident
        labels:
          severity: page
        annotations:
          summary: "IMMEDIATE: Strategy {{ $labels.strategy }} has stopped emitting. Check process health."
          runbook_url: "https://wiki.internal/runbooks/strategy-heartbeat-missed"
          pnl_impact: "Strategy is not trading. All open positions are unhedged."

      # Exchange data feed must publish a tick every 10 seconds during market hours
      - alert: MarketDataFeedDead
        expr: |
          (time() - market_data_last_tick_timestamp_seconds) > 60
          and on() trading_session_active == 1
        for: 0m
        labels:
          severity: page
        annotations:
          summary: "IMMEDIATE: Market data feed {{ $labels.feed }} is stale. Strategy may be trading on old prices."
          runbook_url: "https://wiki.internal/runbooks/market-data-feed-dead"

The strategy’s responsibility is simple: emit a counter or gauge update every 30 seconds:

# Python - strategy heartbeat emission
import time
from prometheus_client import Gauge

HEARTBEAT = Gauge('strategy_last_heartbeat_timestamp_seconds',
                  'Unix timestamp of last heartbeat',
                  ['strategy'])

class StrategyRunner:
    def run_main_loop(self):
        while self.running:
            self.process_market_data()
            self.evaluate_signals()
            self.submit_orders()

            # Heartbeat - must run every iteration
            HEARTBEAT.labels(strategy=self.name).set(time.time())

            time.sleep(0.1)  # 100ms cycle time

If the strategy crashes, the heartbeat stops. The Prometheus alert fires after 5 minutes of missing heartbeats. At 3 AM, the on-call engineer gets a page that says “IMMEDIATE: Strategy has stopped emitting. Check process health.” They SSH in, check systemctl status strategy, see it crashed, restart it. Total time: 3 minutes.

Without the dead man’s switch, the first indication would have been the morning reconciliation showing zero fills overnight. By then, the PnL damage is done.

Complete Alert Rule File

This is the structure I use for a production trading alerting config:

# prometheus/rules/trading-alerts.yml
groups:
  - name: trading_slo_alerts
    rules:
      - alert: OrderLatencyFastBurn
        expr: |
          trading:order_ack_slo_burn_rate:1h > 14
          and
          trading:order_ack_slo_burn_rate:5m > 14
        labels:
          severity: page
          component: order_routing
        annotations:
          summary: "Check order router logs and exchange connectivity immediately"
          runbook_url: "https://wiki.internal/runbooks/order-latency"
          description: "P99 order ACK latency SLO burning at {{ $value | humanize }}x rate"

      - alert: FillRateFastBurn
        expr: |
          trading:fill_rate_slo_burn_rate:1h > 14
          and
          trading:fill_rate_slo_burn_rate:5m > 14
        labels:
          severity: page
          component: order_management
        annotations:
          summary: "Check OMS connectivity. Verify no zombie orders via position reconciliation."
          runbook_url: "https://wiki.internal/runbooks/fill-rate"

      - alert: MarketDataStaleness
        expr: |
          max by (feed) (
            (time() - market_data_last_tick_timestamp_seconds) * 1000
          ) > 500
          and on() trading_session_active == 1
        for: 1m
        labels:
          severity: page
          component: market_data
        annotations:
          summary: "Reconnect market data feed {{ $labels.feed }}. Strategy may be trading stale."
          runbook_url: "https://wiki.internal/runbooks/market-data-staleness"

  - name: trading_heartbeat_alerts
    rules:
      - alert: StrategyHeartbeatMissed
        expr: (time() - strategy_last_heartbeat_timestamp_seconds) > 300
        for: 0m
        labels:
          severity: page
          component: strategy
        annotations:
          summary: "CRITICAL: Restart strategy {{ $labels.strategy }} - process may be dead"
          runbook_url: "https://wiki.internal/runbooks/strategy-heartbeat"

      - alert: RiskManagerHeartbeatMissed
        expr: (time() - risk_manager_last_heartbeat_timestamp_seconds) > 120
        for: 0m
        labels:
          severity: page
          component: risk
        annotations:
          summary: "CRITICAL: Risk manager down. HALT all new orders immediately."
          runbook_url: "https://wiki.internal/runbooks/risk-manager-down"

  - name: trading_circuit_breaker_alerts
    rules:
      - alert: DailyLossLimitApproaching
        expr: |
          strategy_daily_pnl_usd < -(strategy_daily_loss_limit_usd * 0.8)
        labels:
          severity: page
          component: risk
        annotations:
          summary: "Strategy {{ $labels.strategy }} at 80% daily loss limit - review positions"
          runbook_url: "https://wiki.internal/runbooks/daily-loss-limit"

      - alert: DailyLossLimitBreached
        expr: |
          strategy_daily_pnl_usd < -strategy_daily_loss_limit_usd
        for: 0m
        labels:
          severity: page
          component: risk
        annotations:
          summary: "HALT: Strategy {{ $labels.strategy }} has breached daily loss limit"
          runbook_url: "https://wiki.internal/runbooks/daily-loss-halt"

How This Breaks in Production

Failure mode 1: Alert with no runbook link fires at 3 AM. Symptom: on-call engineer receives a page with an opaque metric name and no context. Spends 15 minutes figuring out what it means while the incident continues. Root cause: alert was written by an engineer who understood the metric, merged without requiring a runbook. Fix: enforce runbook_url in CI - yamllint or a custom linter that requires the field.

Failure mode 2: False positive conditioning. Symptom: a real P1 incident goes unnoticed for 20 minutes because the on-call engineer assumed the alert was another false positive. Root cause: the same alert has fired as a false positive 30+ times in the last month. Engineers are conditioned to dismiss it. Fix: quarterly alert review; any alert with a false positive rate above 20% is recalibrated or demoted to ticket.

Failure mode 3: Dead man’s switch not implemented. Symptom: a strategy crashes silently at midnight, runs zero orders for 6 hours, and the first notification is the morning P&L report. Root cause: no heartbeat metric emitted, no watchdog alert. Fix: every strategy process must emit a heartbeat. The heartbeat alert is non-negotiable.

Failure mode 4: Single for duration on high-volatility metric. Symptom: during a normal market open with high order volume, 15 spurious “FillRateBelowSLO” pages fire within 2 minutes. Root cause: single-window threshold fires on the brief latency spike at market open before the system warms up. Fix: multi-window alerting requiring both fast and slow windows to confirm the signal.

Failure mode 5: Alertmanager silences not time-boxed. Symptom: an alert was silenced during a maintenance window two months ago and the silence was never removed. A real incident fires but the alert is silenced. Root cause: Alertmanager silences have no mandatory expiry. Fix: all silences require an expiry time; long-term silences require Ticket-severity audit.

Failure mode 6: Risk manager heartbeat missing but not paged. Symptom: risk limits are not enforced because the risk manager crashed, but no page was sent because the heartbeat alert was categorized as “ticket” severity. Strategy accumulates positions beyond the risk limit. Root cause: incorrect severity assignment on heartbeat alerts for safety-critical components. Fix: all heartbeat alerts for risk-critical components (risk manager, position tracker, order manager) are always Page severity with no exception.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.