Skip to content
Curriculum 7 posts · ~1.3h total

Observability & SLOs for Trading Systems

If your p99 dashboard says 50µs, your p99 is probably 2ms

Instrumentation and alerting for trading systems: SLO design with PnL-equivalent error budgets, Prometheus cardinality traps, HDR histogram vs t-digest, distributed tracing hot paths, and incident response.

What you'll master

  • SLO design with error budgets tied to PnL
  • Prometheus cardinality management
  • HDR histogram for accurate latency percentiles
  • Tail-based distributed tracing for hot paths
  • Alerting hygiene: 40 pages/week to fewer than 5

Why this matters

The most dangerous systems are the ones that look healthy on dashboards while silently degrading. These seven posts document the observability patterns (HDR histograms over t-digest, SLOs tied to error budgets rather than uptime, tail-based tracing for hot paths) that catch problems averages miss.

The Curriculum - 7 modules