Skip to content

Infrastructure

Distributed Tracing for Trading Hot Paths: Sampling Strategies That Don't Distort the Signal

How a naive OpenTelemetry implementation added 15µs to ZeroCopy's signing hot path, and the head/tail sampling architecture that brought overhead to under 300ns.

10 min
#opentelemetry #tracing #sampling #latency #trading-infrastructure #rust #observability

When we added distributed tracing to ZeroCopy’s AWS Nitro enclave signing path, the first benchmark after instrumentation came back with a 15µs overhead per signing operation. The signing path itself was running in 8µs. We had more than doubled the latency of the operation by observing it.

This is not a theoretical problem. It is a practical one that every trading system hits the first time someone adds OpenTelemetry to a hot path without thinking through sampling. The solution is not to remove tracing - traces are indispensable for debugging production incidents on distributed systems. The solution is to understand the cost model of tracing and make deliberate decisions about when and how to sample.

OpenTelemetry Basics: What You’re Actually Paying For

A distributed trace is a tree of spans. Each span represents a unit of work: an HTTP request, a database query, a function call. Spans have a start time, end time, attributes, and a parent span ID that links them into the tree.

The work involved in creating and exporting a span breaks down into three phases:

Span creation: allocating the span object, capturing start time, setting attributes. On a modern server, this is 200-500ns if you use a no-op or low-overhead SDK implementation. In naive implementations using heap allocation per span, this can be 1-5µs.

Span recording: the work between span start and end - your actual business logic. This is zero overhead for tracing (the spans don’t change what happens).

Span export: serializing the span to the export format (OTLP protobuf), and either writing to a local buffer for async export, or making a synchronous network call. Synchronous export to a collector is 50-200µs per span. Async buffered export with a background exporter is close to the span creation cost: 200-500ns per span, amortized.

In our case, the 15µs overhead came from three sources: heap allocation per span (we were using a naive Go-style SDK in Rust), synchronous attribute serialization, and a mutex protecting the span buffer. After fixing all three, the overhead dropped to 280ns per span - acceptable even at 50µs target latency.

Why 100% Sampling is Unacceptable on Hot Paths

For a web API handling 1,000 RPS, 100% trace sampling at 500µs overhead per request adds 500ms of load per second - 0.05% CPU overhead. Barely noticeable.

For a trading system handling 100,000 orders per second at 50µs target latency, 100% sampling at 500µs overhead per span means 50 seconds of overhead per second of work. Physically impossible.

Even at 280ns per span (our optimized implementation), 100% sampling on a 100K ops/sec system adds 28ms of overhead per second. On a 24-core machine, that is 1.2ms per core per second - meaningful on a latency-sensitive hot path.

The correct answer is sampling: trace only a fraction of requests, chosen intelligently so you still capture the information you need.

Head Sampling: Fast but Blind

Head sampling makes the sampling decision at the first span in the trace - typically at the entry point of your system. If a request is sampled (decision is “yes”), all subsequent spans in that request are also captured. If not sampled, no spans are created.

Head sampling is fast because the decision is a single random number check per request:

// Rust head sampling - custom sampler
use opentelemetry_sdk::trace::{SamplingDecision, SamplingResult, Sampler};

pub struct RateSampler {
    rate: f64,  // e.g., 0.001 = 0.1%
}

impl Sampler for RateSampler {
    fn should_sample(
        &self,
        parent_context: Option<&Context>,
        _trace_id: TraceId,
        _name: &str,
        _span_kind: &SpanKind,
        _attributes: &[KeyValue],
        _links: &[Link],
    ) -> SamplingResult {
        // If there's an existing parent decision, respect it
        if let Some(ctx) = parent_context {
            if let Some(span) = ctx.span() {
                return if span.span_context().is_sampled() {
                    SamplingResult { decision: SamplingDecision::RecordAndSample, .. }
                } else {
                    SamplingResult { decision: SamplingDecision::Drop, .. }
                };
            }
        }

        // New root trace: apply rate
        if rand::random::<f64>() < self.rate {
            SamplingResult {
                decision: SamplingDecision::RecordAndSample,
                attributes: Vec::new(),
                trace_state: Default::default(),
            }
        } else {
            SamplingResult {
                decision: SamplingDecision::Drop,
                attributes: Vec::new(),
                trace_state: Default::default(),
            }
        }
    }
}

Head sampling’s weakness: it makes the decision before you know anything about how the request will turn out. At 0.1% sampling rate, you will sample 1 in 1000 requests randomly. If an important event - a failed order, a latency spike - happens on one of the 999 unsampled requests, you have no trace. The failure is invisible.

For normal healthy-path requests, this is fine - you don’t need a trace for every successful order. But for debugging production incidents, head sampling is inadequate on its own.

Tail Sampling: Smart but Expensive to Buffer

Tail sampling defers the sampling decision until after the trace is complete. The OTel collector buffers all spans for each trace for a configurable window (typically 30-60 seconds), and at the end of the window, applies a policy: keep this trace if it has any error spans, keep it if the root span duration exceeded a threshold, drop it otherwise.

This gives you exactly the traces you need: every error, every latency outlier, and a statistical sample of the happy path.

The cost is the buffer. If you have 100,000 ops/sec and a 30-second tail sampling window, the collector must buffer 3,000,000 span objects simultaneously. At 2KB per span (typical OTLP serialization size), that is 6GB of in-memory buffer. For a single collector instance, this is feasible but requires careful sizing.

OTel Collector tail sampling configuration:

# otel-collector-config.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  tail_sampling:
    decision_wait: 30s        # Buffer all spans for 30 seconds before deciding
    num_traces: 100000        # Max traces in buffer (tune based on your ops/sec)
    expected_new_traces_per_sec: 1000

    policies:
      # Keep ALL traces with errors
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Keep all traces where root span duration > 10ms (latency outliers)
      - name: latency-outlier-policy
        type: latency
        latency:
          threshold_ms: 10

      # Keep 0.1% of healthy traces for baseline visibility
      - name: happy-path-sample
        type: probabilistic
        probabilistic:
          sampling_percentage: 0.1

exporters:
  otlp:
    endpoint: jaeger:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [otlp]

The Right Model for Trading: Layered Sampling

The correct architecture combines head and tail sampling for different use cases:

Hot path (order routing, signing, exchange connector): Head sample at 0.1% with a forced-sample flag for unusual conditions. The SDK should support a “force sample” mechanism: if the system detects it is about to handle something unusual (retry on a timeout, fallback to secondary exchange, unusual order size), it can override the head sampling decision and force a trace.

Error path: 100% sampling on any error. This is cheap because errors are rare. When an order submission fails, you want the full trace. The cost of capturing 100% of error traces is negligible because errors should be rare by definition.

Latency outliers: Tail sampling in the collector for any trace where root span duration > 2ms. On a system targeting P99 < 500µs, a 2ms trace is a 4-sigma event. You want all of them.

Background paths (position reconciliation, end-of-day settlement, monitoring): 10-50% sampling. These paths are not latency-sensitive, and traces here are genuinely useful for debugging data pipeline issues.

// Rust - configuring layered sampling with opentelemetry_sdk
use opentelemetry_sdk::trace::config;

fn build_tracer(service_name: &str) -> impl Tracer {
    let sampler = opentelemetry_sdk::trace::Sampler::ParentBased(
        Box::new(HotPathSampler {
            // 0.1% on hot paths - head sampling decision
            hot_path_rate: 0.001,
            // 100% on error paths - the SDK checks span status before export
            // (Tail sampling at collector handles latency outliers)
        })
    );

    opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint("http://otel-collector:4317")
        )
        .with_trace_config(
            config()
                .with_sampler(sampler)
                .with_resource(Resource::new(vec![
                    KeyValue::new("service.name", service_name.to_string()),
                    KeyValue::new("service.version", env!("CARGO_PKG_VERSION")),
                ]))
                // CRITICAL: use batch exporter, not simple (synchronous) exporter
                // BatchSpanProcessor buffers locally and exports async
        )
        .install_batch(opentelemetry_sdk::runtime::Tokio)
        .expect("Failed to install tracer")
}

The install_batch vs install_simple choice is the single most important configuration decision for hot-path tracing. Simple (synchronous) export adds export latency to every instrumented operation. Batch export buffers locally and ships in the background, paying the export cost only in the background thread.

Context Propagation Without Allocation

The standard OTel context propagation model passes a Context object through every function call. In idiomatic Go, this is func foo(ctx context.Context, ...). In Rust with async, it is either a thread-local or an explicit parameter.

Thread-local context is the approach most Rust OTel libraries use:

use opentelemetry::Context;

async fn handle_order(order: Order) -> Result<Fill> {
    // Create a span - no allocation if sampled=false (zero-cost no-op sampler path)
    let span = tracer.start("handle_order");
    let cx = Context::current_with_span(span);

    // Pass context explicitly OR use cx.attach() for implicit propagation
    let _guard = cx.attach();  // Sets thread-local context for duration of guard lifetime

    // All child spans created in this async task will automatically
    // parent to this span via the thread-local context
    let fill = route_to_exchange(&order).await?;

    // span ends when _guard drops
    Ok(fill)
}

For the signing hot path in ZeroCopy’s enclave, we use a dedicated no-op tracer in production builds when tracing is disabled, and the real sampled tracer in environments where observability is needed. The Rust compiler’s monomorphization eliminates the no-op path entirely in release builds - zero overhead when disabled.

How This Breaks in Production

Failure mode 1: Synchronous span export on hot path. Symptom: after adding tracing, P99 latency increases by 50-200µs and the increase is constant (not spiky). Root cause: simple (synchronous) span exporter is making a network call on every hot-path operation. Fix: switch to install_batch and a background exporter.

Failure mode 2: 100% head sampling on high-volume path. Symptom: OTel collector OOMs, trace backend is overwhelmed, Jaeger/Tempo becomes unavailable. Root cause: every order at 100K ops/sec is generating a span. Fix: reduce head sampling rate to 0.1-1% on hot paths; rely on tail sampling for error and outlier capture.

Failure mode 3: Tail sampling window too short. Symptom: tail sampling collector marks long-running traces as complete before they finish; traces appear truncated with missing child spans. Root cause: a complex order flow that takes 45 seconds (async fill, partial fills, amendments) is being cut off by a 30-second tail sampling window. Fix: increase decision_wait or add a forced-sample flag for flows known to run long.

Failure mode 4: Head sampling drops all errors on low-rate paths. Symptom: exchange connectivity errors are not appearing in traces despite being logged. Root cause: the exchange connector handles 10 ops/sec, head sampling is at 0.1%, and the 0.001 probability sampling misses 99.9% of the 5-10 errors that occur daily. Fix: force-sample on error by setting span status to ERROR and configuring the collector’s error policy to keep all ERROR-status traces.

Failure mode 5: Span attributes on hot path with string allocation. Symptom: tracing appears low-overhead in benchmarks but causes GC pressure in production. Root cause: span attributes are built with dynamic string formatting (format!("order_{}", order_id)) on every hot-path span, even sampled-out ones. The OTel SDK creates attribute key-value pairs before checking the sampling decision. Fix: check sampling decision before creating attributes, or use lazy attribute evaluation where the SDK supports it.

Failure mode 6: No trace context on async boundaries. Symptom: traces are fragmented - the order submission span exists, but the exchange ACK span is a separate root trace with no link to the submission. Root cause: async message passing (through a channel or queue) doesn’t automatically propagate the trace context. Fix: serialize the W3C traceparent header into the message, and restore context from it on the receiving side.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.