Skip to content
Back to case studies

Case Study · HFT Exchange Infrastructure

Crypto trading infrastructure
across 12+ exchanges at a Tier-1 HFT firm

June 2021–Sep 202212+ exchangesAWSMarket-data normalizationDeployment automationObservability

Anonymization note

The firm is a major proprietary trading firm. I worked there as Founding Cryptocurrency DevOps Engineer from June 2021 to September 2022. The firm is not named here; no P&L, trading strategy, or competitive details are disclosed.

A Tier-1 prop firm expanding into crypto had a deadline problem. The firm needed production trading infrastructure across a large number of exchanges before the next strategy cycle. Each exchange had a different API design, different rate-limit behavior, different reliability characteristics, and different outage patterns. There was no unified playbook for this.

The problem

Prop trading firms moving into crypto in 2021 faced an infrastructure problem that equity or futures firms hadn't encountered: there was no industry-standard connectivity layer. Each crypto exchange was its own protocol zoo. WebSocket implementations varied. Rate limits were inconsistently documented and frequently changed. Reliability ranged from "mostly up" to "degraded for hours during peak volume." Order routing semantics differed in ways that mattered to a trading strategy.

The firm needed infra that could connect to 12+ exchanges simultaneously, normalize market data into a consistent format, route orders reliably, handle exchange outages gracefully, and provide enough observability to diagnose a latency spike or missed fill without anyone on call guessing.

All of it needed to run 24/7. Crypto doesn't close.

The build

The infrastructure ran on AWS. This was a deliberate choice. The alternative was co-location at a specific exchange, which would optimize for one venue while making all others worse. At the firm's scale in 2021, AWS let us place compute close to exchange matching engines via regional availability zones, use managed networking to handle the per-exchange TCP connection pools, and keep deployment and observability consistent across venues. The "get to colo" conversation was acknowledged as a later optimization, not the starting point.

I built a unified exchange connectivity layer that abstracted per-exchange API differences behind a consistent internal interface. Each exchange adapter handled that exchange's authentication, rate-limit management, WebSocket reconnection logic, and order state reconciliation. The strategy layer above didn't need to know which exchange it was talking to.

Market data from all exchanges was normalized into a common schema: symbol format, timestamp precision, order book depth, and trade event structure. This normalization happened at the adapter boundary. By the time data reached any internal consumer, it looked the same regardless of source.

Key engineering decisions

Build vs buy per exchange adapter

The main third-party crypto connectivity libraries in 2021 were general-purpose and optimized for retail usage patterns. They handled the basics but fell apart under HFT rate-limit budgets, had inconsistent error handling, and didn't expose the low-level connection state the ops team needed. We built adapters in-house for the high-volume venues and used library wrappers for lower-priority connections with additional hardening on top.

Rate-limit budget management

Exchange rate limits are per-IP, per-API-key, and in some cases per-endpoint with separate buckets. A naive implementation that fires requests until it gets a 429 is not acceptable in production. Each adapter tracked its rate-limit budget with conservative headroom, backed off cleanly on approach, and surfaced current utilization to the monitoring layer so any budget approaching capacity generated an alert before a rejection.

Exchange outage handling

Exchanges go down during peak volume. The connectivity layer needed to distinguish between "exchange is temporarily unavailable" and "our connection is broken" and "the exchange is in a degraded state returning bad data." Each of these required a different response. Circuit breakers were set per-exchange and per-connection type. Strategy behavior when a venue was unavailable was configurable and tested before any live trading started.

AWS over colo for initial deployment

The latency argument for colo is real but context-dependent. For a firm just entering crypto with 12+ venues to connect simultaneously, the operational overhead of managing colo in multiple exchange data centers was not justified. AWS gave consistent, measured latency across venues, a deployment model we already understood, and managed networking that reduced the operational surface area. The latency difference at this stage was in the millisecond range, acceptable until the strategy was validated and the volume justified the colo investment.

Observability for the latency path

Signed off on every deploy meant being able to answer: what is the order-to-fill latency for each venue? Where is the 99th percentile and what's causing it? When did it change? Prometheus metrics on every stage of the order path, from strategy signal to exchange acknowledgment, with per-venue p50/p95/p99 breakdowns. Any latency regression showed up in alerting before it affected trading results.

Outcomes

12+
Centralized exchanges in production simultaneously
24/7
Continuous operation with exchange-aware circuit breakers
1
Unified connectivity interface across all venue adapters
AWS
Cloud infrastructure with per-venue latency observability

What this means for a client

Multi-exchange crypto trading infrastructure is not primarily a cryptography or algorithm problem. It's an operational problem. Rate limits, outage handling, data normalization, deployment consistency, observability: those are the pieces that determine whether a strategy can be tested and operated without the infra team firefighting every day.

I've built this from scratch at a firm that required production-grade reliability from day one. The failure modes are documented in working code, not in a design doc. What I bring to an engagement is someone who's already been through the build-vs-buy decisions, the rate-limit incident at 3am, and the exchange-API change that broke normalization silently.

The productized version of this work is the HFT infrastructure audit: a forensic review of your trading infrastructure's latency path, failure modes, and operational gaps.

If your multi-exchange infra has gaps

Need an engineer who's built production trading infra at a Tier-1 prop shop?