Skip to content

Fixed-scope audit · 2 weeks · $10,000

Accepting 1 new Fractional · June 2026
30-40 hrs/wk contract · Q2-Q3 2026

HFT infrastructure audit

A forensic review of your trading infrastructure's latency path, failure modes, and operational gaps. Written findings with severity rankings. 2 weeks, fixed scope.

Most HFT teams instrument the happy path. They don't measure what happens to signing latency at 3x burst, whether their NUMA pinning is actually holding, or whether their p99 dashboards are measuring what they think they're measuring. When fills slow down or a failover misfires in production, they're debugging under pressure without baseline data. This audit gives you that data before that moment.

$2,000 deposit to reserve your slot. Remainder due on report delivery. Non-refundable deposit after Day 1 kickoff; full report guaranteed or you owe nothing on the balance.

42µs signing latency 15 years in production 40+ Rust crates shipped Ex-Gemini · Ex-Akuna

Why this reviewer

Founding crypto DevOps engineer at Akuna Capital (Jun 2021–Sep 2022). Built crypto trading infrastructure on AWS across 12+ exchanges. Dealt with signing latency, failover choreography, and cross-exchange connectivity at scale.

Senior SRE at Gemini Exchange (Nov 2020–May 2021). Shipped PTP on Solarflare hardware. $2B+/day throughput at 99.99% uptime. Owned the infrastructure reliability posture for a regulated exchange at that scale.

Lead Linux Infrastructure Engineer at FlexTrade (6 years, 2012–2018). 1,500+ Linux systems. SOC-2 and MiFID II compliance. Built and operated the infrastructure backbone for a multi-venue OMS/EMS platform.

Founder of ZeroCopy Systems. Rust signing pipeline in AWS Nitro Enclaves. Measured 42µs deterministic signing latency on the benchmark suite. Designed for production validation.

What you get.

Signing-path latency measurement under realistic load

I instrument your signing path with OpenTelemetry spans, replay your stated TPS profile, and measure p50/p90/p99 at normal load and at 2x and 5x burst. The numbers in the report reflect what your system does under pressure, not what a benchmark script achieves in isolation.

NUMA and CPU-isolation review

I verify whether your NUMA pinning is configured correctly and holding under load, whether your critical threads are actually isolated from OS noise, and whether your interrupt-coalescing and IRQ affinity settings are consistent with your latency targets. Misconfigured NUMA topology is one of the most common sources of unexplained p99 spikes.

Kernel-configuration audit

Review of kernel parameters relevant to trading latency: scheduler settings, transparent huge pages, NOHZ_FULL configuration, network stack tuning (socket buffer sizes, TSO/GSO, Nagle). Not a generic Linux hardening checklist: focused on the settings that move p99 in trading workloads.

Failover and circuit-breaker review

I trace your failover path end-to-end: what triggers it, how long it takes, what state is preserved or lost, whether your circuit breakers fire at the right thresholds, and whether a failover in production would behave the same way as a failover in staging. I'll also check whether failover events are observable: whether your on-call team would know it happened and whether the reason is logged.

Observability gap analysis

Do your p99 dashboards reflect reality? I check whether your metrics are measuring at the right granularity, whether aggregation is hiding tail latency, whether your alerting thresholds have drift, and whether there are critical paths in your stack that have no instrumentation at all. "Our dashboards look fine" is not the same as "our dashboards are measuring the right things."

Prioritized findings report with severity rankings

Written report delivered as PDF and markdown. Each finding gets a severity level (critical / high / medium / low), an estimated latency or reliability impact, and a specific remediation step. You can hand the markdown directly to an engineer and they know what to build.

What the report looks like.

Sample table of contents. Actual depth depends on stack complexity.

HFT Infrastructure Audit | [Client] | [Date]

1. Executive summary

1.1 Critical findings (must fix before next trading day)

1.2 High findings (fix within 2 weeks)

1.3 Medium / low findings (backlog)

2. Signing-path latency

2.1 Instrumented measurements (p50/p90/p99 at 1x / 2x / 5x load)

2.2 Comparison vs TEE, CloudHSM, KMS baselines

2.3 Bottleneck analysis (where time is actually going)

3. CPU and NUMA topology

3.1 Current configuration

3.2 Deviations from latency-optimal settings

4. Kernel configuration

5. Failover and circuit-breaker review

5.1 Failover path trace

5.2 State preservation analysis

5.3 Circuit-breaker threshold review

6. Observability gaps

7. Remediation roadmap (severity-ordered)

8. Appendix: raw measurement data

Example finding (anonymized)

Finding #3 - SEVERITY: HIGH

Observability · Signing path

Description: The signing p99 metric in Grafana is computed from a 1-minute histogram bucket aggregation. At the firm's peak TPS, this aggregation window masks latency spikes shorter than 30 seconds. Three incidents in the trailing 6 months showed degraded fill rates with no corresponding alert. The p99 dashboard showed green throughout.

Remediation: Switch signing latency instrumentation to a sliding-window histogram with 5-second resolution. Instrument at the call site, not at the API boundary. Add an alert on the raw p99 sample series, not the aggregated average.

What this is not.

Not an implementation engagement. The audit tells you what to fix and why. Executing the remediations is a separate engagement. If you want hands-on implementation after the report, that's a scoped project we can discuss.

Not a full security audit or penetration test. I review the threat model for your signing path and flag obvious security gaps, but this is primarily a latency and operational-reliability review. For a comprehensive security audit, engage a dedicated pen-test firm.

Not applicable if you don't control the infrastructure layer. If you're fully managed (cloud-native with no OS access, shared exchange colocation with no tuning rights), there's nothing to instrument at the kernel level. The free 20-minute diagnostic will clarify quickly.

Not a trading strategy or alpha review. This is pure infrastructure: hardware, OS, network, signing, failover, observability. P&L, strategy, and model performance are out of scope.

2-week process.

Day 1

Kickoff

You share stack access (read-only preferred: SSH with limited sudo, Grafana read, architecture docs). I review your existing instrumentation, alert configs, and any incident post-mortems. We align on scope: which exchanges, which signing paths, which failover scenarios.

Day 2–3

Instrument and measure

I add OpenTelemetry spans to your signing path and run load replay at your stated TPS profile plus burst multiples. I collect p50/p90/p99 per signing operation and per exchange leg. I also pull your current kernel configuration and CPU topology.

Day 4–7

Analyze

I trace the NUMA topology and CPU isolation configuration against the measurement data. I review your failover logic end-to-end, including state machine, timeout values, and circuit breaker thresholds. I audit your kernel parameters against a latency-optimized baseline. I map observability gaps against the critical paths.

Day 8–12

Write findings and walkthrough

I write the findings report: each issue documented with evidence, severity, impact estimate, and remediation step. PDF and markdown delivered by end of Day 12. 60-minute walkthrough call with your engineering team scheduled on Day 12 or 13. Balance due on report delivery.

Questions.

What access do you need?

Read-only SSH access to a staging or shadow production environment, read access to Grafana/Prometheus, and your architecture documentation. I don't need access to live order flow, trading accounts, or exchange credentials. If staging doesn't exist, we work with a production shadow. I'll specify the exact access scope before kickoff so your security team can approve it cleanly.

We use a managed colocation provider. Does this still apply?

Depends on what you control. If you have dedicated hardware with OS-level access: yes, that is where the most tuning opportunities are. If you're fully managed (vendor-controlled OS, no tuning rights), the kernel and NUMA work won't apply, but signing-path instrumentation and failover review still do. Clarify in the 20-min diagnostic and I'll scope accordingly.

What's the refund policy on the $2,000 deposit?

Non-refundable after Day 1 kickoff. Before kickoff, I'll refund in full if you cancel at least 48 hours before the scheduled start. If I fail to deliver the report by Day 12, you owe nothing on the balance.

Who owns the report and instrumentation code?

You do. Full IP transfer on final payment. I retain the right to reference the engagement type in aggregate ("audited N HFT stacks"), but will not disclose your firm name, findings, or benchmark numbers without written permission. Any instrumentation code I write is yours to keep regardless.

Can I add scope mid-audit?

No. The 2-week window is fixed. Adding scope mid-audit risks the depth of what's already in flight. If you have additional exchanges, signing paths, or systems you want reviewed, we schedule a second engagement. Returning-client discount of 15% applies.

Know your actual numbers.

"Our stack is fast" is not a p99. A signed report with measured latency, failure mode analysis, and a severity-ranked remediation roadmap is something you can act on.

One slot per 2-week period. Reserve with the deposit.