On-Premise GPU vs Cloud for Trading AI: When the Math Tips

I have done this analysis twice for ZeroCopy and once for Upside (a systematic equities fund I advised in 2024). Each time the answer surprised the person who asked. The tipping point is not at the utilization level most firms assume, and the latency argument for on-premise is different from the cost argument. This post walks through both.

The short answer, before the math: if you are running GPU inference continuously for more than 8 hours per day, or if you are running GPU training on a continuous loop rather than scheduled jobs, on-premise is almost certainly cheaper within 18 months. But “cheaper” is not always the right frame. The latency argument is often more compelling than the cost argument, and it tips even earlier.

What Trading AI Actually Runs on GPUs

Before comparing costs, we need to be precise about what actually needs a GPU in a systematic trading operation.

Signal generation from high-frequency data. This is the most common use case. A mean-reversion strategy that operates on 5,000 instruments simultaneously needs to score each instrument against a model - usually LightGBM or XGBoost for tabular features derived from order book data and price history. LightGBM is CPU-bound for inference on tabular data. A well-tuned LightGBM inference serving 5,000 instruments at 100ms cadence runs comfortably on 8 CPU cores. No GPU needed.

Transformer models on time-series. Some strategies use sequence models - Transformers, LSTMs, S4 models - on OHLCV time series to generate alpha. These are GPU-bound for inference at the batch sizes needed for market-wide signal generation. A model scoring all S&P 500 components at minute cadence needs to process 500 sequences every 60 seconds. On a single A100 GPU, this takes roughly 200ms with a well-optimized inference stack. On a 16-core CPU, it takes closer to 4 seconds.

Reinforcement learning for execution decisions. RL-based execution agents (trained to minimize market impact by optimizing order placement timing) require GPU inference at relatively high frequency - each decision is a forward pass through the trained policy network. On a smaller GPU (RTX 4090 class), this is fast enough for executions at second-level cadence.

Training on continuous data streams. If you retrain your signal model every day (common for high-turnover strategies where the signal decays quickly), the training job needs a GPU or the retraining window becomes too long. Daily retraining of a LightGBM on 12 months of minute-level data takes about 15 minutes on a CPU (32 cores) vs about 3 minutes with GPU acceleration via XGBoost CUDA backend.

LLM inference for workflow agents. If you are running AI workflow agents (see the previous post in this series), you need LLM inference. For ZeroCopy’s agent workload - 30-minute cadence synthesis, producing perhaps 200-400 tokens per run - a hosted LLM API (Anthropic, OpenRouter) is dramatically cheaper than running your own GPU for LLM inference. The economics of LLM inference favor large providers running at massive scale. Do not run your own LLaMA for workflow agents unless you have a specific data confidentiality requirement.

The GPU-relevant workloads for most systematic trading firms are: transformer-based signal models and RL execution agents. The others either run on CPU (LightGBM, XGBoost tabular) or should run on hosted APIs (LLM workflow agents).

Cloud GPU Cost Reality

Cloud GPU pricing changes constantly, but here are the numbers that matter as of early 2026:

NVIDIA A100 80GB (SXM variant):

AWS p4d.24xlarge: 8x A100, $32.77/hr on-demand, ~$ 12-14/hr spot (highly variable)
Per-GPU on-demand: ~$4.10/hr
Per-GPU spot: ~$1.50-1.75/hr (with interruption risk)

NVIDIA H100 80GB (SXM variant):

AWS p5.48xlarge: 8x H100, available in limited regions
CoreWeave H100 SXM5: ~$4.76/hr per GPU
Lambda Labs H100 SXM5: ~$4.99/hr per GPU
Lambda spot-equivalent (H100 PCIe): ~$2.49/hr

NVIDIA H100 PCIe (commonly available):

RunPod: ~$2.29/hr spot
Vast.ai: ~$1.90-2.50/hr (varies by provider)
Lambda Labs: ~$2.49/hr

For a single A100 running 24/7:

On-demand: $4.10 × 24 × 30 =$ 2,952/month
Spot (assuming 70% availability due to interruptions): ~$900/month but you need a fallback

For a single H100 PCIe running 24/7 on CoreWeave reserved:

$2.49/hr reserved × 24 × 30 =$ 1,793/month

These are compute-only costs. Add storage, egress, and managed service overhead and the effective cost for a production-grade single-GPU inference cluster on cloud is closer to $2,500-4,000/month for an A100,$ 2,200-3,500/month for an H100 PCIe (on third-party clouds).

On-Premise GPU Economics

The comparison point for on-premise is the purchase price plus operational overhead.

NVIDIA H100 SXM5 80GB (the current benchmark): ~ $35,000-40,000 new, as of early 2026. Used H100s (refurbished from hyperscaler overbuy) are appearing at$ 22,000-28,000 but carry some reliability uncertainty.

NVIDIA A100 80GB SXM: Used market ~$12,000-16,000. New is largely discontinued in favor of H100/H200.

NVIDIA RTX 6000 Ada (consumer/prosumer workstation GPU): ~$7,000 new. 48GB VRAM. Adequate for transformer inference on 500-1,000 instruments at moderate cadence.

For colocation at a professional data center:

1U server with 1x H100: ~$1,200-1,800/month for a full 1U at a quality colocation provider (power, cooling, physical security, network)
2U server with 2x A100: ~$1,800-2,400/month for the server slot
Network bandwidth: typically 1Gbps included, 10Gbps for an additional ~$200-400/month

Hardware depreciation model (conservative 3-year straight-line for GPU assets in a volatile technology market):

H100 SXM5 purchase: $38,000
3-year depreciation: $38,000 / 36 = $1,056/month hardware cost
Colocation (1U, 10Gbps): $1,500/month
Power (roughly 700W peak, ~500W average): ~$90/month (at $0.12/kWh)
Network/uplink: included in colocation
Maintenance/insurance: $100/month estimate

Total on-premise H100 per month: ~$2,746/month

vs cloud H100 PCIe at $2,200-3,500/month.

At face value, the on-premise total is in the range of cloud costs. So where does on-premise win?

When the Math Tips

The break-even analysis has two distinct dimensions: utilization and amortization.

Utilization dimension. Cloud costs are proportional to hours used. On-premise costs are fixed (depreciation + colocation runs regardless of whether the GPU is used). The crossover for a single H100 at $2,746/month total:

Cloud H100 (CoreWeave reserved, ~$91/day): break-even is roughly 30 days per month of 24/7 usage - i.e., if you use the GPU for a full continuous month. This sounds like “you always pay the same” but it is not: CoreWeave reserved pricing is only available with 1-year or 3-year commitments, which start looking like hardware purchases.

Without reserved pricing (spot/on-demand at $4.76/hr): if your H100 is only used 8 hours per day, you pay$ 4.76 × 8 × 30 = $1,142/month. On-premise at$ 2,746/month is more expensive for this utilization profile.

The crossover for utilization with on-demand pricing: $2,746 / ($ 4.76 × 24) = 24.1 days per month, or about 80% utilization. If your GPU workload is running more than 80% of the time continuously, on-premise is cheaper at on-demand cloud rates. At spot rates (with interruption risk), the crossover is around 60% utilization.

The 24/7 inference case. Signal generation for a live trading strategy runs continuously during market hours plus preprocessing runs overnight. For a firm trading 16 hours per day across Asian and US sessions: 16/24 = 67% utilization. This is above the spot crossover but below the on-demand crossover. The economics are close but tip toward on-premise if you factor in amortization.

Amortization dimension. After 3 years, your on-premise H100 is fully depreciated. The hardware still runs. Your operational cost drops to $1,700/month (colocation + power only). Cloud costs do not decrease with time. A firm that buys on-premise in year 1 and operates for 5 years pays:

On-premise 5-year total:
Year 1-3: $2,746/month × 36 = $98,856
Year 4-5: $1,700/month × 24 = $40,800
Total: $139,656

Cloud (CoreWeave H100 reserved, 1-year commitments):
Year 1: $2,200/month × 12 = $26,400
Year 2 (assume 10% price decrease): $1,980/month × 12 = $23,760
Year 3: $1,782/month × 12 = $21,384
Year 4: $1,603/month × 12 = $19,236
Year 5: $1,443/month × 12 = $17,316
Total: $108,096

On this model, cloud wins at 5 years even with 10% annual price decreases. However, there is a counter-argument: the on-premise hardware can be resold at end of life. An H100 that is 3 years old in 2029 will have a market value - probably $8,000-15,000 if the H200/B200 generation has pushed H100s down market. The residual value materially changes the TCO.

The honest answer on pure economics: if you are running a single GPU continuously, cloud and on-premise are within 15-25% of each other over a 3-year horizon. Cloud wins on flexibility; on-premise wins slightly on total cost if you hold to year 5+.

The Latency Argument: Different from the Cost Argument

Here is where the analysis changes decisively for trading firms.

If your signal generation model runs on a cloud GPU at CoreWeave’s Atlanta data center, and your execution engine runs on an EC2 instance in us-east-1, every signal generation result pays the round-trip between those two points. CoreWeave Atlanta to AWS us-east-1: roughly 8-12ms on a good day.

For a strategy operating at 1-minute bars, 8ms is negligible. For a strategy operating at 1-second bars, 8ms is an 8-cycle delay - potentially meaningful for signals that decay quickly. For a strategy operating at 100ms bars, 8ms is a structural disadvantage.

The on-premise solution: GPU, execution engine, and market data feed all co-located in the same data center, connected by 10Gbps LAN. The latency from GPU inference output to order submission is bounded by the LAN round-trip (~50µs) plus the execution engine’s processing time, not by the intercontinental internet.

Lynx Trading Technologies made this decision publicly in November 2025, moving their AI signal generation to an NVIDIA HGX B200 cluster co-located in Equinix NY4 (adjacent to their execution infrastructure). Their CTO described the key benefit as “eliminating the geography problem” - the signal and the execution decision live in the same building.

For any strategy with a signal decay time under 500ms, the latency case for on-premise GPU co-location is strong independent of the cost analysis.

The Cost Comparison Spreadsheet

Here is the analysis in tabular form for three operational scenarios:

Scenario 1: Moderate utilization (8h/day, 5 days/week = 24% utilization)
────────────────────────────────────────────────────────────────────────
Cloud H100 (on-demand):     $4.76 × 8 × 22 = $836/month
Cloud H100 (spot, 70% SLA): $2.49 × 8 × 22 = $438/month (with interruption)
On-premise H100:            $2,746/month

Verdict: Cloud wins decisively. On-premise is 3-6x more expensive.

Scenario 2: High utilization (20h/day, 7 days/week = 83% utilization)
────────────────────────────────────────────────────────────────────────
Cloud H100 (on-demand):     $4.76 × 20 × 30 = $2,856/month
Cloud H100 (reserved 1yr):  $2,200/month
On-premise H100:            $2,746/month

Verdict: Cloud reserved and on-premise are equivalent. On-premise wins
after Year 3 when hardware is depreciated.

Scenario 3: Continuous inference (24/7, latency-sensitive)
────────────────────────────────────────────────────────────────────────
Cloud H100 (on-demand):     $4.76 × 24 × 30 = $3,427/month
Cloud H100 (reserved 1yr):  $2,200/month + latency overhead
On-premise H100:            $2,746/month, co-located with execution

Verdict: On-premise wins on both economics (marginal) and latency
(decisive). The latency co-location benefit closes the cost gap entirely.

The analysis that surprises most people is Scenario 2: at 83% utilization, reserved cloud pricing and on-premise are nearly identical in the first three years. The common assumption is that cloud is always more expensive at high utilization; it is not true when you factor in reserved pricing. The on-premise advantage only becomes decisive after hardware is depreciated.

When to Buy vs. When to Rent

Cloud wins when:

Inference workloads are below 8 hours per day (Scenario 1)
The training job is scheduled and bursty, not continuous (a 4-hour daily training job is cheap on spot)
The firm is early-stage and capital constraint beats operational efficiency
The strategy is experimental and might be retired within 12 months
Latency requirements are above 500ms (intercontinental network overhead is acceptable)

On-premise wins when:

GPU inference is continuous (24/7 or near-24/7)
The strategy has sub-500ms signal decay (latency argument)
The firm has a 3+ year horizon and will fully depreciate hardware
Data confidentiality requires keeping inference data off third-party infrastructure
Multiple GPU workloads can be consolidated on one server (consolidation economics)

The consolidation point is important. A firm running three strategies that each need 8 hours of GPU inference per day can run all three on a single on-premise H100 with careful scheduling. The effective utilization of the hardware is 24/7 while each individual strategy workload would look like Scenario 1 in isolation.

A Note on Confidential Compute for GPU Workloads

There is an emerging intersection between the TEE themes from earlier in this series and on-premise GPU infrastructure. NVIDIA’s Confidential Computing technology (H100 and later) allows GPU workloads to run in a trusted execution environment, with memory encryption and attestation support. This means a signal generation model running on an on-premise H100 in Confidential Computing mode has the same IP protection properties as the enclave-based signing infrastructure described in Post 1 and Post 2.

For a firm whose primary IP is the signal generation model, this is significant: the model weights are encrypted in GPU memory, the computation is attested, and the owner can prove to an auditor that the model running today is the same model they registered. The convergence of on-premise GPU, confidential compute, and attestation-based IP protection is not theoretical - NVIDIA’s H100 and H200 support this today.

ZeroCopy’s roadmap includes extending attestation coverage from the signing enclave to the signal generation layer. The full chain will be: market data input → attested signal generation (on-premise H100 with CC) → attested risk check (Nitro Enclave) → attested signing (Nitro Enclave) → verified settlement. Every step in the chain will produce an attestation artifact that can be independently verified.

The next and final post in this series is the manifesto: why sovereign trading infrastructure is not just a ZeroCopy thesis but a structural shift in how the next generation of institutional trading will be built.

On-Premise GPU vs Cloud for Trading AI: When the Math Tips

What Trading AI Actually Runs on GPUs

Cloud GPU Cost Reality

On-Premise GPU Economics

When the Math Tips

The Latency Argument: Different from the Cost Argument

The Cost Comparison Spreadsheet

When to Buy vs. When to Rent

A Note on Confidential Compute for GPU Workloads

Sovereign infrastructure for AI agents handling capital: a practitioner's reference

Continue Reading

Sovereign Trading Infrastructure: Why the Next Generation of HFT Will Run Inside Enclaves

AI-Driven Execution Agents: BAML/Letta Patterns for Trading Workflow Orchestration

Confidential Compute for Strategy IP: Protecting Your Alpha Inside an Enclave