Skip to content

Infrastructure

Cost Engineering for 24/7 Crypto Trading: Where the Money Actually Goes and How to Halve It

A line-by-line breakdown of crypto trading infrastructure costs: NAT gateway egress, cross-AZ transfers, RDS vs ClickHouse, reserved instances, and how to cut the bill by roughly two-thirds.

14 min
#cost-engineering #aws #trading-infrastructure #finops #clickhouse #reserved-instances

The bill arrived on the first of the month, as always. 3,247.Thatnumberhadbeenapproximatelythesameforsixmonths.Ihadbuilttheinfrastructureforcorrectoperationnotforcost.Everyservicewasrightsizedfortheworkload,thearchitecturewassound,themonitoringwascomprehensive.Andwewerepaying3,247. That number had been approximately the same for six months. I had built the infrastructure for correct operation - not for cost. Every service was right-sized for the workload, the architecture was sound, the monitoring was comprehensive. And we were paying 3,200 a month for infrastructure that a well-optimized setup could run for $1,100.

The optimization took two months of incremental work. The result was a 66% reduction in monthly spend while maintaining the same trading performance. More importantly, I understood exactly where each dollar was going and why.

This post is a precise account of where crypto trading infrastructure money actually goes, and the optimizations that are real versus the ones that look good in a blog post but do not actually move your AWS bill.

The Actual Cost Breakdown

Before optimization, ZeroCopy’s $3,247/month broke down as follows:

ServiceMonthly CostPercentageNotes
EC2 instances$97430%2x c6i.metal (reserved), 4x c6i.2xlarge (on-demand), monitoring
Data egress (AWS → internet)$81125%Market data feed to strategy + order confirmations to ops dashboard
RDS PostgreSQL$48715%db.r6g.xlarge, multi-AZ, 500GB storage
NAT Gateway$48715%Data processing charges + hourly fees
CloudWatch + observability storage$32510%Log storage, custom metrics, 90-day retention
Other (Route53, ACM, misc)$1635%DNS queries, certificates, API calls

Four of these six categories had significant optimization opportunities. EC2 and “other” were already reasonably optimized.

The NAT Gateway Shock (487487 → 47)

This is the optimization I explain to every team that comes to me with an AWS cost problem, because it is consistently the most surprising and most actionable.

AWS charges for NAT Gateway in two ways:

  1. Hourly fee: 0.045/hourperNATGateway= 0.045/hour per NAT Gateway = ~32/month for one NAT Gateway in one AZ
  2. Data processing fee: $0.045/GB for all data that passes through the NAT Gateway

The hourly fee is visible and expected. The data processing fee is where teams get surprised.

For a crypto trading desk, the data flow that matters:

  • Inbound market data: Exchange WebSocket → NIC → application. No NAT involved - this is inbound. Free.
  • Outbound order submission: Application → Exchange API. Requires internet egress. If your strategy instances are in a private subnet (they should be), this traffic goes through NAT.
  • Outbound operational traffic: Prometheus metrics → operator dashboard, ArgoCD sync → GitHub, container pulls → ECR/DockerHub.

Our pre-optimization setup: all outbound traffic, including the exchange API calls and GitHub sync, routed through a single NAT Gateway. At our order volume, the exchange API traffic was 2-3 GB/day. The GitHub sync and container pulls were another 1-2 GB/day. Total: ~5 GB/day × 0.045/GB×30days=0.045/GB × 30 days = 6.75/month in data processing alone. That is actually small - the large number came from something else.

What was actually expensive: our operational dashboards were pulling Prometheus metrics from within the VPC and pushing them to an external Grafana Cloud instance. 20 GB/day of metric data × 0.045/GB×30days=0.045/GB × 30 days = 27/month. Plus the 25% data egress charge on top of NAT processing for the internet-bound portion.

The fix had three parts:

1. VPC Endpoints for AWS services. Services that live within AWS - S3, Secrets Manager, ECR, CloudWatch - do not need to go through the internet. VPC Interface Endpoints allow private connectivity to these services without egress through NAT or internet gateway. The endpoints cost 0.01/houreach(0.01/hour each (7.20/month per endpoint), but eliminate all NAT data processing charges for that traffic.

# VPC Endpoints that eliminated our largest NAT flows
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.trading.id
  service_name = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"  # Free for S3 and DynamoDB
  route_table_ids = [aws_route_table.private.id]
}

resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.trading.id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"  # $0.01/hour
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true  # Critical: allows existing DNS names to resolve to endpoint
}

resource "aws_vpc_endpoint" "ecr_dkr" {
  vpc_id              = aws_vpc.trading.id
  service_name        = "com.amazonaws.${var.region}.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

resource "aws_vpc_endpoint" "secrets_manager" {
  vpc_id              = aws_vpc.trading.id
  service_name        = "com.amazonaws.${var.region}.secretsmanager"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

2. Self-hosted Grafana instead of Grafana Cloud. We moved the Grafana instance inside the VPC on a t3.medium ($29/month). Prometheus scrapes targets within the VPC (no NAT), Grafana reads from Prometheus within the VPC (no NAT), operators connect via a bastion or VPN (one VPN connection = negligible cost). The 20 GB/day metric push to Grafana Cloud went to zero.

3. DigitalOcean for the operator dashboard. This is specific to ZeroCopy’s architecture: the Command Center (our Tauri operator interface) connects to a DigitalOcean droplet that hosts the non-latency-sensitive control plane services. DigitalOcean’s networking model is simpler and has no NAT Gateway equivalent - outbound traffic is included in the droplet bandwidth allocation (1-2 TB/month for typical droplet sizes). Moving operational services to DO eliminated a class of AWS egress charges entirely.

Result: NAT Gateway spend dropped from 487to487 to 47/month (one NAT for the remaining legitimate use cases - exchange API calls from private subnets that cannot go through VPC endpoints).

Data Egress: The Hidden Tax (811811 → 280)

AWS charges 0.09/GBfordatatransferfromEC2totheinternet.WithinthesameAZ,datatransferisfree.CrossAZdatatransfercosts0.09/GB for data transfer from EC2 to the internet. Within the same AZ, data transfer is free. Cross-AZ data transfer costs 0.01/GB.

For trading infrastructure, the most controllable egress is the choice of where your consumers live relative to your producers.

Market data distribution: If your strategy instance (in us-east-1a) consumes market data from your feed handler (also in us-east-1a), the traffic is free (same AZ). If your strategy instance is in us-east-1a and your feed handler is in us-east-1b (perhaps due to placement group constraints), you pay 0.01/GBcrossAZ.At50GB/dayoftickdata,thatis0.01/GB cross-AZ. At 50 GB/day of tick data, that is 15/month in avoidable cross-AZ transfer.

Order book snapshots for recovery: Large position state transfers during failover. If your primary is in us-east-1a and your DR is in ap-southeast-1, this transfer costs 0.09/GB.A500MBpositionsnapshoteveryfailovertest×12tests/year=6GB×0.09/GB. A 500 MB position snapshot every failover test × 12 tests/year = 6 GB × 0.09 = $0.54 - negligible. But if you are doing frequent failover drills or continuous replication for DR, this adds up.

The egress that is genuinely unavoidable: Exchange API calls leave your VPC. Order submission, fill queries, account balance checks - all cross the internet gateway. At ZeroCopy’s order volume, this is 3-5 GB/month of unavoidable egress. At 0.09/GB,thatis0.09/GB, that is 0.27-$0.45/month. Not meaningful.

What was expensive before our optimization: our Command Center clients (desktop apps running on developer and operator machines) were connecting directly to our AWS services via the internet, pulling Prometheus metrics, log streams, and position data continuously. At 4 clients × 8 hours/day × 30 days × ~50 MB/hour per client = 48 GB/month of egress at 0.09/GB=0.09/GB = 4.32/month per client × 4 clients = $17/month.

The fix: route all Command Center traffic through the DigitalOcean droplet, which has included bandwidth. The DO droplet acts as a proxy/aggregator for the Command Center - it pulls relevant data from AWS (within AWS egress budget) and serves it to Command Center clients over the internet from DO. The AWS → DO transfer cost is $0.09/GB but the volume is much lower (aggregated, not raw streams). The DO → internet transfer is free within DO’s bandwidth allocation.

After all egress optimizations: $280/month - primarily the unavoidable exchange API and the reduced AWS → DO aggregation transfer.

Storage: ClickHouse vs TimescaleDB (325325 → 45)

Our original observability stack used TimescaleDB (PostgreSQL extension for time-series) to store trading metrics: order events, fill events, position snapshots, latency histograms. TimescaleDB is excellent - well-supported, standard SQL, good compression. We had 90 days of data stored at 400 GB uncompressed, compressed to 180 GB with TimescaleDB’s native compression.

The problem: 180 GB of PostgreSQL storage in RDS is expensive. RDS gp3 storage costs 0.115/GB/monthfor180GB=0.115/GB/month for 180 GB = 20.70/month just for storage, plus the instance cost ($150-200/month for a db.r6g.large with enough memory for time-series query workloads).

We migrated trading metrics to ClickHouse running on a c6g.xlarge instance (0.13/hour=0.13/hour = 94/month). ClickHouse’s columnar storage compresses time-series trading data at ratios of 10:1 to 20:1 versus uncompressed. Our 400 GB of data that took 180 GB in TimescaleDB takes 22 GB in ClickHouse - 87% storage reduction. Query performance for time-range aggregations is 10-50x faster than equivalent TimescaleDB queries.

The migration process:

-- ClickHouse schema for order events
-- Columnar storage with codec compression:
-- LZ4 for most columns, DoubleDelta for timestamps (excellent compression for sequential timestamps),
-- Gorilla for prices (designed for floating-point time series)
CREATE TABLE order_events
(
    event_time      DateTime64(6, 'UTC')         CODEC(DoubleDelta, LZ4),
    order_id        String                        CODEC(LZ4),
    instrument      LowCardinality(String),       -- LowCardinality = dictionary encoding for repeated values
    order_type      LowCardinality(String),
    side            LowCardinality(String),
    quantity        Float64                       CODEC(Gorilla),
    price           Float64                       CODEC(Gorilla),
    status          LowCardinality(String),
    exchange        LowCardinality(String),
    strategy_id     LowCardinality(String),
    latency_us      UInt32                        CODEC(DoubleDelta, LZ4),
    fill_price      Nullable(Float64)             CODEC(Gorilla),
    fill_quantity   Nullable(Float64)             CODEC(Gorilla)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (strategy_id, instrument, event_time)   -- Primary sort: matches 90% of queries
TTL event_time + INTERVAL 90 DAY;               -- Auto-expire old data

Total observability storage after migration: ClickHouse 22 GB on c6g.xlarge with a local 50 GB gp3 volume = 94(instance)+94 (instance) + 5.75 (storage) = ~100/month.TheTimescaleDBRDSinstancewaseliminatedentirely,saving100/month. The TimescaleDB RDS instance was eliminated entirely, saving 170-200/month. Net change: 325325 → 100 for the full observability tier.

(The 45figureintheheadlineisafteralsoeliminatingCloudWatchcustommetricsbyroutingallmetricstoPrometheusClickHouse,whichremoved45 figure in the headline is after also eliminating CloudWatch custom metrics by routing all metrics to Prometheus → ClickHouse, which removed 55/month in CloudWatch charges.)

EC2 Reserved Instances: 40% Discount for Persistent Workloads (974974 → 584)

The trading hot path (2x c6i.metal) was already on 1-year convertible reserved instances at ZeroCopy before the cost optimization sprint - this is standard practice for infrastructure that runs continuously. The savings here came from applying the same discipline to infrastructure we had been leaving on on-demand pricing.

Infrastructure that never turns off should never be on on-demand pricing. The optimization we had missed:

Feed multiplexers: Two c6i.xlarge instances running 24/7 to distribute market data. On-demand: 0.204/hr×2×730hr=0.204/hr × 2 × 730 hr = 298/month. 1-year convertible RI: 0.136/hr×2×730hr=0.136/hr × 2 × 730 hr = 198/month. Saving: $100/month.

NATS cluster: Three t3.medium instances, always running. On-demand: 0.0416/hr×3×730=0.0416/hr × 3 × 730 = 91/month. 1-year RI: 0.0264/hr×3×730=0.0264/hr × 3 × 730 = 58/month. Saving: $33/month.

Monitoring node (Prometheus + Grafana): One c6g.large, always running. On-demand: 0.068/hr×730=0.068/hr × 730 = 50/month. 1-year RI: 0.043/hr×730=0.043/hr × 730 = 31/month. Saving: $19/month.

Total EC2 savings from RI coverage: $152/month without changing a single line of code or architecture.

The convertible RI caveat: convertible RIs allow you to exchange for different instance types within the same family and generation. If AWS releases a c7i.metal that offers better performance per dollar, you can convert your c6i.metal RI rather than losing the reservation. This matters because the right instance type for trading evolves faster than a 1-year commitment period.

Graviton3 for Non-Latency-Sensitive Workloads

The c7g family (Graviton3) is consistently 20-40% cheaper than equivalent Intel x86-64 instances for the same memory and compute allocation. For workloads where ARM binary compatibility is not an issue, this is a straightforward cost optimization.

At ZeroCopy, we run these workloads on Graviton:

  • NATS cluster: t4g.medium (Graviton2) - 20% cheaper than t3.medium, fully compatible
  • Monitoring (Prometheus + Grafana): c6g.large - 25% cheaper than c6i.large
  • ArgoCD + Kubernetes control plane: m6g.large nodes - 20% cheaper than m6i.large
  • Database (non-trading): db.r6g.large - 20% cheaper than db.r6i.large

The workloads we do not run on Graviton:

  • Strategy execution (c6i.metal): x86-64 SIMD intrinsics in the Rust engine
  • Feed multiplexers (c6i.xlarge): same reason - hand-optimized order parsing routines

For any new service built in Go, Rust (without x86 intrinsics), or Python, Graviton should be the default choice. The ARM ecosystem in 2026 is mature - Docker images, all major open-source tools, and most language runtimes publish ARM builds.

The Complete Before/After

CategoryBeforeAfterChange
EC2 (reserved coverage)$974$584-$390
Data egress$811$280-$531
RDS PostgreSQL$487$0 (migrated to CH)-$487
NAT Gateway$487$47-$440
Observability storage$325$100 (ClickHouse)-$225
Other$163$89-$74
Total$3,247$1,100-$2,147 (66%)
DigitalOcean (new)$0$52+$52
Net total$3,247$1,152-$2,095 (65%)

The work required to achieve this was approximately 6 engineer-weeks spread over 2 months. At any reasonable engineering labor cost, the ROI is achieved within the first month of savings.

How This Breaks in Production

VPC Endpoint routing and split-horizon DNS. After adding VPC endpoints with private_dns_enabled = true, some services that were previously routing to AWS APIs via the internet now route through the VPC endpoint. This changes the source IP they see on the AWS side - it is now your ENI IP rather than your public IP. If you have IAM conditions or security group rules that check for specific source IPs (common in tightly-controlled environments), the endpoint routing change will break those. Audit your security group rules and IAM policies for aws:SourceIp conditions before enabling private DNS on VPC endpoints.

ClickHouse storage growth without TTL enforcement. The ClickHouse TTL clause in your table definition marks data for deletion, but the actual deletion happens during background merge operations - not immediately when data expires. If your disk fills up before the next merge runs the TTL, ClickHouse will stop accepting inserts. Set disk utilization alerts at 70% and explicitly trigger merges (OPTIMIZE TABLE order_events FINAL) during low-activity windows, not only relying on background merges.

Reserved Instance coverage gaps during scaling events. If your trading volume grows and you need to add a third c6i.metal instance, that instance runs on-demand pricing (3x the RI cost) until you either purchase a new RI or convert an existing one. Budget for the on-demand cost of any on-demand capacity you need to add during scaling events - do not assume your RI coverage applies to new instances.

Cross-AZ traffic from Graviton migration. When you migrate a service from t3.medium (launched in us-east-1a) to t4g.medium, the new instance may launch in a different AZ if availability is constrained. Any traffic between this service and services in us-east-1a now incurs $0.01/GB cross-AZ transfer. For high-throughput services (NATS cluster, feed multiplexers), pin new instances to the same AZ as their primary consumers via subnet ID specification in your Terraform configuration.

Spot interruption during RI migration. If you run any dev or staging capacity on spot instances to reduce cost, a spot interruption during a period when on-demand is unavailable (rare but possible) will leave you without that capacity. Do not put any component of the trading system’s critical path on spot instances - not even the monitoring infrastructure, because losing your observability during a market event is nearly as bad as losing the trading engine itself.

Graviton binary incompatibility discovered in production. The standard Docker multi-arch build process (docker buildx with --platform linux/amd64,linux/arm64) works for most images, but some images contain platform-specific native extensions (Python packages with C extensions, Java with JNI libraries). Before migrating a service to Graviton, verify the full container builds and runs correctly on ARM64 in staging. The failure mode - Exec format error or a missing .so file at runtime - is silent until the container crashes on startup.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.