Infrastructure
Building a Self-Service Developer Platform for a 30-Person HFT Team
At Akuna with 500+ test environments, manual provisioning was killing iteration speed. How we built PR-triggered ephemeral environments at $50 each with auto-destroy after 48 hours.
In my third year at Akuna, we had a problem that was hard to see until you stepped back and counted the days. A strategy developer wanted to test a new market-making variant against real (but sandbox) exchange connections. The process was: file a ticket to the infrastructure team, wait 2-3 days for a test environment to be provisioned, discover it was misconfigured (wrong exchange version, missing market data feed, wrong risk limits), file another ticket, wait another day. If you were lucky and persistent, you had a working test environment in a week. Most developers just tested against production with reduced position limits, which is exactly as risky as it sounds.
We had 30 quant developers and infrastructure analysts. At any given time, there were 15-20 active strategy development threads. The infrastructure team of 3 was spending 60% of its time on environment provisioning requests. The developers were spending the other 60% waiting.
The solution - self-service ephemeral environments triggered by PR creation - took three months to build properly and eliminated the provisioning bottleneck completely. After launch, environment provisioning time went from 2-3 days to 8-12 minutes. The infrastructure team reallocated the recovered capacity to actual infrastructure work. This is how we built it.
The Core Insight: A Trading Environment Is Not a General-Purpose Environment
The instinct when building developer environments is to create a smaller version of production: same services, same topology, same everything, just cheaper instances. This works for web applications because production is a horizontally-scaled fleet of stateless services. It does not work for trading because a trading environment’s uniqueness is not in its topology - it is in its connections.
What makes a trading environment useful to a strategy developer:
- Exchange connectivity: Live but sandboxed exchange connections that behave exactly like production (same latency characteristics, same order types, same fill simulation)
- Market data subscription: Real market data feed (tick-by-tick), not replayed historical data
- Risk controls: Configured risk limits appropriate for testing (small position sizes, strict drawdown limits)
- Isolated position state: No sharing position state with other test environments - each environment has its own clean position book
What a trading environment does not need:
- Production-grade latency (sub-100µs)
- Bare metal instances
- High availability
- Persistent storage (ephemeral is fine - the exchange is the source of truth)
- Separate VPC (cost savings: can share a “dev” VPC across all environments)
This distinction matters because production-grade infrastructure is expensive. Each environment built as a production replica would cost 40-80/day - and with 48-hour auto-destroy, the actual cost per environment lifecycle is $50-80 total.
The Platform Architecture
+-------------------------------+
| Developer opens PR on GitHub |
+-------------------------------+
|
v
+-------------------------------+
| CI: parse PR for infra config |
| (reads .dev-env.yaml in PR) |
+-------------------------------+
|
v
+-------------------------------+
| Terraform: provision env |
| - 2x c6i.2xlarge (strategy |
| + oms on same node) |
| - Exchange API keys from vault|
| - Market data subscription |
| - Risk limits from defaults |
+-------------------------------+
|
v
+-------------------------------+
| CI: post environment URL to PR|
| comment with access details |
+-------------------------------+
|
v
+-------------------------------+
| Developer uses environment |
| (direct URL + API key) |
+-------------------------------+
|
v
+-------------------------------+
| 48h timer: auto-destroy |
| (or manual: PR close/merge) |
+-------------------------------+
The configuration file that developers include in their PRs:
# .dev-env.yaml (checked into the PR branch)
version: 1
environment:
name: "mean-reversion-v3-test" # Optional: defaults to PR number
exchanges:
- name: binance
type: spot
sandbox: true
instruments:
- BTC/USDT
- ETH/USDT
- name: bybit
type: perpetual
sandbox: true
instruments:
- BTCUSDT
market_data:
mode: live # live | replay | both
# If replay: specify historical date range for backtesting
# replay_start: "2025-01-01"
# replay_end: "2025-01-31"
risk_config:
max_position_usd: 1000 # Per instrument
max_drawdown_pct: 5 # 5% drawdown = halt
max_order_rate: 10 # Orders per second
strategy:
binary: strategies/mean-reversion-v3 # Relative path to compiled strategy
config: strategies/mean-reversion-v3.toml
ttl_hours: 48 # Default: 48. Max: 96. Requires manager approval > 96h.
The Terraform Module
The dev environment Terraform module is a simplified version of the production trading node module, optimized for fast creation and minimal cost:
# modules/dev-trading-env/main.tf
variable "env_name" {
description = "Unique identifier for this environment (usually PR number)"
type = string
}
variable "strategy_binary_s3_uri" {
description = "S3 URI of the strategy binary to deploy"
type = string
}
variable "exchange_configs" {
description = "List of exchange configurations (name, type, instruments)"
type = list(object({
name = string
type = string
sandbox = bool
instruments = list(string)
}))
}
variable "risk_config" {
description = "Risk configuration for the environment"
type = object({
max_position_usd = number
max_drawdown_pct = number
max_order_rate = number
})
}
variable "ttl_hours" {
description = "Environment lifetime in hours before auto-destroy"
type = number
default = 48
}
# Single instance: strategy engine + OMS on the same host
# (not production-grade, but sufficient for strategy testing)
resource "aws_instance" "dev_env" {
ami = data.aws_ami.trading_dev.id
instance_type = "c6i.2xlarge" # 8 vCPU, 16 GB - fine for dev
subnet_id = data.aws_subnet.dev_shared.id # Shared dev subnet
vpc_security_group_ids = [
aws_security_group.dev_env.id,
data.aws_security_group.dev_shared_egress.id
]
iam_instance_profile = aws_iam_instance_profile.dev_env.name
user_data = templatefile("${path.module}/userdata.sh", {
env_name = var.env_name
strategy_binary_s3_uri = var.strategy_binary_s3_uri
exchange_configs = jsonencode(var.exchange_configs)
risk_config = jsonencode(var.risk_config)
secrets_path = "/dev-envs/${var.env_name}" # Secrets Manager path
ttl_hours = var.ttl_hours
})
# Dev instances use GP3 30GB - no need for 100GB production volumes
root_block_device {
volume_type = "gp3"
volume_size = 30
iops = 3000
}
tags = {
Name = "dev-env-${var.env_name}"
Environment = "development"
TTLHours = tostring(var.ttl_hours)
CreatedAt = timestamp()
PRNumber = var.env_name
AutoDestroy = "true" # Picked up by the reaper Lambda
}
}
# Security group: allow developer access by IP, not public
resource "aws_security_group" "dev_env" {
name = "dev-env-${var.env_name}"
vpc_id = data.aws_vpc.dev.id
# Allow admin API access from the developer VPN range only
ingress {
from_port = 8080
to_port = 8080
protocol = "tcp"
cidr_blocks = [var.developer_vpn_cidr]
}
# No inbound SSH: operators use AWS SSM Session Manager instead
# This avoids managing SSH keys and doesn't require a bastion
}
The Security Boundary: Sandbox vs Production Credentials
This is the single most important design decision in a trading developer platform, and getting it wrong can result in sandbox tests executing real trades.
The rule: dev environments can never access production exchange API keys. This sounds obvious, but the mechanisms that enforce it require deliberate design.
The credential architecture:
AWS Secrets Manager
├── /prod/exchanges/binance/api-key (access: prod IAM role only)
├── /prod/exchanges/binance/api-secret (access: prod IAM role only)
│
├── /sandbox/exchanges/binance/api-key (access: dev IAM role)
├── /sandbox/exchanges/binance/api-secret (access: dev IAM role)
│
└── /dev-envs/<env-name>/ (created at env provisioning)
├── env-api-key (generated per-environment)
└── exchange-config (points to sandbox keys only)
The IAM role for dev environment instances explicitly denies access to the /prod/ path:
# IAM policy for dev environment instances
resource "aws_iam_policy" "dev_env_secrets" {
name = "dev-env-secrets-${var.env_name}"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["secretsmanager:GetSecretValue"]
Resource = [
"arn:aws:secretsmanager:*:*:secret:/sandbox/*",
"arn:aws:secretsmanager:*:*:secret:/dev-envs/${var.env_name}/*"
]
},
{
# Explicit deny for production secrets, even if a broader Allow exists elsewhere
Effect = "Deny"
Action = ["secretsmanager:GetSecretValue"]
Resource = "arn:aws:secretsmanager:*:*:secret:/prod/*"
}
]
})
}
The explicit Deny is redundant if the Allow is correctly scoped (and it is), but explicit Deny provides defense-in-depth: even if someone adds a broad Allow elsewhere, the Deny wins.
Market Data Sharing: One Feed, Many Environments
At Akuna, each dev environment consuming its own independent market data subscription from Binance would have meant:
- A separate WebSocket connection to Binance for each environment
- A separate rate limit allocation (Binance limits connections per API key)
- Multiplied data transfer costs
The solution is a shared market data distribution layer: one canonical feed subscriber per exchange, which fans out to all active dev environments via an internal NATS topic.
Binance WebSocket (1 connection)
|
v
Feed Multiplexer (singleton, always running)
|
v
NATS: market.data.binance.btcusdt.trades
NATS: market.data.binance.btcusdt.orderbook
NATS: market.data.binance.ethusdt.trades
|
+-----------+-----------+
| | |
Dev Env 1 Dev Env 2 Dev Env 3
(subscribes to relevant subjects)
The feed multiplexer is a persistent service running in the dev VPC - not ephemeral. Dev environments subscribe to NATS topics rather than maintaining their own exchange connections for market data. This reduces the dev environment’s exchange connection footprint to order routing only (which genuinely needs its own connection per environment for isolation).
# Feed multiplexer: subscribes to exchange, publishes to internal NATS
class FeedMultiplexer:
def __init__(self, exchange: str, instruments: list[str]):
self.exchange = exchange
self.instruments = instruments
async def run(self, nats_client: nats.NATS):
async with websockets.connect(
self.exchange_ws_url(self.exchange, self.instruments)
) as ws:
async for message in ws:
data = json.loads(message)
# Parse to normalized internal format
normalized = self.normalize(data)
# Publish to NATS with exchange + instrument + event type as subject
subject = (
f"market.data"
f".{self.exchange}"
f".{normalized.instrument.lower()}"
f".{normalized.event_type}"
)
await nats_client.publish(
subject,
normalized.to_bytes()
)
Auto-Destroy: The Lambda Reaper
Without auto-destroy, ephemeral environments accumulate. Developers forget to clean up. A 48-hour environment becomes a 2-week environment. At 350. Multiply by 30 developers and you have a $10,000+ waste problem.
The auto-destroy mechanism is a Lambda function that runs hourly and terminates instances whose TTL has expired:
# lambda/environment_reaper.py
import boto3
from datetime import datetime, timezone
ec2 = boto3.client('ec2')
tf_state_bucket = 'zerocopy-tfstate-dev-envs'
def handler(event, context):
# Find all instances tagged as auto-destroy
response = ec2.describe_instances(
Filters=[
{'Name': 'tag:AutoDestroy', 'Values': ['true']},
{'Name': 'instance-state-name', 'Values': ['running', 'stopped']}
]
)
for reservation in response['Reservations']:
for instance in reservation['Instances']:
tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}
created_at_str = tags.get('CreatedAt')
ttl_hours = int(tags.get('TTLHours', '48'))
env_name = tags.get('PRNumber', 'unknown')
if not created_at_str:
continue
created_at = datetime.fromisoformat(created_at_str.replace('Z', '+00:00'))
now = datetime.now(timezone.utc)
age_hours = (now - created_at).total_seconds() / 3600
if age_hours >= ttl_hours:
print(f"Destroying expired environment: {env_name} (age: {age_hours:.1f}h, TTL: {ttl_hours}h)")
# Trigger Terraform destroy via Step Functions
# (not directly from Lambda - TF destroy can take 2-3 minutes)
sfn = boto3.client('stepfunctions')
sfn.start_execution(
stateMachineArn=os.environ['DESTROY_STATE_MACHINE_ARN'],
input=json.dumps({
'env_name': env_name,
'instance_id': instance['InstanceId'],
'reason': f'TTL expired after {age_hours:.1f}h'
})
)
The Step Functions state machine handles the Terraform destroy workflow asynchronously, sends a notification to the developer’s Slack DM, and updates the PR comment to show the environment has been cleaned up.
Commitment Pricing for Persistent Dev Infrastructure
Not everything in the dev platform is ephemeral. The feed multiplexer, the NATS cluster, the shared dev VPC, and the CI runners are persistent infrastructure. For these, 1-year convertible reserved instances provide significant cost savings.
At ZeroCopy, our persistent dev infrastructure runs on:
- 2x
c6i.xlargereserved (1-year convertible): feed multiplexers - 0.204/hr on-demand, 33% savings - 3x
t3.mediumreserved: NATS nodes - significant savings vs on-demand for always-on workloads t3.smallNAT instances: replaced with VPC endpoints where possible to eliminate per-GB charges
The dev VPC uses VPC endpoints for S3 and Secrets Manager access - the two most frequently accessed services from dev environments. Without endpoints, all S3 and Secrets Manager traffic routes through NAT, which charges $0.045/GB for data processing. With endpoints, the same traffic is free within the VPC.
How This Breaks in Production
Sandbox exchange connections behaving differently from production. Binance’s sandbox environment has known quirks: order fill latency is higher, market impact simulation is simplified, some order types are not available. A strategy that performs well in sandbox may behave differently on production - not because of a bug in the strategy, but because the sandbox does not accurately simulate the production exchange. Document the sandbox limitations explicitly and build a “paper trading on production feed” mode for final validation.
Rate limits shared across dev environments. The shared feed multiplexer uses a single API key for market data subscriptions. If Binance changes its rate limit calculation to include WebSocket subscription count per API key, all dev environments could be affected by one environment’s misbehavior. Monitor the multiplexer’s connection health independently from dev environment health.
TTL expiry during active testing. A developer is running a long backtest on an environment that has reached its TTL. The reaper destroys the environment mid-test. The developer loses the test run. Solution: send a Slack warning at TTL-4h and TTL-1h, and allow self-service extension via a Slack slash command that adds 24 hours (with logging).
Terraform state for dev environments accumulating in S3. Each ephemeral environment creates a Terraform state file. If the destroy process fails (Lambda timeout, network issue), the state file persists but the instance is gone - creating orphaned state. Add a weekly cleanup Lambda that removes state files for environments where the corresponding EC2 instance no longer exists.
Secrets Manager GetSecretValue pricing at scale. AWS charges 1.50/month - fine. At 200 environments × same pattern = 100s per day before anyone notices. Set a CloudWatch alarm on the GetSecretValue API call rate.
Continue Reading
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.