Skip to content

Infrastructure

Building a Self-Service Developer Platform for a 30-Person HFT Team

At Akuna with 500+ test environments, manual provisioning was killing iteration speed. How we built PR-triggered ephemeral environments at $50 each with auto-destroy after 48 hours.

11 min
#developer-experience #hft #terraform #ephemeral-environments #trading-infrastructure #platform-engineering

In my third year at Akuna, we had a problem that was hard to see until you stepped back and counted the days. A strategy developer wanted to test a new market-making variant against real (but sandbox) exchange connections. The process was: file a ticket to the infrastructure team, wait 2-3 days for a test environment to be provisioned, discover it was misconfigured (wrong exchange version, missing market data feed, wrong risk limits), file another ticket, wait another day. If you were lucky and persistent, you had a working test environment in a week. Most developers just tested against production with reduced position limits, which is exactly as risky as it sounds.

We had 30 quant developers and infrastructure analysts. At any given time, there were 15-20 active strategy development threads. The infrastructure team of 3 was spending 60% of its time on environment provisioning requests. The developers were spending the other 60% waiting.

The solution - self-service ephemeral environments triggered by PR creation - took three months to build properly and eliminated the provisioning bottleneck completely. After launch, environment provisioning time went from 2-3 days to 8-12 minutes. The infrastructure team reallocated the recovered capacity to actual infrastructure work. This is how we built it.

The Core Insight: A Trading Environment Is Not a General-Purpose Environment

The instinct when building developer environments is to create a smaller version of production: same services, same topology, same everything, just cheaper instances. This works for web applications because production is a horizontally-scaled fleet of stateless services. It does not work for trading because a trading environment’s uniqueness is not in its topology - it is in its connections.

What makes a trading environment useful to a strategy developer:

  1. Exchange connectivity: Live but sandboxed exchange connections that behave exactly like production (same latency characteristics, same order types, same fill simulation)
  2. Market data subscription: Real market data feed (tick-by-tick), not replayed historical data
  3. Risk controls: Configured risk limits appropriate for testing (small position sizes, strict drawdown limits)
  4. Isolated position state: No sharing position state with other test environments - each environment has its own clean position book

What a trading environment does not need:

  • Production-grade latency (sub-100µs)
  • Bare metal instances
  • High availability
  • Persistent storage (ephemeral is fine - the exchange is the source of truth)
  • Separate VPC (cost savings: can share a “dev” VPC across all environments)

This distinction matters because production-grade infrastructure is expensive. Each environment built as a production replica would cost 300500/day.Eachenvironmentbuiltasa"developmentgrade"environmentwithsharedinfrastructurewherepossiblecosts300-500/day. Each environment built as a "development-grade" environment with shared infrastructure where possible costs 40-80/day - and with 48-hour auto-destroy, the actual cost per environment lifecycle is $50-80 total.

The Platform Architecture

+-------------------------------+
| Developer opens PR on GitHub  |
+-------------------------------+
             |
             v
+-------------------------------+
| CI: parse PR for infra config |
| (reads .dev-env.yaml in PR)   |
+-------------------------------+
             |
             v
+-------------------------------+
| Terraform: provision env      |
| - 2x c6i.2xlarge (strategy    |
|   + oms on same node)         |
| - Exchange API keys from vault|
| - Market data subscription    |
| - Risk limits from defaults   |
+-------------------------------+
             |
             v
+-------------------------------+
| CI: post environment URL to PR|
| comment with access details   |
+-------------------------------+
             |
             v
+-------------------------------+
| Developer uses environment    |
| (direct URL + API key)        |
+-------------------------------+
             |
             v
+-------------------------------+
| 48h timer: auto-destroy       |
| (or manual: PR close/merge)   |
+-------------------------------+

The configuration file that developers include in their PRs:

# .dev-env.yaml (checked into the PR branch)
version: 1

environment:
  name: "mean-reversion-v3-test"  # Optional: defaults to PR number

exchanges:
  - name: binance
    type: spot
    sandbox: true
    instruments:
      - BTC/USDT
      - ETH/USDT
  - name: bybit
    type: perpetual
    sandbox: true
    instruments:
      - BTCUSDT

market_data:
  mode: live        # live | replay | both
  # If replay: specify historical date range for backtesting
  # replay_start: "2025-01-01"
  # replay_end:   "2025-01-31"

risk_config:
  max_position_usd: 1000    # Per instrument
  max_drawdown_pct: 5       # 5% drawdown = halt
  max_order_rate: 10        # Orders per second

strategy:
  binary: strategies/mean-reversion-v3  # Relative path to compiled strategy
  config: strategies/mean-reversion-v3.toml

ttl_hours: 48   # Default: 48. Max: 96. Requires manager approval > 96h.

The Terraform Module

The dev environment Terraform module is a simplified version of the production trading node module, optimized for fast creation and minimal cost:

# modules/dev-trading-env/main.tf
variable "env_name" {
  description = "Unique identifier for this environment (usually PR number)"
  type        = string
}

variable "strategy_binary_s3_uri" {
  description = "S3 URI of the strategy binary to deploy"
  type        = string
}

variable "exchange_configs" {
  description = "List of exchange configurations (name, type, instruments)"
  type = list(object({
    name       = string
    type       = string
    sandbox    = bool
    instruments = list(string)
  }))
}

variable "risk_config" {
  description = "Risk configuration for the environment"
  type = object({
    max_position_usd = number
    max_drawdown_pct = number
    max_order_rate   = number
  })
}

variable "ttl_hours" {
  description = "Environment lifetime in hours before auto-destroy"
  type        = number
  default     = 48
}

# Single instance: strategy engine + OMS on the same host
# (not production-grade, but sufficient for strategy testing)
resource "aws_instance" "dev_env" {
  ami           = data.aws_ami.trading_dev.id
  instance_type = "c6i.2xlarge"  # 8 vCPU, 16 GB - fine for dev
  subnet_id     = data.aws_subnet.dev_shared.id  # Shared dev subnet

  vpc_security_group_ids = [
    aws_security_group.dev_env.id,
    data.aws_security_group.dev_shared_egress.id
  ]

  iam_instance_profile = aws_iam_instance_profile.dev_env.name

  user_data = templatefile("${path.module}/userdata.sh", {
    env_name              = var.env_name
    strategy_binary_s3_uri = var.strategy_binary_s3_uri
    exchange_configs      = jsonencode(var.exchange_configs)
    risk_config           = jsonencode(var.risk_config)
    secrets_path          = "/dev-envs/${var.env_name}"  # Secrets Manager path
    ttl_hours             = var.ttl_hours
  })

  # Dev instances use GP3 30GB - no need for 100GB production volumes
  root_block_device {
    volume_type = "gp3"
    volume_size = 30
    iops        = 3000
  }

  tags = {
    Name        = "dev-env-${var.env_name}"
    Environment = "development"
    TTLHours    = tostring(var.ttl_hours)
    CreatedAt   = timestamp()
    PRNumber    = var.env_name
    AutoDestroy = "true"  # Picked up by the reaper Lambda
  }
}

# Security group: allow developer access by IP, not public
resource "aws_security_group" "dev_env" {
  name   = "dev-env-${var.env_name}"
  vpc_id = data.aws_vpc.dev.id

  # Allow admin API access from the developer VPN range only
  ingress {
    from_port   = 8080
    to_port     = 8080
    protocol    = "tcp"
    cidr_blocks = [var.developer_vpn_cidr]
  }

  # No inbound SSH: operators use AWS SSM Session Manager instead
  # This avoids managing SSH keys and doesn't require a bastion
}

The Security Boundary: Sandbox vs Production Credentials

This is the single most important design decision in a trading developer platform, and getting it wrong can result in sandbox tests executing real trades.

The rule: dev environments can never access production exchange API keys. This sounds obvious, but the mechanisms that enforce it require deliberate design.

The credential architecture:

AWS Secrets Manager
├── /prod/exchanges/binance/api-key      (access: prod IAM role only)
├── /prod/exchanges/binance/api-secret   (access: prod IAM role only)

├── /sandbox/exchanges/binance/api-key   (access: dev IAM role)
├── /sandbox/exchanges/binance/api-secret (access: dev IAM role)

└── /dev-envs/<env-name>/                (created at env provisioning)
    ├── env-api-key                      (generated per-environment)
    └── exchange-config                  (points to sandbox keys only)

The IAM role for dev environment instances explicitly denies access to the /prod/ path:

# IAM policy for dev environment instances
resource "aws_iam_policy" "dev_env_secrets" {
  name = "dev-env-secrets-${var.env_name}"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = ["secretsmanager:GetSecretValue"]
        Resource = [
          "arn:aws:secretsmanager:*:*:secret:/sandbox/*",
          "arn:aws:secretsmanager:*:*:secret:/dev-envs/${var.env_name}/*"
        ]
      },
      {
        # Explicit deny for production secrets, even if a broader Allow exists elsewhere
        Effect = "Deny"
        Action = ["secretsmanager:GetSecretValue"]
        Resource = "arn:aws:secretsmanager:*:*:secret:/prod/*"
      }
    ]
  })
}

The explicit Deny is redundant if the Allow is correctly scoped (and it is), but explicit Deny provides defense-in-depth: even if someone adds a broad Allow elsewhere, the Deny wins.

Market Data Sharing: One Feed, Many Environments

At Akuna, each dev environment consuming its own independent market data subscription from Binance would have meant:

  • A separate WebSocket connection to Binance for each environment
  • A separate rate limit allocation (Binance limits connections per API key)
  • Multiplied data transfer costs

The solution is a shared market data distribution layer: one canonical feed subscriber per exchange, which fans out to all active dev environments via an internal NATS topic.

Binance WebSocket (1 connection)
          |
          v
    Feed Multiplexer (singleton, always running)
          |
          v
    NATS: market.data.binance.btcusdt.trades
    NATS: market.data.binance.btcusdt.orderbook
    NATS: market.data.binance.ethusdt.trades
          |
    +-----------+-----------+
    |           |           |
Dev Env 1   Dev Env 2   Dev Env 3
(subscribes to relevant subjects)

The feed multiplexer is a persistent service running in the dev VPC - not ephemeral. Dev environments subscribe to NATS topics rather than maintaining their own exchange connections for market data. This reduces the dev environment’s exchange connection footprint to order routing only (which genuinely needs its own connection per environment for isolation).

# Feed multiplexer: subscribes to exchange, publishes to internal NATS
class FeedMultiplexer:
    def __init__(self, exchange: str, instruments: list[str]):
        self.exchange = exchange
        self.instruments = instruments

    async def run(self, nats_client: nats.NATS):
        async with websockets.connect(
            self.exchange_ws_url(self.exchange, self.instruments)
        ) as ws:
            async for message in ws:
                data = json.loads(message)

                # Parse to normalized internal format
                normalized = self.normalize(data)

                # Publish to NATS with exchange + instrument + event type as subject
                subject = (
                    f"market.data"
                    f".{self.exchange}"
                    f".{normalized.instrument.lower()}"
                    f".{normalized.event_type}"
                )

                await nats_client.publish(
                    subject,
                    normalized.to_bytes()
                )

Auto-Destroy: The Lambda Reaper

Without auto-destroy, ephemeral environments accumulate. Developers forget to clean up. A 48-hour environment becomes a 2-week environment. At 50/environment/48hours,a2weekenvironmentcosts50/environment/48 hours, a 2-week environment costs 350. Multiply by 30 developers and you have a $10,000+ waste problem.

The auto-destroy mechanism is a Lambda function that runs hourly and terminates instances whose TTL has expired:

# lambda/environment_reaper.py
import boto3
from datetime import datetime, timezone

ec2 = boto3.client('ec2')
tf_state_bucket = 'zerocopy-tfstate-dev-envs'

def handler(event, context):
    # Find all instances tagged as auto-destroy
    response = ec2.describe_instances(
        Filters=[
            {'Name': 'tag:AutoDestroy', 'Values': ['true']},
            {'Name': 'instance-state-name', 'Values': ['running', 'stopped']}
        ]
    )

    for reservation in response['Reservations']:
        for instance in reservation['Instances']:
            tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}

            created_at_str = tags.get('CreatedAt')
            ttl_hours = int(tags.get('TTLHours', '48'))
            env_name = tags.get('PRNumber', 'unknown')

            if not created_at_str:
                continue

            created_at = datetime.fromisoformat(created_at_str.replace('Z', '+00:00'))
            now = datetime.now(timezone.utc)
            age_hours = (now - created_at).total_seconds() / 3600

            if age_hours >= ttl_hours:
                print(f"Destroying expired environment: {env_name} (age: {age_hours:.1f}h, TTL: {ttl_hours}h)")

                # Trigger Terraform destroy via Step Functions
                # (not directly from Lambda - TF destroy can take 2-3 minutes)
                sfn = boto3.client('stepfunctions')
                sfn.start_execution(
                    stateMachineArn=os.environ['DESTROY_STATE_MACHINE_ARN'],
                    input=json.dumps({
                        'env_name': env_name,
                        'instance_id': instance['InstanceId'],
                        'reason': f'TTL expired after {age_hours:.1f}h'
                    })
                )

The Step Functions state machine handles the Terraform destroy workflow asynchronously, sends a notification to the developer’s Slack DM, and updates the PR comment to show the environment has been cleaned up.

Commitment Pricing for Persistent Dev Infrastructure

Not everything in the dev platform is ephemeral. The feed multiplexer, the NATS cluster, the shared dev VPC, and the CI runners are persistent infrastructure. For these, 1-year convertible reserved instances provide significant cost savings.

At ZeroCopy, our persistent dev infrastructure runs on:

  • 2x c6i.xlarge reserved (1-year convertible): feed multiplexers - 0.136/hrvs0.136/hr vs 0.204/hr on-demand, 33% savings
  • 3x t3.medium reserved: NATS nodes - significant savings vs on-demand for always-on workloads
  • t3.small NAT instances: replaced with VPC endpoints where possible to eliminate per-GB charges

The dev VPC uses VPC endpoints for S3 and Secrets Manager access - the two most frequently accessed services from dev environments. Without endpoints, all S3 and Secrets Manager traffic routes through NAT, which charges $0.045/GB for data processing. With endpoints, the same traffic is free within the VPC.

How This Breaks in Production

Sandbox exchange connections behaving differently from production. Binance’s sandbox environment has known quirks: order fill latency is higher, market impact simulation is simplified, some order types are not available. A strategy that performs well in sandbox may behave differently on production - not because of a bug in the strategy, but because the sandbox does not accurately simulate the production exchange. Document the sandbox limitations explicitly and build a “paper trading on production feed” mode for final validation.

Rate limits shared across dev environments. The shared feed multiplexer uses a single API key for market data subscriptions. If Binance changes its rate limit calculation to include WebSocket subscription count per API key, all dev environments could be affected by one environment’s misbehavior. Monitor the multiplexer’s connection health independently from dev environment health.

TTL expiry during active testing. A developer is running a long backtest on an environment that has reached its TTL. The reaper destroys the environment mid-test. The developer loses the test run. Solution: send a Slack warning at TTL-4h and TTL-1h, and allow self-service extension via a Slack slash command that adds 24 hours (with logging).

Terraform state for dev environments accumulating in S3. Each ephemeral environment creates a Terraform state file. If the destroy process fails (Lambda timeout, network issue), the state file persists but the instance is gone - creating orphaned state. Add a weekly cleanup Lambda that removes state files for environments where the corresponding EC2 instance no longer exists.

Secrets Manager GetSecretValue pricing at scale. AWS charges 0.05per10,000APIcallstoSecretsManager.Devenvironmentsthatpollforsecretsoneverystartup(orworse,oneveryrequest)at30environments×20startupsperday×5secretsperstartup=3,000GetSecretValuecallsperday,or0.05 per 10,000 API calls to Secrets Manager. Dev environments that poll for secrets on every startup (or worse, on every request) at 30 environments × 20 startups per day × 5 secrets per startup = 3,000 GetSecretValue calls per day, or 1.50/month - fine. At 200 environments × same pattern = 10/monthstillfine.ButifanapplicationbugcausesatightstartupretryloopcallingGetSecretValuecontinuously,costscanspiketo10/month - still fine. But if an application bug causes a tight startup retry loop calling GetSecretValue continuously, costs can spike to 100s per day before anyone notices. Set a CloudWatch alarm on the GetSecretValue API call rate.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.