Infrastructure
Terraform at Trading-Firm Scale: Module Patterns, State Isolation, and Multi-Account Hygiene
How ZeroCopy manages 4 AWS accounts and DigitalOcean with isolated Terraform state, Atlantis PR workflows, and module patterns that prevent production drift.
The incident happened at Upside two weeks after I joined. A senior engineer ran terraform apply from their laptop in the production AWS account to push a “minor security group update.” The apply succeeded, but it had been run against a stale state file - the engineer had pulled state before someone else had made three unrelated changes that day. Two of those changes got overwritten. One of them was a firewall rule protecting the order management system’s admin port.
Nobody noticed for six days. No breach occurred, but the exposure window was real, the audit trail was muddy, and the postmortem was not pleasant. The fix was not technical - it was procedural. We moved to a model where terraform apply could only happen through CI after a PR review, never from a developer’s laptop. Within three months, the number of unreviewed infrastructure changes dropped from several per week to zero.
That experience shaped how I build Terraform setups at every company since, including ZeroCopy. What follows is the complete pattern.
Module Architecture: The Three-Layer Model
Terraform modules can be organized in many ways, and most organizations get it wrong by starting too abstract. The pattern that has worked consistently for me across three companies is a three-layer model:
Layer 1: Root modules (one per environment). These are what you actually terraform apply. They compose child modules, reference data sources, and produce outputs. The root module for prod-aws knows nothing about staging-aws - they are entirely separate Terraform workspaces with separate state.
Layer 2: Service modules (reusable units of infrastructure). Each service module encapsulates one deployable service: a trading node cluster, a database instance, a NATS cluster. Service modules take structured inputs and produce structured outputs. They contain no hard-coded environment-specific values.
Layer 3: Data modules (shared lookups). These are pure data blocks: AMI lookups, Route53 zone references, VPC data sources. They exist to avoid duplicating lookup logic across multiple service modules.
The directory structure:
infra/
├── environments/
│ ├── prod-aws/
│ │ ├── main.tf # Root module: composes services
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── backend.tf # Remote state config (S3 + DynamoDB)
│ ├── staging-aws/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── backend.tf
│ └── prod-do/ # DigitalOcean production
│ ├── main.tf
│ └── backend.tf
├── modules/
│ ├── trading-node/ # Service module: EC2 cluster + EBS + SGs
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── nats-cluster/ # Service module: NATS cluster + persistence
│ ├── vpc-baseline/ # Service module: VPC + subnets + route tables
│ └── trading-database/ # Service module: RDS + parameter groups
└── data/
├── amis.tf # Data sources: AMI lookups
└── zones.tf # Data sources: Route53 zones
The key discipline: service modules contain zero environment-specific knowledge. The trading-node module does not know it is in production. It receives an environment variable, an instance_type variable, a subnet_ids variable - all of which the root module provides. This means the same module definition runs identically in staging and production, with only the inputs differing.
A minimal trading node module:
# modules/trading-node/variables.tf
variable "environment" {
description = "Environment name (prod, staging, dev)"
type = string
}
variable "instance_type" {
description = "EC2 instance type - c6i.metal for prod, c6i.2xlarge for staging"
type = string
}
variable "placement_group_id" {
description = "Cluster placement group ID - required for prod, optional for staging"
type = string
default = null
}
variable "subnet_ids" {
description = "Subnet IDs to launch instances into - must be single-AZ for cluster placement"
type = list(string)
}
variable "ami_id" {
description = "Base AMI ID - pulled from data source in root module"
type = string
}
# modules/trading-node/main.tf
resource "aws_instance" "node" {
count = var.instance_count
ami = var.ami_id
instance_type = var.instance_type
placement_group = var.placement_group_id
subnet_id = var.subnet_ids[0] # Single AZ enforced
vpc_security_group_ids = [aws_security_group.trading_node.id]
# trading nodes use GP3 root volume: 3000 IOPS baseline, no extra cost
root_block_device {
volume_type = "gp3"
volume_size = 100
iops = 3000
throughput = 125
delete_on_termination = true
encrypted = true
}
# Critical: never let Terraform restart trading nodes due to user_data changes
lifecycle {
ignore_changes = [
user_data,
user_data_base64,
ami, # AMI updates go through bake + replace, not in-place
]
}
tags = {
Name = "${var.environment}-trading-node-${count.index}"
Environment = var.environment
Service = "trading-hot-path"
ManagedBy = "terraform"
}
}
State Isolation: One Bucket Per Environment, Always
The Upside incident would not have been possible with proper state isolation. The rule is simple and non-negotiable: each environment gets its own S3 bucket and its own DynamoDB table for state locking.
This is not about S3 costs (negligible). It is about blast radius. If staging state gets corrupted, it cannot affect production state. If a developer runs an errant apply against staging, the production state file is physically in a different bucket with different IAM permissions.
The state backend configuration:
# environments/prod-aws/backend.tf
terraform {
backend "s3" {
bucket = "zerocopy-tfstate-prod-aws-55b6bb7a"
key = "prod-aws/terraform.tfstate"
region = "us-east-1"
encrypt = true
# DynamoDB table for state locking - prevents concurrent applies
dynamodb_table = "zerocopy-tfstate-lock-prod-aws"
# IAM role in the prod account (not the deployer's personal credentials)
role_arn = "arn:aws:iam::PROD_ACCOUNT_ID:role/terraform-deployer"
}
}
# environments/staging-aws/backend.tf
terraform {
backend "s3" {
bucket = "zerocopy-tfstate-staging-aws-55b6bb7a" # Different bucket
key = "staging-aws/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "zerocopy-tfstate-lock-staging-aws" # Different table
role_arn = "arn:aws:iam::STAGING_ACCOUNT_ID:role/terraform-deployer"
}
}
The role_arn is the other critical piece. The state bucket in the production account is only accessible to a specific IAM role - the terraform-deployer role - not to individual developer credentials. Developers can assume this role via their CI system (Atlantis, GitHub Actions), but cannot directly access the production state bucket with their personal AWS keys. This is enforced through bucket policy.
Atlantis: The Apply-From-PR Model
Atlantis is the piece that ties this together operationally. It is a small Go service that runs in your Kubernetes cluster, watches for GitHub PR events, and executes Terraform on your behalf. The workflow:
- Developer opens a PR with an infrastructure change
- Atlantis automatically runs
terraform planand posts the plan output as a PR comment - Team reviews the plan as part of the PR review
- After approval, a reviewer comments
atlantis applyto trigger the apply - Atlantis applies the plan and posts the result
The value is not just auditability (though that matters). It is that terraform plan output is visible before anyone applies. Reviewers can see exactly what AWS resources will be created, modified, or destroyed. For trading infrastructure, this is the difference between “I think this security group rule change is safe” and “I can see the diff: one ingress rule is being removed from sg-0abc123 on port 9876.”
Our Atlantis configuration:
# atlantis.yaml (at repo root)
version: 3
automerge: false
delete_source_branch_on_merge: false
projects:
- name: prod-aws
dir: infra/environments/prod-aws
workspace: prod-aws
autoplan:
enabled: true
when_modified:
- "**/*.tf"
- "../../modules/**/*.tf" # Also plan when modules change
apply_requirements:
- approved # At least one approval required
- mergeable # PR must be mergeable (no conflicts, checks passing)
- name: staging-aws
dir: infra/environments/staging-aws
workspace: staging-aws
autoplan:
enabled: true
when_modified:
- "**/*.tf"
- "../../modules/**/*.tf"
apply_requirements:
- approved
# No mergeable requirement for staging - faster iteration
- name: prod-do
dir: infra/environments/prod-do
workspace: prod-do
autoplan:
enabled: true
when_modified:
- "**/*.tf"
apply_requirements:
- approved
- mergeable
One operational decision worth discussing: automerge: false. Atlantis can be configured to automatically merge the PR after a successful apply. We do not use this because an infrastructure change being applied correctly does not mean it should be merged - there may be follow-on testing or coordination steps. The merge is a human decision.
Multi-Account Structure
At trading-firm scale, a single AWS account is an operational liability. The structure that has served us well is four accounts via AWS Organizations:
prod-trading - contains all production trading infrastructure. Restricted: only the CI system can deploy here. No human IAM users with programmatic access to production resources. Access is via temporary role assumption from the CI system only.
staging-trading - mirrors production topology but with cheaper instance types. Developers can have broader (but still role-based) access here. This is where you test infrastructure changes before promoting to prod.
shared-services - contains cross-account resources: artifact repositories (ECR), secrets management (Secrets Manager or Parameter Store for non-sensitive config), centralized logging (CloudWatch cross-account), and the Atlantis deployment itself. Shared services has no trading infrastructure.
security - contains the IAM Identity Center (SSO) configuration, CloudTrail storage, and Security Hub aggregator. Completely locked down except for security tooling.
# AWS Organizations structure
resource "aws_organizations_account" "prod_trading" {
name = "zerocopy-prod-trading"
email = "aws-prod@zerocopy.systems"
parent_id = aws_organizations_organizational_unit.trading.id
# SCPs attached to the OU control what accounts in this OU can do
}
# Service Control Policy: prod account cannot disable CloudTrail
resource "aws_organizations_policy" "require_cloudtrail" {
name = "RequireCloudTrail"
type = "SERVICE_CONTROL_POLICY"
content = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Deny"
Action = [
"cloudtrail:StopLogging",
"cloudtrail:DeleteTrail",
"cloudtrail:UpdateTrail"
]
Resource = "*"
Condition = {
ArnNotLike = {
"aws:PrincipalArn" = [
"arn:aws:iam::*:role/security-audit-*"
]
}
}
}
]
})
}
Drift Detection: The Cron Plan
Even with Atlantis enforcing PR-based applies, drift happens. Engineers log into AWS console and click “just this once.” A third-party integration (Datadog, PagerDuty) creates resources during onboarding. An AWS service creates default resources you did not ask for.
The detection mechanism is a scheduled terraform plan --refresh-only run that alerts on any unexpected diff:
# .github/workflows/drift-detection.yml
name: Terraform Drift Detection
on:
schedule:
- cron: '0 6 * * *' # 6 AM UTC daily
workflow_dispatch: # Allow manual trigger
jobs:
detect-drift:
runs-on: self-hosted
strategy:
matrix:
environment: [prod-aws, staging-aws, prod-do]
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets[format('TF_DEPLOY_ROLE_{0}', matrix.environment)] }}
aws-region: us-east-1
- name: Terraform Init
working-directory: infra/environments/${{ matrix.environment }}
run: terraform init
- name: Drift Check
id: drift
working-directory: infra/environments/${{ matrix.environment }}
run: |
# --refresh-only: only checks real state vs Terraform state, no changes
terraform plan -refresh-only -detailed-exitcode 2>&1 | tee drift-output.txt
echo "exit_code=$?" >> $GITHUB_OUTPUT
- name: Alert on Drift
if: steps.drift.outputs.exit_code == '2' # exit code 2 = drift detected
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "Terraform drift detected in ${{ matrix.environment }}",
"attachments": [{
"color": "danger",
"text": "Run `atlantis plan` on a PR or investigate manually."
}]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.INFRA_SLACK_WEBHOOK }}
The lifecycle.ignore_changes Discipline
This deserves special treatment because getting it wrong will cause outages.
Several AWS resource properties change through mechanisms outside Terraform’s control. If Terraform does not know to ignore these changes, a routine terraform plan will show a diff, and an atlantis apply will either modify a running trading node or, worse, destroy and recreate it.
The patterns I always include for trading infrastructure:
resource "aws_instance" "trading_node" {
# ... other config ...
lifecycle {
ignore_changes = [
# AMI updates: new AMIs go through a bake + replace process,
# not Terraform in-place. If we don't ignore this, any AMI update
# to the data source will mark ALL trading nodes for replacement.
ami,
# user_data: initialization script. After first boot, changes here
# should NOT restart the trading node. New config goes through
# the application config management layer, not re-provisioning.
user_data,
user_data_base64,
# EBS optimized flag: AWS may toggle this based on instance type capabilities
ebs_optimized,
# Volume attributes that AWS auto-adjusts
root_block_device[0].iops, # AWS may adjust IOPS within constraints
root_block_device[0].throughput,
]
# Prevent accidental destruction of trading nodes
# This means Terraform will error rather than destroy - requires manual override
prevent_destroy = true
}
}
resource "aws_security_group" "trading_node" {
# ...
lifecycle {
# If you manage security group rules in a separate resource (aws_security_group_rule),
# Terraform's view of inline rules vs external rules can conflict.
# Either manage rules entirely inline OR entirely externally, then ignore the other.
ignore_changes = [ingress, egress] # Only if using external rule resources
}
}
The prevent_destroy = true on trading nodes deserves a note. It adds friction - you have to comment it out and re-apply before destroying. That friction is intentional. An accidental terraform destroy on a production trading node cluster during a market session would be a very expensive mistake.
How This Breaks in Production
State lock timeout during a network partition. DynamoDB state locking uses a lease that expires after a timeout. If your CI runner loses network connectivity mid-apply and the lock does not release cleanly, the next apply will fail with a state lock error. The fix is to manually release the lock via terraform force-unlock <LOCK_ID>, but this requires identifying the correct lock ID from DynamoDB. Have a runbook for this before you need it.
Atlantis plan/apply desync. Atlantis generates a plan, then applies it. If the real AWS state changes between the plan and the apply (another engineer applies directly, an auto-scaling event fires), the apply may fail or apply an incorrect diff. This is not unique to Atlantis - it is a fundamental property of Terraform’s plan-then-apply model. Mitigate by keeping apply windows short and not applying during active trading hours.
Module version pin drift. If your modules directory is a local path (as in our setup), module changes affect all environments that reference them simultaneously. A change to modules/trading-node/main.tf affects both staging and prod plans at once. This is intentional for a small team (you want to know immediately if a module change would affect prod) but can be surprising. Teams that want environment-independent module evolution need versioned module releases via a Terraform registry.
SCPs blocking legitimate CI operations. Service Control Policies are evaluated before IAM policies - an SCP deny is absolute. If you add an SCP that denies an action in the prod account, and your CI system needs that action, your deploys will silently fail with AccessDenied errors that look like IAM issues. Always test SCP changes against your CI role before attaching them to production OUs.
DigitalOcean provider schema drift. DigitalOcean’s Terraform provider has shipped schema breaking changes where fields added to the provider schema do not match existing resource state. The public_networking field on droplets is one example - adding it to the provider schema caused plans to show unexpected modifications on resources that had not changed. The fix is lifecycle.ignore_changes on the affected field and a terraform state show to understand what the real state contains before applying.
Sensitive outputs in state files. Terraform state files contain the outputs of every resource, including sensitive ones. If your state is in S3 and a developer has read access to that bucket (perhaps through an over-broad IAM policy), they can read database passwords, API keys, or private keys that Terraform creates. Always encrypt state at rest, restrict bucket access to the CI role only, and consider using Vault or AWS Secrets Manager for sensitive values rather than generating them in Terraform.
Continue Reading
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.