Security
Secrets Management at a Trading Firm: Vault, KMS, and the Secret-Zero Problem
ZeroCopy uses Infisical and AWS KMS. The secret-zero problem was the hardest design challenge. Here is the full secrets management architecture for a trading firm.
Every secrets manager has the same bootstrapping problem: the manager requires a secret to authenticate, and you need to store that secret somewhere. If you store it in the container image, you have baked a credential into your artifact. If you store it in an environment variable, someone with shell access to the host can read it. If you store it in a configuration file, that file needs to be protected, which requires another credential.
This is the secret-zero problem. It is not theoretical. I have seen trading firms with exchange API keys committed in docker-compose.yml files in their primary repository - keys that were rotated once, after the SOC 2 auditor asked about them, but that exist in the git history forever.
The secret-zero problem has real solutions. This is a guide to the architecture that works.
The Secret-Zero Problem in Detail
Consider a trading service that needs to authenticate to a secrets manager (Vault, AWS Secrets Manager, Infisical) to retrieve its exchange API keys. The service needs to provide something to the secrets manager to prove it is the authorized service and not an attacker. That something is the initial credential - the secret-zero.
The naive solutions and why they fail:
Bake a credential into the container image. Anyone with access to pull your container images (Docker Hub, ECR, GHCR) can extract the credential. Your CI/CD system, every developer who pulls the image, and any system that runs the image all have the credential.
Pass the credential as an environment variable at container start. Better - the credential is not in the artifact. But it lives in your container orchestration system’s configuration. In Kubernetes, this means it is in a Secret object that any pod in the namespace with sufficient RBAC permissions can read. It also means it is in shell history on every machine where anyone ran docker run -e SECRET=....
Store it in a config file mounted at runtime. Similar issues - the file has to live somewhere, and that somewhere requires protection.
The actual solutions rely on the cloud platform proving identity without a user-provided credential.
Solution 1: IRSA (IAM Roles for Service Accounts)
IRSA is the AWS solution for workloads running in Kubernetes. It binds a Kubernetes service account to an IAM role. When a pod uses that service account, the AWS SDK automatically obtains temporary credentials for the associated IAM role from the pod’s local metadata endpoint.
The secret-zero problem is dissolved: the pod never has a static credential. The Kubernetes control plane and the IAM system jointly assert the pod’s identity based on the service account token.
The setup:
# Create an IAM role with a trust policy that allows the k8s service account to assume it
aws iam create-role \
--role-name trading-engine-role \
--assume-role-policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B716D3041E"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B716D3041E:sub": "system:serviceaccount:trading:trading-engine"
}
}
}]
}'
# The Kubernetes service account
apiVersion: v1
kind: ServiceAccount
metadata:
name: trading-engine
namespace: trading
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/trading-engine-role
With this configuration, the trading engine pod automatically gets temporary IAM credentials scoped to its role. It can then call AWS Secrets Manager to retrieve its exchange API keys without ever holding a static credential.
The temporary credentials are rotated automatically by the AWS SDK, typically every hour.
Solution 2: Instance Identity (EC2 Instance Metadata Service)
For services running on EC2 instances (rather than Kubernetes), the Instance Metadata Service (IMDS) provides a similar primitive. An IAM instance profile attached to the EC2 instance grants the instance a role. The instance’s applications call http://169.254.169.254/latest/meta-data/iam/security-credentials/{role-name} to retrieve temporary credentials.
The important hardening step: enable IMDSv2, which requires a session-oriented token request before retrieving credentials. IMDSv1 allowed any process on the instance - including a compromised container - to call the metadata endpoint with a simple curl. IMDSv2 requires a PUT request to obtain the session token, which provides defense-in-depth against server-side request forgery (SSRF) attacks.
# Force IMDSv2 on an existing instance
aws ec2 modify-instance-metadata-options \
--instance-id i-0123456789abcdef0 \
--http-endpoint enabled \
--http-token required
Solution 3: Vault AppRole with Response Wrapping
For on-premises deployments or multi-cloud environments where cloud IAM is not available, Vault’s AppRole authentication method with response wrapping provides a secure bootstrap mechanism.
The sequence:
-
The machine provisioning system (Terraform, Ansible, your cloud-init script) calls Vault to generate a one-time wrapping token. This token can only be used once and expires in 60 seconds.
-
The wrapping token is delivered to the machine through your provisioning channel (cloud-init user data, a separate secure delivery mechanism).
-
The machine uses the wrapping token to unwrap a
secret_id(a component of the AppRole credential). The wrapping token is consumed and cannot be reused. -
The machine now has its
role_id(a static identifier for its AppRole) and thesecret_idit unwrapped. Together these are used to authenticate to Vault and obtain a renewable token. -
The renewable token is used for all subsequent credential lookups. The machine renews it before it expires.
The wrapping step prevents replay attacks: an attacker who intercepts the wrapping token cannot use it because the token was already consumed during the unwrap.
# Provisioning system creates a wrapped secret_id (TTL: 60s)
WRAPPING_TOKEN=$(vault write -wrap-ttl=60s -field=wrapping_token \
auth/approle/role/trading-engine/secret-id)
# Machine uses the wrapping token to unwrap the secret_id
SECRET_ID=$(vault unwrap -field=secret_id ${WRAPPING_TOKEN})
# Machine authenticates with role_id and secret_id
VAULT_TOKEN=$(vault write -field=token auth/approle/login \
role_id="${ROLE_ID}" \
secret_id="${SECRET_ID}")
The How-Not-To-Solve-It List
Config files in repositories. This includes example files with “example” values that are actually the production values. It includes local config files that are .gitignored locally but committed on a developer machine that was set up without the proper ignore rules. The git history is permanent. A force-push to remove a commit from history is not sufficient - every machine that cloned or fetched the repository retains the history locally.
Environment variables baked into container images. The ENV instruction in a Dockerfile creates a layer with the value embedded. It is visible in docker inspect, in the CI artifact logs, and in any container registry that has the image.
Hardcoded in config files that are committed. Config files in YAML, TOML, JSON, or any other format that contain actual credentials and are committed to source control. Even in a private repository, every developer and CI system has access.
Passed via CLI arguments. Command-line arguments are visible in /proc/<pid>/cmdline to any user with access to the system, and they appear in shell history. For trading systems, exchange API keys or signing credentials passed as CLI arguments have been readable by anyone with access to the process table.
Credential Rotation: The Three States
When you rotate a credential, there is a window during which both the old and new versions are valid. This is necessary to avoid downtime - if you invalidate the old credential before all consumers have the new one, requests using the old credential fail.
The three-state rotation model:
- Previous: The prior version, still accepted during the transition window
- Current: The active version, used by consumers after rotation
- Next: Being provisioned, not yet deployed
For exchange API keys, the rotation sequence:
- Create a new API key on the exchange (most exchanges allow 2 active keys per account)
- Store the new key as the “Next” version in your credentials manager
- Deploy a configuration change that sets “Current = Next” across all services
- After all services are confirmed to be using the new key (via monitoring of key usage), revoke the old key on the exchange
The monitoring step is critical. If you revoke the old key before confirming all services have rotated, you will generate authentication errors on any service that was not yet updated. For a trading system mid-market, this is a potentially costly outage.
At ZeroCopy, we implement this with a rotation daemon that:
- Checks the current rotation state in Infisical
- Compares the key fingerprint in use by each service against the “Current” version
- Blocks the revocation of “Previous” until all services report using “Current”
Vault Policy for a Trading Service
A Vault policy should follow the principle of least privilege: a trading service that only needs its exchange API keys should not have access to the database credentials for the reporting service.
# vault-policy-trading-engine.hcl
# Read access to own exchange credentials - no write, no list on parent path
path "secret/data/trading/exchange-keys/strategy-alpha" {
capabilities = ["read"]
}
# Read own TLS certificates
path "secret/data/trading/tls/trading-engine" {
capabilities = ["read"]
}
# Allow renewing own Vault token
path "auth/token/renew-self" {
capabilities = ["update"]
}
# Allow looking up own token metadata
path "auth/token/lookup-self" {
capabilities = ["read"]
}
# Explicitly deny access to other services' credentials
path "secret/data/trading/exchange-keys/+" {
capabilities = ["deny"]
}
path "secret/data/reporting/*" {
capabilities = ["deny"]
}
The explicit deny blocks are important. Without them, a misconfigured wildcard policy at a higher level could grant unintended access. Defense in depth: the trading engine’s policy should deny access to credentials it should not have, not merely fail to grant it.
Audit Logging for Credential Access
Your credentials manager should log every access. At ZeroCopy, we treat the Infisical audit log as a first-class security artifact:
- Every credential read is logged with the identity, timestamp, and path
- Anomalous patterns alert (a service suddenly reading 50 credential paths it has never accessed before)
- The audit log is streamed to immutable storage (CloudWatch Logs with a 90-day retention policy and a resource policy preventing deletion)
The access log feeds into the SOC 2 evidence package. Auditors can see that the trading engine read its API keys on schedule, without unusual access patterns, and that no other service accessed those paths.
Credential Lifecycle Management: Beyond Rotation
Most credential management discussions focus on rotation. But the credential lifecycle has two other phases that are equally important: provisioning and revocation.
Provisioning. New credentials should be provisioned through an automated, audited process - not manually created and emailed to an engineer. Every credential provisioning event should create an audit record: who requested it, what the credential is for, when it was created, and who approved it. At ZeroCopy, exchange API key provisioning creates a ticket in our issue tracker automatically, with the key’s scope and expiry date recorded.
Revocation. When an engineer departs, when a service is decommissioned, or when a credential is suspected to be compromised, revocation must be fast and complete. Fast means the credential is invalidated within minutes, not days. Complete means every downstream cache that may have the credential must be invalidated, not just the primary store.
For exchange API keys specifically, revocation requires calling the exchange’s API to invalidate the key. This is a side-effectful operation that can fail. Build a revocation tracking system: every credential that has been marked for revocation should have a status (pending, revoked, failed) and a retry mechanism for failed revocations. Do not rely on a single fire-and-forget API call.
Detecting Credential Misuse
Credential misuse detection is the runtime complement to credential management. Even with excellent management practices, a credential can be compromised. The detection layer asks: is this credential being used in a way that is consistent with its intended use?
For exchange API keys, the signals:
- Geographic anomaly. If the credential has always been used from your data center’s IP ranges and a request comes from a residential ISP in a different country, this is a strong signal.
- Request rate anomaly. If a strategy API key suddenly generates 100x its normal order volume, either the strategy has malfunctioned or the key is being used by an unauthorized party.
- Off-hours access. If a service account that only operates during market hours suddenly authenticates at 3am local time, investigate.
- Unusual endpoint access. If a key that has only ever been used for spot trading suddenly attempts to access the derivatives API, this may indicate a compromised key attempting to maximize damage.
Most exchanges provide webhook or email alerts for unusual API key activity. Enable these and route them to your monitoring system. For exchange keys where the exchange does not provide anomaly detection, implement it yourself: log every API call with the credentials used, and run hourly aggregations that look for the patterns above.
How This Breaks in Production
The failure mode that has cost firms real money: rotation that does not coordinate across all consumers. A firm runs a daily rotation job that revokes the old key and creates a new one in sequence. The rotation job does not check whether all services have been updated. Some services cache the API key in memory for 24 hours to avoid hitting the credentials manager on every request. When the key is rotated, the caching services continue to use the old key - which has been revoked. Authentication errors cascade. In a live market, this means missed executions.
The fix: when designing credential rotation, model every consumer of every credential and its caching behavior. The rotation job must account for the longest cache TTL across all consumers. If the trading engine caches API keys for an hour, the rotation window (the period during which both old and new versions are valid) must be at least one hour.
The second failure mode is over-broad credential access. A service granted access to all credentials in the trading/* path because “it needs to be able to fetch its own credentials” can now also fetch every other trading service’s credentials. If that service is compromised, the blast radius is the entire trading system rather than just that service.
The third failure mode is treating credential management as a solved problem after the initial implementation. The system that was correctly configured six months ago may have drifted: new services added without their own credentials (using a shared service account instead), rotation schedules slipped because someone overrode the automation during an incident and never re-enabled it, access paths not revoked after a service was decommissioned.
Run a credential hygiene audit quarterly. Pull the full list of credentials from your manager. For each: is the owning service still running? Is the rotation schedule current? Does the access scope still match what the service actually needs? This audit takes a day and should be part of your recurring security operations calendar.
Every credential path should have exactly the set of services that need it on its ACL. Nothing more.
The fourth failure mode applies specifically to trading firms that onboard new exchange integrations rapidly: the credential management discipline that applies to existing exchange connections is not extended to new ones. A new exchange integration is added with a broad-permission API key that was generated manually during testing and never rotated out of staging credentials. Six months later, the staging key is still in production because the rotation process was never applied to that connection.
Treat every new exchange integration as an opportunity to verify that the credential management process was followed from day one: a dedicated key per integration, scoped to the minimum required permissions, stored in the credentials manager, and enrolled in the rotation schedule. The cost of this discipline at connection time is a few hours. The cost of discovering a compromised, over-permissioned, unrotated key during an incident is measured in missed trades and potential account bans.
Continue Reading
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.