Security
Zero-Trust Networking for a Crypto Trading Firm: BeyondCorp Patterns Applied to Exchange Connectivity
At ZeroCopy we run BeyondCorp-style access - no VPN, device certificate verification on every connection. Here is the full architecture.
Two years ago I got a call from a friend who ran security at a mid-size crypto trading firm. A contractor had left three months earlier. The contractor’s VPN account had been deactivated, but no one had revoked their access to the internal wiki, the Grafana dashboards, or the staging environment. The staging environment shared a subnet with production. The contractor - now working at a competitor - had been reading the firm’s internal monitoring dashboards for three months via a bookmark they had saved before departure.
Nothing was stolen. No trades were front-run. But the access was real, and the discovery only happened because someone at the firm noticed a login from an unexpected IP in a quarterly access review. The firm was lucky.
That story is the best argument I have for why VPN-based network security is the wrong model for 2026.
Why VPN Is the Wrong Model
The VPN model grants network-level access first and then relies on application-level authentication to enforce what that user can do. The mental model is a castle and moat: once you are inside the moat (authenticated to the VPN), you can reach anything on the network that is not explicitly blocked by a firewall rule.
The problem is that this model grants implicit trust based on network position. If an attacker obtains valid VPN credentials - through phishing, credential stuffing, malware on the employee’s machine, or the contractor scenario above - they inherit the network trust position of that account. They can reach internal services, probe internal endpoints, and pivot to adjacent systems.
The VPN model has two additional failure modes specific to trading infrastructure:
Latency. VPN tunnels add latency. For a trading firm where market data needs to reach the strategy engine with minimal delay, adding a VPN hop on the critical path is a non-starter. Zero-trust architectures keep the data path clean while applying authentication to the control plane.
The insider threat. For an internal attacker with valid VPN credentials, the network perimeter provides zero protection. Zero-trust architectures apply the same authentication requirements regardless of whether the request comes from inside or outside the corporate network.
Zero-Trust Principles
Zero-trust is not a product. It is an architectural principle with three components:
-
Authenticate every connection. Every request to every service must present a valid identity credential. There is no implicit trust based on network location. A request from the same machine that made an authenticated request two minutes ago must re-authenticate.
-
Authorize every request. Having a valid identity does not mean you have access to any particular resource. Authorization is checked at the resource level, against a policy that specifies which identities can perform which operations on which resources.
-
No implicit network trust. A connection from
10.0.0.0/8is not inherently more trusted than a connection from the public internet. The source IP is not an authorization signal.
The BeyondCorp implementation at Google added a fourth component: device trustworthiness. A valid user account on an unmanaged device that failed an MDM check gets a different (lower) trust level than the same user on a corporate-managed device with current patches. Device certificates encode this trust level.
The ZeroCopy Architecture
Here is how we implemented this at ZeroCopy. The architecture has four layers.
Layer 1: The Overlay Network (Tailscale)
We use Tailscale as the network overlay. Tailscale builds a WireGuard mesh between every node - servers, developer laptops, CI runners - and assigns each node a stable IP in the 100.64.0.0/10 Tailscale address space.
Why Tailscale instead of raw WireGuard or a traditional VPN:
- Every device on the network has a certificate, not just a pre-shared key. Certificates are issued by Tailscale’s coordination server and are short-lived (renewed automatically).
- ACL policies are centrally defined and pushed to every node. A new rule takes effect everywhere within seconds.
- The coordination server is only used for key exchange. The actual traffic flows peer-to-peer between nodes, which keeps the data path latency minimal.
The ACL policy for a trading firm looks like this:
{
"acls": [
{
"action": "accept",
"src": ["group:engineers"],
"dst": ["tag:dev-servers:*"]
},
{
"action": "accept",
"src": ["tag:trading-engine"],
"dst": ["tag:market-data-feed:9090"]
},
{
"action": "accept",
"src": ["tag:trading-engine"],
"dst": ["tag:order-management:8080"]
},
{
"action": "accept",
"src": ["tag:monitoring"],
"dst": ["tag:*:9100"]
},
{
"action": "deny",
"src": ["*"],
"dst": ["tag:production:*"]
}
],
"tagOwners": {
"tag:trading-engine": ["group:platform-team"],
"tag:production": ["group:platform-team"]
}
}
Note the explicit deny at the end. No production access by default - every grant is explicit.
The tailscale up invocation on a new server:
tailscale up \
--authkey="${TAILSCALE_AUTH_KEY}" \
--advertise-tags="tag:trading-engine" \
--accept-routes
The auth key is a one-time provisioning key that registers the node and assigns its tags. After registration, the node’s certificate handles ongoing authentication.
Layer 2: Service-to-Service mTLS
Tailscale handles network-level connectivity between nodes. Within that overlay, we use mTLS for service-to-service communication.
Every service presents a certificate when connecting to another service. The receiving service validates that certificate against its trust store before processing the request. Services only accept connections from services whose certificates are in their trust store.
The practical implementation uses SPIFFE/SPIRE. SPIRE runs as a daemon on each node and issues short-lived (1-hour) SVIDs (SPIFFE Verifiable Identity Documents) - X.509 certificates with the service’s SPIFFE ID embedded as a SAN. Services request their SVID from the local SPIRE agent on startup and use it for all outbound connections.
The trust policy for the order management system:
entries:
- spiffe_id: "spiffe://zerocopy.systems/trading-engine"
parent_id: "spiffe://zerocopy.systems/spire-agent"
selectors:
- type: "k8s"
value: "ns:trading"
- type: "k8s"
value: "sa:trading-engine"
The order management service’s trust store only contains the SPIFFE ID for the trading engine. A request presenting any other certificate is rejected at the connection level, before the request is even parsed.
Layer 3: Just-in-Time Human Access (Teleport)
For human access to production systems, we use Teleport. Teleport provides certificate-based, audited access to SSH, Kubernetes, databases, and web applications. Critically: access is just-in-time, time-bounded, and fully audited.
An engineer requesting production database access:
tsh db connect production-postgres --db-user=readonly
This triggers an approval workflow. The engineer gets a notification that their request is pending. A second engineer approves it. Teleport issues a short-lived database certificate (valid for 1 hour by default). The engineer gets a connection. Every query they execute is logged.
After the certificate expires, access ends. No credential to revoke, no session to terminate - the certificate is simply not renewed.
The audit log entry looks like:
{
"event": "db.session.query",
"user": "nikhil@zerocopy.systems",
"db_service": "production-postgres",
"db_name": "trading",
"db_user": "readonly",
"query": "SELECT * FROM positions WHERE strategy_id = $1",
"time": "2026-03-23T14:22:31.441Z",
"session_id": "3a8f2d1c-..."
}
Every access is attributable to a specific person, at a specific time, for a specific operation. This is the evidence SOC 2 auditors want to see.
Layer 4: The Exchange Connectivity Exception
Exchange connectivity is the most important carve-out in a zero-trust trading architecture.
Market data and order entry connections to exchanges are outbound from your infrastructure. They originate from your trading engine and connect to the exchange’s endpoints. Zero-trust applies to the control plane (how humans and services access your infrastructure), not to the exchange data path.
The exchange connections should be treated as:
- Outbound only from a dedicated subnet, with no inbound connections permitted from exchange IPs
- Authenticated at the application layer using exchange-issued API keys, not network-level controls
- Rate-limited and monitored - any anomaly in order volume or order rate should trigger an alert
The exchange data path benefits from zero-trust principles without the overhead: you do not trust the exchange to be the only entity that can send you data on those ports, so you authenticate at the message level. Every order acknowledgment includes a signature or HMAC that your system verifies before treating the response as authoritative.
The Device Trust Layer
For developer laptops, we layer device certificates on top of Tailscale node certificates. Every laptop that can access the Tailscale network must be enrolled in our MDM solution and pass a device health check:
- Operating system version is current (within 30 days of the latest release)
- Disk encryption is enabled
- Screen lock is configured
- Endpoint detection and response agent is installed and reporting
The MDM system issues device certificates. The Tailscale ACL policy gates access to sensitive resources on the presence of a valid device certificate with a passing device health status.
An engineer on an unmanaged or unhealthy device gets access to the lower-trust segment of the network (internal tools, documentation) but not to production database access or production Kubernetes clusters.
Migration: From VPN to Zero Trust
If you are migrating from a VPN model, the path is not “turn off the VPN and turn on Tailscale simultaneously.” The practical migration:
-
Deploy Tailscale alongside the existing VPN. Every server and every developer laptop gets a Tailscale node. At this point Tailscale is a parallel network - nothing routes through it yet.
-
Migrate non-production environments first. Update the dev and staging environment ACLs to route through Tailscale. Run this in parallel with the VPN for two weeks to identify issues.
-
Migrate monitoring and tooling. Grafana, alerting, oncall tooling - shift these to Tailscale addresses. Verify that alerts still fire.
-
Replace production database access. Migrate database access to Teleport just-in-time credentials. This is the highest-value change and the one that most reduces your access risk.
-
Remove VPN access for production. Once every production access path has been migrated to the zero-trust model, disable VPN access to production subnets. Leave the VPN running for legacy systems temporarily.
-
Decommission the VPN. When no traffic is routing through the VPN, shut it down. The certificate ceremony is complete.
Identity for Non-Human Workloads
One of the most frequently neglected areas in a zero-trust migration is identity for non-human workloads: cron jobs, batch processors, data pipeline workers, and CI/CD runners. These workloads do not have a human to authenticate on their behalf. They need their own identity.
The failure pattern: a shared service account token that all batch jobs use. When the token is compromised (and eventually, it will be), every batch job that uses it is compromised simultaneously.
The correct approach: every distinct workload has its own identity, with access scoped to what that workload needs. The Kubernetes service account + IRSA pattern handles this for cloud workloads. For workloads that are not on Kubernetes, Vault’s AppRole authentication or platform instance identity (as covered in the credentials management context) provides the equivalent.
For CI/CD runners specifically, the GitHub Actions OIDC token is the cleanest solution. A workflow step can obtain a short-lived OIDC token that asserts “this action is running in repository X, on branch Y, triggered by event Z.” The receiving service (your deployment endpoint, your secrets manager) can verify this token against GitHub’s OIDC endpoint and grant access scoped to exactly those conditions - without any long-lived credential stored in the repository.
# GitHub Actions workflow with OIDC identity
permissions:
id-token: write # Required to request OIDC token
- name: Configure AWS credentials from OIDC
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
aws-region: us-east-1
# No access keys or passwords - pure OIDC identity
The trust relationship on the IAM role limits it to specific repositories and branches, preventing a compromised repository from assuming the production deployment role.
Monitoring Zero-Trust Infrastructure
Zero-trust generates significantly more authentication events than a VPN-based model. Every service-to-service call involves a certificate check. Every human access involves a Teleport session log. This is a feature, not a bug - but you need to be prepared to ingest and analyze the volume.
The metrics to monitor:
Certificate health: Track the expiry of every certificate in your mTLS mesh. Alert at 7 days remaining, page at 24 hours. This is especially important for certificates issued by your own intermediate CA, where the renewal daemon failure does not get caught by an external monitoring service.
Authentication failures: Count mTLS handshake failures per service pair per time window. A sudden spike in failures between the trading engine and the OMS could indicate a certificate rotation issue - or it could indicate an unauthorized service attempting to connect. Both deserve investigation.
Access scope drift: Periodically reconcile the Tailscale ACL policy against the actual access patterns observed in logs. If the trading engine has never actually connected to the database read replica in six months of logs, the ACL rule that allows it should be removed. This is the zero-trust equivalent of the access review.
JIT access usage: Track Teleport session creation and duration. An engineer who typically accesses production for 15-minute sessions every two weeks and suddenly has a 6-hour session accessing the trading database at 2am is a signal worth investigating.
How This Breaks in Production
The failure mode I see most often in zero-trust deployments is the service-to-service mTLS implementation not covering all traffic. Engineering teams implement mTLS for the critical path - trading engine to order management - but leave internal tooling (the metrics exporter, the log shipper, the backup agent) running without certificates.
An attacker who compromises the backup agent gets network access but cannot authenticate to the trading engine via mTLS. But they can reach the Prometheus metrics endpoint, the log aggregation service, and any other service that the backup agent has network access to and that was not included in the mTLS scope.
The fix: mTLS is not a feature you bolt onto important services. It is a property of the network: all service-to-service traffic within the overlay uses mTLS, or you accept that the overlay provides connectivity without integrity. The SPIFFE/SPIRE model helps here because the certificate issuance is automated - you do not have to manually provision certificates for each service, and the overhead of including all services is low.
The second failure mode is certificate rotation. Short-lived certificates require automated renewal. If your renewal daemon fails on any service, that service stops being able to initiate mTLS connections. In a trading environment, this is a potential outage. Monitor certificate expiry and certificate renewal success as first-class infrastructure metrics.
The third failure mode specific to trading firms is treating exchange connectivity as an exception to zero-trust principles and then progressively expanding the exception. The exchange data path is a legitimate carve-out. But “the exchange feed needs to reach the strategy engine directly” becomes “the strategy engine needs network access to the exchange subnet” becomes “the exchange subnet is generally accessible from the production network” - and you have recreated the implicit trust zone you were trying to eliminate.
Define the exchange connectivity exception precisely and enforce it precisely. The trading engine can receive inbound connections from the exchange feed addresses on port 9090. It cannot initiate connections to the exchange subnet. It cannot be reached from the exchange subnet on any other port. The Tailscale ACL that implements this exception should be as narrow as the actual technical requirement.
The contractor scenario I opened with would not be possible in a zero-trust architecture. There is no “inside the network” that grants implicit access to internal services. The contractor’s account would have had no certificates, and every internal service would have rejected their connections at the TLS handshake. The quarterly access review that caught the issue would instead have been the off-boarding process that ensured no certificates were issued to the contractor’s identity from day one of their departure.
Continue Reading
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.