The Insider-Threat Model: Engineering Controls for Collusion Resistance in Custody and Trading

When I joined Upside as Head of Engineering, the first thing the CISO told me was: assume any one of us could be compromised. Not malicious - compromised. A phishing attack, a social engineering call, a coerced relative in a jurisdiction we did not operate in. The threat model was not the external hacker trying to breach the perimeter. The threat model was an insider who already had legitimate access.

This assumption changes the entire architecture. Perimeter security becomes irrelevant. Authentication tokens become insufficient. The security posture shifts from “keep attackers out” to “ensure that no individual’s compromise can produce a material outcome.”

This is collusion resistance. And it is the hardest security property to build correctly.

The Insider Threat Model

For a crypto custody or trading firm, the insider threat categories are:

The compromised credential. An engineer’s laptop has malware. The attacker controls their authentication tokens, their session cookies, their SSH keys. The insider is not acting maliciously - they are unaware. Every action available to that engineer is now available to the attacker.

The coerced insider. An engineer is under external pressure - financial debt, a family member’s safety in a different jurisdiction, a blackmail situation. They take a deliberate action to enable an attack in exchange for relief. This is distinct from the malicious employee because the motivation is external and the insider may actively attempt to minimize the damage.

The malicious insider. An employee who decides to extract value for personal gain. This is statistically less common than the compromised credential case but gets more attention because the narrative is more compelling.

The collusion case. Two or more insiders coordinating. This is the scenario that requires the most sophisticated controls because any M-of-N requirement where M ≤ N can be defeated if the N members include M colluding insiders.

The implications for the engineering controls: assume that any single person’s access can be compromised at any time. Design systems so that a single person’s compromise cannot produce a catastrophic outcome.

Engineering Controls for Collusion Resistance

Dual Control on Production Changes

No single engineer can both author and approve a production change. This is the most basic collusion resistance control, and it is one that many engineering teams implement with a loophole: the same engineer can approve their own change if no one else is available.

The control must be enforced at the tooling level, not as a policy. GitHub’s branch protection rules can require at least one approving review from a different account. The enforcement matters because under pressure (a production incident, a tight deadline), engineers will ask for exceptions, and managers will grant them.

At Upside, we went further: for any change that touched the signing infrastructure or custody-related code, a second reviewer from the security team was required in addition to the standard engineering review. The security team reviewer was a named group, and the approval requirement was enforced by a GitHub app that checked the review history against the CODEOWNERS file before allowing a merge.

The 4-Eyes Principle on Production Deploys

A deploy to production requires two engineers to authorize it, using separate authentication. The second engineer cannot be the person who authored the code being deployed.

For trading infrastructure, this extends to parameter changes: no single engineer can change a strategy’s position limit, risk threshold, or venue routing without a second authorization. This prevents a compromised engineer from manipulating trading behavior without leaving evidence.

The implementation: your deployment pipeline requires two separate JWT tokens, signed by two different user identities, before proceeding. The tokens are checked against a revocation list to prevent replay attacks using a previously valid token.

Geographic and Organizational Segregation for High-Value Operations

For operations above a specific value threshold (at Upside this was set at $1M), the two required authorizers had to be in different geographic regions. This controls against physical coercion scenarios - an attacker physically present at one location cannot simultaneously coerce both required authorizers.

The segregation requirement: no two members of the approval quorum can share the same employer, time zone (UTC ± 3 hours), or physical location within 100km. This makes physical coercion require coordinated multi-site operations.

For a Shamir’s Secret Sharing or MPC-based custody system, the shares should be held by parties that meet these segregation requirements. A 2-of-3 scheme where all three keyholders are in the same office is not meaningfully more secure than a 1-of-1 scheme under a physical threat.

Automated Anomaly Detection on Trading Activity

A rogue insider with valid credentials can authorize trades. The correct authorization chain is: strategy generates signal → risk engine validates → order management routes → execution.

Anomaly detection sits orthogonally to this chain. If any strategy suddenly trades 10x its normal size, fires a different venue than it has used in the last 90 days, or generates a series of orders that would benefit a specific external party, an alert fires regardless of whether the authorization chain was correctly followed.

The anomaly detection rules at Upside:

Order size > 3 standard deviations from the strategy’s 30-day rolling average: alert and suspend
Order routing to a venue not in the strategy’s historical venue list: flag for review before execution
Identical-price orders on both sides of the market (wash trading pattern): block and alert
Execution pace more than 5x the strategy’s historical rate: throttle and alert

The key design principle: the anomaly detection system should not be accessible to the same engineer who can authorize trades. Segregation of duties means the person who can approve a trade cannot also modify the anomaly detection rules.

Key Controls for Custody

For custody systems, the key material control is the most important control in the system. No single person should be able to:

Access the signing key material (even in encrypted form)
Access the decryption key that unlocks the signing key material
Access the production server on which signing occurs
Approve the transaction that triggers a signing operation

These four capabilities must be distributed across at least two people, with no overlap. A single person holding any two of these capabilities can construct an unauthorized signing operation.

In practice, the implementation uses hardware security modules (HSMs) or trusted execution environments (TEEs) to ensure that even a person with root access to the production server cannot extract key material from the signing service. The TEE attestation proves to an auditor that the signing service ran code they approved, with key material it obtained from an authorized source, in response to a transaction that was authorized through the policy engine.

The “God Mode” Quarterly Audit

Every quarter, run a full access review: who has access to what, and does that access align with their current role? The audit should specifically flag:

Anyone with access to more than one of the four custody capabilities listed above
Anyone whose job function has changed but whose access has not been updated
Any service account that has broader access than the services that use it
Any emergency access (break-glass) grants that were used in the quarter, with a review of whether the use was justified

This audit is not just a compliance exercise. It is a detection mechanism: a compromised insider who has been quietly accumulating access over months will appear in the access review.

A Production Near-Miss: The DeFi Extraction Pattern

In 2022, an engineer at a DeFi protocol nearly extracted $40M before being detected. The technical sequence:

The engineer had access to the protocol’s multi-sig wallet interface, which required 3-of-5 signers.
The engineer initiated a “routine maintenance transaction” - a parameter update that had been done before - and obtained their own signature plus signatures from two colleagues who trusted the engineer’s description of the transaction.
The transaction was not what the engineer described. It transferred $40M to an external address.
The transaction was caught because the protocol had a 48-hour time-lock on large transfers. During the time-lock window, a fourth signer reviewed the transaction details directly on the blockchain and noticed the discrepancy.

What controls failed:

The signers did not independently verify the transaction contents before signing. They trusted the initiating engineer’s description.
The multi-sig interface showed a human-readable description provided by the initiator, not derived from the transaction itself.
Two of the three signers were in the same office and had daily direct contact with the initiating engineer.

What saved the protocol:

The time-lock. Without the 48-hour delay, the funds would have been transferred before anyone noticed.
A fourth signer who reviewed transaction details independently.
The fourth signer’s decision to verify on-chain data rather than rely on the interface.

The engineering lessons:

Never show signers a description provided by the initiator. Show them the decoded transaction data directly. The signing interface should derive a human-readable description from the on-chain data, not accept one from the initiator.

Time-locks are a compensating control for social engineering. A transaction that takes 48 hours to execute gives time for independent verification. For large-value operations, this is worth the operational overhead.

Geographic segregation of the quorum is not optional. If the signers know each other personally and are co-located, the social engineering attack surface is trivially exploited.

Behavioral Analytics: Detecting the Low-and-Slow Insider

Most insider threat controls focus on preventing a single catastrophic action. But sophisticated insider attacks often operate incrementally: small privilege escalations, gradual access accumulation, recon over weeks before the extraction event.

Behavioral analytics addresses this by establishing baseline patterns for each user and alerting on deviations:

Access time patterns. An engineer who always accesses production between 9am and 7pm UTC and suddenly starts a session at 2am is worth a review, even if the access itself is authorized.

Resource access breadth. An engineer who suddenly starts accessing credential paths, config files, and database tables they have never accessed before represents a change in behavior. Either their role has changed (and that should be documented) or something unusual is happening.

Query patterns for database access. An engineer doing routine debugging runs SELECT queries scoped to recent data. An engineer extracting data runs queries with large date ranges, no WHERE clauses, or ORDER BY with LIMIT - the pattern of someone copying data, not investigating a specific issue.

Export events. Any copy of data to an external destination - a personal S3 bucket, an external SFTP server, a personal email - should be an alert, not a routine event. At Upside, our DLP (Data Loss Prevention) tooling monitored for large file uploads from corporate devices to non-corporate destinations.

The implementation: aggregate logs from Tailscale (network access), Teleport (database and server sessions), AWS CloudTrail (API calls), and your secrets manager (credential access) into a SIEM. Write rules that fire when access patterns deviate significantly from the prior 90-day baseline for that user. The SIEM analysis does not need to be sophisticated - even a simple “this user has accessed 5x their normal number of distinct resources in the last 24 hours” alert adds meaningful coverage.

How This Breaks in Production

The most common failure mode for collusion resistance controls is the exception process. The dual-control requirement has an emergency bypass. The bypass requires approval from the CTO. Under a production incident at 3am, the CTO approves the bypass so the on-call engineer can deploy the fix without waiting for a second reviewer.

The bypass is the right decision in the immediate emergency. The problem is when the bypass is used routinely. Once engineers know that the CTO will approve bypass requests without scrutiny, the dual-control requirement effectively becomes optional.

The fix: every bypass must generate an immutable audit record that is reviewed in the next business day’s security standup. The CTO’s bypass approvals should appear in the quarterly access review. If the number of bypass approvals exceeds a threshold (we set it at 3 per month), an automatic review of whether the process needs to be redesigned is triggered.

The second failure mode is anomaly detection that is too noisy. A system that generates 50 alerts per day is worse than no system - alert fatigue means the one real anomaly is ignored. Calibrate your anomaly detection rules against historical trading data. A threshold that would have generated fewer than 5 alerts per month on the past year of data is a reasonable starting point.

The third failure mode is the “we trust each other” culture that treats the insider threat model as an insult. I have had conversations with founding teams of trading startups who view dual-control requirements as expressing distrust of their colleagues. The correct framing is the opposite: these controls protect everyone. If any member of the team is ever compromised, coerced, or accused of unauthorized activity, the controls provide proof of what actually happened. They protect the innocent as much as they deter the malicious.

The adversary model for an insider threat is someone who understands your controls. They will probe the edges of your anomaly detection, learn the bypass procedures, and exploit the social trust in your team. The only reliable defense is to design systems where the outcome of a successful insider attack is limited regardless of the attacker’s knowledge of the controls.

The Personnel Lifecycle as a Control Surface

The highest-risk moments in the insider threat lifecycle are not random points in an employee’s tenure. They are specific transitions: the first 90 days (before full background checks complete and before behavioral baselines are established), the period after a denied promotion or negative performance review, and the notice period after resignation.

For trading firms, the notice period is particularly sensitive. An engineer who has resigned has full knowledge of your systems and their access has not yet been revoked. They may be leaving for a direct competitor. Standard practice at most financial firms: employees in sensitive roles are placed on immediate “garden leave” when they resign. They are paid for their notice period but have no system access from the moment of resignation announcement.

This is not punitive. It is the same logic as the dual-control and time-lock controls: limit the window during which an adverse outcome is possible. An engineer on garden leave with no system access cannot exfiltrate data or manipulate trading parameters, regardless of their intent.

From an engineering standpoint, garden leave requires that access revocation be fast and complete. The trigger is the resignation conversation, not the last day of work. Your offboarding automation should be executable in minutes: disable the SSO account, revoke the VPN certificate, remove the Tailscale node, invalidate active Teleport sessions. Build and test this procedure before you need it.