Infrastructure
ArgoCD GitOps for a Trading Platform: Progressive Delivery Without Breaking Production
ZeroCopy runs 11 ArgoCD applications on DOKS. How we use app-of-apps, sync waves, resource hooks, and manual sync policies to deploy trading services safely.
The first time I tried to use ArgoCD’s automated sync on our order management system, it restarted the service in the middle of a trading session. ArgoCD had detected that the live deployment differed from the Git state (a human had applied a configmap change to fix a typo in an alert label) and helpfully synced the cluster back to Git’s version of truth - which triggered a rolling restart of the OMS pods.
The positions were fine - the restart was fast enough that no trades were lost - but the experience was instructive. ArgoCD’s default behavior assumes that continuous reconciliation toward Git state is always desirable. For a trading platform, that assumption is wrong in specific, important ways.
This post is about how we structured ArgoCD at ZeroCopy to get the benefits of GitOps (auditability, reproducibility, drift detection) without the risk of automated syncs touching services during live trading hours.
The App-of-Apps Pattern
ZeroCopy runs 11 ArgoCD applications across a DOKS (DigitalOcean Kubernetes) cluster. Managing 11 application manifests independently would mean 11 places to update the repo URL, the target revision, and the cluster credentials when things change. The app-of-apps pattern solves this: one parent application manages all child application definitions.
The parent application points to a directory of ArgoCD Application manifests:
# argocd/parent-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: zerocopy-root
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
source:
repoURL: https://github.com/padalan/zerocopy
targetRevision: main
path: argocd/apps # Directory containing all child Application manifests
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true # Remove applications that are deleted from Git
selfHeal: true # Re-sync if someone manually modifies the parent app
The parent app’s syncPolicy.automated is enabled because the application definitions themselves are not trading infrastructure - changing which applications exist does not restart trading services. The parent app manages the ArgoCD control plane state, not the trading workloads.
The child applications are defined in argocd/apps/:
argocd/apps/
├── oms.yaml # Order management system
├── risk-engine.yaml # Risk engine
├── nats-cluster.yaml # NATS messaging
├── prometheus.yaml # Monitoring
├── grafana.yaml # Dashboards
├── alertmanager.yaml # Alerts
├── harbor.yaml # Container registry
├── cert-manager.yaml # TLS certificates
├── external-secrets.yaml # Secret sync from Infisical
├── kyverno.yaml # Policy engine
└── argo-rollouts.yaml # Progressive delivery controller
Each child application is its own ArgoCD Application resource, and critically, each has its own syncPolicy - different services get different sync behavior.
Sync Waves: Ordering Your Deployment Sequence
Sync waves solve a class of ordering problems that naive Kubernetes deployments fail on: you cannot deploy a service before the namespace that contains it, or a secret before the deployment that references it, or an application before the CRD that defines its custom resources.
ArgoCD implements sync waves via annotation. Resources with wave 0 sync first, then wave 1, then wave 2, and so on. Within a wave, resources sync concurrently.
Our sync wave structure for a trading service deployment:
# Wave 0: Namespaces and CRDs (must exist before everything else)
# argocd/manifests/namespaces.yaml
apiVersion: v1
kind: Namespace
metadata:
name: trading
annotations:
argocd.argoproj.io/sync-wave: "0"
---
apiVersion: v1
kind: Namespace
metadata:
name: observability
annotations:
argocd.argoproj.io/sync-wave: "0"
# Wave 1: External Secrets operator (must be running before wave 2 secrets)
# argocd/manifests/external-secrets-operator.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-secrets
namespace: external-secrets
annotations:
argocd.argoproj.io/sync-wave: "1"
# Wave 2: Secrets (populated by External Secrets from Infisical vault)
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: oms-credentials
namespace: trading
annotations:
argocd.argoproj.io/sync-wave: "2"
spec:
refreshInterval: 5m
secretStoreRef:
name: infisical-backend
kind: ClusterSecretStore
target:
name: oms-credentials
data:
- secretKey: db-password
remoteRef:
key: OMS_DB_PASSWORD
- secretKey: exchange-api-key
remoteRef:
key: EXCHANGE_API_KEY
# Wave 3: Core infrastructure services (NATS, databases)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nats
namespace: infrastructure
annotations:
argocd.argoproj.io/sync-wave: "3"
# Wave 5: Trading application services (depend on secrets + infrastructure)
apiVersion: apps/v1
kind: Deployment
metadata:
name: oms
namespace: trading
annotations:
argocd.argoproj.io/sync-wave: "5"
The gap between wave 3 and wave 5 is intentional - it leaves room to insert wave 4 resources (like a database migration job) without renumbering everything.
Resource Hooks: Pre- and Post-Sync Operations
Resource hooks are the mechanism for running operations at specific points in the sync lifecycle. The trading-specific use cases:
PreSync: Set trading engine to standby before deployment.
For stateful trading services that cannot restart cleanly mid-session, the PreSync hook sends a “go to standby” command before ArgoCD modifies any resources. The service stops accepting new orders, drains in-flight operations, and acknowledges the standby state. Only then does ArgoCD proceed with the sync.
# argocd/hooks/oms-presync.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: oms-presync-standby
namespace: trading
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded # Clean up after success
spec:
template:
spec:
containers:
- name: standby-notifier
image: curlimages/curl:latest
command:
- sh
- -c
- |
# Send standby signal to OMS
curl -f -X POST \
-H "Authorization: Bearer ${OMS_ADMIN_TOKEN}" \
-H "Content-Type: application/json" \
-d '{"mode": "standby", "reason": "argocd-sync"}' \
http://oms.trading.svc.cluster.local:8080/admin/mode
# Wait for OMS to confirm standby mode
for i in $(seq 1 30); do
MODE=$(curl -sf http://oms.trading.svc.cluster.local:8080/health/mode | jq -r .mode)
if [ "$MODE" = "standby" ]; then
echo "OMS confirmed standby mode"
exit 0
fi
sleep 2
done
echo "OMS did not acknowledge standby within 60s"
exit 1
envFrom:
- secretRef:
name: oms-admin-credentials
restartPolicy: Never
backoffLimit: 1 # Fail fast - do not retry if standby signal fails
PostSync: Verify health before declaring success.
The PostSync hook runs after all resources have been updated. It verifies that the new version is healthy before ArgoCD marks the sync as successful.
# argocd/hooks/oms-postsync.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: oms-postsync-verify
namespace: trading
annotations:
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
template:
spec:
containers:
- name: health-verifier
image: curlimages/curl:latest
command:
- sh
- -c
- |
# Wait for OMS to complete startup
for i in $(seq 1 60); do
HEALTH=$(curl -sf http://oms.trading.svc.cluster.local:8080/health/deep)
if echo "$HEALTH" | jq -e '.status == "healthy" and .exchange_connections == "all_connected"' > /dev/null; then
echo "OMS health check passed: all exchange connections live"
exit 0
fi
echo "Waiting for OMS health... attempt $i/60"
sleep 5
done
echo "OMS failed health check after 300s"
exit 1
restartPolicy: Never
backoffLimit: 0
SyncFail: Page on-call when sync fails.
If a sync fails - whether due to a failing hook, a pod that fails to start, or a resource conflict - the SyncFail hook fires. For production trading services, a sync failure should wake someone up.
apiVersion: batch/v1
kind: Job
metadata:
name: oms-syncfail-alert
namespace: trading
annotations:
argocd.argoproj.io/hook: SyncFail
argocd.argoproj.io/hook-delete-policy: HookFailed
spec:
template:
spec:
containers:
- name: alerter
image: curlimages/curl:latest
command:
- sh
- -c
- |
curl -X POST "$PAGERDUTY_EVENTS_URL" \
-H "Content-Type: application/json" \
-d "{
\"routing_key\": \"$PAGERDUTY_ROUTING_KEY\",
\"event_action\": \"trigger\",
\"payload\": {
\"summary\": \"ArgoCD sync failed for OMS in production\",
\"severity\": \"critical\",
\"source\": \"argocd\"
}
}"
envFrom:
- secretRef:
name: pagerduty-credentials
restartPolicy: Never
Manual Sync Policy for Hot-Path Services
This is the most important configuration decision in our ArgoCD setup. Trading services that hold live position state - the OMS, the risk engine, the strategy engine - must have syncPolicy: {} (manual sync) rather than syncPolicy.automated.
# argocd/apps/oms.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: oms
namespace: argocd
spec:
project: trading-prod
source:
repoURL: https://github.com/padalan/zerocopy
targetRevision: main
path: k8s/trading/oms
destination:
server: https://kubernetes.default.svc
namespace: trading
# CRITICAL: Manual sync only for trading services
# ArgoCD will detect drift and show it in the UI, but will NOT auto-apply
syncPolicy: {} # Empty = manual
# These options affect the drift detection display, not sync behavior
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # HPA manages replicas - ignore Terraform's desired count
For control plane services with no live trading state - Prometheus, Grafana, cert-manager, Harbor - automated sync is appropriate:
# argocd/apps/prometheus.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: prometheus
namespace: argocd
spec:
project: observability
source:
repoURL: https://github.com/padalan/zerocopy
targetRevision: main
path: k8s/observability/prometheus
destination:
server: https://kubernetes.default.svc
namespace: observability
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
DOKS-Specific Gotchas
Running ArgoCD on DigitalOcean Kubernetes has a few specific pain points that are not documented prominently.
ArgoCD revision cache requires full namespace delete to bust. When ArgoCD’s git revision cache gets corrupted or stale (most commonly after a repo is restructured or a large force-push), ArgoCD will appear to sync correctly but may be applying an old revision’s state. The only reliable fix is:
kubectl delete ns argocd
# Wait for namespace to be fully terminated
kubectl create ns argocd
helm upgrade --install argocd argo/argo-cd -n argocd -f values/argocd.yaml
Pod restarts, cache invalidation API calls, and ArgoCD’s own hard-refresh button do not reliably fix a corrupted revision cache. This is not DOKS-specific - it applies to any ArgoCD installation - but it is surprising the first time you encounter it.
Helm chart support requires kustomize build options. If your ArgoCD application uses Kustomize and your Kustomize configuration references Helm charts (via helmCharts in kustomization.yaml), ArgoCD requires an explicit option to enable this:
# argocd/apps/oms.yaml
spec:
source:
repoURL: https://github.com/padalan/zerocopy
path: k8s/trading/oms
kustomize:
buildOptions: --enable-helm # Required for Helm charts in Kustomize
Without this flag, ArgoCD will fail to build the application with a cryptic “helmCharts not enabled” error.
DOKS node pool autoscaling and ArgoCD drift. DOKS can add or remove nodes based on cluster autoscaler activity. When a node is removed, pods are rescheduled, which may change the nodeName field on some pod specs. ArgoCD can detect this as drift and flag the application as out-of-sync. Add nodeName to your ignoreDifferences configuration for any StatefulSets where this matters.
A Real ArgoCD Application Manifest for a Trading Service
This is a representative manifest for the OMS that incorporates all the patterns above:
# argocd/apps/oms.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: oms
namespace: argocd
labels:
team: trading
criticality: high
annotations:
# Links for operator context
argocd.argoproj.io/manifest-generate-paths: k8s/trading/oms
spec:
project: trading-prod # Restricts which repos/clusters this app can deploy to
source:
repoURL: https://github.com/padalan/zerocopy
targetRevision: main
path: k8s/trading/oms
kustomize:
buildOptions: --enable-helm
destination:
server: https://kubernetes.default.svc
namespace: trading
# Manual sync: no automated reconciliation for trading hot path
syncPolicy: {}
# Ignore fields that change outside Terraform/Git
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # HPA-managed
- /spec/template/spec/containers/0/image # Updated by CI, not ArgoCD
- group: ""
kind: Secret
jsonPointers:
- /data # Managed by External Secrets operator, not directly committed
# Health check customization
info:
- name: Runbook
value: https://internal-docs.zerocopy.systems/runbooks/oms-deployment
- name: Owner
value: trading-infra
How This Breaks in Production
Sync wave failures leaving partial state. If wave 3 (NATS) deploys successfully but wave 5 (OMS) fails, your cluster is in a partial state: new NATS version running, old OMS version running. ArgoCD marks the application as out-of-sync but the OMS is still functional. Be careful with rollback strategy here - rolling back the NATS version while the OMS is connected to it can cause connection disruption.
PreSync hook timeout leaving trading service in standby. If the PreSync hook sends the standby signal but the PostSync hook fails, your OMS is stuck in standby mode (not trading) with the old code deployed (rollback did not fire). You now have a trading service that is neither the old version (trading) nor the new version (verified) - it is the old version in standby. Your on-call runbook must handle the “abort deployment, return to active” procedure for this specific failure mode.
ArgoCD App-of-Apps race condition on initial deploy. When you first deploy the app-of-apps to a fresh cluster, the parent app creates child application manifests before the child application destinations may have their namespaces. If the child app’s destination namespace does not exist, ArgoCD marks the child as degraded. Use CreateNamespace=true in the child app’s syncOptions, or ensure namespace creation is in wave 0 of the parent app.
ignoreDifferences hiding genuine drift. If your ignoreDifferences list is too broad, ArgoCD will suppress warnings about real drift. A common pattern: ignoring the image field (because CI updates it) means ArgoCD will never alert if someone accidentally rolls back to an old image through a non-CI path. Audit your ignoreDifferences configuration periodically - every suppressed field is a potential blind spot.
DOKS cluster version upgrades triggering ArgoCD CRD conflicts. DOKS minor version upgrades can update built-in CRD versions (for example, upgrading from policy/v1beta1 to policy/v1 for PodDisruptionBudgets). If your ArgoCD applications manage resources with the old API version, ArgoCD will show drift and attempting to sync may fail with API version conflicts. Always run kubectl convert on your manifests before a Kubernetes version upgrade and test on staging first.
Continue Reading
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.