Skip to content

Infrastructure

ArgoCD GitOps for a Trading Platform: Progressive Delivery Without Breaking Production

ZeroCopy runs 11 ArgoCD applications on DOKS. How we use app-of-apps, sync waves, resource hooks, and manual sync policies to deploy trading services safely.

11 min
#argocd #gitops #kubernetes #trading-infrastructure #progressive-delivery #doks

The first time I tried to use ArgoCD’s automated sync on our order management system, it restarted the service in the middle of a trading session. ArgoCD had detected that the live deployment differed from the Git state (a human had applied a configmap change to fix a typo in an alert label) and helpfully synced the cluster back to Git’s version of truth - which triggered a rolling restart of the OMS pods.

The positions were fine - the restart was fast enough that no trades were lost - but the experience was instructive. ArgoCD’s default behavior assumes that continuous reconciliation toward Git state is always desirable. For a trading platform, that assumption is wrong in specific, important ways.

This post is about how we structured ArgoCD at ZeroCopy to get the benefits of GitOps (auditability, reproducibility, drift detection) without the risk of automated syncs touching services during live trading hours.

The App-of-Apps Pattern

ZeroCopy runs 11 ArgoCD applications across a DOKS (DigitalOcean Kubernetes) cluster. Managing 11 application manifests independently would mean 11 places to update the repo URL, the target revision, and the cluster credentials when things change. The app-of-apps pattern solves this: one parent application manages all child application definitions.

The parent application points to a directory of ArgoCD Application manifests:

# argocd/parent-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: zerocopy-root
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: https://github.com/padalan/zerocopy
    targetRevision: main
    path: argocd/apps  # Directory containing all child Application manifests
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true      # Remove applications that are deleted from Git
      selfHeal: true   # Re-sync if someone manually modifies the parent app

The parent app’s syncPolicy.automated is enabled because the application definitions themselves are not trading infrastructure - changing which applications exist does not restart trading services. The parent app manages the ArgoCD control plane state, not the trading workloads.

The child applications are defined in argocd/apps/:

argocd/apps/
├── oms.yaml               # Order management system
├── risk-engine.yaml       # Risk engine
├── nats-cluster.yaml      # NATS messaging
├── prometheus.yaml        # Monitoring
├── grafana.yaml           # Dashboards
├── alertmanager.yaml      # Alerts
├── harbor.yaml            # Container registry
├── cert-manager.yaml      # TLS certificates
├── external-secrets.yaml  # Secret sync from Infisical
├── kyverno.yaml           # Policy engine
└── argo-rollouts.yaml     # Progressive delivery controller

Each child application is its own ArgoCD Application resource, and critically, each has its own syncPolicy - different services get different sync behavior.

Sync Waves: Ordering Your Deployment Sequence

Sync waves solve a class of ordering problems that naive Kubernetes deployments fail on: you cannot deploy a service before the namespace that contains it, or a secret before the deployment that references it, or an application before the CRD that defines its custom resources.

ArgoCD implements sync waves via annotation. Resources with wave 0 sync first, then wave 1, then wave 2, and so on. Within a wave, resources sync concurrently.

Our sync wave structure for a trading service deployment:

# Wave 0: Namespaces and CRDs (must exist before everything else)
# argocd/manifests/namespaces.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: trading
  annotations:
    argocd.argoproj.io/sync-wave: "0"
---
apiVersion: v1
kind: Namespace
metadata:
  name: observability
  annotations:
    argocd.argoproj.io/sync-wave: "0"
# Wave 1: External Secrets operator (must be running before wave 2 secrets)
# argocd/manifests/external-secrets-operator.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-secrets
  namespace: external-secrets
  annotations:
    argocd.argoproj.io/sync-wave: "1"
# Wave 2: Secrets (populated by External Secrets from Infisical vault)
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: oms-credentials
  namespace: trading
  annotations:
    argocd.argoproj.io/sync-wave: "2"
spec:
  refreshInterval: 5m
  secretStoreRef:
    name: infisical-backend
    kind: ClusterSecretStore
  target:
    name: oms-credentials
  data:
    - secretKey: db-password
      remoteRef:
        key: OMS_DB_PASSWORD
    - secretKey: exchange-api-key
      remoteRef:
        key: EXCHANGE_API_KEY
# Wave 3: Core infrastructure services (NATS, databases)
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nats
  namespace: infrastructure
  annotations:
    argocd.argoproj.io/sync-wave: "3"
# Wave 5: Trading application services (depend on secrets + infrastructure)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: oms
  namespace: trading
  annotations:
    argocd.argoproj.io/sync-wave: "5"

The gap between wave 3 and wave 5 is intentional - it leaves room to insert wave 4 resources (like a database migration job) without renumbering everything.

Resource Hooks: Pre- and Post-Sync Operations

Resource hooks are the mechanism for running operations at specific points in the sync lifecycle. The trading-specific use cases:

PreSync: Set trading engine to standby before deployment.

For stateful trading services that cannot restart cleanly mid-session, the PreSync hook sends a “go to standby” command before ArgoCD modifies any resources. The service stops accepting new orders, drains in-flight operations, and acknowledges the standby state. Only then does ArgoCD proceed with the sync.

# argocd/hooks/oms-presync.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: oms-presync-standby
  namespace: trading
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded  # Clean up after success
spec:
  template:
    spec:
      containers:
        - name: standby-notifier
          image: curlimages/curl:latest
          command:
            - sh
            - -c
            - |
              # Send standby signal to OMS
              curl -f -X POST \
                -H "Authorization: Bearer ${OMS_ADMIN_TOKEN}" \
                -H "Content-Type: application/json" \
                -d '{"mode": "standby", "reason": "argocd-sync"}' \
                http://oms.trading.svc.cluster.local:8080/admin/mode

              # Wait for OMS to confirm standby mode
              for i in $(seq 1 30); do
                MODE=$(curl -sf http://oms.trading.svc.cluster.local:8080/health/mode | jq -r .mode)
                if [ "$MODE" = "standby" ]; then
                  echo "OMS confirmed standby mode"
                  exit 0
                fi
                sleep 2
              done

              echo "OMS did not acknowledge standby within 60s"
              exit 1
          envFrom:
            - secretRef:
                name: oms-admin-credentials
      restartPolicy: Never
  backoffLimit: 1  # Fail fast - do not retry if standby signal fails

PostSync: Verify health before declaring success.

The PostSync hook runs after all resources have been updated. It verifies that the new version is healthy before ArgoCD marks the sync as successful.

# argocd/hooks/oms-postsync.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: oms-postsync-verify
  namespace: trading
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      containers:
        - name: health-verifier
          image: curlimages/curl:latest
          command:
            - sh
            - -c
            - |
              # Wait for OMS to complete startup
              for i in $(seq 1 60); do
                HEALTH=$(curl -sf http://oms.trading.svc.cluster.local:8080/health/deep)
                if echo "$HEALTH" | jq -e '.status == "healthy" and .exchange_connections == "all_connected"' > /dev/null; then
                  echo "OMS health check passed: all exchange connections live"
                  exit 0
                fi
                echo "Waiting for OMS health... attempt $i/60"
                sleep 5
              done

              echo "OMS failed health check after 300s"
              exit 1
      restartPolicy: Never
  backoffLimit: 0

SyncFail: Page on-call when sync fails.

If a sync fails - whether due to a failing hook, a pod that fails to start, or a resource conflict - the SyncFail hook fires. For production trading services, a sync failure should wake someone up.

apiVersion: batch/v1
kind: Job
metadata:
  name: oms-syncfail-alert
  namespace: trading
  annotations:
    argocd.argoproj.io/hook: SyncFail
    argocd.argoproj.io/hook-delete-policy: HookFailed
spec:
  template:
    spec:
      containers:
        - name: alerter
          image: curlimages/curl:latest
          command:
            - sh
            - -c
            - |
              curl -X POST "$PAGERDUTY_EVENTS_URL" \
                -H "Content-Type: application/json" \
                -d "{
                  \"routing_key\": \"$PAGERDUTY_ROUTING_KEY\",
                  \"event_action\": \"trigger\",
                  \"payload\": {
                    \"summary\": \"ArgoCD sync failed for OMS in production\",
                    \"severity\": \"critical\",
                    \"source\": \"argocd\"
                  }
                }"
          envFrom:
            - secretRef:
                name: pagerduty-credentials
      restartPolicy: Never

Manual Sync Policy for Hot-Path Services

This is the most important configuration decision in our ArgoCD setup. Trading services that hold live position state - the OMS, the risk engine, the strategy engine - must have syncPolicy: {} (manual sync) rather than syncPolicy.automated.

# argocd/apps/oms.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: oms
  namespace: argocd
spec:
  project: trading-prod
  source:
    repoURL: https://github.com/padalan/zerocopy
    targetRevision: main
    path: k8s/trading/oms
  destination:
    server: https://kubernetes.default.svc
    namespace: trading

  # CRITICAL: Manual sync only for trading services
  # ArgoCD will detect drift and show it in the UI, but will NOT auto-apply
  syncPolicy: {}   # Empty = manual

  # These options affect the drift detection display, not sync behavior
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas   # HPA manages replicas - ignore Terraform's desired count

For control plane services with no live trading state - Prometheus, Grafana, cert-manager, Harbor - automated sync is appropriate:

# argocd/apps/prometheus.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: prometheus
  namespace: argocd
spec:
  project: observability
  source:
    repoURL: https://github.com/padalan/zerocopy
    targetRevision: main
    path: k8s/observability/prometheus
  destination:
    server: https://kubernetes.default.svc
    namespace: observability

  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

DOKS-Specific Gotchas

Running ArgoCD on DigitalOcean Kubernetes has a few specific pain points that are not documented prominently.

ArgoCD revision cache requires full namespace delete to bust. When ArgoCD’s git revision cache gets corrupted or stale (most commonly after a repo is restructured or a large force-push), ArgoCD will appear to sync correctly but may be applying an old revision’s state. The only reliable fix is:

kubectl delete ns argocd
# Wait for namespace to be fully terminated
kubectl create ns argocd
helm upgrade --install argocd argo/argo-cd -n argocd -f values/argocd.yaml

Pod restarts, cache invalidation API calls, and ArgoCD’s own hard-refresh button do not reliably fix a corrupted revision cache. This is not DOKS-specific - it applies to any ArgoCD installation - but it is surprising the first time you encounter it.

Helm chart support requires kustomize build options. If your ArgoCD application uses Kustomize and your Kustomize configuration references Helm charts (via helmCharts in kustomization.yaml), ArgoCD requires an explicit option to enable this:

# argocd/apps/oms.yaml
spec:
  source:
    repoURL: https://github.com/padalan/zerocopy
    path: k8s/trading/oms
    kustomize:
      buildOptions: --enable-helm  # Required for Helm charts in Kustomize

Without this flag, ArgoCD will fail to build the application with a cryptic “helmCharts not enabled” error.

DOKS node pool autoscaling and ArgoCD drift. DOKS can add or remove nodes based on cluster autoscaler activity. When a node is removed, pods are rescheduled, which may change the nodeName field on some pod specs. ArgoCD can detect this as drift and flag the application as out-of-sync. Add nodeName to your ignoreDifferences configuration for any StatefulSets where this matters.

A Real ArgoCD Application Manifest for a Trading Service

This is a representative manifest for the OMS that incorporates all the patterns above:

# argocd/apps/oms.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: oms
  namespace: argocd
  labels:
    team: trading
    criticality: high
  annotations:
    # Links for operator context
    argocd.argoproj.io/manifest-generate-paths: k8s/trading/oms
spec:
  project: trading-prod  # Restricts which repos/clusters this app can deploy to

  source:
    repoURL: https://github.com/padalan/zerocopy
    targetRevision: main
    path: k8s/trading/oms
    kustomize:
      buildOptions: --enable-helm

  destination:
    server: https://kubernetes.default.svc
    namespace: trading

  # Manual sync: no automated reconciliation for trading hot path
  syncPolicy: {}

  # Ignore fields that change outside Terraform/Git
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas   # HPA-managed
        - /spec/template/spec/containers/0/image  # Updated by CI, not ArgoCD
    - group: ""
      kind: Secret
      jsonPointers:
        - /data   # Managed by External Secrets operator, not directly committed

  # Health check customization
  info:
    - name: Runbook
      value: https://internal-docs.zerocopy.systems/runbooks/oms-deployment
    - name: Owner
      value: trading-infra

How This Breaks in Production

Sync wave failures leaving partial state. If wave 3 (NATS) deploys successfully but wave 5 (OMS) fails, your cluster is in a partial state: new NATS version running, old OMS version running. ArgoCD marks the application as out-of-sync but the OMS is still functional. Be careful with rollback strategy here - rolling back the NATS version while the OMS is connected to it can cause connection disruption.

PreSync hook timeout leaving trading service in standby. If the PreSync hook sends the standby signal but the PostSync hook fails, your OMS is stuck in standby mode (not trading) with the old code deployed (rollback did not fire). You now have a trading service that is neither the old version (trading) nor the new version (verified) - it is the old version in standby. Your on-call runbook must handle the “abort deployment, return to active” procedure for this specific failure mode.

ArgoCD App-of-Apps race condition on initial deploy. When you first deploy the app-of-apps to a fresh cluster, the parent app creates child application manifests before the child application destinations may have their namespaces. If the child app’s destination namespace does not exist, ArgoCD marks the child as degraded. Use CreateNamespace=true in the child app’s syncOptions, or ensure namespace creation is in wave 0 of the parent app.

ignoreDifferences hiding genuine drift. If your ignoreDifferences list is too broad, ArgoCD will suppress warnings about real drift. A common pattern: ignoring the image field (because CI updates it) means ArgoCD will never alert if someone accidentally rolls back to an old image through a non-CI path. Audit your ignoreDifferences configuration periodically - every suppressed field is a potential blind spot.

DOKS cluster version upgrades triggering ArgoCD CRD conflicts. DOKS minor version upgrades can update built-in CRD versions (for example, upgrading from policy/v1beta1 to policy/v1 for PodDisruptionBudgets). If your ArgoCD applications manage resources with the old API version, ArgoCD will show drift and attempting to sync may fail with API version conflicts. Always run kubectl convert on your manifests before a Kubernetes version upgrade and test on staging first.

Continue Reading

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.