This guide walks you through enabling data plane failover on an existing Astro Private Cloud (APC) installation. You configure the control plane and each participating data plane cluster separately.
For a conceptual overview of the feature and its components, see Data plane failover.
global.plane.mode: control) and at least two data plane clusters (global.plane.mode: data).ClusterSecretStore custom resource configured in each data plane cluster that points to the same external secrets store. The name of this resource is the value you provide for global.dataPlaneFailover.externalSecretManagerName.ClusterRole for the identity running helm install or helm upgrade on each data plane cluster. ESO installs cluster-scoped CRDs, which require cluster-level permissions.Before applying the Helm values, create a ClusterSecretStore and its backing credentials secret in each data plane cluster. APC uses the ClusterSecretStore to push and pull Airflow secrets between data planes during failover.
The following example uses AWS Secrets Manager. Substitute the values for your environment and secrets store provider.
ClusterSecretStore in each data plane cluster:The value you use for metadata.name is the value you provide for global.dataPlaneFailover.externalSecretManagerName in your Helm values.
Add the following values to your control plane values.yaml. Setting global.dataPlaneFailover.enabled: true activates Navigator, DP-Link, and the APC API dispatcher when global.plane.mode is control.
Replace <your-cluster-secret-store-name> with the name of the ClusterSecretStore custom resource in your data plane clusters.
ESO isn’t required on the control plane. Don’t set external-secrets.enabled: true in your control plane values.
Add the following values to each data plane values.yaml. Setting global.dataPlaneFailover.enabled: true activates Pilot and the Flightdeck database bootstrap when global.plane.mode is data.
Use the same value for externalSecretManagerName as on the control plane. Both clusters must reference the same ClusterSecretStore.
The external-secrets key enables the bundled ESO subchart, which installs cluster-scoped CRDs. The identity running helm upgrade must have a ClusterRole on the data plane cluster. If you already run ESO separately, set external-secrets.enabled: false and ensure your existing ESO installation recognizes the ClusterSecretStore that APC expects.
The deployment orchestrator bootstraps the Flightdeck database as an init container during startup. If the bootstrap fails, the deployment orchestrator Pod doesn’t start. Check the flightdeck-bootstrapper and flightdeck-db-migrations init container logs if the deployment orchestrator fails to come up after enabling this feature.
Apply the updated values to each cluster using helm upgrade. Upgrade the control plane first.
Run the same command for each data plane cluster, substituting the appropriate release name, namespace, and values file.
After the upgrade completes, confirm that the new components are running on each cluster.
On the control plane, verify that the following Pods are running:
On each data plane, verify that the deployment orchestrator started successfully and Pilot is running:
Check deployment orchestrator logs to confirm Flightdeck initialized correctly:
Changing any of the values in this section can meaningfully affect resource usage on your Kubernetes clusters and may adversely affect failover functionality. Change and test these values in a non-production environment before applying them to production.
Pilot’s claim, retry, and circuit breaker behavior is configurable via environment variables. Set these under astronomer.pilot.env in your data plane values.yaml.
For data planes with a larger number of Airflow Deployments (roughly 50 or more), or for cross-region failovers where each Deployment takes longer to come up because the deployment orchestrator has to pull container images from a remote-region registry endpoint or fetch secrets from a remote-region secrets backend, consider raising PILOT_MAX_INFLIGHT_PER_WORKER above the default of 5. A higher value lets Pilot bring more Deployments up on the destination cluster in parallel, which reduces overall failover time and helps amortize cross-region latency. Each in-flight flight runs additional work on the data plane cluster (secret syncs, Helm installs, and database operations) and consumes additional bandwidth to the registry and secrets store, so only raise this value if your data plane cluster has spare CPU, memory, and API server headroom and your registry/secrets backends can handle the extra concurrent traffic. Validate the new value in a non-production environment first.
Navigator’s reconcile loop timing is configurable via environment variables. Set these under astronomer.navigator.env in your control plane values.yaml.
DP-Link determines cluster health based on heartbeat age. Adjust these thresholds under astronomer.dpLink.env in your control plane values.yaml.
The APC API dispatcher dispatches flights from the control plane to the deployment orchestrator on each data plane. Its loop timing, concurrency, retry, and circuit breaker behavior are configurable through environment variables. Set these under astronomer.houston.env in your control plane values.yaml.