Data plane failover is a resiliency feature that moves all Apache Airflow Deployments from a source data plane cluster to a destination data plane cluster. When you trigger a failover, Astro Private Cloud (APC) applies each Deployment’s configuration and secrets to the destination cluster, gets it up and running there, and cleans up the source side with minimal manual intervention.
Failover is a full-cluster operation — every Deployment on the source cluster is included. It is asynchronous: after you submit a request, the platform drives execution through a state machine until every Deployment is running on the destination cluster or has failed with an error that requires operator attention.
A failover request moves through the following stages:
FailoverRequest record and transitions it to IN_PROGRESS.FailoverRequest as SUCCEEDED or FAILED.The Trigger Failover button in the UI is active only when failoverEnabled is true on the source cluster, when external-secrets.enabled is true in both the source and destination clusters, global.dataPlaneFailover.externalSecretManagerName is set, and a valid, authenticated ClusterSecretStore exists.
The destination cluster dropdown shows only clusters that APC considers schedulable targets for the selected source. A cluster appears as a valid target when it is registered, healthy, and has no pending failover operations targeting it as a destination.
APC doesn’t compare APC versions between the source and destination data planes before a failover. You are responsible for keeping the source and destination clusters on compatible APC versions.
Data plane failover adds several components that aren’t deployed in a standard APC installation. Each component runs on either the control plane or the data plane, as described in the following table.
APC uses the External Secrets Operator (ESO) to replicate Airflow secrets between data planes. When failover is enabled:
PushSecret custom resources that write Airflow secrets (fernet key, environment variables, and database credentials) into an external secrets store through a ClusterSecretStore.ExternalSecret custom resources that pull those secrets from the same store into the destination namespace.Both the source and destination data plane clusters must be able to reach the same external secrets store.
APC provisions logical databases and database users automatically when you create a Deployment, but it doesn’t provision database servers. You must provide a database server hostname that is network-accessible from both the source and destination data plane clusters.
Two database server topologies are supported:
In either topology, you are responsible for setting up network access from your data plane clusters to the database server, and for managing replication, primary promotion, and endpoint cutover if you use the synchronized topology.
When performing regional failover with synchronized database servers, the expected sequence is:
When a Deployment is created with failover enabled, APC provisions two sets of logical databases and credentials on the database server: an active set used by the source cluster and an inactive set used by the destination cluster. APC immediately blocks the inactive credentials from connecting after provisioning.
During failover, the deployment orchestrator performs database fencing for each Deployment: it revokes connect access from the active credentials and grants it to the inactive credentials. This ensures only one cluster writes to a given Deployment’s logical databases at a time. Fencing is per Deployment, so individual Deployments can be migrated at different times during a single failover request.
To prevent split-brain writes during failover, APC fences each Airflow Deployment at the database level using two database users per Deployment. APC supports two ownership models:
ALTER ROLE), and APC handles the rest. You don’t take any further action.APC doesn’t replicate container images between regions. You configure a single container registry endpoint per APC installation on the control plane, and every data plane in the installation pulls Airflow Deployment images from that endpoint. APC currently doesn’t support different registry endpoints per region.
registry.example.com/my-org/airflow:1.2.3 must resolve to the same image regardless of which data plane pulls it.If you back the registry endpoint with cross-region replication, replication latency determines when a Deployment is eligible to run on a destination data plane. If a Deployment’s image hasn’t yet replicated to the region serving the destination data plane, APC can’t start that Deployment there — Pilot’s upsert step fails until the image becomes available. Size your registry replication SLA to be faster than your expected failover window for the Deployments that must be able to fail over.
APC ships Airflow task logs from each data plane to an external Elasticsearch sink. After a Deployment moves between data planes, you still need to be able to read its task logs through the same UI, so the log sink topology matters for failover.
For details on configuring Vector and the Elasticsearch sink itself, see Configure task log collection and exporting to ElasticSearch.
APC supports two Elasticsearch topologies for failover:
In either topology, you are responsible for sizing, securing, and operating the Elasticsearch infrastructure, and (in the active-active topology) for managing replication between regions.
Before enabling data plane failover, confirm the following:
global.plane.mode: control on the control plane, global.plane.mode: data on each data plane). Failover isn’t supported in unified mode.ClusterSecretStore custom resource in your data plane clusters that points to that secrets store.The initial APC 2.0 release of data plane failover has the following limitations: