For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
      • AstroFully-managed data operations, powered by Apache Airflow.
      • Astro Private CloudRun Airflow-as-a-service in your environment.
      • Professional ServicesExpert Airflow services for your enterprise's success.
    • Tools
      • Cosmos
      • Orbiter
      • CLI
      • AI SDK
      • Agents
      • Blueprint
      • UpdatesThe State of Airflow 2026See the insights from over 5,800 data practitioners in the full report. Download Now ➔
  • Customers
  • Docs
    • Insights
      • Blog
      • Webinars
      • Resource Library
      • Events
    • Education
      • Academy
      • What is Airflow?
  • Pricing
Get Started Free
    • Overview
      • Create a Deployment
      • Execution mode
        • Overview
        • Shared responsibility model
        • Get started
          • Configure Remote Execution Agents
            • Failure and recovery scenarios
          • Configure secrets backend
          • Configure XCom backend
          • Configure DAG sources
        • Deploy Remote Execution project
        • Deploy a dbt project
        • Helm chart reference
      • Worker queues
      • Environment variables
      • Secrets backend
    • Book Office Hours

Product

  • Platform Overview
  • Astro
  • Astro Observe
  • Astro Private Cloud
  • Security & Trust
  • Pricing

Tools & Services

  • Cosmos
  • Docs
  • Professional Services
  • Product Updates

Use Cases

  • AI Ops
  • Data Observability
  • ETL/ELT
  • ML Ops
  • Operational Analytics
  • All Use Cases

Industries

  • Financial Services
  • Gaming
  • Retail
  • Manufacturing
  • Healthcare
  • All Industries

Resources

  • Academy
  • eBooks & Guides
  • Blog
  • Webinars
  • Events
  • The Data Flowcast Podcast
  • All Resources

Airflow

  • What is Airflow
  • Airflow on Astro
  • Airflow 3.0
  • Airflow Upgrades
  • Airflow Use Cases
  • Airflow 2.x End of Life

Company

  • Our Story
  • Customers
  • Newsroom
  • Careers
  • Contact

Support

  • Knowledge Base
  • Status
  • Contact Support
GitHubYouTubeLinkedInx
  • Legal
  • Privacy
  • Terms of Service
  • Consent Preferences

  • Do Not Sell or Share My Personal information
  • Limit the Use Of My Sensitive Personal Information

Apache Airflow®, Airflow, and the Airflow logo are trademarks of the Apache Software Foundation. Copyright © Astronomer 2026. All rights reserved.

LogoLogo
Manage DeploymentsRemote ExecutionCore configurationConfigure Remote Execution Agents

Remote Execution Agent failure and recovery scenarios

Edit this page
Built with

When the heartbeat between the API server and a Remote Execution Agent is disrupted, the Astro executor prevents task duplication by marking queued tasks from that agent as failed. This makes tasks eligible for reassignment to healthy agents. To ensure safe task execution, an agent must receive explicit confirmation from the API server before starting any task. If an agent loses connectivity with the API server, the agent continues executing any tasks that the API server already confirmed and marked as running, but the agent will not start new tasks until heartbeat communication is restored.

Agent failure

The API Server marks an agent as failed if the API server misses three consecutive heartbeat intervals. When that happens, the API server checks whether the Agent has any “queued” tasks, or tasks the agent already picked up and started running, but has not yet reported as complete. If a worker agent fails, the API server marks those tasks as failed and makes them available for reassignment. If a triggerer Agent fails, the API server immediately reassigns the tasks, since triggerer tasks are short-lived and idempotent.

Dag scheduling and retention during Agent disconnection

The Airflow scheduler retains all dags that were most recently parsed and sent by the dag processor agent. If the dag processor agent or any Remote Execution Agent disconnects or fails, the scheduler continues to use these previously parsed dags. The scheduler will keep creating dag runs on schedule or in response to events, such as dataset updates, for all retained dags.

  • New or updated dags are not detected until a healthy dag processor Agent reconnects and provides an updated set of dags.
  • All tasks and dag runs remain pending until a healthy Remote Execution Agent, worker or triggerer, is available for execution.

If no healthy Remote Execution Agents are connected, the scheduler continues to create dag runs for known dags but those tasks remain in queued state and will not execute until an agent becomes available. If a task stays in queued state for more than 600 seconds (default) or the value set via the AIRFLOW__SCHEDULER__TASK_QUEUED_TIMEOUT environment variable on your Astro deployment, it will be marked as failed.

API Server failure

If an agent’s heartbeats can’t reach the API server, the agent assumes that the API server and other agents remain healthy. In this case:

  • A worker continues running any tasks that the API server already marked as running, but the worker doesn’t start new tasks until it reconnects with the API server. This prevents two agents from running the same task.
  • A triggerer stops processing tasks entirely until it restores connectivity. Since triggerer workloads are designed to be reassigned immediately when disconnected, trigger execution stops during the partition.

This behavior preserves task safety and prevents duplication for both workers and triggerers, even during partial failures or network partitions.