For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
      • AstroFully-managed data operations, powered by Apache Airflow.
      • Astro Private CloudRun Airflow-as-a-service in your environment.
      • Professional ServicesExpert Airflow services for your enterprise's success.
    • Tools
      • Cosmos
      • Orbiter
      • CLI
      • AI SDK
      • Agents
      • Blueprint
      • UpdatesThe State of Airflow 2026See the insights from over 5,800 data practitioners in the full report. Download Now ➔
  • Customers
  • Docs
    • Insights
      • Blog
      • Webinars
      • Resource Library
      • Events
    • Education
      • Academy
      • What is Airflow?
  • Pricing
Get Started Free
    • Astro Private Cloud overview
    • Astro Private Cloud features
      • Configure metrics
      • Configure liveness and readiness probes
      • Forward logs to Amazon S3
      • Platform and deployment alerts
      • Logs configuration
      • Export task logs
    • Release and lifecycle policy
    • Support policy

Product

  • Platform Overview
  • Astro
  • Astro Observe
  • Astro Private Cloud
  • Security & Trust
  • Pricing

Tools & Services

  • Cosmos
  • Docs
  • Professional Services
  • Product Updates

Use Cases

  • AI Ops
  • Data Observability
  • ETL/ELT
  • ML Ops
  • Operational Analytics
  • All Use Cases

Industries

  • Financial Services
  • Gaming
  • Retail
  • Manufacturing
  • Healthcare
  • All Industries

Resources

  • Academy
  • eBooks & Guides
  • Blog
  • Webinars
  • Events
  • The Data Flowcast Podcast
  • All Resources

Airflow

  • What is Airflow
  • Airflow on Astro
  • Airflow 3.0
  • Airflow Upgrades
  • Airflow Use Cases
  • Airflow 2.x End of Life

Company

  • Our Story
  • Customers
  • Newsroom
  • Careers
  • Contact

Support

  • Knowledge Base
  • Status
  • Contact Support
GitHubYouTubeLinkedInx
  • Legal
  • Privacy
  • Terms of Service
  • Consent Preferences

  • Do Not Sell or Share My Personal information
  • Limit the Use Of My Sensitive Personal Information

Apache Airflow®, Airflow, and the Airflow logo are trademarks of the Apache Software Foundation. Copyright © Astronomer 2026. All rights reserved.

LogoLogo
On this page
  • Alert architecture
  • Anatomy of an alert
  • Subscribe to alerts
  • Configure alert receivers
  • Email alerts
  • Slack alerts
  • PagerDuty alerts
  • OpsGenie alerts
  • Default receiver groups
  • Custom routes
  • Custom receivers
  • Apply configuration
  • Create custom alerts
  • Platform alert example
  • Deployment alert example
  • Built-in deployment alerts
  • Built-in platform alerts
  • Viewing active alerts
  • Alertmanager UI
  • Prometheus UI
  • CLI
  • Silencing alerts
  • Via Alertmanager UI
  • Via API
  • Best practices
  • Related documentation
Platform Observability

Platform and deployment alerts

Edit this page
Built with

APC includes two built-in alerting systems for monitoring health:

  • Deployment-level alerts: Notify you when an Airflow Deployment is unhealthy or components are underperforming.
  • Platform-level alerts: Notify you when APC platform components are unhealthy (Elasticsearch, Houston API, Registry, Commander).

Alerts fire based on metrics collected by Prometheus. When alert conditions are met, Prometheus Alertmanager sends notifications to your configured channels.

Alertmanager is enabled by default as part of the APC monitoring stack (tags.monitoring: true). To disable it individually, set global.alertmanagerEnabled: false in your values.yaml. See Apply platform configuration for details.

Alert architecture

Anatomy of an alert

Alerts are defined in YAML using PromQL queries:

1- alert: ManyUnhealthySchedulers
2 expr: count(rate(airflow_scheduler_heartbeat{}[1m]) <= 0) > 5
3 for: 5m
4 labels:
5 tier: platform
6 severity: critical
7 annotations:
8 summary: "{{ $value }} airflow schedulers are not heartbeating"
9 description: "More than 5 Airflow schedulers have not emitted a heartbeat for over 5 minutes."
FieldDescription
exprPromQL expression that determines when to fire
forDuration the condition must be true (for example, 5m, 1h)
labels.tierAlert level: airflow (Deployment) or platform
labels.severitySeverity: info, warning, high, critical
annotations.summaryAlert message text
annotations.descriptionHuman-readable description

Subscribe to alerts

Configure alert receivers

Alertmanager uses receivers to integrate with notification platforms. Define receivers in your values.yaml:

Email alerts

1alertmanager:
2 receivers:
3 platform:
4 email_configs:
5 - smarthost: smtp.example.com:587
6 from: alerts@example.com
7 to: ops-team@example.com
8 auth_username: alerts@example.com
9 auth_password: ${SMTP_PASSWORD}
10 send_resolved: true

Slack alerts

1alertmanager:
2 receivers:
3 platformCritical:
4 slack_configs:
5 - api_url: https://hooks.slack.com/services/xxx/yyy/zzz
6 channel: '#platform-alerts'
7 title: '{{ .CommonAnnotations.summary }}'
8 text: |-
9 {{ range .Alerts }}{{ .Annotations.description }}
10 {{ end }}

PagerDuty alerts

1alertmanager:
2 receivers:
3 platformCritical:
4 pagerduty_configs:
5 - service_key: ${PAGERDUTY_SERVICE_KEY}
6 severity: '{{ .CommonLabels.severity }}'
7 description: '{{ .CommonAnnotations.summary }}'

OpsGenie alerts

1alertmanager:
2 receivers:
3 platformCritical:
4 opsgenie_configs:
5 - api_key: ${OPSGENIE_API_KEY}
6 message: '{{ .CommonAnnotations.summary }}'
7 priority: '{{ if eq .CommonLabels.severity "critical" }}P1{{ else }}P3{{ end }}'

Default receiver groups

APC includes default receiver groups based on tier and severity:

ReceiverTierSeverity
platformplatformall
platformCriticalplatformcritical
airflowairflowall

Custom routes

If you define a platform, platformCritical, or airflow receiver, you don’t need a customRoute to route to it — alerts are automatically routed based on the tier label. Use customRoutes only for non-default routing (for example, high-severity Deployment alerts):

1alertmanager:
2 customRoutes:
3 - receiver: deployment-high-receiver
4 match_re:
5 tier: airflow
6 severity: high
7 - receiver: deployment-warning-receiver
8 match_re:
9 tier: airflow
10 severity: warning

Custom receivers

Use alertmanager.customReceiver to define receivers for notification services not covered by the built-in receiver keys. Custom receivers work alongside customRoutes to route alerts to those services:

1alertmanager:
2 customReceiver:
3 - name: sns-receiver
4 sns_configs:
5 - api_url: <SNS_ENDPOINT>
6 topic_arn: <SNS_TOPIC_ARN>
7 subject: '[Alert: {{ .GroupLabels.alertname }}]'
8 sigv4:
9 region: <AWS_REGION>
10 role_arn: <SNS_ROLE_ARN>
11 customRoutes:
12 - receiver: sns-receiver
13 match_re:
14 tier: platform
15 severity: critical

Apply configuration

Push receiver configuration to your installation:

$helm upgrade astronomer astronomer/astronomer \
> -f values.yaml \
> --namespace astronomer

Create custom alerts

Add custom alerts using the Prometheus Helm chart:

Platform alert example

Alert when multiple schedulers are unhealthy:

1prometheus:
2 additionalAlerts:
3 platform: |
4 - alert: MultipleSchedulersUnhealthy
5 expr: count(rate(airflow_scheduler_heartbeat{}[1m]) <= 0) > 2
6 for: 5m
7 labels:
8 tier: platform
9 severity: critical
10 annotations:
11 summary: "{{ $value }} schedulers are not heartbeating"
12 description: "More than 2 Airflow schedulers are unhealthy for over 5 minutes."

Deployment alert example

Alert on high task failure rate:

1prometheus:
2 additionalAlerts:
3 airflow: |
4 - alert: HighTaskFailureRate
5 expr: |
6 (
7 sum(increase(airflow_ti_failures{}[1h])) by (deployment)
8 /
9 sum(increase(airflow_ti_successes{}[1h]) + increase(airflow_ti_failures{}[1h])) by (deployment)
10 ) > 0.1
11 for: 15m
12 labels:
13 tier: airflow
14 severity: warning
15 annotations:
16 summary: "High task failure rate in {{ $labels.deployment }}"
17 description: "Task failure rate exceeds 10% for the past 15 minutes."

Built-in deployment alerts

For a complete list of built-in alerts, see the Prometheus alerts configmap.

AlertDescriptionAction
AirflowDeploymentUnhealthyDeployment is unhealthy or unavailable for 15+ minutesCheck pod status, review logs
AirflowPodQuotaUsing more than 95% pod quota for 10+ minutesIncrease Extra Capacity or optimize Dags
AirflowSchedulerUnhealthyScheduler not heartbeating for 6+ minutesCheck scheduler logs, restart if needed
AirflowTasksPendingIncreasingTasks pending faster than clearing for 30+ minutesIncrease concurrency or worker resources

Built-in platform alerts

AlertDescriptionAction
CriticalComponentPodCrashLoopingA core platform component pod (Houston, Commander, Grafana, Prometheus, Registry) is repeatedly restarting for 15+ minutesCheck pod logs in the APC namespace, investigate the crash cause
CriticalComponentPodNotReadyA pod in the APC platform namespace has been in a non-ready state for 15+ minutesCheck pod events and logs in the APC namespace
TargetDownMore than 10% of Prometheus scrape targets for a job are unreachable for 10+ minutesCheck the failing service’s pods and endpoints
ElasticSeachUnassignedShardsElasticsearch cluster has unassigned shards for 10+ minutesCheck Elasticsearch cluster health and logs
ElasticDiskHighWatermarkReachedElasticsearch node disk usage exceeds 90% for 5+ minutesIncrease Elasticsearch storage or clean up old indices
ElasticDiskFloodWatermarkReachedElasticsearch node disk usage exceeds 95% for 5+ minutes — Elasticsearch enforces a read-only index block at this thresholdImmediately increase storage or delete old indices
IngessCertificateExpirationA TLS certificate for a platform hostname expires in less than one weekRenew the TLS certificate

The ElasticSeachUnassignedShards and IngessCertificateExpiration alert names contain typos in their current implementation. Use the exact names shown when creating silences or custom routes.

Viewing active alerts

Alertmanager UI

Access Alertmanager to view active alerts:

https://alertmanager.<base-domain>

Prometheus UI

Query alerts in Prometheus:

https://prometheus.<base-domain>/alerts

CLI

$# View firing alerts
$kubectl exec -n astronomer prometheus-0 -- \
> wget -qO- localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

Silencing alerts

Temporarily silence alerts during maintenance:

Via Alertmanager UI

  1. Go to https://alertmanager.<base-domain>
  2. Click Silences > New Silence
  3. Add matchers (for example, alertname=AirflowSchedulerUnhealthy)
  4. Set duration and comment
  5. Click Create

Via API

$curl -X POST https://alertmanager.<base-domain>/api/v2/silences \
> -H "Content-Type: application/json" \
> -d '{
> "matchers": [{"name": "alertname", "value": "AirflowSchedulerUnhealthy", "isRegex": false}],
> "startsAt": "2026-02-05T00:00:00Z",
> "endsAt": "2026-02-05T06:00:00Z",
> "createdBy": "admin",
> "comment": "Maintenance window"
> }'

Best practices

  1. Start with built-in alerts before creating custom ones
  2. Set appropriate thresholds - avoid alert fatigue
  3. Use severity levels - reserve critical for pages
  4. Include runbook links in alert descriptions
  5. Test alerts in non-production environments first
  6. Document escalation paths for each severity level

Related documentation

  • Apply platform configuration
  • Prometheus Alertmanager documentation