Platform and deployment alerts | Astronomer Documentation

APC includes two built-in alerting systems for monitoring health:

Deployment-level alerts: Notify you when an Airflow Deployment is unhealthy or components are underperforming.
Platform-level alerts: Notify you when APC platform components are unhealthy (Elasticsearch, Houston API, Registry, Commander).

Alerts fire based on metrics collected by Prometheus. When alert conditions are met, Prometheus Alertmanager sends notifications to your configured channels.

Alertmanager is enabled by default as part of the APC monitoring stack (tags.monitoring: true). To disable it individually, set global.alertmanagerEnabled: false in your values.yaml. See Apply platform configuration for details.

Alert architecture

Anatomy of an alert

Alerts are defined in YAML using PromQL queries:

1 - alert: ManyUnhealthySchedulers
2   expr: count(rate(airflow_scheduler_heartbeat{}[1m]) <= 0) > 5
3   for: 5m
4   labels:
5     tier: platform
6     severity: critical
7   annotations:
8     summary: "{{ $value }} airflow schedulers are not heartbeating"
9     description: "More than 5 Airflow schedulers have not emitted a heartbeat for over 5 minutes."

Field	Description
`expr`	PromQL expression that determines when to fire
`for`	Duration the condition must be true (for example, `5m`, `1h`)
`labels.tier`	Alert level: `airflow` (Deployment) or `platform`
`labels.severity`	Severity: `info`, `warning`, `high`, `critical`
`annotations.summary`	Alert message text
`annotations.description`	Human-readable description

Configure alert receivers

Alertmanager uses receivers to integrate with notification platforms. Define receivers in your values.yaml:

Email alerts

1 alertmanager:
2   receivers:
3     platform:
4       email_configs:
5         - smarthost: smtp.example.com:587
6           from: alerts@example.com
7           to: ops-team@example.com
8           auth_username: alerts@example.com
9           auth_password: ${SMTP_PASSWORD}
10           send_resolved: true

Slack alerts

1 alertmanager:
2   receivers:
3     platformCritical:
4       slack_configs:
5         - api_url: https://hooks.slack.com/services/xxx/yyy/zzz
6           channel: '#platform-alerts'
7           title: '{{ .CommonAnnotations.summary }}'
8           text: |-
9             {{ range .Alerts }}{{ .Annotations.description }}
10             {{ end }}

PagerDuty alerts

1 alertmanager:
2   receivers:
3     platformCritical:
4       pagerduty_configs:
5         - service_key: ${PAGERDUTY_SERVICE_KEY}
6           severity: '{{ .CommonLabels.severity }}'
7           description: '{{ .CommonAnnotations.summary }}'

OpsGenie alerts

1 alertmanager:
2   receivers:
3     platformCritical:
4       opsgenie_configs:
5         - api_key: ${OPSGENIE_API_KEY}
6           message: '{{ .CommonAnnotations.summary }}'
7           priority: '{{ if eq .CommonLabels.severity "critical" }}P1{{ else }}P3{{ end }}'

Default receiver groups

APC includes default receiver groups based on tier and severity:

Receiver	Tier	Severity
`platform`	platform	all
`platformCritical`	platform	critical
`airflow`	airflow	all

Custom routes

If you define a platform, platformCritical, or airflow receiver, you don’t need a customRoute to route to it — alerts are automatically routed based on the tier label. Use customRoutes only for non-default routing (for example, high-severity Deployment alerts):

1 alertmanager:
2   customRoutes:
3     - receiver: deployment-high-receiver
4       match_re:
5         tier: airflow
6         severity: high
7     - receiver: deployment-warning-receiver
8       match_re:
9         tier: airflow
10         severity: warning

Custom receivers

Use alertmanager.customReceiver to define receivers for notification services not covered by the built-in receiver keys. Custom receivers work alongside customRoutes to route alerts to those services:

1 alertmanager:
2   customReceiver:
3     - name: sns-receiver
4       sns_configs:
5         - api_url: <SNS_ENDPOINT>
6           topic_arn: <SNS_TOPIC_ARN>
7           subject: '[Alert: {{ .GroupLabels.alertname }}]'
8           sigv4:
9             region: <AWS_REGION>
10             role_arn: <SNS_ROLE_ARN>
11   customRoutes:
12     - receiver: sns-receiver
13       match_re:
14         tier: platform
15         severity: critical

Apply configuration

Push receiver configuration to your installation:

$ helm upgrade astronomer astronomer/astronomer \
>   -f values.yaml \
>   --namespace astronomer

Create custom alerts

Add custom alerts using the Prometheus Helm chart:

Platform alert example

Alert when multiple schedulers are unhealthy:

1 prometheus:
2   additionalAlerts:
3     platform: |
4       - alert: MultipleSchedulersUnhealthy
5         expr: count(rate(airflow_scheduler_heartbeat{}[1m]) <= 0) > 2
6         for: 5m
7         labels:
8           tier: platform
9           severity: critical
10         annotations:
11           summary: "{{ $value }} schedulers are not heartbeating"
12           description: "More than 2 Airflow schedulers are unhealthy for over 5 minutes."

Deployment alert example

Alert on high task failure rate:

1 prometheus:
2   additionalAlerts:
3     airflow: |
4       - alert: HighTaskFailureRate
5         expr: |
6           (
7             sum(increase(airflow_ti_failures{}[1h])) by (deployment)
8             /
9             sum(increase(airflow_ti_successes{}[1h]) + increase(airflow_ti_failures{}[1h])) by (deployment)
10           ) > 0.1
11         for: 15m
12         labels:
13           tier: airflow
14           severity: warning
15         annotations:
16           summary: "High task failure rate in {{ $labels.deployment }}"
17           description: "Task failure rate exceeds 10% for the past 15 minutes."

Built-in deployment alerts

For a complete list of built-in alerts, see the Prometheus alerts configmap.

Alert	Description	Action
`AirflowDeploymentUnhealthy`	Deployment is unhealthy or unavailable for 15+ minutes	Check pod status, review logs
`AirflowPodQuota`	Using more than 95% pod quota for 10+ minutes	Increase Extra Capacity or optimize Dags
`AirflowSchedulerUnhealthy`	Scheduler not heartbeating for 6+ minutes	Check scheduler logs, restart if needed
`AirflowTasksPendingIncreasing`	Tasks pending faster than clearing for 30+ minutes	Increase concurrency or worker resources

Built-in platform alerts

Alert	Description	Action
`CriticalComponentPodCrashLooping`	A core platform component pod (Houston, Commander, Grafana, Prometheus, Registry) is repeatedly restarting for 15+ minutes	Check pod logs in the APC namespace, investigate the crash cause
`CriticalComponentPodNotReady`	A pod in the APC platform namespace has been in a non-ready state for 15+ minutes	Check pod events and logs in the APC namespace
`TargetDown`	More than 10% of Prometheus scrape targets for a job are unreachable for 10+ minutes	Check the failing service’s pods and endpoints
`ElasticSeachUnassignedShards`	Elasticsearch cluster has unassigned shards for 10+ minutes	Check Elasticsearch cluster health and logs
`ElasticDiskHighWatermarkReached`	Elasticsearch node disk usage exceeds 90% for 5+ minutes	Increase Elasticsearch storage or clean up old indices
`ElasticDiskFloodWatermarkReached`	Elasticsearch node disk usage exceeds 95% for 5+ minutes — Elasticsearch enforces a read-only index block at this threshold	Immediately increase storage or delete old indices
`IngessCertificateExpiration`	A TLS certificate for a platform hostname expires in less than one week	Renew the TLS certificate

The ElasticSeachUnassignedShards and IngessCertificateExpiration alert names contain typos in their current implementation. Use the exact names shown when creating silences or custom routes.

Viewing active alerts

Alertmanager UI

Access Alertmanager to view active alerts:

https://alertmanager.<base-domain>

Prometheus UI

Query alerts in Prometheus:

https://prometheus.<base-domain>/alerts

CLI

$ # View firing alerts
$ kubectl exec -n astronomer prometheus-0 -- \
>   wget -qO- localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

Silencing alerts

Temporarily silence alerts during maintenance:

Via Alertmanager UI

Go to https://alertmanager.<base-domain>
Click Silences > New Silence
Add matchers (for example, alertname=AirflowSchedulerUnhealthy)
Set duration and comment
Click Create

Via API

$ curl -X POST https://alertmanager.<base-domain>/api/v2/silences \
>   -H "Content-Type: application/json" \
>   -d '{
>     "matchers": [{"name": "alertname", "value": "AirflowSchedulerUnhealthy", "isRegex": false}],
>     "startsAt": "2026-02-05T00:00:00Z",
>     "endsAt": "2026-02-05T06:00:00Z",
>     "createdBy": "admin",
>     "comment": "Maintenance window"
>   }'

Best practices

Start with built-in alerts before creating custom ones
Set appropriate thresholds - avoid alert fatigue
Use severity levels - reserve critical for pages
Include runbook links in alert descriptions
Test alerts in non-production environments first
Document escalation paths for each severity level

1	- alert: ManyUnhealthySchedulers
2	expr: count(rate(airflow_scheduler_heartbeat{}[1m]) <= 0) > 5
3	for: 5m
4	labels:
5	tier: platform
6	severity: critical
7	annotations:
8	summary: "{{ $value }} airflow schedulers are not heartbeating"
9	description: "More than 5 Airflow schedulers have not emitted a heartbeat for over 5 minutes."

1	alertmanager:
2	receivers:
3	platform:
4	email_configs:
5	- smarthost: smtp.example.com:587
6	from: alerts@example.com
7	to: ops-team@example.com
8	auth_username: alerts@example.com
9	auth_password: ${SMTP_PASSWORD}
10	send_resolved: true

1	alertmanager:
2	receivers:
3	platformCritical:
4	slack_configs:
5	- api_url: https://hooks.slack.com/services/xxx/yyy/zzz
6	channel: '#platform-alerts'
7	title: '{{ .CommonAnnotations.summary }}'
8	text: \|-
9	{{ range .Alerts }}{{ .Annotations.description }}
10	{{ end }}

1	alertmanager:
2	customRoutes:
3	- receiver: deployment-high-receiver
4	match_re:
5	tier: airflow
6	severity: high
7	- receiver: deployment-warning-receiver
8	match_re:
9	tier: airflow
10	severity: warning

1	alertmanager:
2	customReceiver:
3	- name: sns-receiver
4	sns_configs:
5	- api_url: <SNS_ENDPOINT>
6	topic_arn: <SNS_TOPIC_ARN>
7	subject: '[Alert: {{ .GroupLabels.alertname }}]'
8	sigv4:
9	region: <AWS_REGION>
10	role_arn: <SNS_ROLE_ARN>
11	customRoutes:
12	- receiver: sns-receiver
13	match_re:
14	tier: platform
15	severity: critical

$	helm upgrade astronomer astronomer/astronomer \
>	-f values.yaml \
>	--namespace astronomer

1	prometheus:
2	additionalAlerts:
3	platform: \|
4	- alert: MultipleSchedulersUnhealthy
5	expr: count(rate(airflow_scheduler_heartbeat{}[1m]) <= 0) > 2
6	for: 5m
7	labels:
8	tier: platform
9	severity: critical
10	annotations:
11	summary: "{{ $value }} schedulers are not heartbeating"
12	description: "More than 2 Airflow schedulers are unhealthy for over 5 minutes."

1	prometheus:
2	additionalAlerts:
3	airflow: \|
4	- alert: HighTaskFailureRate
5	expr: \|
6	(
7	sum(increase(airflow_ti_failures{}[1h])) by (deployment)
8	/
9	sum(increase(airflow_ti_successes{}[1h]) + increase(airflow_ti_failures{}[1h])) by (deployment)
10	) > 0.1
11	for: 15m
12	labels:
13	tier: airflow
14	severity: warning
15	annotations:
16	summary: "High task failure rate in {{ $labels.deployment }}"
17	description: "Task failure rate exceeds 10% for the past 15 minutes."

$	# View firing alerts
$	kubectl exec -n astronomer prometheus-0 -- \
>	wget -qO- localhost:9090/api/v1/alerts \| jq '.data.alerts[] \| select(.state=="firing")'

$	curl -X POST https://alertmanager.<base-domain>/api/v2/silences \
>	-H "Content-Type: application/json" \
>	-d '{
>	"matchers": [{"name": "alertname", "value": "AirflowSchedulerUnhealthy", "isRegex": false}],
>	"startsAt": "2026-02-05T00:00:00Z",
>	"endsAt": "2026-02-05T06:00:00Z",
>	"createdBy": "admin",
>	"comment": "Maintenance window"
>	}'