APC includes two built-in alerting systems for monitoring health:
Alerts fire based on metrics collected by Prometheus. When alert conditions are met, Prometheus Alertmanager sends notifications to your configured channels.
Alertmanager is enabled by default as part of the APC monitoring stack (tags.monitoring: true). To disable it individually, set global.alertmanagerEnabled: false in your values.yaml. See Apply platform configuration for details.
Alerts are defined in YAML using PromQL queries:
Alertmanager uses receivers to integrate with notification platforms. Define receivers in your values.yaml:
APC includes default receiver groups based on tier and severity:
If you define a platform, platformCritical, or airflow receiver, you don’t need a customRoute to route to it — alerts are automatically routed based on the tier label. Use customRoutes only for non-default routing (for example, high-severity Deployment alerts):
Use alertmanager.customReceiver to define receivers for notification services not covered by the built-in receiver keys. Custom receivers work alongside customRoutes to route alerts to those services:
Push receiver configuration to your installation:
Add custom alerts using the Prometheus Helm chart:
Alert when multiple schedulers are unhealthy:
Alert on high task failure rate:
For a complete list of built-in alerts, see the Prometheus alerts configmap.
The ElasticSeachUnassignedShards and IngessCertificateExpiration alert names contain typos in their current implementation. Use the exact names shown when creating silences or custom routes.
Access Alertmanager to view active alerts:
Query alerts in Prometheus:
Temporarily silence alerts during maintenance:
https://alertmanager.<base-domain>alertname=AirflowSchedulerUnhealthy)critical for pages