Prometheus alert rules¶
Prometheus alert rules are specific conditions defined in Prometheus that trigger alerts when evaluated expression results match certain criteria.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md] These rules are evaluated periodically by the Prometheus server; if a condition remains true for a specified duration, an alert is fired and sent to the [[alertmanager]] for routing and notification.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
Rule Structure¶
An alert rule typically consists of the following components:
alert: The unique name of the alert (e.g.,hostCpuUsageAlert).^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]expr: A PromQL expression that represents the condition to monitor.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]for: The duration the condition must be true before triggering the alert (e.g.,5m).^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]labels: Key-value pairs used to categorize or add metadata to the alert (e.g.,severity: warning).^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]annotations: Textual information meant to store longer, explanatory strings, such as a summary or description, which can utilize template variables like{{ $labels.instance }}or{{ $value }}.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
Rule Files¶
Alert rules are stored in rule files, typically formatted in YAML.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
Prometheus is configured to load these files via the rule_files section of the main configuration file (e.g., prometheus.yml).^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
# Example snippet from prometheus.yml
rule_files:
- "/data/etc/rules.yml"
The rules file is organized into groups, where each group contains a list of rules sharing the same interval.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
# Example structure of a rules.yml
groups:
- name: hostStatsAlert
rules:
- alert: hostCpuUsageAlert
expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }}%)"
Common Examples¶
Host Resource Alerts¶
These rules monitor the physical or virtual machine resources, often using metrics exported by node-exporter.
- High CPU Usage: Triggers when the average non-idle CPU usage exceeds 85% for 5 minutes.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85 - Memory Usage: Triggers when memory usage (Total minus Available) exceeds 85% of total memory.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
- Disk Space: Triggers when available disk space for a specific filesystem (e.g., overlay) drops below 10%.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
HTTP Blackbox Probes¶
These rules use metrics from blackbox-exporter to monitor endpoint availability and performance.
- Probe Failure: Triggers if
probe_successis 0 for 1 minute.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md] - HTTP Status Code: Triggers if the status code is not in the 200-399 range for 1 minute.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
- SSL Expiry: Triggers if an SSL certificate will expire within 30 days.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
Pod Resource Alerts¶
These rules monitor resource consumption of Kubernetes pods.
- Pod CPU: Triggers if container CPU usage exceeds 80% of the defined limit for 5 minutes.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
- Pod Memory: Triggers if container memory usage exceeds 80% of the defined limit.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
Configuration Workflow¶
- Edit/Create Rule File: Add the rules to a YAML file (e.g.,
/data/nfs-volume/prometheus/etc/rules.yml).^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md] - Update Prometheus Config: Reference the rule file in
prometheus.ymlunder therule_fileskey.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md] - Reload Configuration: Instead of restarting the entire Prometheus pod (which can be resource-intensive in production), a smooth reload can be triggered by sending a
SIGHUPsignal to the Prometheus process.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]kill -SIGHUP <prometheus_pid> - Verification: Check the Prometheus UI to ensure the new rules are active and no syntax errors occurred.