Prometheus alerting rules and PromQL¶
Prometheus alerting rules are configurations that allow Prometheus to evaluate defined conditions against time-series data and trigger alerts when specific thresholds are met^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]. These rules are defined using PromQL (Prometheus Query Language), a flexible query language designed for the multi-dimensional data model used by Prometheus^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md].
PromQL (Prometheus Query Language)¶
PromQL is the functional language used to query and manipulate Metrics within Prometheus^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]. It supports a wide range of operations including filtering, aggregation, and mathematical functions tailored for time-series data.
Key Functions and Operators¶
irate: Calculates the per-second instant rate of increase for a counter vector. It is specifically optimized for alerting where "slow" burns are less relevant than immediate spikes^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1399].- Example:
sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance)^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1399]
- Example:
rate: Calculates the per-second average rate of increase over a time window^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1449].- Aggregation Operators: Functions like
sum,avg, andmaxare used to aggregate data across dimensions^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1399]. - Label Modifiers: Operators like
withoutorbyallow users to group or exclude specific labels (e.g.,by (instance)) during aggregation^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1399].
Alerting Rules Structure¶
Alerting rules are typically stored in a file referenced in the Prometheus configuration (e.g., rules.yml)^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1484]. They are grouped into logical blocks for easier management.
Rule Syntax¶
A single alert rule consists of the following components^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1380]:
* alert: The name of the alert (e.g., hostCpuUsageAlert).
* expr: The PromQL expression to evaluate.
* for: The duration the condition must be true before triggering the alert (e.g., 5m). This helps to prevent alert flapping for transient issues^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1381].
* labels: Key-value pairs to attach to the alert (e.g., severity: warning).
* annotations: Informational text, often used to store the alert message or summary, which can use templating to reference label values (e.g., {{ $labels.instance }}).
Example Rule: CPU Usage¶
The following rule triggers a warning if the CPU usage of an instance exceeds 85% for 5 minutes^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1380-L1385]:
- alert: hostCpuUsageAlert
expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }}%)"
Rule Management and Evaluation¶
- Evaluation Interval: Defined globally in
prometheus.yml(e.g.,evaluation_interval: 15s), this determines how often Prometheus checks theexprin alert rules^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1284]. - Reloading Rules: After modifying rule files, Prometheus can reload its configuration without restarting by sending a
SIGHUPsignal to the process^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1476-L1478]. - Configuration: The rule file path is specified in the main configuration file under the
rule_filessection^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1484].
Common Use Cases¶
Alerting rules are categorized based on the type of resource or metric being monitored^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1376-L1425].
Resource Saturation¶
- Memory: Alerts when memory usage exceeds a threshold (e.g.,
(node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85)^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1387-L1390]. - Disk: Alerts on disk space or inode exhaustion (e.g.,
node_filesystem_free{...} / node_filesystem_size{...} * 100 < 10)^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1392-L1400].
Blackbox Monitoring¶
- Probes: Alerts based on the success or status of network probes. For example,
probe_success == 0triggers if a probe fails^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1426-L1428]. - HTTP Status: Checks if the HTTP status code returned by a probe is outside the 200-399 range^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1431-L1434].
Application Performance¶
- Pod Usage: Monitors CPU and memory usage at the container/pod level relative to defined limits^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1463-L1473].
Sources¶
^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]