Skip to content

Prometheus alerting rules and PromQL

Prometheus alerting rules are configurations that allow Prometheus to evaluate defined conditions against time-series data and trigger alerts when specific thresholds are met^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]. These rules are defined using PromQL (Prometheus Query Language), a flexible query language designed for the multi-dimensional data model used by Prometheus^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md].

PromQL (Prometheus Query Language)

PromQL is the functional language used to query and manipulate Metrics within Prometheus^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]. It supports a wide range of operations including filtering, aggregation, and mathematical functions tailored for time-series data.

Key Functions and Operators

  • irate: Calculates the per-second instant rate of increase for a counter vector. It is specifically optimized for alerting where "slow" burns are less relevant than immediate spikes^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1399].
    • Example: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance)^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1399]
  • rate: Calculates the per-second average rate of increase over a time window^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1449].
  • Aggregation Operators: Functions like sum, avg, and max are used to aggregate data across dimensions^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1399].
  • Label Modifiers: Operators like without or by allow users to group or exclude specific labels (e.g., by (instance)) during aggregation^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1399].

Alerting Rules Structure

Alerting rules are typically stored in a file referenced in the Prometheus configuration (e.g., rules.yml)^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1484]. They are grouped into logical blocks for easier management.

Rule Syntax

A single alert rule consists of the following components^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1380]: * alert: The name of the alert (e.g., hostCpuUsageAlert). * expr: The PromQL expression to evaluate. * for: The duration the condition must be true before triggering the alert (e.g., 5m). This helps to prevent alert flapping for transient issues^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1381]. * labels: Key-value pairs to attach to the alert (e.g., severity: warning). * annotations: Informational text, often used to store the alert message or summary, which can use templating to reference label values (e.g., {{ $labels.instance }}).

Example Rule: CPU Usage

The following rule triggers a warning if the CPU usage of an instance exceeds 85% for 5 minutes^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1380-L1385]:

- alert: hostCpuUsageAlert
  expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }}%)"

Rule Management and Evaluation

  • Evaluation Interval: Defined globally in prometheus.yml (e.g., evaluation_interval: 15s), this determines how often Prometheus checks the expr in alert rules^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1284].
  • Reloading Rules: After modifying rule files, Prometheus can reload its configuration without restarting by sending a SIGHUP signal to the process^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1476-L1478].
  • Configuration: The rule file path is specified in the main configuration file under the rule_files section^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1484].

Common Use Cases

Alerting rules are categorized based on the type of resource or metric being monitored^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1376-L1425].

Resource Saturation

  • Memory: Alerts when memory usage exceeds a threshold (e.g., (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85)^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1387-L1390].
  • Disk: Alerts on disk space or inode exhaustion (e.g., node_filesystem_free{...} / node_filesystem_size{...} * 100 < 10)^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1392-L1400].

Blackbox Monitoring

  • Probes: Alerts based on the success or status of network probes. For example, probe_success == 0 triggers if a probe fails^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1426-L1428].
  • HTTP Status: Checks if the HTTP status code returned by a probe is outside the 200-399 range^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1431-L1434].

Application Performance

  • Pod Usage: Monitors CPU and memory usage at the container/pod level relative to defined limits^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md#L1463-L1473].

Sources

^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]