Skip to content

Prometheus alert rules

Prometheus alert rules are specific conditions defined in Prometheus that trigger alerts when evaluated expression results match certain criteria.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md] These rules are evaluated periodically by the Prometheus server; if a condition remains true for a specified duration, an alert is fired and sent to the [[alertmanager]] for routing and notification.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]

Rule Structure

An alert rule typically consists of the following components:

  • alert: The unique name of the alert (e.g., hostCpuUsageAlert).^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
  • expr: A PromQL expression that represents the condition to monitor.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
  • for: The duration the condition must be true before triggering the alert (e.g., 5m).^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
  • labels: Key-value pairs used to categorize or add metadata to the alert (e.g., severity: warning).^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
  • annotations: Textual information meant to store longer, explanatory strings, such as a summary or description, which can utilize template variables like {{ $labels.instance }} or {{ $value }}.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]

Rule Files

Alert rules are stored in rule files, typically formatted in YAML.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]

Prometheus is configured to load these files via the rule_files section of the main configuration file (e.g., prometheus.yml).^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]

# Example snippet from prometheus.yml
rule_files:
 - "/data/etc/rules.yml"

The rules file is organized into groups, where each group contains a list of rules sharing the same interval.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]

# Example structure of a rules.yml
groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }}%)"

Common Examples

Host Resource Alerts

These rules monitor the physical or virtual machine resources, often using metrics exported by node-exporter.

  • High CPU Usage: Triggers when the average non-idle CPU usage exceeds 85% for 5 minutes.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
    expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
    
  • Memory Usage: Triggers when memory usage (Total minus Available) exceeds 85% of total memory.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
  • Disk Space: Triggers when available disk space for a specific filesystem (e.g., overlay) drops below 10%.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]

HTTP Blackbox Probes

These rules use metrics from blackbox-exporter to monitor endpoint availability and performance.

  • Probe Failure: Triggers if probe_success is 0 for 1 minute.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
  • HTTP Status Code: Triggers if the status code is not in the 200-399 range for 1 minute.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
  • SSL Expiry: Triggers if an SSL certificate will expire within 30 days.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]

Pod Resource Alerts

These rules monitor resource consumption of Kubernetes pods.

  • Pod CPU: Triggers if container CPU usage exceeds 80% of the defined limit for 5 minutes.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
  • Pod Memory: Triggers if container memory usage exceeds 80% of the defined limit.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]

Configuration Workflow

  1. Edit/Create Rule File: Add the rules to a YAML file (e.g., /data/nfs-volume/prometheus/etc/rules.yml).^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
  2. Update Prometheus Config: Reference the rule file in prometheus.yml under the rule_files key.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
  3. Reload Configuration: Instead of restarting the entire Prometheus pod (which can be resource-intensive in production), a smooth reload can be triggered by sending a SIGHUP signal to the Prometheus process.^[400-devops__06-Kubernetes__k8s-paas__07.Promtheus监控k8s企业级应用.md]
    kill -SIGHUP <prometheus_pid>
    
  4. Verification: Check the Prometheus UI to ensure the new rules are active and no syntax errors occurred.

Sources