用 Prometheus Recording Rules 把告警噪声砍掉 70%(二)

发布时间：2个月前热度： 236 ℃评论数：

五、故障排查和监控

5.1 故障排查

◆ 5.1.1 日志查看

# 查看 Prometheus 日志中的规则评估错误
journalctl -u prometheus | grep -i "rule" | tail -50

# 查看规则评估耗时
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.health != "ok")'

# Kubernetes 环境
kubectl logs -n monitoring prometheus-server-0 | grep -i "recording"

◆ 5.1.2 常见问题排查

问题一：Recording Rule 显示 “no data”

# 诊断命令
# 1. 确认源指标存在
curl -s 'http://localhost:9090/api/v1/query?query=node_cpu_seconds_total' | jq '.data.result | length'

# 2. 测试表达式
curl -s 'http://localhost:9090/api/v1/query?query=rate(node_cpu_seconds_total[5m])' | jq '.data.result | length'

解决方案：

1. 确认源指标正在被采集
2. 检查标签选择器是否匹配
3. 验证时间窗口是否合理（数据量是否足够）

问题二：Recording Rules 计算结果与预期不符

# 诊断命令
# 对比 Recording Rule 结果与手动查询结果
curl -s 'http://localhost:9090/api/v1/query?query=instance:node_cpu_utilization:avg5m{instance="node1:9100"}'
curl -s 'http://localhost:9090/api/v1/query?query=1-avg(rate(node_cpu_seconds_total{instance="node1:9100",mode="idle"}[5m]))'

解决方案：检查 by 子句是否包含了所有需要保留的标签

问题三：规则评估超时

• 症状：Prometheus 日志出现 “rule group took longer than interval”
• 排查：检查规则复杂度和数据量
• 解决：拆分规则组、增大 interval、优化表达式

◆ 5.1.3 调试模式

# 开启 Prometheus 查询日志
# 在启动参数中添加
--query.log-queries

# 查看慢查询
grep "slow" /var/log/prometheus/query.log

# 使用 Prometheus 内置指标分析规则评估
prometheus_rule_evaluation_duration_seconds
prometheus_rule_group_last_duration_seconds

5.2 性能监控

◆ 5.2.1 关键指标监控

# Recording Rules 评估耗时
histogram_quantile(0.99, rate(prometheus_rule_evaluation_duration_seconds_bucket[5m]))

# 规则评估失败次数
rate(prometheus_rule_evaluation_failures_total[5m])

# 规则产生的时序数量
prometheus_tsdb_head_series

# 内存使用
process_resident_memory_bytes{job="prometheus"}

◆ 5.2.2 监控指标说明

指标名称	正常范围	告警阈值	说明
rule_evaluation_duration_seconds	< 1s	> 5s	规则评估耗时，过长说明规则复杂或数据量大
rule_evaluation_failures_total	0	> 0	规则评估失败，需要检查表达式
tsdb_head_series	视环境而定	增长率 > 10%/h	时序数量，Recording Rules 会增加时序
process_resident_memory_bytes	< 80% 配置内存	> 90%	内存使用，时序增加会占用更多内存

◆ 5.2.3 Recording Rules 自身监控

# 监控 Recording Rules 健康状态
groups:
-name:recording_rules_health
rules:
-alert:RecordingRuleEvaluationSlow
expr:|
          prometheus_rule_group_last_duration_seconds >
          prometheus_rule_group_interval_seconds
for:5m
labels:
severity:warning
annotations:
summary:"Recording Rule 组 {{ $labels.rule_group }} 评估超时"

-alert:RecordingRuleEvaluationFailure
expr:rate(prometheus_rule_evaluation_failures_total[5m])>0
for:5m
labels:
severity:critical
annotations:
summary:"Recording Rule 评估失败"

5.3 备份与恢复

◆ 5.3.1 配置备份

#!/bin/bash
# Recording Rules 配置备份脚本
# 文件名：backup_recording_rules.sh

BACKUP_DIR="/backup/prometheus/rules"
PROMETHEUS_RULES_DIR="/etc/prometheus/rules"
DATE=$(date +%Y%m%d_%H%M%S)

mkdir -p ${BACKUP_DIR}

# 备份所有规则文件
tar -czvf ${BACKUP_DIR}/rules_${DATE}.tar.gz -C ${PROMETHEUS_RULES_DIR} .

# 保留最近30天的备份
find ${BACKUP_DIR} -name "rules_*.tar.gz" -mtime +30 -delete

# 验证备份
tar -tzf ${BACKUP_DIR}/rules_${DATE}.tar.gz

◆ 5.3.2 恢复流程

1. 停止 Prometheus：systemctl stop prometheus
2. 恢复配置：tar -xzvf /backup/prometheus/rules/rules_xxx.tar.gz -C /etc/prometheus/rules/
3. 验证配置：promtool check rules /etc/prometheus/rules/*.yml
4. 启动服务：systemctl start prometheus

六、总结

6.1 技术要点回顾

• 要点一：Recording Rules 的核心价值在于预计算和聚合，通过时间窗口平滑消除瞬时抖动，通过层级聚合实现告警收敛
• 要点二：命名规范至关重要，采用 level:metric_name:aggregation 格式，便于理解和维护
• 要点三：分层设计是大规模环境的最佳实践，从原始指标到基础设施指标到业务指标，逐层抽象
• 要点四：Recording Rules 需要持续监控和优化，关注评估耗时、失败次数、时序增长等指标

6.2 进阶学习方向

1. 告警分级与智能降噪：结合机器学习算法实现异常检测，替代固定阈值

• 学习资源：Prometheus 社区的 anomaly detection proposals
• 实践建议：从简单的动态基线开始，逐步引入更复杂的算法

2. 大规模 Prometheus 架构：联邦集群、远程存储、Thanos/Cortex

• 学习资源：Thanos 官方文档、CNCF 相关演讲
• 实践建议：在测试环境搭建联邦架构，理解 Recording Rules 在不同层级的分工

3. PromQL 高级技巧：子查询、偏移量、复杂聚合

• 学习资源：PromLabs 博客、Prometheus 官方文档
• 实践建议：多在 Prometheus UI 练习，理解不同函数的计算逻辑

6.3 参考资料

• Prometheus Recording Rules 官方文档 - 权威参考
• Robust Perception 博客 - Brian Brazil（Prometheus 核心开发者）的博客
• PromLabs 博客 - 深入的 PromQL 教程
• Awesome Prometheus Alerts - 社区维护的告警规则集合

附录

A. Recording Rules 命令速查表

# 检查规则语法
promtool check rules /etc/prometheus/rules/*.yml

# 热加载配置
curl -X POST http://localhost:9090/-/reload

# 查看规则状态
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {record, health}'

# 测试表达式
curl -s 'http://localhost:9090/api/v1/query?query=YOUR_EXPRESSION' | jq

# 查看规则评估指标
curl -s 'http://localhost:9090/api/v1/query?query=prometheus_rule_evaluation_duration_seconds'

B. 常用聚合函数说明

函数	用途	示例
avg_over_time	时间窗口内平均值，平滑抖动	avg_over_time(cpu[5m])
max_over_time	时间窗口内最大值，捕获峰值	max_over_time(cpu[15m])
min_over_time	时间窗口内最小值	min_over_time(available[5m])
rate	计数器每秒增长率	rate(requests_total[5m])
increase	时间窗口内增量	increase(restarts[1h])
histogram_quantile	直方图分位数	histogram_quantile(0.99, rate(bucket[5m]))
clamp_max/clamp_min	限制值范围	clamp_max(ratio, 1)

C. 术语表

术语	英文	解释
Recording Rules	Recording Rules	Prometheus 的预计算规则，将复杂查询结果存储为新时间序列
告警噪声	Alert Noise	非关键性、重复性或误报的告警，会导致告警疲劳
时间窗口	Time Window	PromQL 中用方括号指定的时间范围，如 [5m]
基数	Cardinality	时间序列的数量，由指标名和标签组合决定
告警收敛	Alert Aggregation	将多个相关告警合并为一个，减少告警数量

上一篇：Nginx性能调优18条黄金法则：支撑10万并发的配置模板

下一篇：用 Prometheus Recording Rules 把告警噪声砍掉 70%(一)

栏目导航

开启 HTTPS 并获得 ssllabs 满分的过程 18369 ℃
Centos7 安装 openvas 14560 ℃
Centos7 利用iptables防止nmap工具防端口扫描 14342 ℃
在CentOS 7中添加命令自动补全功能 14090 ℃
ELK+Filebeat+Kafka+ZooKeeper 构建海量日志分析平台 12100 ℃
Acme.sh 给 SSL 证书自动续期失败的解决方法 12060 ℃
用 Nginx 给 Cookie 增加 Secure 和 HttpOnly 11781 ℃
Tomcat 安全配置与性能优化 11565 ℃
CentOS7 开启 NSCD 缓存服务减少DNS集群请求压力 10443 ℃
隐藏nginx、apache、php版本信息 9417 ℃

用 Prometheus Recording Rules 把告警噪声砍掉 70%(二)

五、故障排查和监控

5.1 故障排查

◆ 5.1.1 日志查看

◆ 5.1.2 常见问题排查

◆ 5.1.3 调试模式

5.2 性能监控

◆ 5.2.1 关键指标监控

◆ 5.2.2 监控指标说明

◆ 5.2.3 Recording Rules 自身监控

5.3 备份与恢复

◆ 5.3.1 配置备份

◆ 5.3.2 恢复流程

六、总结

6.1 技术要点回顾

6.2 进阶学习方向

6.3 参考资料

附录

A. Recording Rules 命令速查表

B. 常用聚合函数说明

C. 术语表

栏目导航

相关文章

手机扫码访问