【 使用环境 】生产环境
【 OB or 其他组件 】
OCP 社区版
版本号: 4.3.5-20250319105844
发布日期: 2025年3月19日
【问题描述】
ocp告警,有设置恢复时发送通知,
目前问题:
每次告警恢复时,都会重复发送,会收到两份完全一样的通知信息。
请教一下,这个要如何排查?
【 使用环境 】生产环境
【 OB or 其他组件 】
OCP 社区版
版本号: 4.3.5-20250319105844
发布日期: 2025年3月19日
【问题描述】
ocp告警,有设置恢复时发送通知,
目前问题:
每次告警恢复时,都会重复发送,会收到两份完全一样的通知信息。
请教一下,这个要如何排查?
我看下这个问题,有进展会尽快回复您
请问这个现在有什么进展么?
麻烦发下告警通道配置
相同告警重复发送间隔 设置为3600s 观察下
相同告警重复发送间隔:相同告警是否在 该间隔内有发送过,若发送过,则不发出该消息。相同告警是告警的id相同,如果告警恢复了,新产生的告警id会递增,即不是相同告警。
改了,没有用,还是一样的
好细心,我看了下我们的也是这样的
我咨询下这块的老师看下
麻烦发下告警恢复通知发送时候的ocp-server.log日志
是触发了两次告警恢复通知,我们看下
2025-04-22 10:38:53.959 INFO 23259 --- [alarm-task-1,ac8c815e516493b1,0a8b771376cd9f91] c.o.o.a.s.OcpAlarmNotificationService : create alarm notification, id=13148
2025-04-22 10:38:53.960 INFO 23259 --- [alarm-task-1,ac8c815e516493b1,0a8b771376cd9f91] c.o.o.alarm.core.process.AlarmProcessor : message distribute done, channelId=1, notificationCount=1, createdCount=1, message=【恢复通知】OCP 告警 SQL inspection, SQL performance degradation - 告警对象:alarm_template_id=0:ob_cluster=HFC01-1733376512:tenant_name=HF05:db_name=ssdi5:sql_id=AC5BB1475D4815E805AEE6BFFE8D9C75 - 告警详情:During the time period 2025-04-22T10:17:01 - 2025-04-22T10:17:31, tenant HF05, database ssdi5, found anomaly SQL AC5BB1475D4815E805AEE6BFFE8D9C75, performance degradation. Average CPU time 30.63 ms, average response time 30.65 ms, baseline time period 2025-04-22T09:55:01 - 2025-04-22T10:12:07, average CPU time 1.87 ms, average response time 1.9 ms, it is recommended to [refresh PlanCache, current limit], whether the PlanCache has been automatically refreshed: false. - 恢复时间:2025-04-22T02:38:53Z
2025-04-22 10:38:53.960 INFO 23259 --- [alarm-task-1,ac8c815e516493b1,0a8b771376cd9f91] c.o.o.a.c.d.UserSubscribedJudger : calc subscribed users, channelId=1, alarmIds=[6305], alarmType=oas_anomaly_sql_from_sql_inspection_perf_degradation, subscribedUserCount=1
2025-04-22 10:38:53.961 INFO 23259 --- [alarm-task-1,ac8c815e516493b1,0a8b771376cd9f91] c.o.o.a.c.d.CommonMessageDistributor : distribute notification done, channelId=1, alarmType=oas_anomaly_sql_from_sql_inspection_perf_degradation, recipientCount=1
2025-04-22 10:38:53.961 INFO 23259 --- [alarm-task-1,ac8c815e516493b1,0a8b771376cd9f91] c.o.o.alarm.core.process.AlarmProcessor : aggregate result distribute, channelId=1, groupKey=6305, alarmIds=[6305], distributedCount=1
2025-04-22 10:38:53.963 INFO 23259 --- [alarm-task-1,ac8c815e516493b1,0a8b771376cd9f91] c.o.o.a.s.OcpAlarmNotificationService : create alarm notification, id=13149
2025-04-22 10:38:53.964 INFO 23259 --- [alarm-task-1,ac8c815e516493b1,0a8b771376cd9f91] c.o.o.alarm.core.process.AlarmProcessor : message distribute done, channelId=1, notificationCount=1, createdCount=1, message=【恢复通知】OCP 告警 SQL inspection, SQL performance degradation - 告警对象:alarm_template_id=0:ob_cluster=HFC01-1733376512:tenant_name=HF05:db_name=ssdi5:sql_id=AC5BB1475D4815E805AEE6BFFE8D9C75 - 告警详情:During the time period 2025-04-22T10:17:01 - 2025-04-22T10:17:31, tenant HF05, database ssdi5, found anomaly SQL AC5BB1475D4815E805AEE6BFFE8D9C75, performance degradation. Average CPU time 30.63 ms, average response time 30.65 ms, baseline time period 2025-04-22T09:55:01 - 2025-04-22T10:12:07, average CPU time 1.87 ms, average response time 1.9 ms, it is recommended to [refresh PlanCache, current limit], whether the PlanCache has been automatically refreshed: false. - 恢复时间:2025-04-22T02:38:53Z
2025-04-22 10:38:53.964 INFO 23259 --- [alarm-task-1,ac8c815e516493b1,0a8b771376cd9f91] c.o.o.alarm.core.process.AlarmProcessor : process alarm, channelId=1, updatedAlarmsCount=2, sendCount=2
2025-04-22 10:38:53.964 INFO 23259 --- [alarm-task-1,ac8c815e516493b1,0a8b771376cd9f91] c.o.o.alarm.core.process.AlarmProcessor : aggregate done, channelId=2, alarmsCount=2, resultsCount=2
2025-04-22 10:38:53.964 INFO 23259 --- [alarm-task-1,ac8c815e516493b1,0a8b771376cd9f91] c.o.o.alarm.core.process.AlarmProcessor : processRecovers, channelsCount=2, recoverDistributorCount=2, sentRecoverMessageCount=2
2025-04-22 10:38:53.966 INFO 23259 --- [alarm-task-1,ac8c815e516493b1,0a8b771376cd9f91] c.o.o.alarm.core.process.AlarmProcessor : process done, currentTimestamp=2025-04-22T10:38:53.842291+08:00
登陆ocp_meta租户,查询下meta_database的表ocp2_alarm_channel
obclient -hxxx -P2881 -uroot@ocp_meta -p'xxx' -Dmeta_database -A
select * from ocp2_alarm_channel;
有查出结果吗?
obclient(root@ocp_meta)[meta_database]> select * from ocp2_alarm_channel \G
*************************** 1. row ***************************
id: 1
gmt_create: 2024-12-12 15:37:13
gmt_modified: 2025-04-22 16:58:25
update_time: 2025-04-21 17:10:00
created_by: admin
modified_by: admin
name: NULL
is_default: 0
last_sent_at: 2025-04-22 16:58:26
channel_type: Email
recipient_addr_source: uid
is_group_channel: 0
message_template1: NULL
message_template2: NULL
channel_settings_json: {“emailChannel”:{“fromEmail":"XXX@XXX.com”,“password”:“XXX”,“smtpHost”:“mail.XXX.com”,“smtpPort”:587,“toEMailList”:[“it@XXX.com”],“username":"XXX@XXX.com”}}
is_aggregation_enabled: 1
aggregation_message_template1: NULL
aggregation_message_template2: NULL
aggregation_rule_json: {“aggregateIntervalSeconds”:300,“aggregateWaitSeconds”:30,“repeatIntervalSeconds”:3600}
name2: NULL
name3: NULL
message_template3: NULL
aggregation_message_template3: NULL
message_language: zh_CN
response_validation: success
name_i18n: {“name_zh_tw”:“IT_MAIL”,“name_zh_cn”:“IT_MAIL”}
message_template_i18n: {“message_template_en_us”:"OCP Alert Notifications ${alarm_name}\n - Level: ${alarm_level}\n - Alert target: ${alarm_target}\n - Summary: ${alarm_summary}\n - Generation time: ${alarm_active_at}\n - Description: ${alarm_description}\n - OCP URL: ${alarm_url}\n ",“message_template_zh_cn”:"OCP 告警通知 ${alarm_name}\n - 级别:${alarm_level}\n - 告警对象:${alarm_target}\n - 概述:${alarm_summary}\n - 生成时间:${alarm_active_at}\n - 详情:${alarm_description}\n - OCP 链接:${alarm_url}\n ",“message_template_zh_tw”:“OCP 告警通知 ${alarm_name}\n - 級別:${alarm_level}\n - 告警對象:${alarm_target}\n - 概述:${alarm_summary}\n - 生成時間:${alarm_active_at}\n - 詳情:${alarm_description}\n - OCP 鏈接:${alarm_url}\n “}
aggregation_message_template_i18n: {“aggregation_message_template_zh_tw”:“OCP 告警通知 ${alarm_name}\n - 級別:${alarm_level}\n - 告警數量:${alarm_count}\n - 告警分类:${alarm_group}\n - 聚合分組:${alarm_group_by}\n - 告警對象:${alarm_target}\n - 生成時間:${alarm_active_at}\n “,“aggregation_message_template_en_us”:“OCP Alert Notifications ${alarm_name}\n - Level: ${alarm_level}\n - Alerts: ${alarm_count}\n - Alerts group: ${alarm_group}\n - Aggregation group: ${alarm_group_by}\n - Alert object: ${alarm_target}\n - Generation time: ${alarm_active_at} \n “,“aggregation_message_template_zh_cn”:“OCP 告警通知 ${alarm_name}\n - 级别:${alarm_level}\n - 告警数量:${alarm_count}\n - 告警分类:${alarm_group}\n - 聚合分组:${alarm_group_by}\n - 告警对象:${alarm_target}\n - 生成时间:${alarm_active_at}\n “}
recover_message_template_i18n: {“recover_message_template_en_us”:“OCP Alarm Recover Notification ${alarm_name}\n - Target: ${alarm_target}\n - Details: ${alarm_description}\n - Recovered at: ${alarm_resolved_at} \n “,“recover_message_template_zh_tw”:“OCP 警示恢復通知 ${alarm_name}\n - 警示對象:${alarm_target}\n - 警示詳情:${alarm_description}\n - 恢復時間:${alarm_resolved_at}\n “,“recover_message_template_zh_cn”:”【恢复通知】OCP 告警 ${alarm_name}\n - 告警对象:${alarm_target}\n - 告警详情:${alarm_description}\n - 恢复时间:${alarm_resolved_at}\n “}
is_enabled: 1
*************************** 2. row ***************************
id: 2
gmt_create: 2024-12-12 15:39:47
gmt_modified: 2025-04-21 11:28:46
update_time: 2025-04-10 16:28:42
created_by: admin
modified_by: admin
name: NULL
is_default: 0
last_sent_at: 2025-04-21 11:28:45
channel_type: Script
recipient_addr_source: uid
is_group_channel: 0
message_template1: NULL
message_template2: NULL
channel_settings_json: {“scriptChannel”:{“scriptContent”:”#!/usr/bin/env bash\nfunction send() {\n token=‘XXX’\n id=‘XXX’\n URL=“https://XXX/XXX”\n\n # if message is json format, use “’”${message}”’” or “${message}”, do not wrapper a new json body\n # if message is not json format, use “${message}”\n # do not use ‘${message}’\n # print the response to stderr or stdout, which will be validated if success, validate stderr firstly.\n #curl -s -X POST ${URL} -H ‘Content-Type: application/json’ -d '{“msgtype”:“text”,“text”:{“content”:”’”${message}”’”}}’\n curl -s -X POST ${URL} -d id=”${id}” -d text=”${message}”\n return $?\n}\n\n# invoke function to\nsend”,“scriptContentEnabled”:true}}
is_aggregation_enabled: 1
aggregation_message_template1: NULL
aggregation_message_template2: NULL
aggregation_rule_json: {“aggregateIntervalSeconds”:300,“aggregateWaitSeconds”:30,“repeatIntervalSeconds”:600}
name2: NULL
name3: NULL
message_template3: NULL
aggregation_message_template3: NULL
message_language: zh_CN
response_validation: {“ok”:true}
name_i18n: {“name_zh_tw”:“it_ob_tg”,“name_zh_cn”:“it_ob_m2”}
message_template_i18n: {“message_template_en_us”:"OCP Alert Notifications ${alarm_name}\n - Level: ${alarm_level}\n - Alert target: ${alarm_target}\n - Summary: ${alarm_summary}\n - Generation time: ${alarm_active_at}\n - Description: ${alarm_description}\n - OCP URL: ${alarm_url}\n ",“message_template_zh_cn”:"OCP 告警通知 ${alarm_name}\n - 级别:${alarm_level}\n - 告警对象:${alarm_target}\n - 概述:${alarm_summary}\n - 生成时间:${alarm_active_at}\n - 详情:${alarm_description}\n - OCP 链接:${alarm_url}\n ",“message_template_zh_tw”:"OCP 告警通知 ${alarm_name}\n - 級別:${alarm_level}\n - 告警對象:${alarm_target}\n - 概述:${alarm_summary}\n - 生成時間:${alarm_active_at}\n - 詳情:${alarm_description}\n - OCP 鏈接:${alarm_url}\n "}
aggregation_message_template_i18n: {“aggregation_message_template_zh_tw”:"OCP 告警通知 ${alarm_name}\n - 級別:${alarm_level}\n - 告警數量:${alarm_count}\n - 告警分类:${alarm_group}\n - 聚合分組:${alarm_group_by}\n - 告警對象:${alarm_target}\n - 生成時間:${alarm_active_at}\n ",“aggregation_message_template_en_us”:"OCP Alert Notifications ${alarm_name}\n - Level: ${alarm_level}\n - Alerts: ${alarm_count}\n - Alerts group: ${alarm_group}\n - Aggregation group: ${alarm_group_by}\n - Alert object: ${alarm_target}\n - Generation time: ${alarm_active_at} \n ",“aggregation_message_template_zh_cn”:"OCP 告警通知 ${alarm_name}\n - 级别:${alarm_level}\n - 告警数量:${alarm_count}\n - 告警分类:${alarm_group}\n - 聚合分组:${alarm_group_by}\n - 告警对象:${alarm_target}\n - 生成时间:${alarm_active_at}\n "}
recover_message_template_i18n: {“recover_message_template_en_us”:"OCP Alarm Recover Notification ${alarm_name}\n - Target: ${alarm_target}\n - Details: ${alarm_description}\n - Recovered at: ${alarm_resolved_at} \n ",“recover_message_template_zh_tw”:"OCP 警示恢復通知 ${alarm_name}\n - 警示對象:${alarm_target}\n - 警示詳情:${alarm_description}\n - 恢復時間:${alarm_resolved_at}\n “,“recover_message_template_zh_cn”:”【恢复通知】OCP 告警 ${alarm_name}\n - 告警对象:${alarm_target}\n - 告警详情:${alarm_description}\n - 恢复时间:${alarm_resolved_at}\n "}
is_enabled: 1
2 rows in set (0.000 sec)
这个内容没有什么异常,OCP界面里就有的,有两个通道配置,一个是邮件的,一个是shell脚本去触发webhook
配置为1个通道再试下
好的,我们继续看下
包含这个时间的ocp-server.log和monagent.log,mgragent.log 麻烦提供下
我这边也出现相同问题,告警恢复发2遍