OCP监控agent重启

【 使用环境 】生产环境
【 OB or 其他组件 】OCP
【 使用版本 】4.4.0
【问题描述】ocp_monagent报警重启, 看了下日志是报错退出后被守护进程重新拉起,日志如下:
agentd.log看到monagent退出
2026-05-17T12:35:59.42749+08:00 WARN [41630,] caller=agentd/service.go:182:guard: service exited with code 2. service state: running fields: service=ocp_monagent

monagent退出时间附近的warn日志有如下2个
2026-05-17T12:35:59.30212+08:00 WARN [27432,c64b8094a78b8006] caller=host/custom.go:712:doCollectCoredumpTime: get observer coredump time failed, err: open : no such file or directory fields: coredump-path=
2026-05-17T12:35:59.44712+08:00 WARN [14285,] caller=config/yaml.go:99:validateNode: configs may not be replaced: [ocp.agent.http.ip ocp.agent.monitor.http.port]

ocp_monagent.error.log有如下日志,这段日志本身没有时间,我是根据linux的文件修改时间推测这个错误日志和这次重启相关:May 17 12:35 ocp_monagent.error.log
goroutine 1196 [running]:
internal/sync.(*HashTrieMap[…]).Load(0x1b43de0, {0x15ab280, 0xc009e82aa8})
/home/admin/go/src/internal/sync/hashtriemap.go:73 +0x8a
sync.(*Map).Load(…)
/home/admin/go/src/sync/hashtriemap.go:50
github.com/oceanbase/obagent/monitor/plugins/inputs/oceanbase.(*SqlAuditInput).getSampleSqlType(0xc000414500, 0xc009e836f8, 0x2963b40)
/workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/plugins/inputs/oceanbase/sql_audit.go:1550 +0x14d
github.com/oceanbase/obagent/monitor/plugins/inputs/oceanbase.(*SqlAuditInput).parseSampleSqlData(0xc000414500, {0x1b1b300, 0xc021c26f50}, {0x187e7c0?, 0xc0220a3188?, 0x2965020?}, 0xc009e836f8, 0x3f2?, 0xc0025ace20, 0xc036767e00)
/workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/plugins/inputs/oceanbase/sql_audit.go:1436 +0x5c
github.com/oceanbase/obagent/monitor/plugins/inputs/oceanbase.(*ObSqlAudit).parseRawSqlResults(0xc000d3b600, {0x1b1b300, 0xc021c26f50}, 0x2d8cb22ff, 0xc000789180, {0xc009e87ba0?, 0x4d8733?, 0x2965020?}, 0xc0025acdf0, 0xc0025ace20, …)
/workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/plugins/inputs/oceanbase/sql_audit_merge.go:837 +0x1516
github.com/oceanbase/obagent/monitor/plugins/inputs/oceanbase.(*ObSqlAudit).collectRawMsgsByTenant(0xc000d3b600, {0x1b1b300, 0xc021c26f50}, 0x3f2, 0x2d8cb22ff, 0xc019e9b2c0, 0x0, 0xc0025acdf0, 0xc0025ace20, 0xc0025ace50, …)
/workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/plugins/inputs/oceanbase/sql_audit_merge.go:493 +0x63a
github.com/oceanbase/obagent/monitor/plugins/inputs/oceanbase.(*ObSqlAudit).runTaskRound(0xc000d3b600, 0x2d8cb22ff, 0xc0025acdd0, 0xc0009632d0)
/workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/plugins/inputs/oceanbase/sql_audit_merge.go:447 +0x137
github.com/oceanbase/obagent/monitor/plugins/inputs/oceanbase.(*ObSqlAudit).runTask(0xc000d3b600, 0xc0025acdd0, 0xc0009632d0)
/workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/plugins/inputs/oceanbase/sql_audit_merge.go:433 +0x4ca
created by github.com/oceanbase/obagent/monitor/plugins/inputs/oceanbase.(*ObSqlAudit).addTask in goroutine 1185
/workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/plugins/inputs/oceanbase/sql_audit_merge.go:374 +0x24b

问了ai分析说最后这个堆栈说明monagent碰到空指针了?重启是bug导致的吗

3 个赞

OCP agen 是一个收集服务器状态的客户端,因为一些偶然因素导致重启,可以忽略!

2 个赞

两个warn告警可忽略,提供一份完整的ocp_monagent。log日志看下吧覆盖重启时间的

1 个赞

日志太大了不让上传…monagent-2026-05-18T01-36-49.583.log要200M一个,能拉个群传文件吗

1 个赞

打包压缩一下呢 一般50mb就行

1 个赞

monagent-2026-05-18T01-36-49.583.7z (12.1 MB)
OK了,但是我看了下在退出时间2026-05-17T12:35:59附近好像没什么有价值的线索,大佬你看看

2026-05-17T12:35:59.33787+08:00 INFO [27432,] caller=runtime/panic.go:792:gopanic: collectTask rthub_1010 exited
2026-05-17T12:35:59.44712+08:00 WARN [14285,] caller=config/yaml.go:99:validateNode: configs may not be replaced: [ocp.agent.http.ip ocp.agent.monitor.http.port]
2026-05-17T12:35:59.44724+08:00 INFO [14285,310d0b67ff21bf42] caller=sdk/sdk_init.go:47:InitSDK: init sdk conf {ConfigPropertiesDir:/home/admin/ocp_agent/conf/config_properties ModuleConfigDir:/home/admin/ocp_agent/conf/module_config CryptoPath:/home/admin/ocp_agent/conf/.config_secret.key CryptoMethod:aes}
monagent 主动 panic 退出后续在110ms被自动重启。

1 个赞

嗯,现在想知道的是panic的原因,不是最常见的OOM,不知道在哪儿找线索

mgragent.log也提供一份看下

mgragent-2026-05-18T08-51-43.328.log.gz (15.9 MB)

看看日志目录下的ocp_monagent.error.log文件中有没有类似【fatal error: runtime: out of memory 】字样

从这个日志看下来只有ocp_monagent重启了,agentd和mgragnet都未重启。
在崩溃前每秒对 tenant 1010 采集约 1500–2200 条 audit 记录,负载偏高。

1 个赞

那就是无头悬案了咯,主要这是偶发的也不太好部署heap去抓信息什么的,如果只是集群负载高了就挂了那这个解释好像不太能接受。。最上面提供的ocp_monagent.error.log里的堆栈信息有用吗?

1 个赞

根据上面堆栈分析
ocp_monagent (PID 27432)在租户 rthub (tenant_id=1010,0x3f2 )的 SQL Audit 采集协程里发生 panic,协程退出后触发进程级 watchdog panic,随后被 ocp_agentd 拉起 PID 14285。
、rthub(1010) SQL Audit 采集量高
→ runTask 解析 Raw SQL / Sample SQL → getSampleSqlType 中 sync.Map.Load panic (goroutine 1196) → collectTask rthub_1010 协程退出 → monagent 主动 panic 整进程 (27432) → ocp_agentd 拉起新 monagent (14285)

看堆栈应该是个已知bug,开审计持久化后导致的问题建议升级到442

1 个赞

:+1:多谢,后续升级到442再观察下

学习了