集群启动时,第三个节点obagent启动错误

【 使用环境 】生产环境 or 测试环境
【 OB or 其他组件 】
【 使用版本 】
4.2.1.3
【问题描述】清晰明确描述问题


详细错误如下图

【复现路径】问题出现前后相关操作
【附件及日志】推荐使用OceanBase敏捷诊断工具obdiag收集诊断信息,详情参见链接(右键跳转查看):

【SOP系列 22 】——故障诊断第一步(自助诊断和诊断信息收集)

1 个赞

obd cluster restart myoceanbase -c obagent -s 192.168.36.12 能单独重启12节点的obagent 看下正常吗。

如果不正常,可以提供下12节点obagent服务的日志。

我执行 obd cluster restart myoceanbase -c obagent -s 192.168.36.12 单独重启12节点的obagent ,依然有错误

2024-03-12T09:54:01.65049+08:00 ERROR [261475,] caller=agentd/watchdog.go:119:Start: start service ‘ob_monagent’ failed: Module=agentd, kind=INTERNAL, code=write_pid_failed; [/usr/local/zgiot/myoceanbase/obagent/run/ob_monagent.pid]
cause:
open /usr/local/zgiot/myoceanbase/obagent/run/ob_monagent.pid: file exists

我把这个文件,移走了,重新再启动12的agent,现在可以了。

obd --version 是什么版本呢 可以把上面报错的完整日志提供下 ,我们确认了下是否有缺陷。

[root@ziguang ~]# obd --version
OceanBase Deploy: 2.5.0
REVISION: 582dec0e9bece2d738ab1d65b59bd6a599271281
BUILD_BRANCH: HEAD
BUILD_TIME: Dec 29 2023 11:45:51OURCE
Copyright (C) 2021 OceanBase
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

12节点,agent启动时的完整错误如下:

2024-03-12T09:54:01.64687+08:00 INFO [261475,] caller=agentd/main.go:64:run: starting agentd with config /usr/local/zgiot/myoceanbase/obagent/conf/agentd.yaml
2024-03-12T09:54:01.64748+08:00 INFO [261475,] caller=agentd/limit_linux.go:32:newLimiter: create service ob_mgragent resource limit skipped, no limit in config
2024-03-12T09:54:01.64831+08:00 INFO [261475,] caller=agentd/limit_linux.go:43:newLimiter: create service ob_monagent resource limit done, cpu: 2, memory: 2GiB
2024-03-12T09:54:01.64863+08:00 ERROR [261475,] caller=agentd/watchdog.go:266:cleanupPidPattern: cleanup pid file /usr/local/zgiot/myoceanbase/obagent/run/ob_monagent.pid got error fields: error=“Module=agent, kind=INTERNAL, code=read_pid_failed; [/usr/local/zgiot/myoceanbase/obagent/run/ob_monagent.pid]\ncause:\nstrconv.Atoi: parsing “”: invalid syntax”
2024-03-12T09:54:01.64872+08:00 INFO [261475,] caller=agentd/watchdog.go:93:Start: starting agentd
2024-03-12T09:54:01.64959+08:00 INFO [261475,] caller=agentd/watchdog.go:106:Start: starting socket listener on ‘/usr/local/zgiot/myoceanbase/obagent/run/ob_agentd.261475.sock’
2024-03-12T09:54:01.64976+08:00 INFO [261475,] caller=agentd/watchdog.go:116:Start: starting service ‘ob_mgragent’
2024-03-12T09:54:01.6499+08:00 INFO [261475,] caller=agentd/service.go:144:startProc: starting service fields: service=ob_mgragent
2024-03-12T09:54:01.65034+08:00 INFO [261475,] caller=agentd/service.go:151:startProc: service process started. pid: 261481 fields: service=ob_mgragent
2024-03-12T09:54:01.65041+08:00 INFO [261475,] caller=agentd/watchdog.go:116:Start: starting service ‘ob_monagent’
2024-03-12T09:54:01.65049+08:00 ERROR [261475,] caller=agentd/watchdog.go:119:Start: start service ‘ob_monagent’ failed: Module=agentd, kind=INTERNAL, code=write_pid_failed; [/usr/local/zgiot/myoceanbase/obagent/run/ob_monagent.pid]
cause:
open /usr/local/zgiot/myoceanbase/obagent/run/ob_monagent.pid: file exists
2024-03-12T09:54:01.65054+08:00 INFO [261475,] caller=agentd/watchdog.go:122:Start: agentd started
2024-03-12T09:54:02.63785+08:00 INFO [261475,02f3ef277fca493c] caller=http/http_command.go:59:func1: handling command request /api/v1/status
2024-03-12T09:54:02.68098+08:00 INFO [261475,ba7b659741b9de40] caller=http/http_command.go:102:writeOk: command request /api/v1/status succeed
2024-03-12T09:54:03.68249+08:00 INFO [261475,519910baddbc8cc5] caller=http/http_command.go:59:func1: handling command request /api/v1/status
2024-03-12T09:54:03.71746+08:00 INFO [261475,5a1ccfd029a783f5] caller=http/http_command.go:102:writeOk: command request /api/v1/status succeed
2024-03-12T09:54:04.71879+08:00 INFO [261475,111f8b3820d69a41] caller=http/http_command.go:59:func1: handling command request /api/v1/status
2024-03-12T09:54:04.75262+08:00 INFO [261475,179a754e71bc11ac] caller=http/http_command.go:102:writeOk: command request /api/v1/status succeed
2024-03-12T09:54:05.75398+08:00 INFO [261475,909bebb4cd65a920] caller=http/http_command.go:59:func1: handling command request /api/v1/status
2024-03-12T09:54:05.78803+08:00 INFO [261475,76f87ac524682485] caller=http/http_command.go:102:writeOk: command request /api/v1/status succeed
2024-03-12T09:54:06.78928+08:00 INFO [261475,8e5e4cbecb3d5dbe] caller=http/http_command.go:59:func1: handling command request /api/v1/status
2024-03-12T09:54:06.82289+08:00 INFO [261475,30b72ee95978051c] caller=http/http_command.go:102:writeOk: command request /api/v1/status succeed
2024-03-12T09:54:07.82457+08:00 INFO [261475,6bf583ffcbd294e9] caller=http/http_command.go:59:func1: handling command request /api/v1/status
2024-03-12T09:54:07.85812+08:00 INFO [261475,1699c846b840984d] caller=http/http_command.go:102:writeOk: command request /api/v1/status succeed
2024-03-12T09:54:08.86003+08:00 INFO [261475,7b2d00c6479f5ce3] caller=http/http_command.go:59:func1: handling command request /api/v1/status
2024-03-12T09:54:08.89293+08:00 INFO [261475,93c64c5111102ec0] caller=http/http_command.go:102:writeOk: command request /api/v1/status succeed
2024-03-12T09:54:09.8952+08:00 INFO [261475,40501dc22fdf2abb] caller=http/http_command.go:59:func1: handling command request /api/v1/status
2024-03-12T09:54:09.92738+08:00 INFO [261475,28532e303b4ef089] caller=http/http_command.go:102:writeOk: command request /api/v1/status succeed
2024-03-12T09:54:10.92876+08:00 INFO [261475,2becfb7d580d26d4] caller=http/http_command.go:59:func1: handling command request /api/v1/status
2024-03-12T09:54:10.96446+08:00 INFO [261475,d6758e874edd1cf9] caller=http/http_command.go:102:writeOk: command request /api/v1/status succeed
2024-03-12T09:54:11.96615+08:00 INFO [261475,d07cd851ff2ce837] caller=http/http_command.go:59:func1: handling command request /api/v1/status
2024-03-12T09:54:11.99918+08:00 INFO [261475,12c08be8cf870391] caller=http/http_command.go:102:writeOk: command request /api/v1/status succeed

建议升级到最新的OBD261 观察下是否还出现删除不掉pid文件问题。