关于Oceanbase出现故障如何定位

【 使用环境 】生产环境
【 OB or 其他组件 】ob
【 使用版本 】OceanBase_CE 4.3.5.0
【问题描述】机房电源闪断了一下导致所有服务器都停机了,然后oceanbase 服务器重新开机后就无法启动了,因为对oceanbase不太熟悉,我都不知道从什么日志下手排查问题。
操作记录:
从管理节点启动服务:
[root@db4 ~]# obd cluster start myoceanbase
Get local repositories ok
Load cluster param plugin ok
Open ssh connection ok
Check before start ob-configserver ok
[WARN] OBD-4521: The config observer_sys_password in obproxy-ce did not take effect, please config it in oceanbase-ce
Check before start obagent ok
Check before start ocp-express ok
Start ob-configserver ok
ob-configserver program health check ok
cluster scenario: olap
Start observer ok
observer program health check ok
Connect to observer 192.168.5.86:2881 ok
obshell start ok
obshell program health check ok
start obproxy ok
obproxy program health check ok
Connect to obproxy ok
Start obagent x
[ERROR] failed to start 192.168.5.85 obagent.
[WARN] obagent-py_script_start-1.3.0 has animation not been closed

See https://www.oceanbase.com/product/ob-deployer/error-codes .
Trace ID: 04767684-091c-11f0-943c-0cc47ac49e64
If you want to view detailed obd logs, please run: obd display-trace 04767684-091c-11f0-943c-0cc47ac49e64

然后检查
192.168.5.85的

tail -200f /soft/myoceanbase/obagent/log/agentctl.log日志

2025-03-25T09:54:31.18372+08:00 INFO [4903,] caller=agent/admin.go:351:AgentStatus: AgentStatus
2025-03-25T09:54:31.20541+08:00 INFO [4903,] caller=agent/admin.go:363:AgentStatus: check agentd status got: {State:running Ready:false Version:4.2.2-20240108 Socket:/soft/myoceanbase/obagent/run/ob_agentd.4912.sock Services:map[ob_mgragent:{Status:{State:stopped Version: Pid:0 StartAt:-6795364578871345152 Ports:[]} Socket:/soft/myoceanbase/obagent/run/ob_mgragent.0.sock EndAt:-6795364578871345152} ob_monagent:{Status:{State:running Version:4.2.2-20240108 Pid:4919 StartAt:1742867661451529962 Ports:[8088]} Socket:/soft/myoceanbase/obagent/run/ob_monagent.4919.sock EndAt:-6795364578871345152}] Dangling:[] StartAt:1742867661072849671}
2025-03-25T09:54:31.20546+08:00 ERROR [4903,] caller=agent/admin.go:253:startAgent: wait for agent ready timeout
2025-03-25T09:54:31.20549+08:00 INFO [4903,] caller=agent/admin.go:613:progressEnd: startAgent end
2025-03-25T09:54:31.20551+08:00 WARN [4903,] caller=agent/admin.go:615:progressEnd: updateProgress: missing storedStatus fields:, task_token=
2025-03-25T09:54:31.20554+08:00 ERROR [4903,] caller=agent/admin.go:208:StartAgent: start agent failed: Module=agent, kind=DEADLINE_EXCEEDED, code=wait_for_ready_timeout;
2025-03-25T09:54:31.20563+08:00 INFO [4903,] caller=agent/admin.go:188:unlock: process 4903 release admin lock
2025-03-25T09:54:31.20586+08:00 INFO [4903,] caller=agentctl/main.go:275:func2: agentctl error fields: response="{“successful”:false,“message”:null,“error”:"Module=agent, kind=DEADLINE_EXCEEDED, code=wait_for_ready_timeout; “}”

另外一个日志

tail -200f /soft/myoceanbase/obagent/log/monagent.log

2025-03-25T10:02:54.29317+08:00 WARN [4919,] caller=mysql/table_input.go:274:collectWithConfig: slow sql, name: ob_query_response_time_seconds, duration: 2.593550267s (over 100ms), sql: select /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) */ __all_tenant.tenant_name as tenant_name, v_acc_response_time.response_time as response_time_seconds, v_acc_response_time.count as bucket, case when v_acc_response_time.response_time = ‘+inf’ then v_acc_response_time.count else null end as count, case when v_acc_response_time.response_time = ‘+inf’ then cast(v_acc_response_time.sum / 1000000 as float) else null end as sum from (select b.tenant_id, b.response_time / 1000000 as response_time, sum(a.count) as count, sum(a.total) as sum from __all_virtual_query_response_time a, __all_virtual_query_response_time b where a.response_time <= b.response_time and a.svr_ip = b.svr_ip and a.svr_port = b.svr_port and b.svr_ip = ? and b.svr_port = ? group by b.tenant_id, b.response_time union select tenant_id, ‘+inf’, sum(count), sum(total) from __all_virtual_query_response_time) v_acc_response_time, __all_tenant where v_acc_response_time.tenant_id = __all_tenant.tenant_id;
2025-03-25T10:02:54.2933+08:00 WARN [4919,] caller=mysql/table_input.go:321:collectWithConfig: can not convert value of count %!s() to float64
2025-03-25T10:02:54.29333+08:00 WARN [4919,] caller=mysql/table_input.go:321:collectWithConfig: can not convert value of sum %!s() to float64
2025-03-25T10:02:54.29339+08:00 WARN [4919,] caller=mysql/table_input.go:321:collectWithConfig: can not convert value of count %!s() to float64
2025-03-25T10:02:54.29344+08:00 WARN [4919,] caller=mysql/table_input.go:321:collectWithConfig: can not convert value of sum %!s() to float64
2025-03-25T10:02:54.29348+08:00 WARN [4919,] caller=mysql/table_input.go:321:collectWithConfig: can not convert value of count %!s() to float64
2025-03-25T10:02:54.29352+08:00 WARN [4919,] caller=mysql/table_input.go:321:collectWithConfig: can not convert value of sum %!s() to float64
2025-03-25T10:02:54.29356+08:00 WARN [4919,] caller=mysql/table_input.go:321:collectWithConfig: can not convert value of count %!s() to float64
2025-03-25T10:02:54.29359+08:00 WARN [4919,] caller=mysql/table_input.go:321:collectWithConfig: can not convert value of sum %!s() to float64
2025-03-25T10:02:54.29365+08:00 WARN [4919,] caller=mysql/table_input.go:321:collectWithConfig: can not convert value of count %!s() to float64
2025-03-25T10:02:54.29368+08:00 WARN [4919,] caller=mysql/table_input.go:321:collectWithConfig: can not convert value of sum %!s() to float64
2025-03-25T10:02:54.29372+08:00 WARN [4919,] caller=mysql/table_input.go:321:collectWithConfig: can not convert value of count %!s() to float64
2025-03-25T10:02:54.29375+08:00 WARN [4919,] caller=mysql/table_input.go:321:collectWithConfig: can not convert value of sum %!s() to float64

请问大佬 能帮忙分析一下为啥起不来么

生产环境有备份吗

Connect to obproxy ok
Start obagent x
[ERROR] failed to start 192.168.5.85 obagent.
[WARN] obagent-py_script_start-1.3.0 has animation not been closed
这看起来是agent服务启动失败了,似乎有残留的agent脚本进程,看下每台机器上的observer进程起来了吗。

所有故障诊断,需要优先保障数据安全哦

1、执行一下这个
obd display-trace 04767684-091c-11f0-943c-0cc47ac49e64

2、obd日志: 默认保存在安装obd的用户home路径: cd ~/.obd/log/

3、obd cluster edit-config myoceanbase --保存在文本里 提供一下

Oceanbase.txt (171.3 KB)
你好这是记录

observer 每一台都起来了

上面的查看配置文件也发一下

能提供下obanget的配置和日志吗。日志在部署目录/obagent/log下

配置.txt (2.7 KB)
ob-agent.tar.gz (3.7 MB)
你好你要的信息都在附件里面