重启observer ,observer一直提示ERROR 8001 (08004): Server is initializing。observer日志提示failed to get master root server(ret=-4638)

【 使用环境 】生产环境
【 OB or 其他组件 】OB
【 使用版本 】3.1.2
【问题描述】
三节点环境,中午11:40一台observer节点内存100%,导致os将observer kill,重启之后进程存在但节点为不正常状态。之后13:50左右另外一台节点也触发了oom killer,observer进程被杀。
之后整个observer集群便处于无法读写状态。ocp管理里面一直提示运维中无法操作,于是从操作系统上将observer进程进程重启,kill然后重启 /home/admin/oceanbase/bin/observer ,此后2节点显示:
ERROR 8001 (08004): Server is initializing

另外一个节点提示:ERROR 2013 (HY000): Lost connection to MySQL server at 'reading authorization packet', system error: 0

observer.log一直刷写如下日志

[2024-11-28 15:32:56.502795] WARN  [STORAGE.TRANS] do_cluster_heartbeat_ (ob_tenant_weak_read_service.cpp:591) [34502][3291][Y0-0000000000000000] [lt=4] [dc=0] tenant weak read service do cluster heartbeat fail(ret=-5019, ret="OB_TABLE_NOT_EXIST", tenant_id_=1009, last_post_cluster_heartbeat_tstamp_=1732779176442541, cluster_heartbeat_interval_=1000000, cluster_service_pkey={tid:1109407232426210, partition_id:0, part_cnt:0}, cluster_service_master="0.0.0.0")
[2024-11-28 15:32:56.521005] WARN  [SERVER] get_master_root_server (ob_service.cpp:3592) [33462][1468][YB42C0A8AA0D-000627F3C322BBE9] [lt=4] [dc=0] not master rootserver(ret=-4638, master_rs="192.168.170.8:2882")
[2024-11-28 15:32:56.521013] WARN  [SERVER] process (ob_rpc_processor_simple.cpp:1957) [33462][1468][YB42C0A8AA0D-000627F3C322BBE9] [lt=8] [dc=0] failed to get master root server(ret=-4638)
[2024-11-28 15:32:56.521302] WARN  [SERVER] fill_partition_replica (ob_service.cpp:638) [33462][1468][YB42C0A8AA0D-000627F3C322BBE9] [lt=2] [dc=0] invalid partition(ret=-4251, part_key={tid:1099511627777, partition_id:0, part_cnt:1})
[2024-11-28 15:32:56.521310] WARN  [SERVER] fill_partition_replica (ob_service.cpp:609) [33462][1468][YB42C0A8AA0D-000627F3C322BBE9] [lt=7] [dc=0] failed to fill_partition_replica(ret=-4251, pg_key={tid:1099511627777, partition_id:0, part_cnt:1})
[2024-11-28 15:32:56.521313] WARN  [SERVER] get_root_server_status (ob_service.cpp:3544) [33462][1468][YB42C0A8AA0D-000627F3C322BBE9] [lt=2] [dc=0] fail to fill partition replica(ret=-4251, partition_key={tid:1099511627777, partition_id:0, part_cnt:1})
[2024-11-28 15:32:56.521663] WARN  [SERVER] fill_partition_replica (ob_service.cpp:638) [33462][1468][YB42C0A8AA0D-000627F3C322BBE9] [lt=3] [dc=0] invalid partition(ret=-4251, part_key={tid:1099511627777, partition_id:0, part_cnt:1})
[2024-11-28 15:32:56.521669] WARN  [SERVER] fill_partition_replica (ob_service.cpp:609) [33462][1468][YB42C0A8AA0D-000627F3C322BBE9] [lt=5] [dc=0] failed to fill_partition_replica(ret=-4251, pg_key={tid:1099511627777, partition_id:0, part_cnt:1})
[2024-11-28 15:32:56.521672] WARN  [SERVER] get_root_server_status (ob_service.cpp:3544) [33462][1468][YB42C0A8AA0D-000627F3C322BBE9] [lt=2] [dc=0] fail to fill partition replica(ret=-4251, partition_key={tid:1099511627777, partition_id:0, part_cnt:1})
[2024-11-28 15:32:56.522272] WARN  [SERVER] get_master_root_server (ob_service.cpp:3592) [33462][1468][YB42C0A8AA0D-000627F3C322BBEB] [lt=3] [dc=0] not master rootserver(ret=-4638, master_rs="192.168.170.8:2882")
[2024-11-28 15:32:56.522278] WARN  [SERVER] process (ob_rpc_processor_simple.cpp:1957) [33462][1468][YB42C0A8AA0D-000627F3C322BBEB] [lt=6] [dc=0] failed to get master root server(ret=-4638)
[2024-11-28 15:32:56.522594] WARN  [SERVER] fill_partition_replica (ob_service.cpp:638) [33462][1468][YB42C0A8AA0D-000627F3C322BBEB] [lt=2] [dc=0] invalid partition(ret=-4251, part_key={tid:1099511627777, partition_id:0, part_cnt:1})
[2024-11-28 15:32:56.522602] WARN  [SERVER] fill_partition_replica (ob_service.cpp:609) [33462][1468][YB42C0A8AA0D-000627F3C322BBEB] [lt=7] [dc=0] failed to fill_partition_replica(ret=-4251, pg_key={tid:1099511627777, partition_id:0, part_cnt:1})
[2024-11-28 15:32:56.522605] WARN  [SERVER] get_root_server_status (ob_service.cpp:3544) [33462][1468][YB42C0A8AA0D-000627F3C322BBEB] [lt=2] [dc=0] fail to fill partition replica(ret=-4251, partition_key={tid:1099511627777, partition_id:0, part_cnt:1})
[2024-11-28 15:32:56.523016] WARN  [SERVER] fill_partition_replica (ob_service.cpp:638) [33462][1468][YB42C0A8AA0D-000627F3C322BBEB] [lt=3] [dc=0] invalid partition(ret=-4251, part_key={tid:1099511627777, partition_id:0, part_cnt:1})
[2024-11-28 15:32:56.523022] WARN  [SERVER] fill_partition_replica (ob_service.cpp:609) [33462][1468][YB42C0A8AA0D-000627F3C322BBEB] [lt=5] [dc=0] failed to fill_partition_replica(ret=-4251, pg_key={tid:1099511627777, partition_id:0, part_cnt:1})
[2024-11-28 15:32:56.523025] WARN  [SERVER] get_root_server_status (ob_service.cpp:3544) [33462][1468][YB42C0A8AA0D-000627F3C322BBEB] [lt=2] [dc=0] fail to fill partition replica(ret=-4251, partition_key={tid:1099511627777, partition_id:0, part_cnt:1})

rootserver.log.wf刷写如下日志

[2024-11-28 15:33:37.569986] WARN  [RS] follower_process (ob_rs_rpc_processor.h:253) [33559][1654][YB42C0A8AA0A-000627F3A0FD4B4A] [lt=5] [dc=0] not master rootserver
[2024-11-28 15:33:37.569995] WARN  [RS] process_ (ob_rs_rpc_processor.h:233) [33559][1654][YB42C0A8AA0A-000627F3A0FD4B4A] [lt=8] [dc=0] follower process failed(ret=-4638, pcode=1030)
[2024-11-28 15:33:37.591584] WARN  [RS] follower_process (ob_rs_rpc_processor.h:253) [33559][1654][YB42C0A8AA08-000627F39A0B0632] [lt=2] [dc=0] not master rootserver
[2024-11-28 15:33:37.591592] WARN  [RS] process_ (ob_rs_rpc_processor.h:233) [33559][1654][YB42C0A8AA08-000627F39A0B0632] [lt=8] [dc=0] follower process failed(ret=-4638, pcode=1030)
[2024-11-28 15:33:37.592323] WARN  [RS] follower_process (ob_rs_rpc_processor.h:253) [33559][1654][YB42C0A8AA08-000627F39A0B0634] [lt=3] [dc=0] not master rootserver
[2024-11-28 15:33:37.592333] WARN  [RS] process_ (ob_rs_rpc_processor.h:233) [33559][1654][YB42C0A8AA08-000627F39A0B0634] [lt=9] [dc=0] follower process failed(ret=-4638, pcode=1030)
[2024-11-28 15:33:37.638208] WARN  [RS] follower_process (ob_rs_rpc_processor.h:253) [33559][1654][YB42C0A8AA0D-000627F3C322C14F] [lt=3] [dc=0] not master rootserver
[2024-11-28 15:33:37.638215] WARN  [RS] process_ (ob_rs_rpc_processor.h:233) [33559][1654][YB42C0A8AA0D-000627F3C322C14F] [lt=6] [dc=0] follower process failed(ret=-4638, pcode=1030)
[2024-11-28 15:33:37.639541] WARN  [RS] follower_process (ob_rs_rpc_processor.h:253) [33559][1654][YB42C0A8AA0D-000627F3C322C151] [lt=4] [dc=0] not master rootserver
[2024-11-28 15:33:37.639549] WARN  [RS] process_ (ob_rs_rpc_processor.h:233) [33559][1654][YB42C0A8AA0D-000627F3C322C151] [lt=7] [dc=0] follower process failed(ret=-4638, pcode=1030)
[2024-11-28 15:33:37.669241] WARN  [RS] follower_process (ob_rs_rpc_processor.h:253) [33559][1654][YB42C0A8AA0A-000627F3A0FD4B4C] [lt=4] [dc=0] not master rootserver
[2024-11-28 15:33:37.669250] WARN  [RS] process_ (ob_rs_rpc_processor.h:233) [33559][1654][YB42C0A8AA0A-000627F3A0FD4B4C] [lt=8] [dc=0] follower process failed(ret=-4638, pcode=1030)
[2024-11-28 15:33:37.670190] WARN  [RS] follower_process (ob_rs_rpc_processor.h:253) [33559][1654][YB42C0A8AA0A-000627F3A0FD4B4E] [lt=2] [dc=0] not master rootserver
[2024-11-28 15:33:37.670199] WARN  [RS] process_ (ob_rs_rpc_processor.h:233) [33559][1654][YB42C0A8AA0A-000627F3A0FD4B4E] [lt=8] [dc=0] follower process failed(ret=-4638, pcode=1030)
[2024-11-28 15:33:37.691966] WARN  [RS] follower_process (ob_rs_rpc_processor.h:253) [33559][1654][YB42C0A8AA08-000627F39A0B0636] [lt=2] [dc=0] not master rootserver
[2024-11-28 15:33:37.691977] WARN  [RS] process_ (ob_rs_rpc_processor.h:233) [33559][1654][YB42C0A8AA08-000627F39A0B0636] [lt=9] [dc=0] follower process failed(ret=-4638, pcode=1030)
[2024-11-28 15:33:37.692786] WARN  [RS] follower_process (ob_rs_rpc_processor.h:253) [33559][1654][YB42C0A8AA08-000627F39A0B0638] [lt=3] [dc=0] not master rootserver

1 个赞

observer.rar (3.7 MB)
rootservice.rar (11.2 MB)

是obd部署的么?架构是什么样的?如果是obd部署的 obd.log日志发一下

应该不是obd部署的
架构是这样的


现在集群都用不了了。着急

看起来是root server挂掉了,

你将三台机器的操作系统都重启下,然后再分别启动observer进程看下


租户也是不可用的

请问rootserver可以单独重启吗?

是的,1-1-1挂了2个zone 集群是不可用的,租户也自然不可用

请问了可以只重启rootserver不,重启服务器我怕带来更大的问题

服务器上有其它应用吗?服务器上可能有残留,建议重启服务器后再启动observer

没有别的应用,就只有ob的服务,obproxy这些

请问obproxy重启服务器是不是会自启动

正常不会自启动,手动启动参考

https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000001429592


刚保留了原始进程参数,等重启之后看看,感谢

我刚重启了三台服务器,还是没有恢复。登录时提示ERROR 2013 (HY000): Lost connection to MySQL server at 'reading authorization packet', system error: 0

170.13这个节点一直重启不成功,帮忙看看日志呢?


db13rootservice -log(2).rar (8.4 MB)
observer (2)db13-log.rar (3.9 MB)

刚在ocp拓扑图上点了重启集群,看下有没有用

我启动集群这一步一直提示密码不对,我此前没有修改过密码。
请问我这里是需要重置密码还是在哪里修改密码,谢谢

@旭辉

参考这个试试,另外ocp 312是企业版的,建议联系企业版支持解决

https://www.oceanbase.com/docs/enterprise-oceanbase-ocp-cn-10000000000380359