【 使用环境 】测试环境
【 OB or 其他组件 】observer
【 使用版本 】4.3.1
【问题描述】1-1-1 架构,一个zone 硬盘故障后,登录sys租户,通过2881端口,发现一间巡检相关命令都是超时,理论上一个节点down后,应该不影响服务,这个如何能恢复下?
【复现路径】排查步骤参考:https://www.oceanbase.com/knowledge-base/oceanbase-database-1000000001230231?back=kb
应该是一直在无主选举,但是排查网络通的,现在是能登录数据库服务,但是任何表的查询都是超时,包括系统表,表的数据量很小,可忽略大数据影响超时
【附件及日志】
注意:A.A.A.A:2882为坏的节点
grep 'T1002_.*{id:1002}' ./election.log | grep 'dump message count' | vim -
[2024-10-30 14:09:26.813293] INFO [ELECT] operator() (election_proposer.cpp:252) [7 50][T1002_Occam][T1002][Y0-0000000000000000-0-0] [lt=56] dump message count(ls_id="{ id:1002}", self_addr="X.X.X.X:2882", state=
match:[Prepare Request, Prepare Response, Accept Request, Accept Response, C hange Leader]
"X.X.X.X:2882":send:[(5191), 2, 265291, 134, 1], rec:[(5189), 392, 134, 2 64466, 2], last_send:2024-10-30 14:09:16.896292, last_rec:2024-10-30 14:09:16.895077
"X.X.X.X:2882":send:[(5191), (5069), 265291, 265291, 0], rec:[(5190), (51 88), 265291, 265291, 0], last_send:2024-10-30 14:09:18.907356, last_rec:2024-10-30 1 4:09:18.907356
"A.A.A.A:2882":send:[8, 0, 172755, 0, 0], rec:[0, 0, 0, 0, 0], last_send: 2024-10-29 11:22:00.316997, last_rec:1970-01-01 08:00:00.0)
[2024-10-30 14:09:36.816412] INFO [ELECT] operator() (election_proposer.cpp:252) [7 50][T1002_Occam][T1002][Y0-0000000000000000-0-0] [lt=57] dump message count(ls_id="{ id:1002}", self_addr="X.X.X.X:2882", state=
match:[Prepare Request, Prepare Response, Accept Request, Accept Response, C hange Leader]
"X.X.X.X:2882":send:[(5192), 2, 265291, 134, 1], rec:[(5190), 392, 134, 2 64466, 2], last_send:2024-10-30 14:09:26.906711, last_rec:2024-10-30 14:09:26.911132
"X.X.X.X:2882":send:[(5192), (5070), 265291, 265291, 0], rec:[(5191), (51 89), 265291, 265291, 0], last_send:2024-10-30 14:09:28.918764, last_rec:2024-10-30 1 4:09:28.918764
"A.A.A.A:2882":send:[8, 0, 172755, 0, 0], rec:[0, 0, 0, 0, 0], last_send: 2024-10-29 11:22:00.316997, last_rec:1970-01-01 08:00:00.0)
7716 [2024-10-30 14:09:47.314463] INFO [ELECT] operator() (election_proposer.cpp:252) [7 50][T1002_Occam][T1002][Y0-0000000000000000-0-0] [lt=83] dump message count(ls_id="{ id:1002}", self_addr="X.X.X.X:2882", state=
match:[Prepare Request, Prepare Response, Accept Request, Accept Response, C hange Leader]
"X.X.X.X:2882":send:[(5194), 2, 265291, 134, 1], rec:[(5192), 392, 134, 2 64466, 2], last_send:2024-10-30 14:09:46.901561, last_rec:2024-10-30 14:09:46.900145
"X.X.X.X:2882":send:[(5194), (5071), 265291, 265291, 0], rec:[(5193), (51 90), 265291, 265291, 0], last_send:2024-10-30 14:09:46.901561, last_rec:2024-10-30 1 4:09:46.901561
"A.A.A.A:2882":send:[8, 0, 172755, 0, 0], rec:[0, 0, 0, 0, 0], last_send: 2024-10-29 11:22:00.316997, last_rec:1970-01-01 08:00:00.0)
7721 [2024-10-30 14:09:57.314108] INFO [ELECT] operator() (election_proposer.cpp:252) [7 50][T1002_Occam][T1002][Y0-0000000000000000-0-0] [lt=30] dump message count(ls_id="{ id:1002}", self_addr="X.X.X.X:2882", state=
match:[Prepare Request, Prepare Response, Accept Request, Accept Response, C hange Leader]
"X.X.X.X:2882":send:[(5195), 2, 265291, 134, 1], rec:[(5193), 392, 134, 2 64466, 2], last_send:2024-10-30 14:09:56.919197, last_rec:2024-10-30 14:09:57.19282
"X.X.X.X:2882":send:[(5195), (5072), 265291, 265291, 0], rec:[(5194), (51 91), 265291, 265291, 0], last_send:2024-10-30 14:09:56.919197, last_rec:2024-10-30 1 4:09:56.919197
"A.A.A.A:2882":send:[8, 0, 172755, 0, 0], rec:[0, 0, 0, 0, 0], last_send: 2024-10-29 11:22:00.316997, last_rec:1970-01-01 08:00:00.0)
[**centos@localhost log]$ grep 'pcode.*538[4567]' ./observer.log**
[2024-10-30 14:05:17.100259] WDIAG [RPC] create (ob_poc_rpc_server.cpp:141) [93][pnio1][T0][XXXXXXXXXXXXXXXX] [lt=8][errcode=0] PNIO packet wait too much time between proxy and server_cb(pcode=5384, fly_ts=207372, send_timestamp=1730268316892886, peer="XXXXXXXX:38002", sz=828)