ocp部署oceanbase 集群最后在create resource manager for default user 时候失败

【 使用环境 】 测试环境
【 OB or 其他组件 】
【 使用版本 】 4.34 ocp
【问题描述】清晰明确描述问题

ocp 部署完成后,在集群中部署oceanbase 集群时,任务在最后的create resource manager for default user 时候报错,麻烦帮忙看看是什么原因,谢谢!

日志
log_task.zip (111.1 KB)

麻烦发下observer.log

ocp上的observer.log
observer_ocp.log (5.2 MB)

observer上的observer.log 比较大
observer_observer.zip (12.1 MB)

看observer上的observer.log就可以

[2025-02-12 09:47:35.090347] INFO  [SERVER] init (ob_server.cpp:297) [20888][observer][T0][Y0-0000000000000001-0-0] [lt=11] [OBSERVER_NOTICE] start to init observer
[2025-02-12 09:47:35.098372] WDIAG [COMMON] init_from_os (ob_cpu_topology.cpp:97) [20888][observer][T0][Y0-0000000000000001-0-0] [lt=19][errcode=0] cpu flag is not found(CPU_FLAG_CMDS[i]="grep -E ' sse4_2( |$)' /proc/cpuinfo")
[2025-02-12 09:47:35.105638] WDIAG [COMMON] init_from_os (ob_cpu_topology.cpp:97) [20888][observer][T0][Y0-0000000000000001-0-0] [lt=123][errcode=0] cpu flag is not found(CPU_FLAG_CMDS[i]="grep -E ' avx( |$)' /proc/cpuinfo")
[2025-02-12 09:47:35.119033] WDIAG [COMMON] init_from_os (ob_cpu_topology.cpp:97) [20888][observer][T0][Y0-0000000000000001-0-0] [lt=88][errcode=0] cpu flag is not found(CPU_FLAG_CMDS[i]="grep -E ' avx2( |$)' /proc/cpuinfo")
[2025-02-12 09:47:35.127395] WDIAG [COMMON] init_from_os (ob_cpu_topology.cpp:97) [20888][observer][T0][Y0-0000000000000001-0-0] [lt=90][errcode=0] cpu flag is not found(CPU_FLAG_CMDS[i]="grep -E ' avx512bw( |$)' /proc/cpuinfo")
[2025-02-12 09:47:35.127491] WDIAG [COMMON] CpuFlagSet (ob_cpu_topology.cpp:63) [20888][observer][T0][Y0-0000000000000001-0-0] [lt=95][errcode=0] #flag is not supported
[2025-02-12 09:47:35.127528] WDIAG [COMMON] CpuFlagSet (ob_cpu_topology.cpp:64) [20888][observer][T0][Y0-0000000000000001-0-0] [lt=35][errcode=0] #flag is not supported
[2025-02-12 09:47:35.127547] WDIAG [COMMON] CpuFlagSet (ob_cpu_topology.cpp:65) [20888][observer][T0][Y0-0000000000000001-0-0] [lt=18][errcode=0] #flag is not supported
[2025-02-12 09:47:35.127571] WDIAG [COMMON] CpuFlagSet (ob_cpu_topology.cpp:66) [20888][observer][T0][Y0-0000000000000001-0-0] [lt=17][errcode=0] #flag is not supported
……
ormal_raise_block:                               ; preds = %ob_fail
  %get_exception_class = call i64 @eh_classify_exception(i8* %load_sql_state)
  %get_exception_class.off = add i64 %get_exception_class, -3
  %switch = icmp ult i64 %get_exception_class.off, 2
  br i1 %switch, label %reset_ret_block, label %raise_exception

reset_ret_block:                                  ; preds = %normal_raise_block
  store i32 0, i32* %int_alloca, align 4
  br label %ob_success
}
")
[2025-02-12 09:52:31.059319] WDIAG pkts_sk_consume (handle_io.t.h:57) [22160][pnio1][T0][Y0-0000000000000000-0-0] [lt=37][errcode=0] PNIO do_decode fail: 61
[2025-02-12 09:52:31.059352] INFO  eloop_handle_sock_event (eloop.c:112) [22160][pnio1][T0][Y0-0000000000000000-0-0] [lt=26] PNIO sock destroy: sock=0x7fd6f1e05258, connection=fd:127:local:"0.0.0.0:0":remote:"0.0.0.0:0", err=61
[2025-02-12 09:52:31.059362] WDIAG sock_destroy (eloop.c:80) [22160][pnio1][T0][Y0-0000000000000000-0-0] [lt=8][errcode=0] PNIO epoll_ctl delete fd faild, s=0x7fd6f1e05258, s->fd=127, errno=9
[2025-02-12 09:52:31.059375] WDIAG sock_destroy (eloop.c:86) [22160][pnio1][T0][Y0-0000000000000000-0-0] [lt=6][errcode=0] PNIO close sock fd faild, s=0x7fd6f1e05258, s->fd=127, errno=9
[2025-02-12 09:52:31.059365] INFO  [RPC.OBRPC] do_server_loop (ob_net_keepalive.cpp:498) [22266][KeepAliveServer][T0][Y0-0000000000000000-0-0] [lt=22] socket need_disconn(n=-1, errno=9)
[2025-02-12 09:52:31.059386] INFO  pkts_sk_delete (pkts_sk_factory.h:57) [22160][pnio1][T0][Y0-0000000000000000-0-0] [lt=10] PNIO sk_destroy: s=0x7fd6f1e05258 io=0x7fd6f0c04448
[2025-02-12 09:52:31.059391] INFO  [RPC.OBRPC] do_server_loop (ob_net_keepalive.cpp:528) [22266][KeepAliveServer][T0][Y0-0000000000000000-0-0] [lt=23] server connection closed, fd: 129, addr: "20.46.61.160:23294"
[2025-02-12 09:52:31.059400] WDIAG pkts_sk_consume (handle_io.t.h:57) [22161][pnio1][T0][Y0-0000000000000000-0-0] [lt=12][errcode=0] PNIO do_decode fail: 61
[2025-02-12 09:52:31.059429] INFO  eloop_handle_sock_event (eloop.c:112) [22161][pnio1][T0][Y0-0000000000000000-0-0] [lt=23] PNIO sock destroy: sock=0x7fd6bde04048, connection=fd:132:local:"0.0.0.0:0":remote:"0.0.0.0:0", err=61
[2025-02-12 09:52:31.059447] WDIAG sock_destroy (eloop.c:80) [22161][pnio1][T0][Y0-0000000000000000-0-0] [lt=16][errcode=0] PNIO epoll_ctl delete fd faild, s=0x7fd6bde04048, s->fd=132, errno=9
[2025-02-12 09:52:31.059464] WDIAG sock_destroy (eloop.c:86) [22161][pnio1][T0][Y0-0000000000000000-0-0] [lt=15][errcode=0] PNIO close sock fd faild, s=0x7fd6bde04048, s->fd=132, errno=9
[2025-02-12 09:52:31.059473] INFO  pkts_sk_delete (pkts_sk_factory.h:57) [22161][pnio1][T0][Y0-0000000000000000-0-0] [lt=8] PNIO sk_destroy: s=0x7fd6bde04048 io=0x7fd6f0804448
CRASH ERROR!!! IP=555cd7f4c3a0, RBP=7fd6c2ac9320, sig=4, sig_code=2, sig_addr=0x555cd7f4c3a0, RLIMIT_CORE=unlimited, timestamp=1739325151060333, tid=22533, tname=T1_L0_G0, trace_id=YB42142E3DA0-00062DE8334065B5-0-0, lbt=0x1f96b218 0x1f1b698d 0x7fd711b5f67f 0x8bb63a0 0x9be8a9c 0x9c0812c 0x9c08505 0x9be51fd 0x9a466c5 0xa5f92d9 0xa5fa810 0xa5f85ef 0x924c3cf 0x924cafc 0x9253edc 0x9253edc 0x924ed8d 0x92176d1 0x92177b1 0x9217ad3 0x9224c5c 0x9226bd3 0x9226edb 0x9237427 0x1ef0327a 0x1eee447d 0xebf34b2 0xebf0fc5 0xebe4b4b 0xec2577e 0xec5339b 0xec473c9 0xeaa14cd 0xea7362e 0x14a77faa 0x11bba464 0x7c4fe26 0x7923e9c 0x792151d 0x7cc475e 0x78c853e 0x78b88e3 0x78b1adf 0x78aefec 0x789e118 0xfc77118 0x1f95655d 0x7fd711b57dd4 0x7fd711881b3, SQL_ID=, SQL_STRING=call DBMS_RESOURCE_MANAGER.CREATE_CONSUMER_GROUP (CONSUMER_GROUP => 'ocp_monitor_group', COMMENT => 'reserve thread for default user');

应该是当前使用的cpu不支持avx指令,OB内核用到了avx指令,可以使用lscpu命令看下cpu指令集确认下

解决方案:
更换支持AVX指令的CPU型号

obdiag也可以巡检出来

可以参考下这个帖子

1 个赞

谢谢! 看到这个报错了

建议这个应该放在前面的自检中就最好了

从OB4.3.5的下一个版本开始没有avx指令集的机器会直接不让启动了

2 个赞