执行obd cluster start xxx 时报failed to connect meta db

【 使用环境 】测试环境
【 OB or 其他组件 】OB
【 使用版本 】社区版4.1
【问题描述】执行obd cluster start xxx 时报failed to connect meta db
【复现路径】重启集群时报错
【问题现象及影响】
您好,我们计划通过ocp调整集群参数,安装后一切正常,今天启动时有如下错误,ocp启动不起来了
Start ocp-express x
[ERROR] 10.10.10.61: failed to connect meta db

[ERROR] ocp-express start failed

日志文件请见附件
【附件】
observer.rar (5.5 MB)

2 个赞

这个问题目前已有多人反馈,主要是重启时OB初始化的速度比较慢,会出现OCP连不上OB Meta租户导致报错,这块内部已经修复了,预计6月12号会发布新版。
目前的话可以多尝试几次,或者分步来重启,先启动集群,再启动ocp-express
可参考下面的命令
https://www.oceanbase.com/docs/community-obd-cn-10000000002049482#e8807cfc-b381-437d-bdd8-59ebf802121d

2 个赞

谢谢回复!怎么先只启动集群,等集群启动后再启动ocp-express呢?麻烦给个示例脚本

1 个赞

可以用这个:xxx代表你当时部署这个集群的时候的部署名称
obd cluster start xxx -c observer-ce
obd cluster start xxx -c obproxy-ce
obd cluster start xxx -c obagent
obd cluster start xxx -s ip -c ocp-expresss

2 个赞

可以了 :+1: :handshake:

1 个赞

也遇到了同样的问题,按照这个步骤执行,还是不行
检查报这个错,请大佬帮忙看看
image

看看observer.log日志 看看这应该是初始化问题呢

这个集群之前运行过一段时间,想重启,就起不来了
找了其中一个节点的observer,服务起来了,但是一直在报错,有4383、4024、5157错误
其中4024是空间不足
不知道是不是因为obd启动的时候还是用的第一次启动的配置(如下图)
但是在运行期间通过express改过资源参数,扩了datafile_size、log_disk_size、memory_limit
这次启动用的第一次的初始配置,所以启动不起来了

observer的错误日志如下

[2023-11-23 14:33:46.185518] EDIAG [OCCAM] runTimerTask (ob_occam_timer.h:224) [152722][GEleTimer][T0][Y0-0000000000000000-0-0] [lt=17][errcode=-4024] fail to register next timer task(temp_ret=0, ret=-4024, ret="OB_BUF_NOT_ENOUGH", *this={this:0x7f1195f8d2b0, caller:election_event_recorder.cpp:report_event_:98, function_type:ZN9oceanbase4palf8election13EventRecorder13report_event_ENS1_17ElectionEventTypeERKNS_6common8ObStringEE5$_163, timer_running_flag:1, total_running_count:34918, func_shared_ptr_.ptr:0x7f1195f8d510, time_wheel:0x7f12ea1fc090, thread_pool:0x7f12ec7ff750, is_running:1, time_interval:15.00s, expected_run_time:14:11:46.175, task_priority:1, with_handle_protected:0}) BACKTRACE:0xd6ac1c9 0x59d39ec 0x59d34ff 0x59d32f6 0x59cafd8 0x5a15c83 0x3c0ea1c 0x3c0dc09 0xdd19610 0xdd1641a 0x7f1332c7de25 0x7f13329ab34d
[2023-11-23 14:33:46.185589] ERROR issue_dba_error (ob_log.cpp:1792) [152722][GEleTimer][T0][Y0-0000000000000000-0-0] [lt=0][errcode=-4388] Unexpected internal error happen, please checkout the internal errcode(errcode=-4024, file="ob_occam_timer.h", line_no=224, info="fail to register next timer task")
[2023-11-23 14:33:46.185594] EDIAG [OCCAM] runTimerTask (ob_occam_timer.h:224) [152722][GEleTimer][T0][Y0-0000000000000000-0-0] [lt=5][errcode=-4024] fail to register next timer task(temp_ret=0, ret=-4024, ret="OB_BUF_NOT_ENOUGH", *this={this:0x7f118afc0790, caller:election_event_recorder.cpp:report_event_:98, function_type:ZN9oceanbase4palf8election13EventRecorder13report_event_ENS1_17ElectionEventTypeERKNS_6common8ObStringEE5$_163, timer_running_flag:1, total_running_count:34918, func_shared_ptr_.ptr:0x7f118afc09f0, time_wheel:0x7f12ea1fc090, thread_pool:0x7f12ec7ff750, is_running:1, time_interval:15.00s, expected_run_time:14:11:16.176, task_priority:1, with_handle_protected:0}) BACKTRACE:0xd6ac1c9 0x59d39ec 0x59d34ff 0x59d32f6 0x59cafd8 0x5a15c83 0x3c0ea1c 0x3c0dc09 0xdd19610 0xdd1641a 0x7f1332c7de25 0x7f13329ab34d
[2023-11-23 14:33:46.185615] ERROR issue_dba_error (ob_log.cpp:1792) [152722][GEleTimer][T0][Y0-0000000000000000-0-0] [lt=0][errcode=-4388] Unexpected internal error happen, please checkout the internal errcode(errcode=-4024, file="ob_occam_timer.h", line_no=224, info="fail to register next timer task")
[2023-11-23 14:33:46.185621] EDIAG [OCCAM] runTimerTask (ob_occam_timer.h:224) [152722][GEleTimer][T0][Y0-0000000000000000-0-0] [lt=5][errcode=-4024] fail to register next timer task(temp_ret=0, ret=-4024, ret="OB_BUF_NOT_ENOUGH", *this={this:0x7f1195f8c240, caller:election_event_recorder.cpp:report_event_:98, function_type:ZN9oceanbase4palf8election13EventRecorder13report_event_ENS1_17ElectionEventTypeERKNS_6common8ObStringEE5$_163, timer_running_flag:1, total_running_count:34918, func_shared_ptr_.ptr:0x7f1195f8c4a0, time_wheel:0x7f12ea1fc090, thread_pool:0x7f12ec7ff750, is_running:1, time_interval:15.00s, expected_run_time:14:10:46.176, task_priority:1, with_handle_protected:0}) BACKTRACE:0xd6ac1c9 0x59d39ec 0x59d34ff 0x59d32f6 0x59cafd8 0x5a15c83 0x3c0ea1c 0x3c0dc09 0xdd19610 0xdd1641a 0x7f1332c7de25 0x7f13329ab34d
[2023-11-23 14:33:46.185641] ERROR issue_dba_error (ob_log.cpp:1792) [152722][GEleTimer][T0][Y0-0000000000000000-0-0] [lt=0][errcode=-4388] Unexpected internal error happen, please checkout the internal errcode(errcode=-4024, file="ob_occam_timer.h", line_no=224, info="fail to register next timer task")
[2023-11-23 14:33:46.185733] WDIAG [STORAGE.TRANS] process_cluster_heartbeat_rpc (ob_tenant_weak_read_service.cpp:460) [153376][T1001_L0_G0][T1001][YB420A3F2389-00060AC710E99E87-0-0] [lt=43][errcode=-4341] process cluster heartbeat rpc: self is not in cluster service(ret=-4341, ret="OB_NOT_IN_SERVICE", tenant_id_=1001, svr="10.63.35.137:2882", version={val:1700714051590340675}, valid_part_count=1, total_part_count=1, generate_timestamp=1700721226184806)
[2023-11-23 14:33:46.185646] EDIAG [OCCAM] runTimerTask (ob_occam_timer.h:224) [152722][GEleTimer][T0][Y0-0000000000000000-0-0] [lt=5][errcode=-4024] fail to register next timer task(temp_ret=0, ret=-4024, ret="OB_BUF_NOT_ENOUGH", *this={this:0x7f1195ffa820, caller:election_event_recorder.cpp:report_event_:98, function_type:ZN9oceanbase4palf8election13EventRecorder13report_event_ENS1_17ElectionEventTypeERKNS_6common8ObStringEE5$_163, timer_running_flag:1, total_running_count:34918, func_shared_ptr_.ptr:0x7f1195ffaa80, time_wheel:0x7f12ea1fc090, thread_pool:0x7f12ec7ff750, is_running:1, time_interval:15.00s, expected_run_time:14:10:16.175, task_priority:1, with_handle_protected:0}) BACKTRACE:0xd6ac1c9 0x59d39ec 0x59d34ff 0x59d32f6 0x59cafd8 0x5a15c83 0x3c0ea1c 0x3c0dc09 0xdd19610 0xdd1641a 0x7f1332c7de25 0x7f13329ab34d
[2023-11-23 14:33:46.185857] WDIAG [STORAGE.TRANS] process_cluster_heartbeat_rpc (ob_weak_read_service.cpp:203) [153376][T1001_L0_G0][T1001][YB420A3F2389-00060AC710E99E87-0-0] [lt=77][errcode=-4341] tenant weak read service process cluster heartbeat RPC fail(ret=-4341, ret="OB_NOT_IN_SERVICE", tenant_id=1001, req={req_server:"10.63.35.137:2882", version:{val:1700714051590340675}, valid_part_count:1, total_part_count:1, generate_timestamp:1700721226184806}, twrs={inited:true, tenant_id:1001, self:"10.63.35.137:2882", svr_version_mgr:{server_version:{version:{val:1700714051590340675}, total_part_count:1, valid_inner_part_count:1, valid_user_part_count:0, epoch_tstamp:1700721226184911}, server_version_for_stat:{version:{val:1700714051590340675}, total_part_count:1, valid_inner_part_count:1, valid_user_part_count:0, epoch_tstamp:1700721226184911}}, cluster_service:{current_version:{val:0}, min_version:{val:0}, max_version:{val:0}}})
[2023-11-23 14:33:46.185920] ERROR issue_dba_error (ob_log.cpp:1792) [152722][GEleTimer][T0][Y0-0000000000000000-0-0] [lt=0][errcode=-4388] Unexpected internal error happen, please checkout the internal errcode(errcode=-4024, file="ob_occam_timer.h", line_no=224, info="fail to register next timer task")
[2023-11-23 14:33:46.185928] EDIAG [OCCAM] runTimerTask (ob_occam_timer.h:224) [152722][GEleTimer][T0][Y0-0000000000000000-0-0] [lt=8][errcode=-4024] fail to register next timer task(temp_ret=0, ret=-4024, ret="OB_BUF_NOT_ENOUGH", *this={this:0x7f118760bb90, caller:election_event_recorder.cpp:report_event_:98, function_type:ZN9oceanbase4palf8election13EventRecorder13report_event_ENS1_17ElectionEventTypeERKNS_6common8ObStringEE5$_163, timer_running_flag:1, total_running_count:34918, func_shared_ptr_.ptr:0x7f118760bdf0, time_wheel:0x7f12ea1fc090, thread_pool:0x7f12ec7ff750, is_running:1, time_interval:15.00s, expected_run_time:14:09:46.176, task_priority:1, with_handle_protected:0}) BACKTRACE:0xd6ac1c9 0x59d39ec 0x59d34ff 0x59d32f6 0x59cafd8 0x5a15c83 0x3c0ea1c 0x3c0dc09 0xdd19610 0xdd1641a 0x7f1332c7de25 0x7f13329ab34d
[2023-11-23 14:33:46.185954] ERROR issue_dba_error (ob_log.cpp:1792) [152722][GEleTimer][T0][Y0-0000000000000000-0-0] [lt=0][errcode=-4388] Unexpected internal error happen, please checkout the internal errcode(errcode=-4024, file="ob_occam_timer.h", line_no=224, info="fail to register next timer task")

应该就是这个原因

  1. 第一次初始化安装的时候是6G, 写到了obd的配置中
  2. 在ob运行过程中通过Express页面,将memory_limit参数调大了
  3. 本次重启就会启动失败
    直到将配置中的memory_limit调大了,才启动成功
    猜测:是因为启动的时候使用了obd本地的配置,申请的内存不足,导致运行起来的observer满足不租户的所需分配的内存,才启动失败的

我先是使用obd白屏部署,obproxy-ce和ocp-express部署失败后我按照你的方法单独启动成功了。但是我执行obd cluster list时,集群状态依然显示deployed,并且我没办法用obd cluster display clusterName来查看详细信息。请问这个怎么解决呢?

整体启动下 obd cluster start myoceanbase