如何用命令行启动OB集群

shiyh · 2024 年5 月 17 日 18:26

当前情况：
使用OCP部署了OB集群（1+1+1），在OCP中使用“停止集群” 成功停止了OB集群。

之后，OCP所在机器出现损坏，目前OCP已经无法使用。

现在想用命令行启动OB集群，尝试了如下命令：
node1:
(admin)$ cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.41 -P 2882 -p 2881 -z zone1 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10002 -n obcluster1
node2:
(admin)$ cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.42 -P 2882 -p 2881 -z zone1 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10002 -n obcluster1
node3:
(admin)$ cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.43 -P 2882 -p 2881 -z zone1 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10002 -n obcluster1

但是启动失败。

请问，OCP部署的OB集群，如何用命令行启动？

王利博 · 2024 年5 月 17 日 18:34

重启 OCP-OceanBase 云平台

shiyh · 2024 年5 月 17 日 18:35

OCP所在的机器故障，OCP环境已经被破坏。

dingfeng · 2024 年5 月 17 日 22:42

一般的，通过OCP首次启动observer后，参数配置已保存，直接执行observer启动即可，无需参数。
命令行启动的话，要仔细检查参数是否正确。
比如本环境如是1-1-1集群，那么三个Node的【-z】参数应分别是zone1、zone2、zone3，不能都是zone1。
最后，建议用obdiag采集日志发上来，否则信息太少，无法诊断。

shiyh · 2024 年5 月 18 日 09:00

observer_node1.log (1.0 MB)
observer_node2.log (1.0 MB)
observer_node3.log (1.0 MB)
1、使用如下命令，仍然无法启动OB集群
node1:
cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.41 -P 2882 -p 2881 -z zone1 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10010 -n obcluster1
node2:
cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.42 -P 2882 -p 2881 -z zone2 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10010 -n obcluster1
node3:
cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.43 -P 2882 -p 2881 -z zone3 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10010 -n obcluster1

由于OB集群目前无法启动，使用obdiag收集日志报错。
[admin@ob1 oceanbase]$ obdiag gather scene run --scene=observer.unknown
gather_scenes_run start …
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
gather from_time: 2024-05-18 07:30:43, to_time: 2024-05-18 08:01:43
execute tasks: observer.unknown
run scene excute yaml mode in node: 192.168.10.41 start
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
[ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
手动收集了三个节点的observer.log。

谢帮忙看看是什么原因？谢谢！

dingfeng · 2024 年5 月 18 日 09:27

确认下observer进程杀干净了么？因为日志中有如下报错：
1）无法绑定端口：errcode=-4004] observer start fail(ret=-4004)
2）发现了未停止的线程：
============= [BEFORE_DESTROY] begin to show unstopped thread =============
[BEFORE_DESTROY] detect unstopped thread, tid: 113564, name: observer
[BEFORE_DESTROY] detect unstopped thread, tid: 113574, name: IO_TUNING0
[BEFORE_DESTROY] detect unstopped thread, tid: 113575, name: IO_GETEVENT0
[BEFORE_DESTROY] detect unstopped thread, tid: 113576, name: IO_GETEVENT0
……

shiyh · 2024 年5 月 18 日 13:20

observer进程肯定是已经杀干净了的。刚刚又尝试启动了次，仍然报错。新生成的observer.log日志中仍然有detect unstopped thred …

[admin@ob1 log]$ ps -ef |grep ob
admin 158798 158369 0 13:12 pts/0 00:00:00 grep ob
[admin@ob1 log]$
[admin@ob1 log]$ cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.41 -P 2882 -p 2881 -z zone1 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10010 -n obcluster1
/home/admin/oceanbase/bin/observer -I 192.168.10.41 -P 2882 -p 2881 -z zone1 -d /home/admin/oceanbase/store/obcluster1 -r 192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881 -c 10010 -n obcluster1
local_ip: 192.168.10.41
rpc port: 2882
mysql port: 2881
zone: zone1
data_dir: /home/admin/oceanbase/store/obcluster1
rs list: 192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881
cluster id: 10010
appname: obcluster1
[admin@ob1 oceanbase]$

[admin@ob2 log]$
[admin@ob2 log]$
[admin@ob2 log]$ ps -ef |grep ob
admin 148617 148042 0 13:15 pts/0 00:00:00 grep ob
[admin@ob2 log]$
[admin@ob2 log]$ cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.42 -P 2882 -p 2881 -z zone2 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10010 -n obcluster1
/home/admin/oceanbase/bin/observer -I 192.168.10.42 -P 2882 -p 2881 -z zone2 -d /home/admin/oceanbase/store/obcluster1 -r 192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881 -c 10010 -n obcluster1
local_ip: 192.168.10.42
rpc port: 2882
mysql port: 2881
zone: zone2
data_dir: /home/admin/oceanbase/store/obcluster1
rs list: 192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881
cluster id: 10010
appname: obcluster1
[admin@ob2 oceanbase]$

[admin@ob3 log]$
[admin@ob3 log]$ ps -ef |grep ob
admin 148932 148345 0 13:16 pts/0 00:00:00 grep ob
[admin@ob3 log]$
[admin@ob3 log]$ cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.43 -P 2882 -p 2881 -z zone3 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10010 -n obcluster1
/home/admin/oceanbase/bin/observer -I 192.168.10.43 -P 2882 -p 2881 -z zone3 -d /home/admin/oceanbase/store/obcluster1 -r 192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881 -c 10010 -n obcluster1
local_ip: 192.168.10.43
rpc port: 2882
mysql port: 2881
zone: zone3
data_dir: /home/admin/oceanbase/store/obcluster1
rs list: 192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881
cluster id: 10010
appname: obcluster1
[admin@ob3 oceanbase]$
[admin@ob3 oceanbase]$

shiyh · 2024 年5 月 18 日 13:25

observer_log_node1.tar.gz (108.1 KB)
observer_log_node2.tar.gz (108.0 KB)
observer_log_node3.tar.gz (108.5 KB)

刚刚将每个node的 /home/admin/oceanbase/log日志目录清空，然后尝试重新启动observer，附件是新生成的所有日志。

王利博 · 2024 年5 月 18 日 16:25

目前OCP环境已经无法恢复了对吗？

shiyh · 2024 年5 月 18 日 16:28

是的，OCP机器出现故障，整个OCP环境已经删除。

王利博 · 2024 年5 月 18 日 16:51

ob集群那是ocp部署的吗。
可以尝试下找个新ocp来接管这个ob集群

shiyh · 2024 年5 月 18 日 17:07

OB集群当初是用OCP部署的，当前OB集群无法启动，可以用新的OCP接管吗？

王利博 · 2024 年5 月 18 日 17:11

ocp只是个部署+管理工具。可以试看看。

dingfeng · 2024 年5 月 19 日 21:37

报错没什么变化：
[errcode=-4388] Unexpected internal error happen, please checkout the internal errcode(errcode=-4015, file=“ob_sql_nio.cpp”, line_no=654, info=“sql nio bind listen fd failed”)
[errcode=-4004] listen create fail(ret=-4004, port=2881, errno=98, errmsg=“Address already in use”)

不行重启下服务器吧。

chris-sun · 2024 年5 月 22 日 17:11

如果是已经 bootstrap 过的集群，启动的时候不要加启动参数了，直接 cd /home/admin/oceanbase && bin/observer 这样启动，Address already in use 需要确认下 OB 的端口是否被占用，kill 掉占用的进程之后用前面的命令来启动