如何用命令行启动OB集群

当前情况:
使用OCP部署了OB集群(1+1+1), 在OCP中使用“停止集群” 成功停止了OB集群。

之后,OCP所在机器出现损坏, 目前OCP已经无法使用。

现在想用命令行启动OB集群,尝试了如下命令:
node1:
(admin)$ cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.41 -P 2882 -p 2881 -z zone1 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10002 -n obcluster1
node2:
(admin)$ cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.42 -P 2882 -p 2881 -z zone1 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10002 -n obcluster1
node3:
(admin)$ cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.43 -P 2882 -p 2881 -z zone1 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10002 -n obcluster1

但是启动失败。

请问,OCP部署的OB集群, 如何用命令行启动?

重启 OCP-OceanBase 云平台

OCP所在的机器故障,OCP环境已经被破坏。

一般的,通过OCP首次启动observer后,参数配置已保存,直接执行observer启动即可,无需参数。
命令行启动的话,要仔细检查参数是否正确。
比如本环境如是1-1-1集群,那么三个Node的【-z】参数应分别是zone1、zone2、zone3,不能都是zone1。
最后,建议用obdiag采集日志发上来,否则信息太少,无法诊断。

1 个赞

observer_node1.log (1.0 MB)
observer_node2.log (1.0 MB)
observer_node3.log (1.0 MB)
1、使用如下命令,仍然无法启动OB集群
node1:
cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.41 -P 2882 -p 2881 -z zone1 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10010 -n obcluster1
node2:
cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.42 -P 2882 -p 2881 -z zone2 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10010 -n obcluster1
node3:
cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.43 -P 2882 -p 2881 -z zone3 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10010 -n obcluster1

  1. 由于OB集群目前无法启动,使用obdiag收集日志报错。
    [admin@ob1 oceanbase]$ obdiag gather scene run --scene=observer.unknown
    gather_scenes_run start …
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    gather from_time: 2024-05-18 07:30:43, to_time: 2024-05-18 08:01:43
    execute tasks: observer.unknown
    run scene excute yaml mode in node: 192.168.10.41 start
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] StepSQLHandler execute Exception: ‘NoneType’ object has no attribute ‘cursor’
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)
    [ERROR] connect OB: 192.168.10.41:2883 with user root@sys#obcluster1 failed, error:(2013, ‘Lost connection to MySQL server during query’)

  2. 手动收集了三个节点的observer.log。

谢帮忙看看是什么原因? 谢谢!

确认下observer进程杀干净了么?因为日志中有如下报错:
1)无法绑定端口:errcode=-4004] observer start fail(ret=-4004)
2)发现了未停止的线程:
============= [BEFORE_DESTROY] begin to show unstopped thread =============
[BEFORE_DESTROY] detect unstopped thread, tid: 113564, name: observer
[BEFORE_DESTROY] detect unstopped thread, tid: 113574, name: IO_TUNING0
[BEFORE_DESTROY] detect unstopped thread, tid: 113575, name: IO_GETEVENT0
[BEFORE_DESTROY] detect unstopped thread, tid: 113576, name: IO_GETEVENT0
……

observer进程肯定是已经杀干净了的。刚刚又尝试启动了次,仍然报错。 新生成的observer.log日志中仍然有detect unstopped thred …

[admin@ob1 log]$ ps -ef |grep ob
admin 158798 158369 0 13:12 pts/0 00:00:00 grep ob
[admin@ob1 log]$
[admin@ob1 log]$ cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.41 -P 2882 -p 2881 -z zone1 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10010 -n obcluster1
/home/admin/oceanbase/bin/observer -I 192.168.10.41 -P 2882 -p 2881 -z zone1 -d /home/admin/oceanbase/store/obcluster1 -r 192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881 -c 10010 -n obcluster1
local_ip: 192.168.10.41
rpc port: 2882
mysql port: 2881
zone: zone1
data_dir: /home/admin/oceanbase/store/obcluster1
rs list: 192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881
cluster id: 10010
appname: obcluster1
[admin@ob1 oceanbase]$

[admin@ob2 log]$
[admin@ob2 log]$
[admin@ob2 log]$ ps -ef |grep ob
admin 148617 148042 0 13:15 pts/0 00:00:00 grep ob
[admin@ob2 log]$
[admin@ob2 log]$ cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.42 -P 2882 -p 2881 -z zone2 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10010 -n obcluster1
/home/admin/oceanbase/bin/observer -I 192.168.10.42 -P 2882 -p 2881 -z zone2 -d /home/admin/oceanbase/store/obcluster1 -r 192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881 -c 10010 -n obcluster1
local_ip: 192.168.10.42
rpc port: 2882
mysql port: 2881
zone: zone2
data_dir: /home/admin/oceanbase/store/obcluster1
rs list: 192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881
cluster id: 10010
appname: obcluster1
[admin@ob2 oceanbase]$

[admin@ob3 log]$
[admin@ob3 log]$ ps -ef |grep ob
admin 148932 148345 0 13:16 pts/0 00:00:00 grep ob
[admin@ob3 log]$
[admin@ob3 log]$ cd /home/admin/oceanbase && /home/admin/oceanbase/bin/observer -I 192.168.10.43 -P 2882 -p 2881 -z zone3 -d /home/admin/oceanbase/store/obcluster1 -r ‘192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881’ -c 10010 -n obcluster1
/home/admin/oceanbase/bin/observer -I 192.168.10.43 -P 2882 -p 2881 -z zone3 -d /home/admin/oceanbase/store/obcluster1 -r 192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881 -c 10010 -n obcluster1
local_ip: 192.168.10.43
rpc port: 2882
mysql port: 2881
zone: zone3
data_dir: /home/admin/oceanbase/store/obcluster1
rs list: 192.168.10.41:2882:2881;192.168.10.42:2882:2881;192.168.10.43:2882:2881
cluster id: 10010
appname: obcluster1
[admin@ob3 oceanbase]$
[admin@ob3 oceanbase]$

observer_log_node1.tar.gz (108.1 KB)
observer_log_node2.tar.gz (108.0 KB)
observer_log_node3.tar.gz (108.5 KB)

刚刚将每个node的 /home/admin/oceanbase/log日志目录清空,然后尝试重新启动observer,附件是新生成的所有日志。

目前OCP环境已经无法恢复了对吗?

是的,OCP机器出现故障,整个OCP环境已经删除。

ob集群那 是ocp部署的吗。
可以尝试下找个新ocp来接管这个ob集群

OB集群当初是用OCP部署的, 当前OB集群无法启动, 可以用新的OCP接管吗?

ocp只是个部署+管理工具。可以试看看。

报错没什么变化:
[errcode=-4388] Unexpected internal error happen, please checkout the internal errcode(errcode=-4015, file=“ob_sql_nio.cpp”, line_no=654, info=“sql nio bind listen fd failed”)
[errcode=-4004] listen create fail(ret=-4004, port=2881, errno=98, errmsg=“Address already in use”)

不行重启下服务器吧。

如果是已经 bootstrap 过的集群,启动的时候不要加启动参数了,直接 cd /home/admin/oceanbase && bin/observer 这样启动,Address already in use 需要确认下 OB 的端口是否被占用,kill 掉占用的进程之后用前面的命令来启动