OCP集群启动失败了

【 使用环境 】 测试环境
【 OB or 其他组件 】obd ocp
【 使用版本 】ob4.3.3
【问题描述】
obd停止ocp集群后,再启动,就起不来了,老师帮忙看看还有救吗?
【复现路径】
【附件及日志】

[2024-11-13 19:15:32.318] [DEBUG] -- connect 192.168.10.100 -P2881 -uroot -p******
[2024-11-13 19:15:32.322] [ERROR] Traceback (most recent call last):
[2024-11-13 19:15:32.322] [ERROR]   File "pymysql/connections.py", line 613, in connect
[2024-11-13 19:15:32.322] [ERROR]   File "socket.py", line 808, in create_connection
[2024-11-13 19:15:32.322] [ERROR]   File "socket.py", line 796, in create_connection
[2024-11-13 19:15:32.322] [ERROR] ConnectionRefusedError: [Errno 111] Connection refused
[2024-11-13 19:15:32.322] [ERROR]
[2024-11-13 19:15:32.322] [ERROR] During handling of the above exception, another exception occurred:
[2024-11-13 19:15:32.322] [ERROR]
[2024-11-13 19:15:32.322] [ERROR] Traceback (most recent call last):
[2024-11-13 19:15:32.322] [ERROR]   File "core.py", line 2104, in start_cluster
[2024-11-13 19:15:32.322] [ERROR]   File "core.py", line 2204, in _start_cluster
[2024-11-13 19:15:32.322] [ERROR]   File "core.py", line 198, in call_plugin
[2024-11-13 19:15:32.322] [ERROR]   File "_plugin.py", line 348, in __call__
[2024-11-13 19:15:32.322] [ERROR]   File "_plugin.py", line 305, in _new_func
[2024-11-13 19:15:32.323] [ERROR]   File "/root/.obd/plugins/oceanbase-ce/4.2.1.4/connect.py", line 636, in connect
[2024-11-13 19:15:32.323] [ERROR]     cursor = Cursor(ip=server.ip, port=server_config['mysql_port'], tenant='', password=password if password is not None else '', stdio=stdio)
[2024-11-13 19:15:32.323] [ERROR]   File "_stdio.py", line 969, in wrapper
[2024-11-13 19:15:32.323] [ERROR]   File "_stdio.py", line 956, in func_wrapper
[2024-11-13 19:15:32.323] [ERROR]   File "/root/.obd/plugins/oceanbase-ce/4.2.1.4/connect.py", line 528, in __init__
[2024-11-13 19:15:32.323] [ERROR]     self._connect()
[2024-11-13 19:15:32.323] [ERROR]   File "/root/.obd/plugins/oceanbase-ce/4.2.1.4/connect.py", line 558, in _connect
[2024-11-13 19:15:32.323] [ERROR]     self.db = mysql.connect(host=self.ip, user=self.user, port=int(self.port), password=str(self.password),
[2024-11-13 19:15:32.323] [ERROR]   File "pymysql/connections.py", line 353, in __init__
[2024-11-13 19:15:32.323] [ERROR]   File "pymysql/connections.py", line 664, in connect
[2024-11-13 19:15:32.323] [ERROR] pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '192.168.10.100' ([Errno 111] Connection refused)")
[2024-11-13 19:15:32.323] [ERROR]
[2024-11-13 19:15:35.400] [ERROR] OBD-1006: Failed to connect to oceanbase-ce
[2024-11-13 19:15:35.400] [DEBUG] - sub connect ref count to 0
[2024-11-13 19:15:35.400] [DEBUG] - export connect
[2024-11-13 19:15:35.401] [DEBUG] - plugin oceanbase-ce-py_script_connect-4.2.1.4 result: False
[2024-11-13 19:15:35.408] [INFO] See https://www.oceanbase.com/product/ob-deployer/error-codes .
[2024-11-13 19:15:35.409] [INFO] Trace ID: be805b46-a1af-11ef-9bfd-000c290236d4
[2024-11-13 19:15:35.409] [INFO] If you want to view detailed obd logs, please run: obd display-trace be805b46-a1af-11ef-9bfd-000c290236d4
[2024-11-13 19:15:35.409] [DEBUG] - unlock /root/.obd/lock/global
[2024-11-13 19:15:35.409] [DEBUG] - unlock /root/.obd/lock/deploy_ocp_cluster
[2024-11-13 19:15:35.410] [DEBUG] - unlock /root/.obd/lock/mirror_and_repo

日志.log (37.9 KB)
ocp-server.log (16 KB)
observer.rar (6.7 MB)

2 个赞

root密码是否修改过?root的密码和yaml的root_password配置的密码是否一致?

1 个赞

密码没有修改过
yaml文件

[root@ocp ocp-server]# cd /root/.obd/cluster/ocp_cluster
[root@ocp ocp_cluster]# ll
total 8
-rw-------. 1 root root 2176 Nov  7 21:31 config.yaml
-rw-r--r--. 1 root root  130 Nov  7 21:31 inner_config.yaml
[root@ocp ocp_cluster]# more config.yaml
user:
  username: admin
  password: 1qaz@wsx3edC
  port: 22
oceanbase-ce:
  version: 4.2.1.8
  release: 108000022024072217.el7
  package_hash: 499b676f2ede5a16e0c07b2b15991d1160d972e8
  192.168.10.100:
    zone: zone1
  servers:
  - 192.168.10.100
  global:
    appname: ocp_cluster
    root_password: 1qaz@wsx3edC
    mysql_port: 2881
    rpc_port: 2882
    home_path: /home/admin/oceanbase
    data_dir: /data/1
    redo_dir: /data/log1
    datafile_size: 100GB
    datafile_maxsize: 190GB
    datafile_next: 2GB
    log_disk_size: 50GB
    max_syslog_file_count: '10'
    memory_limit: 8GB
    system_memory: 2GB
    cpu_count: 16
    ocp_meta_tenant:
      tenant_name: ocp_meta
      max_cpu: 2.0
      memory_size: 2G
    ocp_meta_username: root
    ocp_meta_password: 1qaz@wsx3edC
    ocp_meta_db: meta_database
    ocp_monitor_tenant:
      tenant_name: ocp_monitor
      max_cpu: 2.0
      memory_size: 2G
    ocp_monitor_username: root
    ocp_monitor_password: 1qaz@wsx3edC
    ocp_monitor_db: monitor_database
    cluster_id: 1730985893
    proxyro_password: sTswllqktk
    ocp_root_password: QlECJQsqry
    ocp_meta_tenant_log_disk_size: 6G
    enable_syslog_wf: false
    production_mode: false
obproxy-ce:
  version: 4.3.2.0
  package_hash: fd779e401be448715254165b1a4f7205c4c1bda5
  release: 26.el7
  servers:
  - 192.168.10.100
  global:
    home_path: /home/admin/obproxy
    prometheus_listen_port: 2884
    listen_port: 2883
    enable_obproxy_rpc_service: false
    obproxy_sys_password: 1qaz@wsx3edC
    skip_proxy_sys_private_check: true
    enable_strict_kernel_release: false
    enable_cluster_checkout: false
  depends:
  - oceanbase-ce
  192.168.10.100:
    proxy_id: 4874
    client_session_id_version: 2
ocp-server-ce:
  version: 4.3.2
  package_hash: 610610e2daf63f6df08af686f9a88b6d8cefcc52
  release: 20241012145836.el7
  servers:
  - 192.168.10.100
  global:
    home_path: /home/admin/ocp
    soft_dir: /home/admin/software
    log_dir: /home/admin/logs
    ocp_site_url: http://192.168.10.100:8080
    port: 8080
    admin_password: dhiQDR{x
    memory_size: 2G
    manage_info:
      machine: 10
  depends:
  - oceanbase-ce
  - obproxy-ce

observer.config.bin.txt (906 字节)

observer.config.bin.history.txt (906 字节)

1 个赞

我也遇到重启后OCP启动不了的问题,不过我看报错和你的不一样。我最后重装了。 :joy:
重启主机后OCP无法访问 - 社区问答- OceanBase社区-分布式数据库

1 个赞

日志中存在报错observer start fail(ret=-4052)

  • 错误原因:错误的日志项,一般可能是由于硬件或者磁盘错误,导致读取到的日志数据有误。

查一下是不是磁盘损坏了或数据文件丢失了

1 个赞

老师,数据文件是否丢失,怎么确定?

1 个赞


在使用obd cluster start ocp_cluster 命令启动时,observer进程在1分钟左右就自动消失了

1 个赞

日志中存在REACH SYSLOG RATE LIMIT。限流 大概率是yaml文件的参数配置有问题
建议白屏化部署,避免不理解的参数。

memory_limit给大点 部署OCP给8G太小了

1 个赞

嗯这个参数我白屏改成0M了,重启后有变回8G了,这个值时部署的时候默认的。可能yaml里面得改下

老师,我重装了ocp
重装后,obd cluster start ocp_cluster 这个命令启动的进程,和之前的不一样

重装后通过正常的odb命令stop集群,重启服务器,在start集群可以正常启动。
但是有个新问题:我在部署的时候服务器时间没选对,比当前时间快了8小时。
改不过来了,通过date -s 修改系统时间后,ocp_cluster启动不了。尝试连接ocp数据库 显示time out
把时间改回去,就正常。
想知道有什么方法可以把时间调正常。

还有就是,我之前的ocp部署的集群还能用吗?我是通过原来的ocp进行了集群停止操作。现在集群起不来。不知道该怎么操作了

1 个赞

如何调整 OBServer 的操作系统时间

https://www.oceanbase.com/knowledge-base/oceanbase-database-20000000070

你只是把ocp重新部署了 ocp装的ob集群可以用 后面可以重新用ocp去接管集群 集群起不来?是通过黑屏的方式启动么?

比当前时间快了8小时这个问题是主机和集群时间都快么。可以参考文档中这一步进行修改

之前的ocp部署的集群,新ocp是包括meta集群observer也重新部署了吧?
你可以接管到新的ocp上面,ocp界面接管集群就可以

重新部署的ocp是单机的,observer集群是三个节点,用ocp接管的话,是不是要先启动三台observer的服务呀,试了一下observer集群起不来了 :smirk:

情况和文档里的还不太一样,是单机版的ocp服务器时间快了8小时,meta库也是这个服务器节点。

相差8小时需要停8小时,建议直接重新部署吧 还快一些。
之前的集群起不来有试过使用带参方法启动么
如下是带参启动的例子:
./observer -p 30200 -P 30201 -z zone1 -c 1 -d 【store目录】 -i lo -r 127.0.0.1:30201 -o log_disk_size=5G,datafile_size=8G

1 个赞