oceanbase 启动失败

【 使用环境 】生产环境
【 OB or 其他组件 】
【 使用版本 】 4.0
【问题描述】
@论坛小助手
OceanBase 启动一直卡在
Get local repositories ok
Search plugins ok
Load cluster param plugin ok
Open ssh connection ok
Check before start observer ok
Check before start obproxy ok
[WARN] OBD-4521: The config observer_sys_password in obproxy-ce did not take effect, please config it in oceanbase-ce

Check before start obagent ok
Check before start ocp-express ok
Start observer ok
observer program health check ok
obshell program health check ok
Connect to observer 10.106.12.18:2881 ok
Start obproxy ok
obproxy program health check ok
Connect to obproxy ok
Initialize obproxy-ce ok
Start obagent ok
obagent program health check ok
Connect to Obagent ok
Start ocp-express |
Start ocp-express /
Start ocp-express /
当前位置,ocp日志中提示,一般是server节点还没初始化完毕等待半小时后,会短暂修复,

但修复执行后还会出现崩溃:


当前应该如何处理这个问题?我尝试了社区中的方法,对这个异常都没有效果。

当前现象,还会引发其中一个节点不断重启,伴随着一些并发问题。

下面是后续出现的日志:
Start ocp-express x
[ERROR] XXXX. failed to connect meta db

[ERROR] ocp-express start failed

observer need bootstarp x

±------------------------------------------------+

| obproxy |

±--------------±-----±----------------±-------+

| ip | port | prometheus_port | status |

±--------------±-----±----------------±-------+

| XXXX. | 2883 | 2884 | active |

| XXXX. | 2883 | 2884 | active |

±--------------±-----±----------------±-------+

obclient -hXXXX.18 -P2883 -uroot -p’ljgsmrlab’ -Doceanbase -A

±-------------------------------------------------------------------+

| obagent |

±--------------±-------------------±-------------------±---------+

| ip | mgragent_http_port | monagent_http_port | status |

±--------------±-------------------±-------------------±---------+

| XXXX. | 8089 | 8088 | active |

| XXXX. | 8089 | 8088 | inactive |

| XXXX. | 8089 | 8088 | active |

±--------------±-------------------±-------------------±---------+

See https://www.oceanbase.com/product/ob-deployer/error-codes .

Trace ID: f48313d2-6cf2-11f0-8fec-fa163e596018

If you want to view detailed obd logs, please run: obd display-trace f48313d2-6cf2-11f0-8fec-fa163e596018

[2025-07-30 11:42:59.754] [ERROR] XXXX.18: failed to connect meta db

[2025-07-30 11:42:59.852] [INFO] [ERROR] XXXX.18: failed to connect meta db

[2025-07-30 11:42:59.852] [INFO]

[2025-07-30 11:42:59.852] [DEBUG] - sub start ref count to 0

[2025-07-30 11:42:59.852] [DEBUG] - export start

[2025-07-30 11:42:59.852] [ERROR] ocp-express start failed

[2025-07-30 11:42:59.853] [DEBUG] - Call oceanbase-ce-py_script_display-3.1.0 for oceanbase-ce-4.2.2.0-100000192024011915.el7-aa3053da7370a6685a2ef457cd202d50e5ab75d3

[2025-07-30 11:42:59.853] [DEBUG] - import display

[2025-07-30 11:42:59.854] [DEBUG] - add display ref count to 1

[2025-07-30 11:42:59.854] [INFO] Wait for observer init

[2025-07-30 11:42:59.855] [DEBUG] – execute sql: select * from oceanbase.__all_server. args: None

[2025-07-30 11:42:59.855] [DEBUG] – OBD-5000: select * from oceanbase.__all_server execute failed

[2025-07-30 11:42:59.857] [ERROR] Traceback (most recent call last):

[2025-07-30 11:42:59.857] [ERROR] File “core.py”, line 2018, in start_cluster

[2025-07-30 11:42:59.857] [ERROR] File “core.py”, line 2142, in _start_cluster

[2025-07-30 11:42:59.857] [ERROR] File “core.py”, line 186, in call_plugin

[2025-07-30 11:42:59.857] [ERROR] File “_plugin.py”, line 346, in call

[2025-07-30 11:42:59.857] [ERROR] File “_plugin.py”, line 304, in _new_func

[2025-07-30 11:42:59.857] [ERROR] File “/root/.obd/plugins/oceanbase-ce/3.1.0/display.py”, line 37, in display

[2025-07-30 11:42:59.857] [ERROR] servers = cursor.fetchall(‘select * from oceanbase.__all_server’, raise_exception=True, exc_level=‘verbose’)

[2025-07-30 11:42:59.858] [ERROR] File “_stdio.py”, line 886, in func_wrapper

[2025-07-30 11:42:59.858] [ERROR] File “/root/.obd/plugins/oceanbase-ce/4.2.2.0/connect.py”, line 511, in fetchall

[2025-07-30 11:42:59.858] [ERROR] return self.execute(sql, args=args, execute_func=‘fetchall’, raise_exception=raise_exception, exc_level=exc_level, stdio=stdio)

[2025-07-30 11:42:59.858] [ERROR] File “_stdio.py”, line 886, in func_wrapper

[2025-07-30 11:42:59.858] [ERROR] File “/root/.obd/plugins/oceanbase-ce/4.2.2.0/connect.py”, line 490, in execute

[2025-07-30 11:42:59.858] [ERROR] self.cursor.execute(sql, args)

[2025-07-30 11:42:59.858] [ERROR] File “pymysql/cursors.py”, line 148, in execute

[2025-07-30 11:42:59.858] [ERROR] File “pymysql/cursors.py”, line 310, in _query

[2025-07-30 11:42:59.858] [ERROR] File “pymysql/connections.py”, line 548, in query

[2025-07-30 11:42:59.858] [ERROR] File “pymysql/connections.py”, line 775, in _read_query_result

[2025-07-30 11:42:59.858] [ERROR] File “pymysql/connections.py”, line 1156, in read

[2025-07-30 11:42:59.858] [ERROR] File “pymysql/connections.py”, line 692, in _read_packet

[2025-07-30 11:42:59.858] [ERROR] File “pymysql/connections.py”, line 748, in _read_bytes

[2025-07-30 11:42:59.858] [ERROR] pymysql.err.OperationalError: (2013, ‘Lost connection to MySQL server during query’)

[2025-07-30 11:42:59.858] [ERROR]

[2025-07-30 11:42:59.985] [DEBUG] - sub display ref count to 0

[2025-07-30 11:42:59.985] [DEBUG] - export display

[2025-07-30 11:42:59.985] [DEBUG] - Call obproxy-ce-py_script_display-3.1.0 for obproxy-ce-4.2.1.0-11.el7-0aed4b782120e4248b749f67be3d2cc82cdcb70d

[2025-07-30 11:42:59.985] [DEBUG] - import display

[2025-07-30 11:42:59.987] [DEBUG] - add display ref count to 1

[2025-07-30 11:42:59.987] [DEBUG] – execute sql: show proxyconfig like “%port”. args: None

[2025-07-30 11:42:59.999] [DEBUG] – execute sql: show proxyconfig like “%port”. args: None

[2025-07-30 11:43:00.000] [DEBUG] – OBD-5000: show proxyconfig like “%port” execute failed

[2025-07-30 11:43:00.000] [DEBUG] – execute sql: show proxyconfig like “%port”. args: None

[2025-07-30 11:43:00.011] [INFO] ±------------------------------------------------+

[2025-07-30 11:43:00.011] [INFO] | obproxy |

[2025-07-30 11:43:00.011] [INFO] ±--------------±-----±----------------±-------+

[2025-07-30 11:43:00.011] [INFO] | ip | port | prometheus_port | status |

[2025-07-30 11:43:00.011] [INFO] ±--------------±-----±----------------±-------+

[2025-07-30 11:43:00.011] [INFO] | XXXX.18 | 2883 | 2884 | active |

[2025-07-30 11:43:00.011] [INFO] | XXXX.249 | 2883 | 2884 | active |

[2025-07-30 11:43:00.012] [INFO] ±--------------±-----±----------------±-------+

[2025-07-30 11:43:00.013] [INFO] obclient -hXXXX.18 -P2883 -uroot -p’ljgsmrlab’ -Doceanbase -A

[2025-07-30 11:43:00.013] [INFO]

[2025-07-30 11:43:00.013] [DEBUG] - sub display ref count to 0

[2025-07-30 11:43:00.013] [DEBUG] - export display

[2025-07-30 11:43:00.013] [DEBUG] - Call obagent-py_script_display-1.3.0 for obagent-4.2.2-100000042024011120.el7-19739a07a12eab736aff86ecf357b1ae660b554e

[2025-07-30 11:43:00.013] [DEBUG] - import display

[2025-07-30 11:43:00.014] [DEBUG] - add display ref count to 1

[2025-07-30 11:43:00.014] [DEBUG] – send http request method: GET, url: http://XXXX.18:8089/api/v1/agent/status, data: None

[2025-07-30 11:43:00.113] [DEBUG] – send http request method: GET, url: http://XXXX.71:8089/api/v1/agent/status, data: None

[2025-07-30 11:43:00.117] [ERROR] Traceback (most recent call last):

[2025-07-30 11:43:00.117] [ERROR] File “urllib3/connection.py”, line 174, in _new_conn

[2025-07-30 11:43:00.117] [ERROR] File “urllib3/util/connection.py”, line 95, in create_connection

[2025-07-30 11:43:00.117] [ERROR] File “urllib3/util/connection.py”, line 85, in create_connection

[2025-07-30 11:43:00.117] [ERROR] ConnectionRefusedError: [Errno 111] Connection refused

[2025-07-30 11:43:00.117] [ERROR]

[2025-07-30 11:43:00.117] [ERROR] During handling of the above exception, another exception occurred:

[2025-07-30 11:43:00.117] [ERROR]

[2025-07-30 11:43:00.117] [ERROR] Traceback (most recent call last):

[2025-07-30 11:43:00.117] [ERROR] File “urllib3/connectionpool.py”, line 715, in urlopen

[2025-07-30 11:43:00.117] [ERROR] File “urllib3/connectionpool.py”, line 416, in _make_request

[2025-07-30 11:43:00.117] [ERROR] File “urllib3/connection.py”, line 244, in request

[2025-07-30 11:43:00.117] [ERROR] File “http/client.py”, line 1256, in request

[2025-07-30 11:43:00.117] [ERROR] File “http/client.py”, line 1302, in _send_request

[2025-07-30 11:43:00.117] [ERROR] File “http/client.py”, line 1251, in endheaders

[2025-07-30 11:43:00.117] [ERROR] File “http/client.py”, line 1011, in _send_output

[2025-07-30 11:43:00.118] [ERROR] File “http/client.py”, line 951, in send

[2025-07-30 11:43:00.118] [ERROR] File “urllib3/connection.py”, line 205, in connect

[2025-07-30 11:43:00.118] [ERROR] File “urllib3/connection.py”, line 186, in _new_conn

[2025-07-30 11:43:00.118] [ERROR] urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fc446742310>: Failed to establish a new connection: [Errno 111] Connection refused

[2025-07-30 11:43:00.118] [ERROR]

[2025-07-30 11:43:00.118] [ERROR] During handling of the above exception, another exception occurred:

[2025-07-30 11:43:00.118] [ERROR]

[2025-07-30 11:43:00.118] [ERROR] Traceback (most recent call last):

[2025-07-30 11:43:00.118] [ERROR] File “requests/adapters.py”, line 439, in send

[2025-07-30 11:43:00.118] [ERROR] File “urllib3/connectionpool.py”, line 799, in urlopen

[2025-07-30 11:43:00.118] [ERROR] File “urllib3/util/retry.py”, line 592, in increment

[2025-07-30 11:43:00.118] [ERROR] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=‘XXXX.71’, port=8089): Max retries exceeded with url: /api/v1/agent/status (Caused by NewConnectionError(’<urllib3.connection.HTTPConnection object at 0x7fc446742310>: Failed to establish a new connection: [Errno 111] Connection refused’))

[2025-07-30 11:43:00.118] [ERROR]

[2025-07-30 11:43:00.118] [ERROR] During handling of the above exception, another exception occurred:

[2025-07-30 11:43:00.118] [ERROR]

[2025-07-30 11:43:00.118] [ERROR] Traceback (most recent call last):

[2025-07-30 11:43:00.118] [ERROR] File “core.py”, line 2018, in start_cluster

[2025-07-30 11:43:00.118] [ERROR] File “core.py”, line 2142, in _start_cluster

[2025-07-30 11:43:00.118] [ERROR] File “core.py”, line 186, in call_plugin

[2025-07-30 11:43:00.118] [ERROR] File “_plugin.py”, line 346, in call

[2025-07-30 11:43:00.118] [ERROR] File “_plugin.py”, line 304, in _new_func

[2025-07-30 11:43:00.119] [ERROR] File “/root/.obd/plugins/obagent/1.3.0/display.py”, line 39, in display

[2025-07-30 11:43:00.119] [ERROR] ‘status’: ‘active’ if api_cursor and api_cursor.connect(stdio) else ‘inactive’,

[2025-07-30 11:43:00.119] [ERROR] File “/root/.obd/plugins/obagent/1.3.0/connect.py”, line 47, in connect

[2025-07-30 11:43:00.119] [ERROR] return self._request(‘GET’, ‘api/v1/agent/status’, stdio=stdio)

[2025-07-30 11:43:00.119] [ERROR] File “/root/.obd/plugins/obagent/1.3.0/connect.py”, line 58, in _request

[2025-07-30 11:43:00.119] [ERROR] resp = requests.request(method, url, auth=self.auth, data=data, verify=False)

[2025-07-30 11:43:00.119] [ERROR] File “requests/api.py”, line 61, in request

[2025-07-30 11:43:00.119] [ERROR] File “requests/sessions.py”, line 542, in request

[2025-07-30 11:43:00.119] [ERROR] File “requests/sessions.py”, line 655, in send

[2025-07-30 11:43:00.119] [ERROR] File “requests/adapters.py”, line 516, in send

[2025-07-30 11:43:00.119] [ERROR] requests.exceptions.ConnectionError: HTTPConnectionPool(host=‘XXXX.71’, port=8089): Max retries exceeded with url: /api/v1/agent/status (Caused by NewConnectionError(’<urllib3.connection.HTTPConnection object at 0x7fc446742310>: Failed to establish a new connection: [Errno 111] Connection refused’))

[2025-07-30 11:43:00.119] [ERROR]

[2025-07-30 11:43:00.119] [DEBUG] – request obagent failed: HTTPConnectionPool(host=‘XXXX.71’, port=8089): Max retries exceeded with url: /api/v1/agent/status (Caused by NewConnectionError(’<urllib3.connection.HTTPConnection object at 0x7fc446742310>: Failed to establish a new connection: [Errno 111] Connection refused’))

[2025-07-30 11:43:00.119] [DEBUG] – send http request method: GET, url: http://XXXX.249:8089/api/v1/agent/status, data: None

[2025-07-30 11:43:00.222] [INFO] ±-------------------------------------------------------------------+

[2025-07-30 11:43:00.222] [INFO] | obagent |

[2025-07-30 11:43:00.222] [INFO] ±--------------±-------------------±-------------------±---------+

[2025-07-30 11:43:00.223] [INFO] | ip | mgragent_http_port | monagent_http_port | status |

[2025-07-30 11:43:00.223] [INFO] ±--------------±-------------------±-------------------±---------+

[2025-07-30 11:43:00.223] [INFO] | XXXX.18 | 8089 | 8088 | active |

[2025-07-30 11:43:00.223] [INFO] | XXXX.71 | 8089 | 8088 | inactive |

[2025-07-30 11:43:00.223] [INFO] | XXXX.249 | 8089 | 8088 | active |

[2025-07-30 11:43:00.223] [INFO] ±--------------±-------------------±-------------------±---------+

[2025-07-30 11:43:00.223] [DEBUG] - sub display ref count to 0

[2025-07-30 11:43:00.223] [DEBUG] - export display

[2025-07-30 11:43:00.233] [INFO] See https://www.oceanbase.com/product/ob-deployer/error-codes .

[2025-07-30 11:43:00.234] [INFO] Trace ID: f48313d2-6cf2-11f0-8fec-fa163e596018

[2025-07-30 11:43:00.234] [INFO] If you want to view detailed obd logs, please run: obd display-trace f48313d2-6cf2-11f0-8fec-fa163e596018

[2025-07-30 11:43:00.235] [DEBUG] - unlock /root/.obd/lock/global

[2025-07-30 11:43:00.235] [DEBUG] - unlock /root/.obd/lock/deploy_lj5grim

[2025-07-30 11:43:00.235] [DEBUG] - unlock /root/.obd/lock/mirror_and_repo

还看到有关于clog相关:
[2025-07-30 15:02:15.890997] ERROR try_recycle_blocks (palf_env_impl.cpp:784) [6691][T1001_PalfGC][T1001][Y0-0000000000000000-0-0] [lt=30][errcode=-4264] Log out of disk space(msg=“log disk space is almost full”, ret=-4264, total_size(MB)=614, used_size(MB)=583, used_percent(%)=95, warn_size(MB)=491, warn_percent(%)=80, limit_size(MB)=583, limit_percent(%)=95, total_unrecyclable_size_byte(MB)=519, maximum_used_size(MB)=583, maximum_log_stream=1, oldest_log_stream=1, oldest_scn={val:1750291890447470001, v:0}, in_shrinking=false)

【备注】基于 LLM 和开源文档 RAG 的论坛小助手已开放测试,在发帖时输入 [@论坛小助手] 即可召唤小助手,欢迎试用!

ocp-express组件当前已经不进行维护了,建议删除该组件部署ocp产品

当前这个clog问题,我尝试修改也无法成功启动,似乎是clog引发的一系列问题,是否需要先处理这个?

你试试启动集群时候使用 -c oceanbase-ce 单独启动ob试试
如果还是报错,使用代餐启动方法 扩容下log盘。
cd /home/oceanbase/oceanbase-ce && bin/observer -o “log_disk_size=xxxG”

这种方式不行,论坛能试的方法我都尝试了。

停掉所有业务,启动集群带参数log_disk_utilization_limit_threshold 把这个参数改到98%启动