obproxy不间断挂掉

【 使用环境 】生产环境
【 OB or 其他组件 】
【 使用版本 】
oceanbase-ce:4.3.5.4
obproxy-ce:4.3.5.0
【问题描述】

故障说明

使用obd部署三台服务器集群,三台均有observer与obproxy。
发现obproxy会不间断挂掉,写了一个脚本去监测故障发生时间,发现周五(2026-01-23)晚上23点发出监测通知(每十分钟检测一次),发现三台机器通过observer端口均可连接,通过obproxy的连接性如下:

服务(obproxy) obproxy可连性
xx.xx.xx.11 可连接
xx.xx.xx.12 进程存在不可连接
xx.xx.xx.13 进程不存在,不可连接

第二台机器obproxy连接提示如下:

➜  ~ mysql -hxx.xx.xx.12 -P18083 -uroot@thtf -p'xxxxxx'
mysql: [Warning] Using a password on the command line interface can be insecure.
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading authorization packet', system error: 2

配置如下

user:
  username: root
  port: 22
  password:
oceanbase-ce:
  version: 4.3.5.4
  release: 104000042025090916.el7
  package_hash: 502277ef3b10b1bcf3fea757b309addf14ad8edb1a8a98d986883045c539603c
  xx.xx.xx.11:
    zone: zone1
  xx.xx.xx.12:
    zone: zone2
  xx.xx.xx.13:
    zone: zone3
  servers:
  - xx.xx.xx.11
  - xx.xx.xx.12
  - xx.xx.xx.13
  global:
    appname: sessob
    root_password: 密码3
    mysql_port: 18081
    rpc_port: 18082
    data_dir: /data
    redo_dir: /redo
    obshell_port: 18086
    home_path: /root/sessob/oceanbase
    scenario: htap
    datafile_size: 2GB
    datafile_maxsize: 900GB
    datafile_next: 2GB
    log_disk_size: 170GB
    max_syslog_file_count: '10'
    memory_limit: 50GB
    system_memory: 5GB
    devname: eth0
    enable_auto_start: 'True'
    cluster_id: 1761301228
    ocp_agent_monitor_password: 密码5
    proxyro_password: 密码2
    enable_syslog_wf: false
    cpu_count: 14
  depends:
  - ob-configserver
obproxy-ce:
  version: 4.3.5.0
  package_hash: dff2846bc2fae2852480bdb0a2c5ddeee3309071f82257b921a687b3c9274345
  release: 3.el7
  servers:
  - xx.xx.xx.11
  - xx.xx.xx.12
  - xx.xx.xx.13
  global:
    prometheus_listen_port: 18084
    listen_port: 18083
    rpc_listen_port: 18085
    home_path: /root/sessob/obproxy
    proxy_mem_limited: 4GB
    obproxy_sys_password: 密码1
    skip_proxy_sys_private_check: true
    enable_strict_kernel_release: false
    enable_cluster_checkout: false
    rs_list: xx.xx.xx.11:18081;xx.xx.xx.12:18081;xx.xx.xx.13:18081
    cluster_name: sessob
    observer_root_password: 密码3
  xx.xx.xx.11:
    proxy_id: 6410
    client_session_id_version: 2
  xx.xx.xx.12:
    proxy_id: 6411
    client_session_id_version: 2
  xx.xx.xx.13:
    proxy_id: 6412
    client_session_id_version: 2
  depends:
  - oceanbase-ce
  - ob-configserver
obagent:
  version: 4.2.3
  package_hash: 5f44d62b09e2fc6fb3107ebce02a2a68185f03e81956f98d8578c55ba3c8238f
  release: 200000032025071420.el7
  servers:
  - xx.xx.xx.11
  - xx.xx.xx.12
  - xx.xx.xx.13
  global:
    monagent_http_port: 18088
    mgragent_http_port: 8089
    home_path: /root/sessob/obagent
    http_basic_auth_password: 密码4
    ob_monitor_status: active
  depends:
  - oceanbase-ce
ob-configserver:
  version: 1.0.1
  release: 1.0.1.el7
  servers:
  - xx.xx.xx.11
  global:
    listen_port: 8080
    home_path: /root/sessob/obconfigserver

日志说明

13服务器日志最新更改时间固定到了发生故障时(2026-01-23 23:00左右),下面是1000行日志
13-obproxy.log.tail1000.txt (544.1 KB)
13-obproxy_stat.log.tail1000.txt (110.0 KB)
obproxy_error.log.tail1000.txt (172.1 KB)

12服务器日志一直在更新,但是无法连接,下面是截取时的1000行日志
12-obproxy.log.tail1000.txt (441.8 KB)
12-obproxy_diagnosis.log.tail1000.txt (587.5 KB)
12-obproxy_error.log.tail1000.txt (168.9 KB)

猜测原因

结合我在论坛里搜到的相关帖子(搜索reading authorization packet),发现我配置的下面两行密码不同

proxyro_password: 密码2
obproxy_sys_password: 密码1

这两个密码是不是应该改成一样的?
【附件及日志】推荐使用OceanBase敏捷诊断工具obdiag收集诊断信息,详情参见链接(右键跳转查看):

【SOP系列 22 】——故障诊断第一步(自助诊断和诊断信息收集)

【备注】基于 LLM 和开源文档 RAG 的论坛小助手已开放测试,在发帖时输入 [@论坛小助手] 即可召唤小助手,欢迎试用!

是否有设置白名单

租户白名单吗?所有租户的白名单都是%
image

建议先更换为434proxy,435存在不少问题。436版本近期会发布可以更新为436

如更换为434后还触发该问题麻烦发一下异常节点的obproxy完整日志

更换434后依然出现了问题,下面是问题的记录,下面的问题出现后处理方式都是重新启动

2月17日,某一台机器的ob-configserver\obagent\obproxy-ce服务挂掉,可能是obproxy-ce的日志
{4A78C5EE-A412-A34E-8610-E8AEB0130E41}.txt (15.2 KB)


2月20日 某一台机器的obagent\obproxy-ce服务挂掉,当时的日志没有记录

3月1日 xx.xx.xx.13的所有bo相关的服务全都挂掉
observer的日志如下

ailure_black_list_interval_:60000000, data_storage_warning_tolerance_time_:5000000, data_storage_error_tolerance_time_:300000000, disk_io_thread_count_:8, sync_io_thread_count_:0, data_storage_io_timeout_ms_:120000})
[2026-03-01 11:50:21.166944] WDIAG [STORAGE.TRANS] getClock (ob_clock_generator.h:70) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=34][errcode=-4006] clock generator not inited
[2026-03-01 11:50:21.168844] WDIAG [STORAGE.TRANS] getClock (ob_clock_generator.h:70) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=28][errcode=-4006] clock generator not inited
[2026-03-01 11:50:21.169100] WDIAG [STORAGE.TRANS] getClock (ob_clock_generator.h:70) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=33][errcode=-4006] clock generator not inited
[2026-03-01 11:50:21.169172] WDIAG [STORAGE.TRANS] getClock (ob_clock_generator.h:70) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=23][errcode=-4006] clock generator not inited
[2026-03-01 11:50:21.169323] WDIAG [STORAGE.TRANS] getClock (ob_clock_generator.h:70) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=25][errcode=-4006] clock generator not inited
[2026-03-01 11:50:21.169344] WDIAG [STORAGE.TRANS] getClock (ob_clock_generator.h:70) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=19][errcode=-4006] clock generator not inited
[2026-03-01 11:50:21.169366] WDIAG [SHARE.SCHEMA] check_inner_stat (ob_server_schema_service.cpp:266) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=8][errcode=0] inner stat error(schema_service_=NULL, sql_proxy_=NULL, config_=NULL)
[2026-03-01 11:50:21.169445] WDIAG [SHARE.SCHEMA] check_inner_stat (ob_multi_version_schema_service.cpp:1887) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=77][errcode=0] inner stat error(init_=false)
[2026-03-01 11:50:21.169476] WDIAG [SHARE.SCHEMA] check_if_tenant_has_been_dropped (ob_multi_version_schema_service.cpp:2164) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=14][errcode=-4014] inner stat error(ret=-4014)
[2026-03-01 11:50:21.169494] WDIAG [SERVER] nonblock_get_leader (ob_inner_sql_connection.cpp:1980) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=16][errcode=0] user tenant has been dropped(ret=0, ret="OB_SUCCESS", tenant_id=1)
[2026-03-01 11:50:21.169528] WDIAG [SHARE.LOCATION] get_leader_with_retry_until_timeout (ob_location_service.cpp:107) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=29][errcode=-4006] not init(ret=-4006, ret="OB_NOT_INIT")
[2026-03-01 11:50:21.169559] WDIAG [SERVER] nonblock_get_leader (ob_inner_sql_connection.cpp:1989) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=28][errcode=-4006] get leader with retry until timeout failed(ret=-4006, ret="OB_NOT_INIT", tenant_id=1, ls_id={id:1}, leader="0.0.0.0:0", cluster_id=1761301228, tmp_abs_timeout_us=1772337022169343, retry_interval_us=200000)
[2026-03-01 11:50:21.169625] WDIAG [SERVER] execute_read_inner (ob_inner_sql_connection.cpp:1884) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=63][errcode=-4006] nonblock get leader failed(ret=-4006, tenant_id=1, ls_id={id:1}, cluster_id=1761301228)
[2026-03-01 11:50:21.169672] WDIAG [SERVER] retry_while_no_tenant_resource (ob_inner_sql_connection.cpp:1087) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=18][errcode=-4006] retry_while_no_tenant_resource failed(ret=-4006, tenant_id=1)
[2026-03-01 11:50:21.169693] WDIAG [SERVER] execute_read (ob_inner_sql_connection.cpp:1791) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=19][errcode=-4006] execute_read failed(ret=-4006, cluster_id=1761301228, tenant_id=1)
[2026-03-01 11:50:21.169719] WDIAG [COMMON.MYSQLP] read (ob_mysql_proxy.cpp:140) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=10][errcode=-4006] query failed(ret=-4006, conn=0x7f56c5fd4060, start=1772337021169095, sql=select mode, size, latency, iops from __all_disk_io_calibration where svr_ip = "xx.xx.xx.13" and svr_port = 18082 and storage_name = "DATA")
[2026-03-01 11:50:21.169768] WDIAG [COMMON.MYSQLP] read (ob_mysql_proxy.cpp:66) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=48][errcode=-4006] read failed(ret=-4006)
[2026-03-01 11:50:21.169781] WDIAG [COMMON] parse_calibration_table (ob_io_calibration.cpp:908) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=10][errcode=-4006] query failed(ret=-4006, sql_string=select mode, size, latency, iops from __all_disk_io_calibration where svr_ip = "xx.xx.xx.13" and svr_port = 18082 and storage_name = "DATA")
[2026-03-01 11:50:21.169958] WDIAG [STORAGE.TRANS] getClock (ob_clock_generator.h:70) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=21][errcode=-4006] clock generator not inited
[2026-03-01 11:50:21.196613] WDIAG [COMMON] read_from_table (ob_io_calibration.cpp:773) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=18][errcode=-4006] parse calibration data failed(ret=-4006)
[2026-03-01 11:50:21.196781] EDIAG [CLOG] do_init_ (ob_server_log_block_mgr.cpp:503) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=83][errcode=-9100] prepare_dir_and_create_meta_ failed(ret=-9100, log_pool_path="/data/clog/log_pool", log_pool_tmp_path="/data/clog/log_pool.tmp") BACKTRACE:0xa78af88 0xa7938a5 0xa884109 0xa883b36 0xa883a6c 0xa883982 0x105f34a6 0x105e543f 0x1437bccc 0x1019a118 0x1019f392 0x2768dcb0 0x1019bdfd 0x7f56d223a7e5 0xad76f46
[2026-03-01 11:50:21.196971] EDIAG [CLOG] init (ob_server_log_block_mgr.cpp:92) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=188][errcode=-9100] do_init_ failed(ret=-9100, this={dir::"", dir_fd:-1, meta_fd:-1, log_pool_meta:{curr_total_size:0, next_total_size:0, status:0}, min_block_id:0, max_block_id:0, min_log_disk_size_for_all_tenants_:0, is_inited:false}, log_disk_base_path="/data/clog") BACKTRACE:0xa78af88 0xa7938a5 0xa884109 0xa883b36 0xa883a6c 0xa883982 0x105e5866 0x105e5392 0x1437bccc 0x1019a118 0x1019f392 0x2768dcb0 0x1019bdfd 0x7f56d223a7e5 0xad76f46
[2026-03-01 11:50:21.197060] WDIAG [CLOG] destroy (ob_server_log_block_mgr.cpp:111) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=87][errcode=0] ObServerLogBlockMgr  destroy(this={dir::"", dir_fd:-1, meta_fd:-1, log_pool_meta:{curr_total_size:0, next_total_size:0, status:0}, min_block_id:0, max_block_id:0, min_log_disk_size_for_all_tenants_:0, is_inited:false})
[2026-03-01 11:50:21.197179] EDIAG [SERVER] init_io (ob_server.cpp:2527) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=111][errcode=-9100] log block mgr init failed(ret=-9100, ret="OB_NO_SUCH_FILE_OR_DIRECTORY") BACKTRACE:0xa78af88 0xa7938a5 0xa841d77 0xa841776 0xa8416b0 0xa8414d7 0x143bb4aa 0x1437d1d7 0x1019a118 0x1019f392 0x2768dcb0 0x1019bdfd 0x7f56d223a7e5 0xad76f46
[2026-03-01 11:50:21.197245] EDIAG [SERVER] init (ob_server.cpp:338) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=58][errcode=-9100] init io failed(ret=-9100, ret="OB_NO_SUCH_FILE_OR_DIRECTORY") BACKTRACE:0xa78af88 0xa7938a5 0xa841d77 0xa841776 0xa8416b0 0xa8414d7 0x1438a47c 0x1437d063 0x1019a118 0x1019f392 0x2768dcb0 0x1019bdfd 0x7f56d223a7e5 0xad76f46
[2026-03-01 11:50:21.197296] EDIAG [SERVER] init (ob_server.cpp:549) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=39][errcode=-9100] [OBSERVER_NOTICE] fail to init observer(ret=-9100, ret="OB_NO_SUCH_FILE_OR_DIRECTORY") BACKTRACE:0xa78af88 0xa7938a5 0xa841d77 0xa841776 0xa8416b0 0xa8414d7 0x1439032e 0x1437f920 0x1019a118 0x1019f392 0x2768dcb0 0x1019bdfd 0x7f56d223a7e5 0xad76f46
[2026-03-01 11:50:21.197366] ERROR [SERVER] init (ob_server.cpp:553) [1488][observer][T0][Y0-0000000000000001-0-0] [lt=24][errcode=-9100] [server_start 4/18] observer init fail. you may find solutions in previous error logs or seek help from official technicians.

直到现在暂时还没出现问题

目前有没有一个比较稳定的版本?可以推荐下吗

observer的报错信息大概率是clog磁盘挂载有问题,建议检查下messages日志或者挂载情况,不排除硬件故障。

obproxy挂掉是否有core文件产生?
确认下是否设置了core文件地址, 且ulimit -c 是 unlimited 否则可能无法生成core文件。
/etc/sysctl.conf |grep kernel.core_pattern

数据目录在/data/下,貌似没有core文件,但是设置都有,下面是执行命令的结果

[root@prod-13 ~]# ulimit -c
unlimited
[root@prod-13 ~]# cat /etc/sysctl.conf |grep kernel.core_pattern
kernel.core_pattern = /data/core-%e-%p-%t
[root@prod-13 ~]# ls -a /data/
.  ..  clog  slog  sstable
[root@prod-13 ~]# lsblk 
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
vda                          253:0    0   500G  0 disk 
├─vda1                       253:1    0     2M  0 part 
├─vda2                       253:2    0   200M  0 part /boot/efi
└─vda3                       253:3    0 499.8G  0 part /
vdb                          253:16   0   200G  0 disk 
└─vdb1                       253:17   0   200G  0 part 
  └─obredo--group-obredo--lv 252:1    0   200G  0 lvm  /redo
vdc                          253:32   0  1000G  0 disk 
└─vdc1                       253:33   0  1000G  0 part 
  └─obdata--group-obdata--lv 252:0    0  1000G  0 lvm  /data
vdd                          253:48   0     1M  0 disk 
[root@prod-13 ~]# df -Th
文件系统                             类型      容量  已用  可用 已用% 挂载点
devtmpfs                             devtmpfs   32G     0   32G    0% /dev
tmpfs                                tmpfs      32G     0   32G    0% /dev/shm
tmpfs                                tmpfs      32G  748K   32G    1% /run
tmpfs                                tmpfs      32G     0   32G    0% /sys/fs/cgroup
/dev/vda3                            ext4      492G   73G  400G   16% /
/dev/vda2                            vfat      200M  5.9M  194M    3% /boot/efi
/dev/mapper/obredo--group-obredo--lv ext4      196G  171G   16G   92% /redo
/dev/mapper/obdata--group-obdata--lv ext4      984G   13G  922G    2% /data
tmpfs                                tmpfs     6.3G     0  6.3G    0% /run/user/0