集群sys合并卡住,导致sys租户与测试租户不可用

【 使用环境 】生产环境
【 OB or 其他组件 】
【 使用版本 】
【问题描述】 1、集群sys租户和测试租户不可用

2、上午刚来是集群和115那台服务器都不可用,咨询过客户说是昨天有开启防火墙,临时让客户先关闭下防火墙,集群正常,115那台服务器客户说也不知道为什么会停了,白屏启动115服务器




【复现路径】
【附件及日志】推荐使用OceanBase敏捷诊断工具obdiag收集诊断信息,详情参见链接(右键跳转查看):

【SOP系列 22 】——故障诊断第一步(自助诊断和诊断信息收集)

卡合并可以直接通过obdiag来根因分析,OceanBase分布式数据库-海量数据 笔笔算数

用Bbd web 部署的metadb,能过obd web白屏部署OCP,再部署集群,我如果在OCP那台服务器安装OBDING,可以诊断吗?但是是Obd cluster list 只能看到Metadb集群,看不到OCP创建的集群。


无法收集OCP部署的集群,用obd cluster list 只能看到metadb 集群


这个命令执行下。提供下observer.log日志

收集信息.txt (57.1 KB)
老师,请查收

ocp部署的集群,可以直接独立使用obdiag的,也就是独立部署obdiag ,
【SOP系列 22 】——故障诊断第一步(自助诊断和诊断信息收集)

其中卡合并的根因分析命令,执行完后会出一份根因分析的报告,把报告发出来看看
obdiag rca run --scene=major_hold

obding 收集的诊断
[root@CloudOBSvr1 ~]# obdiag rca run --scene=major_hold
[WARN] MajorHoldScene execute CDB_OB_MAJOR_COMPACTION panic: (4016, ‘Internal error’)
[ERROR] rca_scene.execute err: MajorHoldScene execute CDB_OB_MAJOR_COMPACTION panic: (4016, ‘Internal error’)
Trace ID: 6032ee18-2474-11ef-8e4d-00163e000cd1
If you want to view detailed obdiag logs, please run: obdiag display-trace 6032ee18-2474-11ef-8e4d-00163e000cd1
[root@CloudOBSvr1 ~]# obdiag display-trace 6032ee18-2474-11ef-8e4d-00163e000cd1
[2024-06-07 10:19:32.053] [DEBUG] - cmd: []
[2024-06-07 10:19:32.054] [DEBUG] - opts: {‘scene’: ‘major_hold’, ‘store_dir’: ‘./rca/’, ‘input_parameters’: None, ‘c’: ‘/root/.obdiag/config.yml’}
[2024-06-07 10:19:32.054] [DEBUG] - mkdir /usr/local/oceanbase-diagnostic-tool/conf/inner_config.yml
[2024-06-07 10:19:32.059] [DEBUG] - mkdir /root/.obdiag/config.yml
[2024-06-07 10:19:43.589] [DEBUG] - connect databse …
[2024-06-07 10:19:43.594] [DEBUG] - RCAHandler.init store dir: ./rca/
[2024-06-07 10:19:43.594] [DEBUG] - rca result save_path is :./rca/
[2024-06-07 10:19:43.595] [DEBUG] - start get_observer_version_by_sql . input: {‘ob_cluster_name’: ‘JGPTOBDB_CLUSTER’, ‘db_host’: ‘172.16.253.113’, ‘db_port’: 2881, ‘tenant_sys’: {‘user’: ‘root@sys’, ‘password’: ‘aaAA11__’}, ‘servers’: [{‘ip’: ‘172.16.253.113’, ‘ssh_username’: ‘root’, ‘ssh_password’: ‘1Zqmq^TXTMzSwKM7’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/oceanbase’, ‘data_dir’: ‘/root/observer/store’, ‘redo_dir’: ‘/root/observer/store’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘remote’, ‘container_name’: None, ‘host_type’: ‘OBSERVER’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b64370>}, {‘ip’: ‘172.16.253.114’, ‘ssh_username’: ‘root’, ‘ssh_password’: ‘1Zqmq^TXTMzSwKM7’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/oceanbase’, ‘data_dir’: ‘/root/observer/store’, ‘redo_dir’: ‘/root/observer/store’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘remote’, ‘container_name’: None, ‘host_type’: ‘OBSERVER’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b64550>}, {‘ip’: ‘172.16.253.115’, ‘ssh_username’: ‘root’, ‘ssh_password’: ‘1Zqmq^TXTMzSwKM7’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/oceanbase’, ‘data_dir’: ‘/root/observer/store’, ‘redo_dir’: ‘/root/observer/store’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘remote’, ‘container_name’: None, ‘host_type’: ‘OBSERVER’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b21880>}, {‘ip’: ‘172.16.253.116’, ‘ssh_username’: ‘root’, ‘ssh_password’: ‘1Zqmq^TXTMzSwKM7’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/oceanbase’, ‘data_dir’: ‘/root/observer/store’, ‘redo_dir’: ‘/root/observer/store’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘remote’, ‘container_name’: None, ‘host_type’: ‘OBSERVER’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b260d0>}, {‘ip’: ‘172.16.253.117’, ‘ssh_username’: ‘root’, ‘ssh_password’: ‘1Zqmq^TXTMzSwKM7’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/oceanbase’, ‘data_dir’: ‘/root/observer/store’, ‘redo_dir’: ‘/root/observer/store’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘remote’, ‘container_name’: None, ‘host_type’: ‘OBSERVER’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b21fd0>}, {‘ip’: ‘172.16.253.118’, ‘ssh_username’: ‘root’, ‘ssh_password’: ‘1Zqmq^TXTMzSwKM7’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/oceanbase’, ‘data_dir’: ‘/root/observer/store’, ‘redo_dir’: ‘/root/observer/store’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘remote’, ‘container_name’: None, ‘host_type’: ‘OBSERVER’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b2fbb0>}]}
[2024-06-07 10:19:43.598] [DEBUG] - connect databse …
[2024-06-07 10:19:43.612] [DEBUG] - get_observer_version_by_sql ob_version_info is (‘5.7.25-OceanBase_CE-v4.2.1.4’,)
[2024-06-07 10:19:43.612] [DEBUG] - RCAHandler.init get observer version: 4.2.1.4
[2024-06-07 10:19:43.613] [DEBUG] - RCAHandler.init get observer version: 4.2.1.4
[2024-06-07 10:19:43.613] [DEBUG] - [remote host 172.16.253.113] run cmd = [/home/admin/obproxy/bin/obproxy --version] start …
[2024-06-07 10:19:43.741] [DEBUG] - get obproxy version, run cmd = [/home/admin/obproxy/bin/obproxy --version]
[2024-06-07 10:19:43.743] [DEBUG] - [remote host 172.16.253.113] run cmd = [export LD_LIBRARY_PATH=/home/admin/obproxy/lib && /home/admin/obproxy/bin/obproxy --version] start …
[2024-06-07 10:19:43.913] [DEBUG] - get obproxy version with LD_LIBRARY_PATH,cmd:export LD_LIBRARY_PATH=/home/admin/obproxy/lib && /home/admin/obproxy/bin/obproxy --version, result:/home/admin/obproxy/bin/obproxy --version
[2024-06-07 10:19:43.914] [DEBUG] obproxy (OceanBase 4.2.1.0 11)
[2024-06-07 10:19:43.914] [DEBUG] REVISION: 1-local-6599462fc897a4a46734d64585906ea80975a656
[2024-06-07 10:19:43.914] [DEBUG] BUILD_TIME: Oct 12 2023 21:03:30
[2024-06-07 10:19:43.914] [DEBUG] BUILD_FLAGS: -g -O2 -D_OB_VERSION=1000 -D_NO_EXCEPTION -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -DNDEBUG -D__USE_LARGEFILE64 -D_FILE_OFFSET_BITS=64 -D_LARGE_FILE -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -Wall -Wextra -Wno-unused-parameter -Wformat -Wno-conversion -Wno-deprecated -Wno-invalid-offsetof -Wno-unused-result -Wno-format-security -finline-functions -fno-strict-aliasing -mtune=core2 -Wno-psabi -Wno-sign-compare -Wno-class-memaccess -Wno-deprecated-copy -Wno-ignored-qualifiers -Wno-aligned-new -Wno-format-truncation -Wno-literal-suffix -Wno-format-overflow -Wno-stringop-truncation -Wno-memset-elt-size -Wno-cast-function-type -Wno-address-of-packed-member -fno-omit-frame-pointer -Wl,-z,noexecstack,-z,relro,-z,now,-z,notext -fPIC -DGCC_52 -D_GLIBCXX_USE_CXX11_ABI=0 -DSUPPORT_SSE4_2 -DHAVE_SCHED_GETCPU -DHAVE_REALTIME_COARSE -DOB_HAVE_EVENTFD -DHAVE_FALLOCATE -I/home/jenkins/agent/workspace/rpm-obproxy-ce-4.2.1.0-1.1.18/ob_source_code_dir/deps/3rd/usr/local/oceanbase/deps/devel/include -I/home/jenkins/agent/workspace/rpm-obproxy-ce-4.2.1.0-1.1.18/ob_source_code_dir/deps/3rd/usr/local/oceanbase/deps/devel/include/mariadb -I/home/jenkins/agent/workspace/rpm-obproxy-ce-4.2.1.0-1.1.18/ob_source_code_dir/deps/3rd/usr/include -Werror
[2024-06-07 10:19:43.915] [DEBUG]
[2024-06-07 10:19:43.916] [DEBUG] Copyright (c) 2021 OceanBase
[2024-06-07 10:19:43.916] [DEBUG] OceanBase Database Proxy(ODP) is licensed under Mulan PubL v2.
[2024-06-07 10:19:43.916] [DEBUG] You can use this software according to the terms and conditions of the Mulan PubL v2.
[2024-06-07 10:19:43.916] [DEBUG] You may obtain a copy of Mulan PubL v2 at:
[2024-06-07 10:19:43.916] [DEBUG] http://license.coscl.org.cn/MulanPubL-2.0
[2024-06-07 10:19:43.916] [DEBUG] THIS SOFTWARE IS PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND,
[2024-06-07 10:19:43.916] [DEBUG] EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT,
[2024-06-07 10:19:43.916] [DEBUG] MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE.
[2024-06-07 10:19:43.916] [DEBUG] See the Mulan PubL v2 for more details.
[2024-06-07 10:19:43.917] [DEBUG]
[2024-06-07 10:19:43.917] [DEBUG] - RCAHandler.init get obproxy version: 4.2.1.0
[2024-06-07 10:19:43.917] [DEBUG] - RCAHandler.init get obproxy version: 4.2.1.0
[2024-06-07 10:19:43.935] [DEBUG] - RCAHandler init.cluster:JGPTOBDB_CLUSTER, init.nodes:[{‘ip’: ‘172.16.253.113’, ‘ssh_username’: ‘root’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/oceanbase’, ‘data_dir’: ‘/root/observer/store’, ‘redo_dir’: ‘/root/observer/store’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘remote’, ‘container_name’: None, ‘host_type’: ‘OBSERVER’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b64370>}, {‘ip’: ‘172.16.253.114’, ‘ssh_username’: ‘root’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/oceanbase’, ‘data_dir’: ‘/root/observer/store’, ‘redo_dir’: ‘/root/observer/store’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘remote’, ‘container_name’: None, ‘host_type’: ‘OBSERVER’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b64550>}, {‘ip’: ‘172.16.253.115’, ‘ssh_username’: ‘root’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/oceanbase’, ‘data_dir’: ‘/root/observer/store’, ‘redo_dir’: ‘/root/observer/store’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘remote’, ‘container_name’: None, ‘host_type’: ‘OBSERVER’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b21880>}, {‘ip’: ‘172.16.253.116’, ‘ssh_username’: ‘root’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/oceanbase’, ‘data_dir’: ‘/root/observer/store’, ‘redo_dir’: ‘/root/observer/store’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘remote’, ‘container_name’: None, ‘host_type’: ‘OBSERVER’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b260d0>}, {‘ip’: ‘172.16.253.117’, ‘ssh_username’: ‘root’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/oceanbase’, ‘data_dir’: ‘/root/observer/store’, ‘redo_dir’: ‘/root/observer/store’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘remote’, ‘container_name’: None, ‘host_type’: ‘OBSERVER’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b21fd0>}, {‘ip’: ‘172.16.253.118’, ‘ssh_username’: ‘root’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/oceanbase’, ‘data_dir’: ‘/root/observer/store’, ‘redo_dir’: ‘/root/observer/store’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘remote’, ‘container_name’: None, ‘host_type’: ‘OBSERVER’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b2fbb0>}], init.obproxy_nodes:[{‘ip’: ‘172.16.253.113’, ‘ssh_username’: ‘root’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/obproxy’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘ssh’, ‘container_name’: None, ‘host_type’: ‘OBPROXY’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b2b910>}, {‘ip’: ‘172.16.253.114’, ‘ssh_username’: ‘root’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/obproxy’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘ssh’, ‘container_name’: None, ‘host_type’: ‘OBPROXY’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed9b37fd0>}, {‘ip’: ‘172.16.253.115’, ‘ssh_username’: ‘root’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/obproxy’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘ssh’, ‘container_name’: None, ‘host_type’: ‘OBPROXY’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed8c5fd30>}, {‘ip’: ‘172.16.253.116’, ‘ssh_username’: ‘root’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/obproxy’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘ssh’, ‘container_name’: None, ‘host_type’: ‘OBPROXY’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed8c5f790>}, {‘ip’: ‘172.16.253.117’, ‘ssh_username’: ‘root’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/obproxy’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘ssh’, ‘container_name’: None, ‘host_type’: ‘OBPROXY’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed8c67520>}, {‘ip’: ‘172.16.253.118’, ‘ssh_username’: ‘root’, ‘ssh_port’: 22, ‘home_path’: ‘/home/admin/obproxy’, ‘ssh_key_file’: ‘’, ‘ssh_type’: ‘ssh’, ‘container_name’: None, ‘host_type’: ‘OBPROXY’, ‘ssher’: <common.ssh.SshHelper object at 0x151ed8c677c0>}], init.store_dir:./rca/
[2024-06-07 10:19:43.936] [DEBUG] - major_hold store_dir:./rca//major_hold_20240607101943
[2024-06-07 10:19:43.936] [DEBUG] - major_hold init success
[2024-06-07 10:19:43.957] [WARNING] MajorHoldScene execute CDB_OB_MAJOR_COMPACTION panic: (4016, ‘Internal error’)
[2024-06-07 10:19:43.958] [ERROR] rca_scene.execute err: MajorHoldScene execute CDB_OB_MAJOR_COMPACTION panic: (4016, ‘Internal error’)
[2024-06-07 10:19:43.958] [INFO] Trace ID: 6032ee18-2474-11ef-8e4d-00163e000cd1
[2024-06-07 10:19:43.958] [INFO] If you want to view detailed obdiag logs, please run: obdiag display-trace 6032ee18-2474-11ef-8e4d-00163e000cd1

报错4016了,~/.obdiag/config.yml中的tenant_sys的登录连接串是不是填错了

sys租户登录会报这个内部错误

那我估计你的sys租户填的不对,填成了proxysys租户了

obcluster:
ob_cluster_name: JGPTOBDB_CLUSTER
db_host: 172.16.253.113
db_port: 2881
tenant_sys:
user: root@sys
password: “aaAA11__”
servers:
nodes:
- ip: 172.16.253.113
- ip: 172.16.253.114
- ip: 172.16.253.115
- ip: 172.16.253.116
- ip: 172.16.253.117
- ip: 172.16.253.118
global:
ssh_username: ‘root’
ssh_password: ‘1Zqmq^TXTMzSwKM7’
home_path: /home/admin/oceanbase
obproxy:
obproxy_cluster_name: JGPTOBDB_OBPROXY
servers:
nodes:
- ip: 172.16.253.113
- ip: 172.16.253.114
- ip: 172.16.253.115
- ip: 172.16.253.116
- ip: 172.16.253.117
- ip: 172.16.253.118
global:
ssh_username: ‘root’
ssh_password: ‘1Zqmq^TXTMzSwKM7’
home_path: /home/admin/obproxy
~

obdiag analyze log 分析一下日志看看,把分析的结果发出来

[root@CloudOBSvr1 log]# obdiag analyze log --from “2024-06-07 00:00:00” --to “2024-06-07 11:00:00”
analyze_log start …
[WARN] 172.16.253.113 The number of log files is 73, out of range (0,50]
[WARN] 172.16.253.114 The number of log files is 571, out of range (0,50]
[WARN] 172.16.253.115 The number of log files is 597, out of range (0,50]
[WARN] 172.16.253.116 The number of log files is 479, out of range (0,50]
[WARN] 172.16.253.117 The number of log files is 592, out of range (0,50]
[WARN] 172.16.253.118 The number of log files is 112, out of range (0,50]

Analyze OceanBase Online Log Summary:

±---------------±--------------------------------------------------------------------±-----------±------------±----------±--------+
| Node | Status | FileName | ErrorCode | Message | Count |
+================+=====================================================================+============+=============+===========+=========+
| 172.16.253.113 | Error:Too many files 73 > 50, Please adjust the analyze time range | | | | |
±---------------±--------------------------------------------------------------------±-----------±------------±----------±--------+
| 172.16.253.114 | Error:Too many files 571 > 50, Please adjust the analyze time range | | | | |
±---------------±--------------------------------------------------------------------±-----------±------------±----------±--------+
| 172.16.253.115 | Error:Too many files 597 > 50, Please adjust the analyze time range | | | | |
±---------------±--------------------------------------------------------------------±-----------±------------±----------±--------+
| 172.16.253.116 | Error:Too many files 479 > 50, Please adjust the analyze time range | | | | |
±---------------±--------------------------------------------------------------------±-----------±------------±----------±--------+
| 172.16.253.117 | Error:Too many files 592 > 50, Please adjust the analyze time range | | | | |
±---------------±--------------------------------------------------------------------±-----------±------------±----------±--------+
| 172.16.253.118 | Error:Too many files 112 > 50, Please adjust the analyze time range | | | | |
±---------------±--------------------------------------------------------------------±-----------±------------±----------±--------+
For more details, please run cmd ’ cat /home/admin/oceanbase/log/analyze_pack_20240607110447/result_details.txt ’

Trace ID: acfb068a-247a-11ef-b87b-00163e000cd1
If you want to view detailed obdiag logs, please run: obdiag display-trace acfb068a-247a-11ef-b87b-00163e000cd1
[root@CloudOBSvr1 log]#

看起来日志被打的挺多,日志被打爆了
obdiag analyze log --since 3m (分析三分钟的日志)

obdiag 扫描3分钟分析日志.txt (42.9 KB)

从日志分析的结果看,集群序列化失败了,有4034、4013报错

4013的报错先按照这个文档分析下看:https://mp.weixin.qq.com/s?__biz=MzkwMDY2ODg5MQ==&mid=2247483794&idx=1&sn=328a9ed9a6f1eba5967acde3936ff8fe&chksm=c041cd26f736443092c09336da3cd54aa2af34568e32a33090d444cac3839748499184b712e7&token=608507686&lang=zh_CN#rd

老师,看报错可能是内存不足导致的。您可以先在observer日志里 grep ‘[OOPS]’ 确定是具体某个租户的问题,然后再确定是哪个模块占用内存大。