ocp 元数据集群负载过高

【 使用环境 】生产环境
【 OB】
【 使用版本 】4.3.5.2、4.3.5.3
【问题描述】等待事件中 other_wait 过高导致租户 cpu 使用率高进而导致服务器负载过高
【复现路径】一直存在
【附件及日志】

大佬们,ocp 元数据集群负载过高,分析下来是等待事件中 other_wait 过高导致租户 cpu 使用率高进而导致服务器负载过高,如何再进一步分析定位到根本原因呢


这套 ocp 是 4.3.5.3,other_wait 过高导致服务器负载过高,长期是这样的


这套 ocp 是 4.3.5.2,也是等待事件中 other_wait 过高导致服务器负载过高,且是周期性的,但没看到 qps 有波动

@AntTech_VGXKVP

1 个赞

这是两套ocp meta集群的ocp_monitor租户,是说ocp_monitor租户的cpu使用率高是吗?

gcpwe_ocp 集群的 ocp_monitor租户的cpu max 128%,avg 87%
eu_ocp 集群的 ocp_monitor租户的cpu max 108%,avg 60%

各ocp_monitor租户的资源配置是怎样的?管理的集群规模(多少集群,主机,租户数量)怎样?

1 个赞
  1. 是的,ocp_monitor 租户的 cpu 使用率高。
  2. 两个 ocp_monitor 资源都是 1c2g 的配置,都只管理1套ob(3 个节点,1个租户)
1 个赞

配置有些低了,可以参考下官网建议

https://www.oceanbase.com/docs/common-ocp-1000000000826976

1 个赞

我还有一套 ocp,也是 ocp_monitor 资源都是 1c2g 的配置,都只管理1套ob(3 个节点,1个租户)。但这套 ocp 的 other_wait 就低,cpu 使用率也低的多,这是为啥呢

1 个赞

4.3.5.2 版本

1 个赞

这两个集群分别使用obdiag巡检下 以及分析下日志

obdiag check run
obdiag analyze log --since 1h

https://www.oceanbase.com/docs/common-obdiag-cn-1000000003892414

https://www.oceanbase.com/docs/common-obdiag-cn-1000000003892416

1 个赞

gcpwe_ocp 的日志 a152e912
eu_ocp 的日志 294e084a

下载地址 即时传 - 在线文件传送

obd obdiag check run gcpwe_ocp 报错 了【[ERROR] check Exception: case_package_file /home/admin/.obdiag/check/observer_check_package.yaml is not exist 】

1 个赞

文件一直没下载下来,你先将巡检结果报错发下吧,
另外OCP 使用的是什么版本?

1 个赞


我试了下下载速度还蛮快的

check 报错如下,ocp 是 ocp-server-ce: version: 4.3.6

[admin@gcpwe-prod-db-oceanbase-ocp-a-0 ~]$ obd obdiag check run gcpwe_ocp
Open ssh connection ok
obdiag version: 3.4.0
check start …
[ERROR] check Exception: case_package_file /home/admin/.obdiag/check/observer_check_package.yaml is not exist
Trace ID: 88fee68e-a5b6-11f0-a7b2-42010afff03d
If you want to view detailed obdiag logs, please run: /home/admin/.obd/repository/oceanbase-diagnostic-tool/3.4.0/c1a7d5038425acc9324b1716da1abca94b7d68d7/obdiag display-trace 88fee68e-a5b6-11f0-a7b2-42010afff03d

See https://www.oceanbase.com/product/ob-deployer/error-codes .
Trace ID: 87a377c8-a5b6-11f0-8173-42010afff03d
If you want to view detailed obd logs, please run: obd display-trace 87a377c8-a5b6-11f0-8173-42010afff03d

1 个赞

估计我们有限速,这里报错 有配置yaml文件吗?如果没有配置可以用下面这种方式


我是 obd 部署的,说不用配置 yaml 文件。eu_ocp 是 ocp 元数据集群名,所以 obd obdiag check run eu_ocp 直接跑的

obd版本是多少?我联系obdiag的老师看下

obd --version

obdiag version: 3.6.0

看这里 obdiag是3.4.0,升级到最新版本再试下

有高版本的,报一样的错,3.6.0

[admin@eu-prod-db-oceanbase-ocp-a-0 ~]$ obd obdiag check run eu_ocp
Open ssh connection ok
obdiag version: 3.6.0
check start …
[ERROR] check Exception: case_package_file /home/admin/.obdiag/check/observer_check_package.yaml is not exist
Trace ID: 9edd2674-a682-11f0-89df-027c71ac6bbb
If you want to view detailed obdiag logs, please run: /home/admin/.obd/repository/oceanbase-diagnostic-tool/3.6.0/ce431826e5571a6e00d81b9849e32eb53991169f/obdiag display-trace 9edd2674-a682-11f0-89df-027c71ac6bbb

See https://www.oceanbase.com/product/ob-deployer/error-codes .
Trace ID: 9dbe1af0-a682-11f0-afc1-027c71ac6bbb
If you want to view detailed obd logs, please run: obd display-trace 9dbe1af0-a682-11f0-afc1-027c71ac6bbb

执行下

source ~/oceanbase-diagnostic-tool/init.sh

这样就可以了,官方文档也维护下吧

[admin@eu-xxx ~]$ obd obdiag check run eu_ocp
Open ssh connection ok
obdiag version: 3.6.0
check start …
[WARN] ./check_report/ not exists. mkdir it!
[WARN] step_base ResultFalseException:ip:xxx ,data_dir and log_dir_disk are on the same disk.
[WARN] Task execute StepResultFailException: ip:xxx ,data_dir and log_dir_disk are on the same disk.
[WARN] step_base ResultFalseException:tsar is not installed. we can not check tcp retransmission.
[WARN] Task execute StepResultFailException: tsar is not installed. we can not check tcp retransmission.
[WARN] step_base ResultFalseException:network_speed is null , can not get real speed
[WARN] Task execute StepResultFailException: network_speed is null , can not get real speed
[WARN] ethtool output invalid for eth0: Cannot get wake-on-lan settings: Operation not permitted
[WARN] step_base ResultFalseException:net.core.netdev_max_backlog: 16384. recommended: 500 ≤ value ≤ 10000.
[WARN] Task execute StepResultFalseException: net.core.netdev_max_backlog: 16384. recommended: 500 ≤ value ≤ 10000. .
[WARN] step_base ResultFalseException:net.ipv4.ip_forward : 1. recommended: 0.
[WARN] Task execute StepResultFalseException: net.ipv4.ip_forward : 1. recommended: 0. .
[WARN] net.ipv4.tcp_tw_recycle is not exist
[WARN] step_base ResultFalseException:net.ipv4.conf.default.rp_filter : 0. recommended: 1.
[WARN] Task execute StepResultFalseException: net.ipv4.conf.default.rp_filter : 0. recommended: 1. .
[WARN] step_base ResultFalseException:net.ipv4.tcp_slow_start_after_idle: 1. recommended: 0.
[WARN] Task execute StepResultFalseException: net.ipv4.tcp_slow_start_after_idle: 1. recommended: 0. .
[WARN] step_base ResultFalseException:fs.pipe-user-pages-soft : 16384. recommended: 0.
[WARN] Task execute StepResultFalseException: fs.pipe-user-pages-soft : 16384. recommended: 0. .
[WARN] step_base ResultFalseException:ip_local_port_range_min : 10000. recommended: 3500
[WARN] Task execute StepResultFalseException: ip_local_port_range_min : 10000. recommended: 3500 .
[WARN] step_base ResultFalseException:net.ipv4.tcp_rmem_default : 12582912. net.ipv4.tcp_rmem_default from net.ipv4.tcp_rmem. recommended: is 65536 ≤ default≤ 131072
[WARN] Task execute StepResultFalseException: net.ipv4.tcp_rmem_default : 12582912. net.ipv4.tcp_rmem_default from net.ipv4.tcp_rmem. recommended: is 65536 ≤ default≤ 131072 .
[WARN] step_base ResultFalseException:net.ipv4.tcp_wmem_default : 12582912. recommended: is 65536 ≤ default≤ 131072
[WARN] Task execute StepResultFalseException: net.ipv4.tcp_wmem_default : 12582912. recommended: is 65536 ≤ default≤ 131072 .
[WARN] step_base ResultFalseException:On ip : xxx, ulimit -c as “core file size” is 0 . recommended: unlimited.
[WARN] Task execute StepResultFalseException: On ip : xxx, ulimit -c as “core file size” is 0 . recommended: unlimited. .
[WARN] step_base ResultFalseException:On ip : xxx, ulimit -u as “max user processes” is unlimited . recommended: 655360.
[WARN] Task execute StepResultFalseException: On ip : xxx, ulimit -u as “max user processes” is unlimited . recommended: 655360. .
[WARN] step_base ResultFalseException:On ip : xxx, ulimit -s as “stack size” is 8192 . recommended: unlimited.
[WARN] Task execute StepResultFalseException: On ip : xxx, ulimit -s as “stack size” is 8192 . recommended: unlimited. .
[WARN] step_base ResultFalseException:On ip : xxx, ulimit -n as “open files” is 1048576 . recommended: unlimited.
[WARN] Task execute StepResultFalseException: On ip : xxx, ulimit -n as “open files” is 1048576 . recommended: unlimited. .
[WARN] step_base ResultFalseException:there tenant resource pool configuration is less than 2C4G, please check it. tenant_id: 1,1002,1004
[WARN] Task execute StepResultFailException: there tenant resource pool configuration is less than 2C4G, please check it. tenant_id: 1,1002,1004
Check observer finished. For more details, please run cmd’ cat ./check_report/obdiag_check_report_observer_2025-10-13-09-44-55.table ’
Trace ID: 360aa130-a7d6-11f0-8aa9-027c71ac6bbb
If you want to view detailed obdiag logs, please run: /home/admin/.obd/repository/oceanbase-diagnostic-tool/3.6.0/ce431826e5571a6e00d81b9849e32eb53991169f/obdiag display-trace 360aa130-a7d6-11f0-8aa9-027c71ac6bbb

学习学习