obdiag一键巡检未达到预期效果

ob版本 4.2.5.3
obdiag 版本 3.6.0

集群是一个observer节点,一个obproxy节点,obdiag是装在另一台机器。在obdiag节点上用ssh是可以免密登陆到observer和obproxy。

脚本如下

ob_host="192.168.1.100"
ob_port="2883"
ob_pass="xxxx"
ob_home_dir="/home/dumbo/"
obp_home_dir="/home/dumbo/obproxy"
ob_data_dir="/home/dumbo/oceanbase/store"
ob_redo_dir="/home/dumbo/oceanbase/store"
obp="192.168.1.200"
obs="192.168.1.100"
date
obdiag check run \
    --config db_host=${ob_host}\
    --config db_port=${ob_port} \
    --config tenant_sys.user=root@sys#test1 \
    --config tenant_sys.password=${ob_pass} \
    --config obcluster.servers.global.home_path=${ob_home_dir} \
    --config obcluster.servers.nodes[0].ip=${obs} \
    --config obcluster.servers.nodes[0].data_dir=${ob_data_dir} \
    --config obcluster.servers.nodes[0].redo_dir=${ob_redo_dir} \
    --config obcluster.servers.global.ssh_port=2200 \
    --config obcluster.servers.global.ssh_username=dumbo \
    --config obcluster.servers.global.ssh_key_file=/home/dumbo/.ssh/id_rsa \
    --config obproxy.servers.nodes[0].ip=${obp} \
    --config obproxy.servers.global.home_path=${obp_home_dir} \
    --config obproxy.servers.global.ssh_port=2200 \
    --config obproxy.servers.global.ssh_username=dumbo \
    --config obproxy.servers.global.ssh_key_file=/home/dumbo/.ssh/id_rsa
obdiag version: 3.6.0
check start ...
[WARN] step_base ResultFalseException:xfs need repair. Please check disk. xfs_repair_log: dmesg: read kernel buffer failed: Operation not permitted
[WARN] Task execute StepResultFailException: xfs need repair. Please check disk. xfs_repair_log: dmesg: read kernel buffer failed: Operation not permitted
[WARN] step_base ResultFalseException:tsar is not installed. we can not check tcp retransmission.
[WARN] Task execute StepResultFailException: tsar is not installed. we can not check tcp retransmission.
[WARN] step_base ResultFalseException:network: bond0  RX error is not 0, please check by ip -s link show bond0
[WARN] Task execute StepResultFailException: network: bond0  RX error is not 0, please check by ip -s link show bond0
[WARN] step_base ResultFalseException:network_speed is null , can not get real speed
[WARN] Task execute StepResultFailException: network_speed is null , can not get real speed
[WARN] Network write condition wakeup count exceeds threshold on remote_192.168.1.100: 4
[WARN] step_base ResultFalseException:crond is not running.It is recommended to enable it, mainly for setting up scheduled tasks and providing related operation and maintenance capabilities.
[WARN] Task execute StepResultFalseException: crond is not running.It is recommended to enable it, mainly for setting up scheduled tasks and providing related operation and maintenance capabilities.   .
[WARN] step_base ResultFalseException:net.core.somaxconn : 32768. recommended: 2048 ≤ value ≤ 16384.
[WARN] Task execute StepResultFalseException: net.core.somaxconn : 32768. recommended: 2048 ≤ value ≤ 16384.   .
[WARN] step_base ResultFalseException:net.ipv4.ip_forward : 1. recommended: 0.
[WARN] Task execute StepResultFalseException: net.ipv4.ip_forward : 1. recommended: 0.   .
[WARN] net.ipv4.tcp_tw_recycle is not exist
[WARN] step_base ResultFalseException:net.ipv4.conf.default.accept_source_route: 1. recommended: 0.
[WARN] Task execute StepResultFalseException: net.ipv4.conf.default.accept_source_route: 1. recommended: 0.   .
[WARN] step_base ResultFalseException:net.ipv4.tcp_syncookies: 0. recommended: 1.
[WARN] Task execute StepResultFalseException: net.ipv4.tcp_syncookies: 0. recommended: 1.   .
[WARN] step_base ResultFalseException:net.ipv4.tcp_slow_start_after_idle: 1. recommended: 0.
[WARN] Task execute StepResultFalseException: net.ipv4.tcp_slow_start_after_idle: 1. recommended: 0.   .
[WARN] step_base ResultFalseException:vm.max_map_count : 131072. recommended:327680 ≤ value ≤ 1000000.
[WARN] Task execute StepResultFalseException: vm.max_map_count : 131072. recommended:327680 ≤ value ≤ 1000000.   .
[WARN] step_base ResultFalseException:kernel.numa_balancing : 1. recommended: 0.
[WARN] Task execute StepResultFalseException: kernel.numa_balancing : 1. recommended: 0.   .
[WARN] step_base ResultFalseException:ip_local_port_range_min : 3001. recommended: 3500
[WARN] Task execute StepResultFalseException: ip_local_port_range_min : 3001. recommended: 3500   .
[WARN] step_base ResultFalseException:On ip : 192.168.1.100, ulimit -u as "max user processes" is 512289 . recommended: 655360.
[WARN] Task execute StepResultFalseException: On ip : 192.168.1.100, ulimit -u as "max user processes" is 512289 . recommended: 655360.   .
[WARN] step_base ResultFalseException:On ip : 192.168.1.100, ulimit -s as "stack size" is 8192 . recommended: unlimited.
[WARN] Task execute StepResultFalseException: On ip : 192.168.1.100, ulimit -s as "stack size" is 8192 . recommended: unlimited.   .
[WARN] step_base ResultFalseException:On ip : 192.168.1.100, ulimit -n as "open files" is 1048576 . recommended: unlimited.
[WARN] Task execute StepResultFalseException: On ip : 192.168.1.100, ulimit -n as "open files" is 1048576 . recommended: unlimited.   .
[WARN] step_base ResultFalseException:ip:192.168.1.100 ,data_dir and log_dir_disk are on the same disk.
[WARN] Task execute StepResultFailException: ip:192.168.1.100 ,data_dir and log_dir_disk are on the same disk.

以上是输出内容,然后一直不动了,持续时间超过一天了。

这个现象需要怎么排查一下呢。

1 个赞

方便提供下 ~/.obdiag/log/obdiag.log 文件么? 这边排查下
怀疑是某个巡检项目卡住了

1 个赞

日志太大了,这是最后一部分日志,现在的状态是卡着了

[2025-08-25 17:09:43.945] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - get observer version, by sql
[2025-08-25 17:09:43.946] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - start get_observer_version_by_sql . input: test-dns.com:2883
[2025-08-25 17:09:43.993] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - connect databse ...
[2025-08-25 17:09:43.999] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - get_observer_version_by_sql ob_version_info is ('5.7.25-OceanBase_CE-v4.2.5.3',)
[2025-08-25 17:09:44.046] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - connect databse ...
[2025-08-25 17:09:44.057] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - ob_query_timeout value: 10000000
[2025-08-25 17:09:44.057] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - ob_query_timeout is within acceptable range: 10000000 microseconds (10.0 seconds)
[2025-08-25 17:09:44.057] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - execute tasks end : cluster.ob_query_timeout
[2025-08-25 17:09:44.057] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - execute tasks is cluster.observer_not_active
[2025-08-25 17:09:44.057] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - cluster.version is 4.2.5.3
[2025-08-25 17:09:44.058] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - cluster.observer_not_active execute!
[2025-08-25 17:09:44.058] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - task_base execute
[2025-08-25 17:09:44.058] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - version_int is 4.2.5.3 steps_versions is [4.0.0.0,*]
[2025-08-25 17:09:44.058] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - minVersion is 4.0.0.0, maxVersion is 999
[2025-08-25 17:09:44.058] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - filter_by_version is return 0
[2025-08-25 17:09:44.060] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - run task in node: {'ip': '192.168.1.100', 'ssh_username': 'dumbo', 'ssh_port': '2200', 'home_path': '/home/oceanbase', 'data_dir': '/home/oceanbase/store', 'redo_dir': '/home/oceanbase/store', 'ssh_key_file': '/home/dumbo/.ssh/id_rsa', 'ssh_type': 'remote', 'container_name': '', 'namespace': '', 'pod_name': '', 'kubernetes_config_file': '', 'host_type': 'OBSERVER', 'ssher': <src.common.ssh_client.ssh.SshClient object at 0x7f4a95676d30>}
[2025-08-25 17:09:44.061] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - step nu: 1
[2025-08-25 17:09:44.061] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - step nu: 1 initted, to execute
[2025-08-25 17:09:44.061] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - task execute and result
[2025-08-25 17:09:44.061] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - StepSQLHandler execute: select  GROUP_CONCAT(DISTINCT SVR_IP) from oceanbase.DBA_OB_SERVERS where STATUS <> "ACTIVE" or START_SERVICE_TIME = null or START_SERVICE_TIME = 0 or STOP_TIME is not null;
[2025-08-25 17:09:44.063] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - execute_sql result:((None,),)
[2025-08-25 17:09:44.064] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - sql result:
[2025-08-25 17:09:44.064] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - sql execute update task_variable_dict: not_ACTIVE_OBSERVER =
[2025-08-25 17:09:44.064] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - self.task_variable_dict: {'remote_ip': '192.168.1.100', 'remote_ssh_username': 'dumbo', 'remote_ssh_port': '2200', 'remote_home_path': '/home/oceanbase', 'remote_data_dir': '/home/oceanbase/store', 'remote_redo_dir': '/home/oceanbase/store', 'remote_ssh_key_file': '/home/dumbo/.ssh/id_rsa', 'remote_ssh_type': 'remote', 'remote_container_name': '', 'remote_namespace': '', 'remote_pod_name': '', 'remote_kubernetes_config_file': '', 'remote_host_type': 'OBSERVER', 'remote_ssher': <src.common.ssh_client.ssh.SshClient object at 0x7f4a95676d30>, 'not_ACTIVE_OBSERVER': ''}
[2025-08-25 17:09:44.064] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - result execute
[2025-08-25 17:09:44.064] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - verify_result execute
[2025-08-25 17:09:44.064] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - verify_type input is None
[2025-08-25 17:09:44.064] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - verify_type input is base, to set base
[2025-08-25 17:09:44.064] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - the result verify is [ -z "$not_ACTIVE_OBSERVER" ]
[2025-08-25 17:09:44.064] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_ip ,the value:192.168.1.100 , the type:<class 'str'>
[2025-08-25 17:09:44.064] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_ssh_username ,the value:dumbo , the type:<class 'str'>
[2025-08-25 17:09:44.064] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_ssh_port ,the value:2200 , the type:<class 'str'>
[2025-08-25 17:09:44.064] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_home_path ,the value:/home/oceanbase , the type:<class 'str'>
[2025-08-25 17:09:44.064] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_data_dir ,the value:/home/oceanbase/store , the type:<class 'str'>
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_redo_dir ,the value:/home/oceanbase/store , the type:<class 'str'>
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_ssh_key_file ,the value:/home/dumbo/.ssh/id_rsa , the type:<class 'str'>
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_ssh_type ,the value:remote , the type:<class 'str'>
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_container_name ,the value: , the type:<class 'str'>
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_namespace ,the value: , the type:<class 'str'>
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_pod_name ,the value: , the type:<class 'str'>
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_kubernetes_config_file ,the value: , the type:<class 'str'>
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_host_type ,the value:OBSERVER , the type:<class 'str'>
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: remote_ssher ,the value:<src.common.ssh_client.ssh.SshClient object at 0x7f4a95676d30> , the type:<class 'src.common.ssh_client.ssh.SshClient'>
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - add env: not_ACTIVE_OBSERVER ,the value: , the type:<class 'str'>
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - real_shell: not_ACTIVE_OBSERVER=""
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_ssher="<src.common.ssh_client.ssh.SshClient object at 0x7f4a95676d30>"
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_host_type="OBSERVER"
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_kubernetes_config_file=""
[2025-08-25 17:09:44.065] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_pod_name=""
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_namespace=""
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_container_name=""
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_ssh_type="remote"
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_ssh_key_file="/home/dumbo/.ssh/id_rsa"
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_redo_dir="/home/oceanbase/store"
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_data_dir="/home/oceanbase/store"
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_home_path="/home/oceanbase"
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_ssh_port="2200"
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_ssh_username="dumbo"
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] remote_ip="192.168.1.100"
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG]
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] if [ -z "$not_ACTIVE_OBSERVER" ]; then
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG]     echo "true"
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] else
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG]     echo "false"
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] fi
[2025-08-25 17:09:44.066] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG]
[2025-08-25 17:09:44.084] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - _verify_base result: true
[2025-08-25 17:09:44.085] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - verify.execute end. and result is True
[2025-08-25 17:09:44.085] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - step nu: 1 execute end
[2025-08-25 17:09:44.085] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - task execute end
[2025-08-25 17:09:44.085] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - execute tasks end : cluster.observer_not_active
[2025-08-25 17:09:44.085] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - execute tasks is cluster.observer_port
[2025-08-25 17:09:44.085] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - cluster.version is 4.2.5.3
[2025-08-25 17:09:44.086] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - cluster.observer_port execute!
[2025-08-25 17:09:44.086] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - task_base execute
[2025-08-25 17:10:23.849] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - get observer version, by sql
[2025-08-25 17:10:23.852] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - start get_observer_version_by_sql . input: test-dns.com:2883
[2025-08-25 17:10:23.891] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - connect databse ...
[2025-08-25 17:10:23.897] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - get_observer_version_by_sql ob_version_info is ('5.7.25-OceanBase_CE-v4.2.5.3',)
[2025-08-25 17:10:23.939] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - connect databse ...
[2025-08-25 17:10:23.946] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - Execute Shell command on server 192.168.1.100:command -v nc
[2025-08-25 17:10:23.952] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - Execute Shell command on server 192.168.1.100:echo | nc -v 192.168.1.100 2882
[2025-08-25 17:10:24.001] [95e6eb22-818e-11f0-988d-3a8805eaf744] [DEBUG] - Execute Shell command on server 192.168.1.100:echo | nc -v 192.168.1.100 2881
1 个赞

这几天前后拿了四台机器测试obdiag,然后发现一个现象:
1、两台机器是debian 9的跑obdiag的时候前面很慢,最终还是会卡住。(开始执行到产生第一条结果这期间很慢)
2、有一台debian 10的系统跑obdiag的时候前面比较快,但是最终也是会卡着不动,不返回结果。
3、有一台debian 10系统跑obdiag的时候前面比较快,最终也能返回结果。

以上现象能稳定复现

1 个赞

单从日志来看,是这个cluster.observer_port巡检卡住了,这个巡检项我们评估下是不是要优化下。

可能是在192.168.1.100节点上执行echo | nc -v 192.168.1.100 2881 这个指令时卡住了

可以通过删除 ~/.obdiag/check/tasks/observer/cluster/observer_port.py 这个文件来实现跳过

下个版本我们会增加执行指令超时退出的 feature 来规避这个情况

1 个赞
192.168.1.101 [192.168.1.100] 2881 (?) open
J
5.7.25�B�D4sh0U=n��.�07bhc;9LQ5Qz3mysql_native_password

在每一台手动执行这个命令都很快的,但是提示要输入密码。

如果将observer_port.py这个脚本删除,关于第三点提到的那个机器确实能跑出结果了,但是第一点的那两个机器还是跟之前一样,从执行开始到产生第一条日志持续时间很久,然后后续的巡检日志也是很长时间才会出一条,这个怎么排查一下呢


1 个赞
  1. 对于日志打印慢的情况
    我们是尽量避免在终端进行日志的打印,仅对报错或需要注意的信息进行日志打印。
    如果需要对日志进行详细的打印,可以在指令后加入
    -v
    比如 原本为 obdiag check run 则修改为 obdiag check run -v

  2. 对于手动执行指令的情况
    obdiag由于敏捷的定位,都是通过 ssh {user}@{ip} {remote command} 的方式来执行指令。
    这会导致部分系统的兼容性可能下降。对于登录目标机器后再执行指令也存在不一样的表现。

可以尝试在安装obdiag的机器上 执行 ssh {user}@{ip} "echo | nc -v 192.168.1.100 2881"

1 个赞

学习了谢谢

学习了谢谢!!