使用 OCP 添加主机时,子任务 Uninstall legacy ocp agent 执行超时

使用环境

  • 测试环境
  • 组件
    • OceanBase
    • OCP
  • 版本
    • CentOS 7.9
    • OceanBase 4.2.1.2
    • OCP 4.3.0 (ocp-agent-ce-4.3.0)

问题描述

使用 OCP 添加主机时,子任务 Uninstall legacy ocp agent 执行超时,重试或回滚后重新添加主机都不能解决,依旧执行失败。

PS. 系统是全新安装的,且软件均升级到最新。

复现路径

新主机已经按照 OB 部署文档配置妥当,且 YUM 源修改为阿里镜像源,在 OCP 控制台处通过 ssh-key (root) 添加主机,查看 Prepare host 任务,子任务 Uninstall legacy ocp agent 提示 Timeout expired 异常。

附件及日志

关键信息如下,任务完整日志见附件

log_task_116.zip (7.4 KB)

2024-08-02 13:11:44.791 ERROR 15300 --- [pool-manual-subtask-executor16,5fe881a91a024ef9,34c6d67da271] c.o.o.e.internal.template.SshTemplate    : SSH execute failed: #!/bin/bash ... find_package 't-oceanbase-ocp-agent' on 172.31.8.88. Root cause net.schmizz.sshj.connection.ConnectionException: Timeout expired and error message SSH executeCommand failed: #!/bin/bash ... find_package 't-oceanbase-ocp-agent' on 172.31.8.88..

2024-08-02 13:11:44.793 ERROR 15300 --- [pool-manual-subtask-executor16,5fe881a91a024ef9,34c6d67da271] c.o.ocp.executor.executor.SshExecutor    : failed to execute ssh command, errMsg:SSH execute failed: #!/bin/bash ... find_package 't-oceanbase-ocp-agent' on 172.31.8.88., cause:{}

java.lang.RuntimeException: SSH executeCommand failed: #!/bin/bash ... find_package 't-oceanbase-ocp-agent' on 172.31.8.88.
	at com.oceanbase.ocp.common.ssh.SshUtils.executeCommand(SshUtils.java:79)
	at com.oceanbase.ocp.executor.internal.template.SshTemplate.execute(SshTemplate.java:73)
	at com.oceanbase.ocp.executor.internal.template.SshTemplate.execute(SshTemplate.java:49)
	at com.oceanbase.ocp.executor.executor.SshExecutor.execute(SshExecutor.java:393)
	at com.oceanbase.ocp.executor.executor.SshExecutor.findPackage(SshExecutor.java:171)
	at com.oceanbase.ocp.executor.executor.SshExecutor.uninstallPackageIfExists(SshExecutor.java:147)
	at com.oceanbase.ocp.service.compute.AgentInstallationTaskService.uninstallLegacyOcpAgent(AgentInstallationTaskService.java:134)
	at com.oceanbase.ocp.service.compute.AgentInstallationTaskService$$FastClassBySpringCGLIB$$f7a6037f.invoke(<generated>)
	at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
	at org.springframework.aop.framework.CglibAopProxy.invokeMethod(CglibAopProxy.java:386)
	at org.springframework.aop.framework.CglibAopProxy.access$000(CglibAopProxy.java:85)
	at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:704)
	at com.oceanbase.ocp.service.compute.AgentInstallationTaskService$$EnhancerBySpringCGLIB$$98201618.uninstallLegacyOcpAgent(<generated>)
	at com.oceanbase.ocp.service.task.business.host.UninstallLegacyOcpAgentTask.run(UninstallLegacyOcpAgentTask.java:44)
	at com.oceanbase.ocp.core.task.engine.runner.JavaSubtaskRunner.execute(JavaSubtaskRunner.java:64)
	at com.oceanbase.ocp.core.task.engine.runner.JavaSubtaskRunner.doRun(JavaSubtaskRunner.java:32)
	at com.oceanbase.ocp.core.task.engine.runner.JavaSu
btaskRunner.run(JavaSubtaskRunner.java:26)
	at com.oceanbase.ocp.core.task.engine.runner.RunnerFactory.doRun(RunnerFactory.java:76)
	at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskExecutor.doRun(SubtaskExecutor.java:203)
	at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskExecutor.redirectConsoleOutput(SubtaskExecutor.java:197)
	at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskExecutor.lambda$submit$2(SubtaskExecutor.java:134)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: net.schmizz.sshj.connection.ConnectionException: Timeout expired
	at net.schmizz.sshj.connection.ConnectionException$1.chain(ConnectionException.java:32)
	at net.schmizz.sshj.connection.ConnectionException$1.chain(ConnectionException.java:26)
	at net.schmizz.concurrent.Promise.retrieve(Promise.java:139)
	at net.schmizz.concurrent.Event.await(Event.java:105)
	at net.schmizz.sshj.connection.channel.AbstractChannel.join(AbstractChannel.java:280)
	at com.oceanbase.ocp.common.ssh.SshUtils.executeCommand(SshUtils.java:61)
	... 24 common frames omitted
	Suppressed: net.schmizz.sshj.connection.ConnectionException: Timeout expired
		at net.schmizz.sshj.connection.ConnectionException$1.chain(ConnectionException.java:32)
		at net.schmizz.sshj.connection.ConnectionException$1.chain(ConnectionException.java:26)
		at net.schmizz.concurrent.Promise.retrieve(Promise.java:139)
		at net.schmizz.concurrent.Event.await(Event.java:105)
		at net.schmizz.sshj.connection.channel.AbstractChannel.close(AbstractChannel.java:266)
		at com.oceanbase.ocp.common.ssh.SshUtils.executeCommand(SshUtils.java:76)
		... 24 common frames omitted
	Caused by: java.util.concurrent.TimeoutException: Timeout expired
		... 28 common frames omitted
Caused by: java.util.concurrent.TimeoutExce
ption: Timeout expired
	... 28 common frames omitted


2024-08-02 13:11:44.799 ERROR 15300 --- [pool-manual-subtask-executor16,5fe881a91a024ef9,34c6d67da271] c.o.o.c.t.e.c.w.subtask.SubtaskExecutor  : Timeout expired

java.util.concurrent.TimeoutException: Timeout expired
	at net.schmizz.concurrent.Promise.retrieve(Promise.java:139)
	at net.schmizz.concurrent.Event.await(Event.java:105)
	at net.schmizz.sshj.connection.channel.AbstractChannel.join(AbstractChannel.java:280)
	at com.oceanbase.ocp.common.ssh.SshUtils.executeCommand(SshUtils.java:61)
	at com.oceanbase.ocp.executor.internal.template.SshTemplate.execute(SshTemplate.java:73)
	at com.oceanbase.ocp.executor.internal.template.SshTemplate.execute(SshTemplate.java:49)
	at com.oceanbase.ocp.executor.executor.SshExecutor.execute(SshExecutor.java:393)
	at com.oceanbase.ocp.executor.executor.SshExecutor.findPackage(SshExecutor.java:171)
	at com.oceanbase.ocp.executor.executor.SshExecutor.uninstallPackageIfExists(SshExecutor.java:147)
	at com.oceanbase.ocp.service.compute.AgentInstallationTaskService.uninstallLegacyOcpAgent(AgentInstallationTaskService.java:134)
	at com.oceanbase.ocp.service.compute.AgentInstallationTaskService$$FastClassBySpringCGLIB$$f7a6037f.invoke(<generated>)
	at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
	at org.springframework.aop.framework.CglibAopProxy.invokeMethod(CglibAopProxy.java:386)
	at org.springframework.aop.framework.CglibAopProxy.access$000(CglibAopProxy.java:85)
	at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:704)
	at com.oceanbase.ocp.service.compute.AgentInstallationTaskService$$EnhancerBySpringCGLIB$$98201618.uninstallLegacyOcpAgent(<generated>)
	at com.oceanbase.ocp.service.task.business.host.UninstallLegacyOcpAgentTask.run(UninstallLegacyOcpAgentTask.java:44)
	at com.oceanbase.ocp.core.task.engine.runner.JavaSubtaskRunner.execute(JavaSubtaskRunner.java:64)
	at com.oceanbase.ocp.core.task.engine.runner.JavaSubtaskRunner.doRun(JavaSubtaskRunner.java:32)
	at com.oceanbase.ocp.core.task.engine
.runner.JavaSubtaskRunner.run(JavaSubtaskRunner.java:26)
	at com.oceanbase.ocp.core.task.engine.runner.RunnerFactory.doRun(RunnerFactory.java:76)
	at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskExecutor.doRun(SubtaskExecutor.java:203)
	at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskExecutor.redirectConsoleOutput(SubtaskExecutor.java:197)
	at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskExecutor.lambda$submit$2(SubtaskExecutor.java:134)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)


Set state for subtask: 158, operation:EXECUTE, state: FAILED

节点互相打通了么 这里看报错timeout

节点之间是通的,尝试过 ping 和 ssh 连接,都没有问题。我这边没找到那个 bash 脚本的内容,能否提供?我这边加速排查问题。

你这里的报错是因为ssh到目标主机 find_package ‘t-oceanbase-ocp-agent’ 超时了,
正常的日志如下图,如目标主机没有运行agent,可以先跳过这步

感谢,首先跳过任务确实能成功添加主机。
另外,经过我的测试,参照你的截图,添加主机时使用非 root 用户就没有这个问题了。