ocp安装部署到最后ocp-server program health check就卡住,部署失败

【 使用环境 】 测试环境
【 OB or 其他组件 】 OCP
【 使用版本 】4.2.1
【问题描述】白屏部署ocp异常
【复现路径】到ocp-server program health check就卡住了
【附件及日志】


2024-01-29 15:38:29,523 INFO get_install_task_info (ocp_handler.py:770) [d4e2e0878ebd4342b4aba2c911c7554e] get ocp install task info
2024-01-29 15:38:29,526 INFO get_install_task_info (ocp_handler.py:770) [0b74ffe514704fdc8bd0f0a99ba9e226] get ocp install task info
2024-01-29 15:38:29,528 INFO dispatch (request_response_log.py:43) [d4e2e0878ebd4342b4aba2c911c7554e] app send response, code: 200
2024-01-29 15:38:29,528 INFO dispatch (request_response_log.py:43) [0b74ffe514704fdc8bd0f0a99ba9e226] app send response, code: 200
2024-01-29 15:38:30,214 WARNING _do_install (ocp_handler.py:739) [0d78fd86fe3e4b30adcfa16bc67eaefa] failed to start component: ocp-server-ce
2024-01-29 15:38:30,214 INFO _do_install (ocp_handler.py:741) [0d78fd86fe3e4b30adcfa16bc67eaefa] end start ocp-server-ce
2024-01-29 15:38:32,525 INFO dispatch (request_response_log.py:40) [aff9c50c3cb14c579a966e7bbc041eb5] app receive request, method: GET, url: http://10.203.62.13:18680/api/v1/ocp/deployments/1/install/2?id=1&task_id=2, query_params: id=1&task_id=2, body: , from: 10.182.63.200:57516
2024-01-29 15:38:32,525 INFO dispatch (request_response_log.py:40) [5fff9eb230e74fc69ffabcd7a4e774ca] app receive request, method: GET, url: http://10.203.62.13:18680/api/v1/ocp/deployments/1/install/2/log, query_params: , body: , from: 10.182.63.200:57515
2024-01-29 15:38:32,527 INFO get_install_task_info (ocp_handler.py:770) [aff9c50c3cb14c579a966e7bbc041eb5] get ocp install task info
2024-01-29 15:38:32,529 INFO get_install_task_info (ocp_handler.py:770) [5fff9eb230e74fc69ffabcd7a4e774ca] get ocp install task info
2024-01-29 15:38:32,531 INFO dispatch (request_response_log.py:43) [aff9c50c3cb14c579a966e7bbc041eb5] app send response, code: 200
2024-01-29 15:38:32,532 INFO dispatch (request_response_log.py:43) [5fff9eb230e74fc69ffabcd7a4e774ca] app send response, code: 200
2024-01-29 15:38:35,528 INFO dispatch (request_response_log.py:40) [5278435bb4cf44cc8d655229953ccc5c] app receive request, method: GET, url: http://10.203.62.13:18680/api/v1/ocp/deployments/1/install/2/log, query_params: , body: , from: 10.182.63.200:57516
2024-01-29 15:38:35,529 INFO get_install_task_info (ocp_handler.py:770) [5278435bb4cf44cc8d655229953ccc5c] get ocp install task info
2024-01-29 15:38:35,530 INFO dispatch (request_response_log.py:43) [5278435bb4cf44cc8d655229953ccc5c] app send response, code: 200
2024-01-29 15:38:35,605 INFO dispatch (request_response_log.py:40) [d38aa7c79c44458d942964af23fba357] app receive request, method: GET, url: http://10.203.62.13:18680/assets/install/failed.png, query_params: , body: , from: 10.182.63.200:57516
2024-01-29 15:38:35,610 INFO dispatch (request_response_log.py:43) [d38aa7c79c44458d942964af23fba357] app send response, code: 200
2024-01-29 15:38:37,804 ERROR wrapper (task.py:125) [0d78fd86fe3e4b30adcfa16bc67eaefa] task 1 got exception
Traceback (most recent call last):
File “service/common/task.py”, line 119, in wrapper
File “service/handler/ocp_handler.py”, line 756, in _do_install
Exception: task cosmoocpdb start failed
2024-01-29 15:38:37,805 INFO wrapper (task.py:128) [0d78fd86fe3e4b30adcfa16bc67eaefa] task 1 finished failed

image

可以再命令行中 重启下集群 看看报错是什么错误。
然后发一下obd日志(~/.obd/log下的obd日志)

[2024-01-29 16:15:21.835] [56abe4c8-be7e-11ee-b596-525400b5e23c] [DEBUG] – failed to start 10.203.62.13 ocp-server, remaining retries: 174
[2024-01-29 16:15:24.838] [56abe4c8-be7e-11ee-b596-525400b5e23c] [DEBUG] – 10.203.62.13 program health check
[2024-01-29 16:15:24.841] [56abe4c8-be7e-11ee-b596-525400b5e23c] [DEBUG] – local execute: ls /proc/19961
[2024-01-29 16:15:24.846] [56abe4c8-be7e-11ee-b596-525400b5e23c] [DEBUG] – exited code 0
[2024-01-29 16:15:24.850] [56abe4c8-be7e-11ee-b596-525400b5e23c] [DEBUG] – local execute: bash -c ‘cat /proc/net/{tcp*,udp*}’ | awk -F’ ’ ‘{print $2,$10}’ | grep ‘00000000:46A0’ | awk -F’ ’ ‘{print $2}’ | uniq
[2024-01-29 16:15:24.858] [56abe4c8-be7e-11ee-b596-525400b5e23c] [DEBUG] – exited code 0
[2024-01-29 16:15:24.858] [56abe4c8-be7e-11ee-b596-525400b5e23c] [DEBUG] – failed to start 10.203.62.13 ocp-server, remaining retries: 173

您这估计是防火墙的问题啊 。 客户端和服务器是两个网段的吧

10.203.62.13
10.182.63.200 全部是互通的 吧????
image

10.182.63.200到10.203.62.13是通的

可以提供下ocp-server安装目录下log目录的2个日志。

全部互通吧。 不互通估计不行。

后面查出磁盘性能问题,到program health check这一步是创建很多系统表,io影响创建效率,导致任务超时,部署失败