ocp 4.3.2bp1升级4.3.3报错

bbq · 2024 年12 月 16 日 07:52

ocp4.3.2bp1 升级4.3.3报错
整个ocp组成如下：
ocp-server-ce为3台
底层独立OB同样为3台

相关操作情况如下：
1.使用ocp4.3.3_all_in_one包启动升级流程，obd web upgrade进入web页面启动升级流程，等待接近20分钟后web页面提示如下：
Generate ocp server contiguration ok
Start ocp-server-ce ok
ocp-server-ce program health check ok
Connect to ocp-server-ce ok
[ERRORI ocp-server-ce-py_script_display-4.2.1 RuntimeError: ‘NoneType’ object has no attribute ‘port’
[ERROR] use obd upgrade failed, reason: call upgrade plugin error
2.此时在执行升级流程的机器上使用obd查看信息，obd cluster xx display 会提示有2个主机ocp连接不上，检查发现3台ocp-server-ce中正常的那台确认相关ocp-server-ce版本已经为4.3.3，另外2台ocp-server的lib包也为4.3.3
3.使用obd cluster restart xxx -c ocp-server-ce 重启3个ocp-server-ce服务以及obd cluster restart xxx -c ocp-server-ce -s ip2 指定重启ip2的ocp-server-ce。最终ip1这台升级过程中正常启动的ocp-server-ce均可正常重启成功，而ip2和ip3服务完成bootstrappe后就会突然进程消失
4.检查obd目录的cluster配置文件~/.obd/cluster/xxx/下发现除了.data、config.yaml外还有一个.upgrade文件，说明升级出现问题后有残留
5.考虑到ip1是ok的，决定通过obd的扩容流程重装ip2和ip3。
a.在备份相关文件后尝试修改配置文件：去除.upgrade文件、修改.data里ocp-server-ce的版本和hash值为4.3.3对应值
b.在config.yaml去掉有问题的2个ocp-server-ce的ip,调整config.yaml为4.3.2
c.编辑好扩容配置文件，分别扩容obd cluster scale_out xxxocp -c scale_out.yaml
6.通过扩容方式可以重新添加2台升级过程中启动ocp有问题的机器，但是当通过obd cluster restart xxx -c ocp-server-ce 重启ocp-server-ce组件时又会出现和升级过程中一样的情况，ip2和ip3上的ocp-server会再次无法启动

分析升级和单独重启ocp-server-ce流程的日志，发现问题ocp-server-ce总是在启动后突然进程消失

辞霜 · 2024 年12 月 16 日 10:34

用诊断工具obdiag 巡检一下：https://www.oceanbase.com/docs/common-obdiag-cn-1000000001768218

bbq · 2024 年12 月 16 日 12:15

2.3到2.6在debian下部署都会报错，debian11系统

sudo alien --scripts oceanbase-diagnostic-tool-2.3.0-42024072417.el7.x86_64.rpm
Package build failed. Here’s the log:
dh binary
dh_update_autotools_config
dh_autoreconf
create-stamp debian/debhelper-build-stamp
dh_testroot
dh_prep
debian/rules override_dh_auto_install
make[1]: Entering directory ‘/home/dba/ocp_soft/ocp-all-in-one_old/oceanbase-diagnostic-tool-2.3.0’
mkdir -p debian/oceanbase-diagnostic-tool

Copy the packages’s files.

find . -maxdepth 1 -mindepth 1 -not -name debian -print0 |
sed -e s#’./’##g |
xargs -0 -r -i cp -a ./{} debian/oceanbase-diagnostic-tool/{}
make[1]: Leaving directory ‘/home/dba/ocp_soft/ocp-all-in-one_old/oceanbase-diagnostic-tool-2.3.0’
dh_installdocs
dh_installchangelogs
dh_perl
dh_usrlocal
dh_usrlocal: error: debian/oceanbase-diagnostic-tool/usr/local/oceanbase-diagnostic-tool/init.sh is not a directory
make: *** [debian/rules:7: binary] Error 25

bbq · 2024 年12 月 16 日 13:04

用obd 部署了bodiag
跑了下,critical的有如下：

| network.TCP-retransmission | [critical] [remote_ip1] tsar is not installed. we can not check tcp retransmission. |
| | [critical] [local] tsar is not installed. we can not check tcp retransmission. |
| | [critical] [remote_ip2] tsar is not installed. we can not check tcp retransmission. |
| disk.xfs_repair | [critical] [local] xfs need repair. Please check disk. xfs_repair_log: dmesg: read kernel buffer failed: Operation not permitted |
| | [critical] [remote_ip2] xfs need repair. Please check disk. xfs_repair_log: dmesg: read kernel buffer failed: Operation not permitted |
| | [critical] [remote_ip1] xfs need repair. Please check disk. xfs_repair_log: dmesg: read kernel buffer failed: Operation not permitted |
| disk.sstable_abnormal_file | [critical] [remote_ip2] sstable_dir_path is null . Please check your nodes.data_dir need absolute Path |
| | [critical] [remote_ip1] sstable_dir_path is null . Please check your nodes.data_dir need absolute Path |
| | [critical] [local] sstable_dir_path is null . Please check your nodes.data_dir need absolute Path |
| cluster.data_path_settings | [critical] [remote_ip2] data_dir_path is null . Please check your nodes.data_dir need absolute Path |
| | [critical] [remote_ip1] data_dir_path is null . Please check your nodes.data_dir need absolute Path |
| | [critical] [local] data_dir_path is null . Please check your nodes.data_dir need absolute Path |
| cluster.task_opt_stat | [critical] [cluster:obcluster] The collection of statistical information related to tenants has issues… Please check the tenant_ids: 1,1001,1002,1003,1004,1005,1006,1007,1008 |

xfs那个不用关注，是以前有lvm的残留，早期日志

兹拉坦 · 2024 年12 月 18 日 20:33

OCP 升级的问题没解决呀？不能直接官方采纳吧？@辞霜

秃蛙 · 2024 年12 月 19 日 10:11

该问题为obd3.0缺陷，升级多节点ocp时，完成1个节点升级后，会导致其他ocp节点无法正常启动升级，内部正在走hf版本修复，预计最晚这周五发布。

谐云 · 2024 年12 月 19 日 15:17

现在obd 3.0.1 已经发版。可以下载后做obd升级。
然后把之前ocp 升级中的状态改为 STOPPED
做一次 obd cluster start xxxx 即可

bbq · 2024 年12 月 20 日 08:20

使用新版obd可以正常重启多节点ocp了，多谢社区快速解决问题