obd 在线升级ob 4.2.0 到4.2.2失败

【 使用环境 】测试环境
【 OB or 其他组件 】OBD OB
【 使用版本 】4.2.0
【问题描述】清晰明确描述问题
从4.2.0在线升级到4.2.2,执行了:

obd cluster upgrade obcluster -c oceanbase-ce -V 4.2.2.0 --usable=d687aabed34f610040c70cd8aa4f256f9a909564bcdb12e1bcbf83224c865fab

从日志和通过SQL查询OB版本来看,已经自动升级到了4.2.1,但是继续升到4.2.2时一直卡住了Rotation upgrade上15各多小时,手动终止后目前集群为upgrading这个状态。集群无法停止、重启等操作。

重新再次执行升级命令,报错见附件。
【复现路径】问题出现前后相关操作
【附件及日志】推荐使用OceanBase敏捷诊断工具obdiag收集诊断信息,详情参见链接(右键跳转查看):

【SOP系列 22 】——故障诊断第一步(自助诊断和诊断信息收集)
upgrade.log (18.9 KB)
upgrade_post.log (16.8 KB)

查看日志报错:main.MyError: 'upgrade checker failed with 1 reasons: [servers build_version not match] ’
obd 节点提供这个信息:cat ~/.obd/cluster/xxx/.upgrade

[admin@ob01 bin]$ cat ~/.obd/cluster/obcluster/.upgrade
component: oceanbase-ce
uprade_ctx:
route:

  • version: 4.2.0.0
    release:
    direct_upgrade: false
    require_from_binary: false
  • version: 4.2.1.2
    release:
    direct_upgrade: false
    require_from_binary: true
  • version: 4.2.2.0
    release:
    direct_upgrade: false
    require_from_binary: false
    upgrade_repositories:
  • version: 4.2.0.0
    hash: 176ae96ab6ea606c860e0a9db09f8046eec0ceba
  • version: 4.2.1.2
    hash: 1873bbe80cbbe5d00d5f276a4f7302cfca677fb6
  • version: 4.2.2.0
    hash: aa3053da7370a6685a2ef457cd202d50e5ab75d3
    index: 2
    process_index: 2
    process_route_index: 2
    backup_param:
    172.30.3.150:2882:
    enable_rereplication: ‘True’
    server_permanent_offline_time: 72h
    enable_rebalance: ‘True’
    172.30.3.149:2882:
    enable_rereplication: ‘True’
    server_permanent_offline_time: 72h
    enable_rebalance: ‘True’
    172.30.3.151:2882:
    enable_rereplication: ‘True’
    server_permanent_offline_time: 72h
    enable_rebalance: ‘True’
    [admin@ob01 bin]$


upgrade_checker.py文件麻烦提供下。
obd --version 信息也提供下。

show parameters like ‘%enable_ddl%’;
show parameters like ‘%mode%’;
信息提供下。

upgrade_checker.zip (7.1 KB)


[admin@ob01 aa3053da7370a6685a2ef457cd202d50e5ab75d3]$ obd --version
OceanBase Deploy: 2.6.1
REVISION: 6aad22bedf20b041b23ff58c203d94dc165c717a
BUILD_BRANCH: HEAD
BUILD_TIME: Feb 05 2024 17:12:05OURCE
Copyright (C) 2021 OceanBase
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

cd ~/.obd/cluster/obcluster/
#备份 .upgrade文件
cp .upgrade /opt/upgrade_bak
#修改升级进度, process_index修改为5
vim .upgrade


#重新obd upgrad升级
obd cluster upgrade obcluster -c oceanbase-ce -V 4.2.2.0 --usable=d687aabed34f610040c70cd8aa4f256f9a909564bcdb12e1bcbf83224c865fab

卡在这里又很久很久了 Rotation upgrade

看下obd日志(cd ~/.obd/log/obd)和upgrade_post.log日志看下。

只有obd日志,没看到upgrade_post.log
obd-log.zip (414.5 KB)

升级任务当前是否中断?
这个sql 查询下
select * from oceanbase.DBA_OB_SERVERS where STATUS != ‘ACTIVE’ or STOP_TIME is not NULL or START_SERVICE_TIME is NULL;

看下 cat .upgrade |grep process_index

升级任务已手动停止
SQL查为空白
[admin@ob01 obcluster]$ cat .upgrade |grep process_index
process_index: 5

无返回是正常的,麻烦重新执行升级命令,然后 查看obd日志 如果持续打印select * from oceanbase.DBA_OB_SERVERS where STATUS != ‘ACTIVE’ or STOP_TIME is not NULL or START_SERVICE_TIME is NULL execute failed 这个信息,连接到升级集群环境执行下这个sql 看下结果。

看obd日志,一直显示SQL执行失败,又看不到什么原因,或者哪里能看到失败原因日志?
OBD-5000: select * from oceanbase.DBA_OB_SERVERS where STATUS != ‘ACTIVE’ or STOP_TIME is not NULL or START_SERVICE_TIME is NULL execute failed

select * from gv$ob_sql_audit where query_sql like ‘%select * from oceanbase.DBA_OB_SERVERS where STATUS%’; 可以看到执行的sql记录信息。

谢谢支持。我重新部署集群,把业务数据倒过来。

好的。关于升级过程中异常中断退出导致升级进度滞后问题会在obd270版本修复。