使用ocp对已接管的ob集群从4.1.0升级到4.2.0,长时间未完成(超过16小时)

【 使用环境 】测试环境
【 OB or 其他组件 】ocp,observer
【 使用版本 】ob集群版本:4.1.0 ocp版本:4.0.3-20230301
【问题描述】使用ocp对已接管的ob集群从4.1.0升级到4.2.0,长时间未完成(超过16小时)
【复现路径】问题出现前后相关操作

先在ocp上上传新版的OceanBase安装包

image

集群管理选择需要升级的集群

image

选择升级的版本

任务长时间停留在第4步 wait dag success

【问题现象及影响】

【附件】
ocp日志详情:
log_task_707751.zip (16.8 KB)

1 个赞

稍等我看看日志。

1 个赞

看下OCP的任务列表,应该还有一个upgrade相关的任务,是不是已经有失败报错了。

2 个赞

是的,Execute upgrade pre script 这一步失败了

报错信息:
[2023-09-01 14:35:35] INFO upgrade_health_checker.py:350 value is 3, expected value is 0, not matched
[2023-09-01 14:35:45] INFO upgrade_health_checker.py:46 succeed to execute query: select /*+ query_timeout(1000000000) */ count(1) from __all_virtual_tablet_compaction_info where max_received_scn > finished_scn and max_received_scn > 0, rowcount = 1
[2023-09-01 14:35:45] INFO upgrade_health_checker.py:350 value is 3, expected value is 0, not matched
[2023-09-01 14:35:45] ERROR upgrade_health_checker.py:379 run error
Traceback (most recent call last):
File “/tmp/rpms/extract/oceanbase-ce-4.2.0.0-100010022023081817.el7.x86_64.rpm/home/admin/oceanbase/etc/upgrade_pre_extract_files_2023_09_01_14_25_43_189418_UEFLqicM/upgrade_health_checker.py”, line 377, in do_check
check_major_merge(query_cur, timeout)
File “/tmp/rpms/extract/oceanbase-ce-4.2.0.0-100010022023081817.el7.x86_64.rpm/home/admin/oceanbase/etc/upgrade_pre_extract_files_2023_09_01_14_25_43_189418_UEFLqicM/upgrade_health_checker.py”, line 337, in check_major_merge
check_until_timeout(query_cur, sql2, 0, timeout)
File “/tmp/rpms/extract/oceanbase-ce-4.2.0.0-100010022023081817.el7
.x86_64.rpm/home/admin/oceanbase/etc/upgrade_pre_extract_files_2023_09_01_14_25_43_189418_UEFLqicM/upgrade_health_checker.py”, line 354, in check_until_timeout
logging.warn(""“check {0} job timeout”"".format(job_name))
NameError: global name ‘job_name’ is not defined
[2023-09-01 14:35:45] ERROR upgrade_health_checker.py:388 normal error
Traceback (most recent call last):
File “/tmp/rpms/extract/oceanbase-ce-4.2.0.0-100010022023081817.el7.x86_64.rpm/home/admin/oceanbase/etc/upgrade_pre_extract_files_2023_09_01_14_25_43_189418_UEFLqicM/upgrade_health_checker.py”, line 380, in do_check
raise e
NameError: global name ‘job_name’ is not defined
[2023-09-01 14:35:45] ERROR do_upgrade_pre.py:94 run error
Traceback (most recent call last):
File “/tmp/rpms/extract/oceanbase-ce-4.2.0.0-100010022023081817.el7.x86_64.rpm/home/admin/oceanbase/etc/upgrade_pre_extract_files_2023_09_01_14_25_43_189418_UEFLqicM/do_upgrade_pre.py”, line 90, in do_upgrade
upgrade_health_checker.do_check(my_host, my_port, my_user, my_passwd, upgrade_params, timeout)
File “/tmp/rpms/extract/oceanbase-ce-4.2.0.0-100010022023081817.el7.x86_64.rpm/home/admin/oceanbase/etc/upgrade_pre_extract_files_2023_09_01_14_25_43_189418_UEFLqicM/upgrade_health_checker.py”, line 389, in do_check
raise e
NameError: global name ‘job_name’ is not defined
[2023-09-01 14:35:45] INFO do_upgrade_pre.py:42

完整子任务日志:
log_task_707752.zip (34.5 KB)

root@sys登陆sys租户后,查一下select /*+ query_timeout(1000000000) */ count(1) from __all_virtual_tablet_compaction_info where max_received_scn > finished_scn and max_received_scn > 0; 看看结果是什么。

另外OCP白屏看下集群状态是否正常,是否有故障的zone或者observer,近期合并是否正常。

mysql> select /*+ query_timeout(1000000000) */ * from oceanbase.all_virtual_tablet_compaction_info where max_received_scn > finished_scn and max_received_scn > 0;
±----------------±---------±----------±------±----------±--------------------±--------------------±--------------------±-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| svr_ip | svr_port | tenant_id | ls_id | tablet_id | finished_scn | wait_check_scn | max_received_scn | serialize_scn_list |
±----------------±---------±----------±------±----------±--------------------±--------------------±--------------------±-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 192.168.140.174 | 2982 | 1001 | 1 | 370 | 1693476747012845786 | 0 | 1693480544981708870 | {size:1, last_compaction_type:0, wait_check_flag:0, last_medium_scn:1693476747012845786, info_list:{[0]:compaction_type:“MEDIUM_COMPACTION”, medium_snapshot
:1693480544981708870, parallel_merge_info:{list_size:0, }}} |
| 192.168.140.175 | 2982 | 1001 | 1 | 370 | 1693476747012845786 | 1693476747012845786 | 1693480544981708870 | {size:1, last_compaction_type:0, wait_check_flag:1, last_medium_scn:1693476747012845786, info_list:{[0]:compaction_type:“MEDIUM_COMPACTION”, medium_snapshot
:1693480544981708870, parallel_merge_info:{list_size:0, }}} |
| 192.168.140.177 | 2982 | 1001 | 1 | 370 | 1693476747012845786 | 1693476747012845786 | 1693480544981708870 | {size:1, last_compaction_type:0, wait_check_flag:1, last_medium_scn:1693476747012845786, info_list:{[0]:compaction_type:“MEDIUM_COMPACTION”, medium_snapshot_:1693480544981708870, parallel_merge_info:{list_size:0, }}} |
±----------------±---------±----------±------±----------±--------------------±--------------------±--------------------±-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

mysql> select * from oceanbase.__all_table where tablet_id=370;
±---------------------------±---------------------------±----------±---------±-----------------------±------------±-----------±----------±---------±------------------±-----------------±-------------------±------------------±---------------±----------±-----------------±-------------------±-----------------±-------------------±--------±-----------±---------------±--------------±-------------±--------------±----------------------±-----------±-----------±---------------±---------------±---------±-------------------±-------------------±-------------±-----------------±----------------±------------------±------------------±-----------------±------------±---------------------±------------±--------±-----------------±-------------------------±-----------±-----------±-----------------±-------------------±-------------±----------------±------------------------±-----------------------±-----------±-----------±--------------±------------------------±----±---------------------±---------------------±---------------±----------±---------------------±----------±-----------------------±---------------±-----------------±-------------------±---------------±-----------------±--------------±------------±-----------------+
| gmt_create | gmt_modified | tenant_id | table_id | table_name | database_id | table_type | load_type | def_type | rowkey_column_num | index_column_num | max_used_column_id | autoinc_column_id | auto_increment | read_only | rowkey_split_pos | compress_func_name | expire_condition | is_use_bloomfilter | comment | block_size | collation_type | data_table_id | index_status | tablegroup_id | progressive_merge_num | index_type | part_level | part_func_type | part_func_expr | part_num | sub_part_func_type | sub_part_func_expr | sub_part_num | schema_version | view_definition | view_check_option | view_is_updatable | index_using_type | parser_name | index_attributes_set | tablet_size | pctfree | partition_status | partition_schema_version | session_id | pk_comment | sess_active_time | row_store_type | store_format | duplicate_scope | progressive_merge_round | storage_format_version | table_mode | encryption | tablespace_id | sub_part_template_flags | dop | character_set_client | collation_connection | auto_part_size | auto_part | association_table_id | tablet_id | max_dependency_version | define_user_id | transition_point | b_transition_point | interval_range | b_interval_range | object_status | table_flags | truncate_version |
±---------------------------±---------------------------±----------±---------±-----------------------±------------±-----------±----------±---------±------------------±-----------------±-------------------±------------------±---------------±----------±-----------------±-------------------±-----------------±-------------------±--------±-----------±---------------±--------------±-------------±--------------±----------------------±-----------±-----------±---------------±---------------±---------±-------------------±-------------------±-------------±-----------------±----------------±------------------±------------------±-----------------±------------±---------------------±------------±--------±-----------------±-------------------------±-----------±-----------±-----------------±-------------------±-------------±----------------±------------------------±-----------------------±-----------±-----------±--------------±------------------------±----±---------------------±---------------------±---------------±----------±---------------------±----------±-----------------------±---------------±-----------------±-------------------±---------------±-----------------±--------------±------------±-----------------+
| 2023-07-20 18:44:51.409466 | 2023-07-20 18:44:51.409466 | 0 | 370 | __all_ls_recovery_stat | 201001 | 0 | 0 | 0 | 2 | 0 | 23 | 0 | 1 | 0 | 0 | none | | 0 | | 16384 | 45 | 0 | 1 | 202001 | 0 | 0 | 0 | 0 | | 1 | 0 | | 0 | 1689849891410088 | | 0 | 0 | 0 | NULL | 0 | 134217728 | 10 | 0 | 0 | 0 | | 0 | encoding_row_store | DYNAMIC | 0 | 1 | 3 | 0 | | -1 | 0 | 1 | 0 | 0 | 0 | 0 | -1 | 370 | -1 | -1 | NULL | NULL | NULL | NULL | 1 | 0 | -1 |
±---------------------------±---------------------------±----------±---------±-----------------------±------------±-----------±----------±---------±------------------±-----------------±-------------------±------------------±---------------±----------±-----------------±-------------------±-----------------±-------------------±--------±-----------±---------------±--------------±-------------±--------------±----------------------±-----------±-----------±---------------±---------------±---------±-------------------±-------------------±-------------±-----------------±----------------±------------------±------------------±-----------------±------------±---------------------±------------±--------±-----------------±-------------------------±-----------±-----------±-----------------±-------------------±-------------±----------------±------------------------±-----------------------±-----------±-----------±--------------±------------------------±----±---------------------±---------------------±---------------±----------±---------------------±----------±-----------------------±---------------±-----------------±-------------------±---------------±-----------------±--------------±------------±-----------------+
1 row in set (0.05 sec)

集群zone和合并的状态 是正常的

看下是否有长事务、悬挂事务

查看了sys租户和业务租户活动会话除了管理员登录的活动会话,没有其他活动会话和事务


请问一下这个升级操作可以回退吗?ocp任务界面没找到相应功能,目前已经影响到应用的访问

过了preCheck后,升级不能回退

不能回退了,但是可以重复执行。请问下你的OceanBase是4.1.0的什么版本?

社区版4.1.0.0_100000202023040520

失败的这步有重试过,还是一样的卡在这个语句的检查上

稍等,这个版本我核实一下

这个语句执行失败是因为升级的时候是禁用DDL的,所以失败了

好的,那麻烦帮忙看看怎么处理哈

检查合并的时候timeout了,改下这个参数,重试一下任务
alter system set internal_sql_execute_timeout=‘10m’

参数已经改了,但是重试后还是失败,失败的原因看着跟之前的一样的

Execute_upgrade_pre_script_715612_FAILED.log (359.8 KB)

麻烦看看 select /*+ query_timeout(1000000000) */ * from oceanbase.all_virtual_tablet_compaction_info where max_received_scn > finished_scn and max_received_scn > 0; 这三条记录。

帮忙在1001租户下查下这个表 DBA_OB_MAJOR_COMPACTION
select* from oceanbase.DBA_OB_MAJOR_COMPACTION

SELECT * FROM oceanbase.GV$OB_TABLET_COMPACTION_PROGRESS where TABLET_ID=‘1693476747012845786’
这个sql 也帮忙查下

查询结果1.txt (4.3 KB)

查询结果1.txt (10.7 KB)

image