OCP报警:failed to merge partition,错误码=4016

【 使用环境 】生产环境
【 OB or 其他组件 】ob 4.3.3
【 使用版本 】
【问题描述】
2024-10-10日晚间,OCP上操作将OB从4.3.2.1升级到4.3.3后, 2024-10-11日凌晨2:00起开始,三台observer都出现报警事件:

告警事件详情
告警概述:alarm_template_id=0:ob_cluster=s7-1725644966:host=192.168.51.21 OBServer 合并失败
告警详情:[OBServer 合并失败] 集群:s7,主机:192.168.51.21,日志类型:observer,日志文件:/home/admin/oceanbase/log/observer.log,日志级别:WDIAG,关键字=failed to merge partition,错误码=4016,日志详情=[2024-10-13 16:20:01.018031] WDIAG [STORAGE] process (ob_tablet_merge_task.cpp:1154) [35137][T1001_MAJOR_MER][T1001][YB42C0A83315-0006241F8799C68E-0-0] [lt=17][errcode=-4016] failed to merge partition(ret=-4016)。

查看observer日志发现每二分钟会出现一条相关日志:

[root@OB-1 ~]# grep 'failed to merge partition' /home/admin/oceanbase/log/observer.log*
/home/admin/oceanbase/log/observer.log:[2024-10-13 16:32:01.374005] WDIAG [STORAGE] process (ob_tablet_merge_task.cpp:1154) [45648][T1001_MAJOR_MER][T1001][YB42C0A83315-0006241F8799C694-0-0] [lt=9][errcode=-4016] failed to merge partition(ret=-4016)
/home/admin/oceanbase/log/observer.log:[2024-10-13 16:34:01.456340] WDIAG [STORAGE] process (ob_tablet_merge_task.cpp:1154) [47376][T1001_MAJOR_MER][T1001][YB42C0A83315-0006241F8799C695-0-0] [lt=9][errcode=-4016] failed to merge partition(ret=-4016)

【复现路径】问题出现前后相关操作
使用obdiag结果如下:

 [root@OB-1 ~]#obdiag rca run --scene=major_hold
[WARN] #2&3 MajorHoldScene execute exception: 'NoneType' object has no attribute 'group'
rca finished. For more details, the result on './rca//obdiag_major_hold_20241013161709' 
You can get the suggest by 'cat ./rca//obdiag_major_hold_20241013161709/record'
Trace ID: 89ccf4d6-893b-11ef-9742-246e96ba4428
If you want to view detailed obdiag logs, please run: obdiag display-trace 89ccf4d6-893b-11ef-9742-246e96ba4428

[root@OB-1 ~]# cat ./rca//obdiag_major_hold_20241013161709/record
+-------------------------------------------------------------------------------------------------------------+
|                                                    record                                                   |
+------+------------------------------------------------------------------------------------------------------+
| step | info                                                                                                 |
+------+------------------------------------------------------------------------------------------------------+
|  1   | CDB_OB_MAJOR_COMPACTION is not exist IS_ERROR='YES'                                                  |
|  2   | __all_virtual_compaction_diagnose_info have status='FAILED',the tenant is []                         |
|  3   | merge tasks that have not ended beyond the expected time,the tenant_id is ['1001']                   |
|  4   | tenant_id is 1001                                                                                    |
|  5   | on CDB_OB_MAJOR_COMPACTION where status='COMPACTING';result:[{'TENANT_ID': 1001, 'FROZEN_SCN':       |
|      | 1728583200208048000, 'FROZEN_TIME': datetime.datetime(2024, 10, 11, 2, 0, 0, 208048),                |
|      | 'GLOBAL_BROADCAST_SCN': 1728583200208048000, 'LAST_SCN': 1728496802259675000, 'LAST_FINISH_TIME':    |
|      | datetime.datetime(2024, 10, 10, 2, 3, 37, 164154), 'START_TIME': datetime.datetime(2024, 10, 11, 2,  |
|      | 0, 1, 54284), 'STATUS': 'COMPACTING', 'IS_ERROR': 'NO', 'IS_SUSPENDED': 'NO', 'INFO': ''}]           |
|  6   | on __all_virtual_compaction_diagnose_info;result:(('192.168.51.21', 2882, 1001, 'MEDIUM_MERGE', 1,   |
|      | 455, 'FAILED', datetime.datetime(2024, 10, 11, 2, 1, 25, 941892), 'error_no=-4016,last_error_time=17 |
|      | 28807360766885,error_trace=YB42C0A83315-0006241F8799C68C-0-0,warning="MAJOR_MERGE/MEDIUM_MERGE;ls_id |
|      | =1;tablet_id=455;compaction_scn=1728583200208048000;exec_mode="EXEC_MODE_LOCAL",concurrent_cnt=1"'), |
|      | ('192.168.51.23', 2882, 1001, 'MEDIUM_MERGE', 1, 455, 'FAILED', datetime.datetime(2024, 10, 11, 2,   |
|      | 1, 53, 220442), 'error_no=-4016,last_error_time=1728807422210633,error_trace=YB42C0A83317-0006241F76 |
|      | ED21AC-0-0,warning="MAJOR_MERGE/MEDIUM_MERGE;ls_id=1;tablet_id=455;compaction_scn=172858320020804800 |
|      | 0;exec_mode="EXEC_MODE_LOCAL",concurrent_cnt=1"'), ('192.168.51.22', 2882, 1001, 'MEDIUM_MERGE', 1,  |
|      | 455, 'FAILED', datetime.datetime(2024, 10, 11, 2, 3, 7, 839729), 'error_no=-4016,last_error_time=172 |
|      | 8807389670127,error_trace=YB42C0A83316-0006241F7D0F3DF4-0-0,warning="MAJOR_MERGE/MEDIUM_MERGE;ls_id= |
|      | 1;tablet_id=455;compaction_scn=1728583200208048000;exec_mode="EXEC_MODE_LOCAL",concurrent_cnt=1"'))  |
|  7   | #2&3 on __all_virtual_compaction_diagnose_info get data failed                                       |
|  8   | global_broadcast_scn is 1728583200208048000                                                          |
|  9   | last_scn is (1728496802259675000,)                                                                   |
|  10  | tenant_id:1001 OB_COMPACTION_PROGRESS_data_global_broadcast_scn save on ./rca//obdiag_major_hold_202 |
|      | 41013161709/rca_major_hold_1001_OB_COMPACTION_PROGRESS_data_global_broadcast_scn                     |
|  11  | tenant_id:1001 OB_COMPACTION_PROGRESS_data_last_scn save on                                          |
|      | ./rca//obdiag_major_hold_20241013161709/rca_major_hold_1001_OB_COMPACTION_PROGRESS_data_last_scn     |
|  12  | tenant_id:1001 OB_COMPACTION_PROGRESS_data_last_scn save on                                          |
|      | ./rca//obdiag_major_hold_20241013161709/rca_major_hold_1001_OB_COMPACTION_SUGGESTIONS_info           |
+------+------------------------------------------------------------------------------------------------------+
The suggest: send the ./rca//obdiag_major_hold_20241013161709 to the oceanbase community

其它说明:
目前ob里面除了sys外,一共开了三个租户,OCP的集群-合并管理中,最后一次合并,也都是显示绿色的正常已完成.

可能触发了bug

1.麻烦查看下升级路径及历史版本
select distinct(value2) from dba_ob_cluster_event_history;

2.麻烦提供下observer.log

obclient [(none)]> use oceanbase;
Database changed
obclient [oceanbase]> select distinct(value2) from dba_ob_cluster_event_history;
+-------------------------------------------------------------------------------------------+
| value2                                                                                    |
+-------------------------------------------------------------------------------------------+
| 4.3.2.1_100000102024081217-9b4a349ba1c06965b20a3f2397bd57eb0e52a667(Aug 12 2024 18:10:52) |
| 4.3.2.1                                                                                   |
| 4.3.3.0_100000132024100711-0eba6bc1c209771831934b11cfe728780ecb7a85(Oct  7 2024 12:01:19) |
+-------------------------------------------------------------------------------------------+
3 rows in set (0.010 sec)

observer.log 这个日志比较大,而且设置了轮转,只保留有一个小时之内的,是需要全部打包上传来吗?

发一个包含告警时段的observer.log文件就可以

observer.zip (14.6 MB)
observer.log 上传了

已确认为已知问题

[2024-10-15 12:52:05.792861] WDIAG [STORAGE] calc_micro_column_checksum (ob_macro_block_writer.cpp:1881) [15194][T1005_MAJOR_MER][T1005][YB42C0A83315-0006241F6A89E9F1-0-0] [lt=32][errcode=-4016] error unexpected, row column count is invalid(ret=-4016, datum_row_={row_flag:{flag:“INSERT”, flag_type:0}, trans_id:{txid:0}, scan_index:0, mvcc_row_flag:{first:0, uncommitted:0, shadow:0, compact:0, ghost:0, last:0, reserved:0, flag:0}, snapshot_version:0, fast_filter_skipped:false, have_uncommited_row:false, group_idx:0, count:44, datum_buffer:{capacity:45, datums:0x7f97f4880090, local_datums:0x7f97f487dc60}, datums:[col_id=0:{len: 8, flag: 0, null: 0, ptr: 0x7f97f48800a0, hex: EE03000000000000, int: 1006}, col_id=1:{len: 8, flag: 0, null: 0, ptr: 0x7f97f48800d8, hex: A640DB6600000000, int: 1725644966}, col_id=2:{len: 8, flag: 0, null: 0, ptr: 0x7f97f4880110, hex: E802000000000000, int: 744}, col_id=3:{len: 13, flag: 0, null: 0, ptr: 0x7f98c4e0920e, hex: 3139322E3136382E35312E3231, cstr: 192.168.51.21}, col_id=4:{len: 8, flag: 0, null: 0, ptr: 0x7f97f4880180, hex: 420B000000000000, int: 2882}, col_id=5:{len: 8, flag: 0, null: 0, ptr: 0x7f97f48801b8, hex: 600B990000000000, int: 10029920}, col_id=6:{len: 8, flag: 0, null: 0, ptr: 0x7f97f48801f0, hex: AD370C0000000000, int: 800685}, col_id=7:{len: 8, flag: 0, null: 0, ptr: 0x7f97f4880228, hex: F84DDEC869A403E8, int: -1728357057731605000}, col_id=8:{len: 8, flag: 0, null: 0, ptr: 0x7f97f4880260, hex: 0000000000000000, int: 0}, col_id=9:{len: 8, flag: 0, null: 0, ptr: 0x7f97f4880298, hex: B6588ED5ED230600, int: 1728354257361078}, col_id=10:{len: 8, flag: 0, null: 0, ptr: 0x7f97f48802d0, hex: 22A1070000000000, int: 500002}, col_id=11:{len: 8, flag: 0, null: 0, ptr: 0x7f97f4880308, hex: 0000000000000000, int: 0}, col_id=12:{len: 32, flag: 0, null: 0, ptr: 0x7f98c4e0a609, hex: 4634353335434446324642433636313938413934443142463239413634443235, cstr: F4535CDF2FBC66198A94D1BF29A64D25}, col_id=13:{len: 33, flag: 0, null: 0, ptr: 0x7fa411cba060, hex: 5942343243304138333331352D303030363232414133453042324239342D302D30, cstr: YB42C0A83315-000622AA3E0B2B94-0-0}, col_id=14:{len: 8, flag: 0, null: 0, ptr: 0x7f97f48803b0, hex: 0000000000000000, int: 0}, col_id=15:{len: 8, flag: 0, null: 0, ptr: 0x7f97f48803e8, hex: 0000000000000000, int: 0}, col_id=16:{len: 8, flag: 0, null: 0, ptr: 0x7f97f4880420, hex: 0000000000000000, int: 0}, col_id=17:{len: 8, flag: 0, null: 0, ptr: 0x7f97f4880458, hex: 0000000000000000, int: 0}, col_id=18:{len: 8, flag: 0, null: 0, ptr: 0x7f97f4880490, hex: 0000000000000000, int: 0}, col_id=19:{len: 8, flag: 0, null: 0, ptr: 0x7f97f48804c8, hex: 0000000000000000, int: 0}, col_id=20:{len: 8, flag: 0, null: 0, ptr: 0x7f97f4880500, hex: 1002000000000000, int: 528}, col_id=21:null, col_id=22:null, col_id=23:null, col_id=24:null, col_id=25:{len: 8, flag: 0, null: 0, ptr: 0x7f97f4880618, hex: 9F0C000000000000, int: 3231}, col_id=26:null, col_id=27:null, col_id=28:null, col_id=29:null, col_id=30:null, col_id=31:null, col_id=32:null, col_id=33:null, col_id=34:null, col_id=35:null, col_id=36:null, col_id=37:null, col_id=38:null, col_id=39:null, col_id=40:null, col_id=41:null, col_id=42:null, col_id=43:null]}, column_cnt=45)

修复方案,下载最新的433版本

  1. 关掉分区合并
    alter system _enable_adaptive_compaction=false tenant all;
    alter system _enable_adaptive_compaction=false tenant all_user;
    alter system _enable_adaptive_compaction=false tenant all_meta;

  2. 检查是否有未完成的分区合并任务
    select * from __all_virtual_tablet_compaction_info where max_received_scn > finished_scn;
    和已经报错4016的tablet_id做对比,是否有未完成的分区合并任务,如果有的话,需要等待完成
    select * from __all_virtual_dag_warning_history;

  3. 第二步检查完成后,替换binary

  4. 所有zone替换完成后,报错4016的分区应该都会合并成功,将_enable_adaptive_compaction都设置为true

这个可以直接用ocp 做升级处理吗?

不能的,ocp升级会检查是否有合并在进行,有在合并的不让升级,这个卡在合并了

手动做完1. 2.步后,第3步直接用OCP升级版本, 这样也不行吗?

如果是手工替换binary, 是不是直接安装rpm包,然后再重启observer就行了?

1.升级不了,ocp有合并卡着不让升级的,第2步只是检查有没有其他合并在进行,卡着的4016还是卡着

2.不是直接安装,是解压下rpm包,手工替换,替换observer重启

1 个赞