【 使用环境 】生产环境
【 OB or 其他组件 】ob 4.3.3
【 使用版本 】
【问题描述】
2024-10-10日晚间,OCP上操作将OB从4.3.2.1升级到4.3.3后, 2024-10-11日凌晨2:00起开始,三台observer都出现报警事件:
告警事件详情
告警概述:alarm_template_id=0:ob_cluster=s7-1725644966:host=192.168.51.21 OBServer 合并失败
告警详情:[OBServer 合并失败] 集群:s7,主机:192.168.51.21,日志类型:observer,日志文件:/home/admin/oceanbase/log/observer.log,日志级别:WDIAG,关键字=failed to merge partition,错误码=4016,日志详情=[2024-10-13 16:20:01.018031] WDIAG [STORAGE] process (ob_tablet_merge_task.cpp:1154) [35137][T1001_MAJOR_MER][T1001][YB42C0A83315-0006241F8799C68E-0-0] [lt=17][errcode=-4016] failed to merge partition(ret=-4016)。
查看observer日志发现每二分钟会出现一条相关日志:
[root@OB-1 ~]# grep 'failed to merge partition' /home/admin/oceanbase/log/observer.log*
/home/admin/oceanbase/log/observer.log:[2024-10-13 16:32:01.374005] WDIAG [STORAGE] process (ob_tablet_merge_task.cpp:1154) [45648][T1001_MAJOR_MER][T1001][YB42C0A83315-0006241F8799C694-0-0] [lt=9][errcode=-4016] failed to merge partition(ret=-4016)
/home/admin/oceanbase/log/observer.log:[2024-10-13 16:34:01.456340] WDIAG [STORAGE] process (ob_tablet_merge_task.cpp:1154) [47376][T1001_MAJOR_MER][T1001][YB42C0A83315-0006241F8799C695-0-0] [lt=9][errcode=-4016] failed to merge partition(ret=-4016)
【复现路径】问题出现前后相关操作
使用obdiag结果如下:
[root@OB-1 ~]#obdiag rca run --scene=major_hold
[WARN] #2&3 MajorHoldScene execute exception: 'NoneType' object has no attribute 'group'
rca finished. For more details, the result on './rca//obdiag_major_hold_20241013161709'
You can get the suggest by 'cat ./rca//obdiag_major_hold_20241013161709/record'
Trace ID: 89ccf4d6-893b-11ef-9742-246e96ba4428
If you want to view detailed obdiag logs, please run: obdiag display-trace 89ccf4d6-893b-11ef-9742-246e96ba4428
[root@OB-1 ~]# cat ./rca//obdiag_major_hold_20241013161709/record
+-------------------------------------------------------------------------------------------------------------+
| record |
+------+------------------------------------------------------------------------------------------------------+
| step | info |
+------+------------------------------------------------------------------------------------------------------+
| 1 | CDB_OB_MAJOR_COMPACTION is not exist IS_ERROR='YES' |
| 2 | __all_virtual_compaction_diagnose_info have status='FAILED',the tenant is [] |
| 3 | merge tasks that have not ended beyond the expected time,the tenant_id is ['1001'] |
| 4 | tenant_id is 1001 |
| 5 | on CDB_OB_MAJOR_COMPACTION where status='COMPACTING';result:[{'TENANT_ID': 1001, 'FROZEN_SCN': |
| | 1728583200208048000, 'FROZEN_TIME': datetime.datetime(2024, 10, 11, 2, 0, 0, 208048), |
| | 'GLOBAL_BROADCAST_SCN': 1728583200208048000, 'LAST_SCN': 1728496802259675000, 'LAST_FINISH_TIME': |
| | datetime.datetime(2024, 10, 10, 2, 3, 37, 164154), 'START_TIME': datetime.datetime(2024, 10, 11, 2, |
| | 0, 1, 54284), 'STATUS': 'COMPACTING', 'IS_ERROR': 'NO', 'IS_SUSPENDED': 'NO', 'INFO': ''}] |
| 6 | on __all_virtual_compaction_diagnose_info;result:(('192.168.51.21', 2882, 1001, 'MEDIUM_MERGE', 1, |
| | 455, 'FAILED', datetime.datetime(2024, 10, 11, 2, 1, 25, 941892), 'error_no=-4016,last_error_time=17 |
| | 28807360766885,error_trace=YB42C0A83315-0006241F8799C68C-0-0,warning="MAJOR_MERGE/MEDIUM_MERGE;ls_id |
| | =1;tablet_id=455;compaction_scn=1728583200208048000;exec_mode="EXEC_MODE_LOCAL",concurrent_cnt=1"'), |
| | ('192.168.51.23', 2882, 1001, 'MEDIUM_MERGE', 1, 455, 'FAILED', datetime.datetime(2024, 10, 11, 2, |
| | 1, 53, 220442), 'error_no=-4016,last_error_time=1728807422210633,error_trace=YB42C0A83317-0006241F76 |
| | ED21AC-0-0,warning="MAJOR_MERGE/MEDIUM_MERGE;ls_id=1;tablet_id=455;compaction_scn=172858320020804800 |
| | 0;exec_mode="EXEC_MODE_LOCAL",concurrent_cnt=1"'), ('192.168.51.22', 2882, 1001, 'MEDIUM_MERGE', 1, |
| | 455, 'FAILED', datetime.datetime(2024, 10, 11, 2, 3, 7, 839729), 'error_no=-4016,last_error_time=172 |
| | 8807389670127,error_trace=YB42C0A83316-0006241F7D0F3DF4-0-0,warning="MAJOR_MERGE/MEDIUM_MERGE;ls_id= |
| | 1;tablet_id=455;compaction_scn=1728583200208048000;exec_mode="EXEC_MODE_LOCAL",concurrent_cnt=1"')) |
| 7 | #2&3 on __all_virtual_compaction_diagnose_info get data failed |
| 8 | global_broadcast_scn is 1728583200208048000 |
| 9 | last_scn is (1728496802259675000,) |
| 10 | tenant_id:1001 OB_COMPACTION_PROGRESS_data_global_broadcast_scn save on ./rca//obdiag_major_hold_202 |
| | 41013161709/rca_major_hold_1001_OB_COMPACTION_PROGRESS_data_global_broadcast_scn |
| 11 | tenant_id:1001 OB_COMPACTION_PROGRESS_data_last_scn save on |
| | ./rca//obdiag_major_hold_20241013161709/rca_major_hold_1001_OB_COMPACTION_PROGRESS_data_last_scn |
| 12 | tenant_id:1001 OB_COMPACTION_PROGRESS_data_last_scn save on |
| | ./rca//obdiag_major_hold_20241013161709/rca_major_hold_1001_OB_COMPACTION_SUGGESTIONS_info |
+------+------------------------------------------------------------------------------------------------------+
The suggest: send the ./rca//obdiag_major_hold_20241013161709 to the oceanbase community
其它说明:
目前ob里面除了sys外,一共开了三个租户,OCP的集群-合并管理中,最后一次合并,也都是显示绿色的正常已完成.