obdiag rca run --scene=major_hold抓取
obclient [oceanbase]> SELECT * FROM oceanbase.CDB_OB_MAJOR_COMPACTION\G
*************************** 1. row ***************************
TENANT_ID: 1
FROZEN_SCN: 1733662804789187373
FROZEN_TIME: 2024-12-08 21:00:04.789187
GLOBAL_BROADCAST_SCN: 1733662804789187373
LAST_SCN: 1733662804789187373
LAST_FINISH_TIME: 2024-12-08 21:04:00.952471
START_TIME: 2024-12-08 21:00:04.927890
STATUS: IDLE
IS_ERROR: NO
IS_SUSPENDED: NO
INFO:
*************************** 2. row ***************************
TENANT_ID: 1001
FROZEN_SCN: 1733680804861169439
FROZEN_TIME: 2024-12-09 02:00:04.861169
GLOBAL_BROADCAST_SCN: 1733680804861169439
LAST_SCN: 1733680804861169439
LAST_FINISH_TIME: 2024-12-09 02:02:32.539844
START_TIME: 2024-12-09 02:00:05.001461
STATUS: IDLE
IS_ERROR: NO
IS_SUSPENDED: NO
INFO:
*************************** 3. row ***************************
TENANT_ID: 1002
FROZEN_SCN: 1732971602166717024
FROZEN_TIME: 2024-11-30 21:00:02.166717
GLOBAL_BROADCAST_SCN: 1732971602166717024
LAST_SCN: 1732885201440173241
LAST_FINISH_TIME: 2024-11-30 12:02:32.934553
START_TIME: 2024-11-30 21:00:04.192632
STATUS: COMPACTING
IS_ERROR: NO
IS_SUSPENDED: NO
INFO:
3 rows in set (0.014 sec)
obclient [oceanbase]> select * from __all_virtual_compaction_diagnose_info;
±------------±---------±----------±-------------±------±----------±-------------±---------------------------±--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| svr_ip | svr_port | tenant_id | type | ls_id | tablet_id | status | create_time | diagnose_info |
±------------±---------±----------±-------------±------±----------±-------------±---------------------------±--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 81.16.49.44 | 2882 | 1002 | MINI_MERGE | 1 | 330 | NOT_SCHEDULE | 2024-12-09 09:25:11.027250 | schedule_suspect_info=“info=“memtable can not create dag successfully”;error_code=-4023;has been ready for flush time=8923518467;ready for flush time=1733698587508771;” |
| 81.16.49.44 | 2882 | 1002 | MEDIUM_MERGE | 1 | 329 | FAILED | 2024-11-30 21:17:36.633805 | error_no=-4019,last_error_time=1733706355752898,error_trace=YB425110312C-000615145E7A25E9-0-0,warning=“info=“DAG_MAJOR_MERGE”;ls_id=1;tablet_id=329;compaction_scn=1732971602166717024;extra_info=“merge_type=“MEDIUM_MERGE””;” |
| 81.16.49.44 | 2882 | 1002 | MINI_MERGE | 1 | 329 | NOT_SCHEDULE | 2024-12-09 09:25:16.029205 | schedule_suspect_info=“info=“memtable can not create dag successfully”;error_code=-4023;has been ready for flush time=8928520426;ready for flush time=1733698587508766;” |
±------------±---------±----------±-------------±------±----------±-------------±---------------------------±--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set (4.961 sec)
根据您提供的信息,OceanBase 4.2.1 版本在合并过程中卡住超过9天,并且报错4019。您已经提供了更多的诊断信息,包括 CDB_OB_MAJOR_COMPACTION
和 __all_virtual_compaction_diagnose_info
表的数据,以及 obdiag
工具生成的诊断报告。以下是对这些信息的分析和可能的解决方案:
分析
-
合并状态:
- 从
CDB_OB_MAJOR_COMPACTION
表中可以看到,租户1002
的合并状态为COMPACTING
,而其他租户的状态为IDLE
。 - 这表明租户
1002
的合并操作正在进行中,但已经持续了很长时间。
- 从
-
诊断信息:
- 从
__all_virtual_compaction_diagnose_info
表中可以看到,租户1002
的MEDIUM_MERGE
操作失败,错误码为-4019
。 - 错误信息中提到
DAG_MAJOR_MERGE
失败,这可能是导致合并卡住的原因之一。
- 从
-
资源使用情况:
- 您已经使用
obdiag
工具生成了诊断报告,这有助于进一步分析资源使用情况和潜在问题。
- 您已经使用
可能的原因及解决方案
-
资源瓶颈:
- 原因:合并操作需要大量的计算和存储资源。如果系统资源不足,可能会导致合并过程卡住。
-
解决方案:
- 检查系统资源使用情况,确保有足够的 CPU、内存和磁盘 I/O 资源。
- 使用
SHOW PROCESSLIST
查看当前正在运行的会话,找出可能占用大量资源的会话并优化或终止它们。 - 分析
obdiag
生成的诊断报告,特别是资源使用情况部分。
-
数据量过大:
- 原因:如果合并的数据量非常大,合并过程可能会非常缓慢。
-
解决方案:
- 分析数据分布情况,确保数据均匀分布在各个分区上。
- 考虑调整合并策略,例如使用增量合并或分批合并。
-
网络问题:
- 原因:网络延迟或不稳定可能导致合并过程卡住。
-
解决方案:
- 检查网络连接,确保各节点之间的网络通信正常。
- 使用
ping
或traceroute
工具检查网络延迟。
-
配置参数不当:
- 原因:某些配置参数设置不当可能导致合并过程卡住。
-
解决方案:
- 检查
major_freeze_duty_time
参数,确保合并时间窗口设置合理。 - 调整
minor_freeze_times
和major_freeze_times
参数,优化合并频率。
- 检查
-
错误码
-4019
:-
原因:错误码
-4019
通常表示合并过程中遇到了某种错误,可能是由于资源不足、数据损坏或其他内部问题。 -
解决方案:
- 检查
observer.log
文件,查找与合并相关的日志记录,特别是错误码-4019
的详细信息。 - 如果日志中没有明确的错误信息,建议联系 OceanBase 的技术支持团队,提供详细的诊断信息和日志文件,以便进一步分析和解决。
- 检查
-
原因:错误码
推荐步骤
-
分析
obdiag
诊断报告:- 仔细阅读
obdiag
生成的诊断报告,特别是资源使用情况和合并过程的部分。 - 关注报告中提到的任何异常或警告信息。
- 仔细阅读
-
检查日志文件:
- 使用
obdiag
工具收集日志文件,分析日志中的错误信息和警告信息。 - 关注
observer.log
文件,查找与合并相关的日志记录。
- 使用
-
调整配置参数:
- 查看当前配置参数:
SHOW PARAMETERS LIKE 'major_freeze_duty_time'; SHOW PARAMETERS LIKE 'minor_freeze_times'; SHOW PARAMETERS LIKE 'major_freeze_times';
- 根据实际情况调整这些参数,优化合并频率和时间窗口。
- 查看当前配置参数:
-
联系技术支持:
- 如果以上步骤无法解决问题,建议联系 OceanBase 的技术支持团队,提供详细的诊断信息和日志文件,以便进一步分析和解决。
希望以上信息对您有所帮助。如果您有更多详细信息或日志文件,欢迎提供,以便我们更好地为您解决问题。
详情请参考:
(小助手的答复已结束,如未能解决您的问题,请继续提问并等待其他同学的回复,谢谢!)
你根据这个文档 排查那些表的合并有问题
如何排查合并卡住问题
https://www.oceanbase.com/knowledge-base/oceanbase-database-1000000000685150?back=kb
麻烦查下81.16.49.44这个节点里的 日志里是否有
tablet multi version start not advance for a long time
怀疑是因为备份卡住导致合并卡住报错
如果是 三副本架构的话,可以挨个重启一下 OB 节点。大概率能恢复。
卡主的表是系统表?难道是触发BUG了?
obclient [oceanbase]> select table_name from __all_virtual_table where tablet_id=329;
±------------------+
| table_name |
±------------------+
| __all_column_stat |
±------------------+
1 row in set (0.114 sec)
你的ob版本 是ob4.2.1bp3hf1么?
最初始的4.2.1,看在bp3才修复这个问题?
你按照那个文档 再往下排查一下 看看具体什么原因导致的转储失败了 排查的信息 尽量贴出来 目前还确定不了什么原因导致的
obclient [oceanbase]> select *
→ from __all_virtual_server_compaction_event_history
→ where tenant_id = 1002
→ and compaction_scn = 1732971602166717024
→ and event like ‘%FINISHED%’;
±------------±---------±------±----------±------------±--------------------±---------------------------±-----------------------------------------------------------------------+
| svr_ip | svr_port | zone | tenant_id | type | compaction_scn | event_timestamp | event |
±------------±---------±------±----------±------------±--------------------±---------------------------±-----------------------------------------------------------------------+
| 81.16.49.46 | 2882 | zone3 | 1002 | MAJOR_MERGE | 1732971602166717024 | 2024-12-01 12:13:18.687964 | cost_time:3405.72s | TABLET_COMPACTION_FINISHED:cost_time=54786295940, |
| 81.16.49.45 | 2882 | zone2 | 1002 | MAJOR_MERGE | 1732971602166717024 | 2024-12-01 14:15:00.267005 | cost_time:2611.88s | TABLET_COMPACTION_FINISHED:cost_time=62090443016, |
±------------±---------±------±----------±------------±--------------------±---------------------------±-----------------------------------------------------------------------+
select * from __all_virtual_dag_warning_history;
±------------±---------±----------±----------------------------------±-----------±---------------------±-------------------------------------±--------±---------------------------±---------------------------±----------±-------------------------------------------------------------------------------------------------------------------------------------------+
| svr_ip | svr_port | tenant_id | task_id | module | type | ret | status | gmt_create | gmt_modified | retry_cnt | warning_info |
±------------±---------±----------±----------------------------------±-----------±---------------------±-------------------------------------±--------±---------------------------±---------------------------±----------±-------------------------------------------------------------------------------------------------------------------------------------------+
| 81.16.49.46 | 2882 | 1002 | YB425110312E-000609789B86EBA6-0-0 | COMPACTION | MAJOR_MERGE | OB_TIMEOUT | WARNING | 2023-12-07 12:41:16.994083 | 2023-12-07 12:41:16.994083 | 0 | info=“DAG_MAJOR_MERGE”;ls_id=1002;tablet_id=1152921504609076455;compaction_scn=1701885602321989480;extra_info=“merge_type=“MEDIUM_MERGE””; |
| 81.16.49.46 | 2882 | 1002 | YB425110312E-000609789B86EBAF-0-0 | COMPACTION | MAJOR_MERGE | OB_TIMEOUT | WARNING | 2023-12-07 12:41:29.292828 | 2023-12-07 12:41:29.292828 | 0 | info=“DAG_MAJOR_MERGE”;ls_id=1002;tablet_id=1152921504607078196;compaction_scn=1701885602321989480;extra_info=“merge_type=“MEDIUM_MERGE””; |
| 81.16.49.46 | 2882 | 1002 | YB425110312E-000609789B86EBBD-0-0 | COMPACTION | MAJOR_MERGE | OB_TIMEOUT | WARNING | 2023-12-07 12:41:49.295295 | 2023-12-07 12:41:49.295295 | 0 | info=“DAG_MAJOR_MERGE”;ls_id=1002;tablet_id=1152921504609085858;compaction_scn=1701885602321989480;extra_info=“merge_type=“MEDIUM_MERGE””; |
| 81.16.49.46 | 2882 | 1002 | YB425110312E-000609789B86EBC6-0-0 | COMPACTION | MAJOR_MERGE | OB_TIMEOUT | WARNING | 2023-12-07 12:42:08.719468 | 2023-12-07 12:42:08.719468 | 0 | info=“DAG_MAJOR_MERGE”;ls_id=1002;tablet_id=1152921504607073689;compaction_scn=1701885602321989480;extra_info=“merge_type=“MEDIUM_MERGE””; |
| 81.16.49.46 | 2882 | 1002 | YB425110312E-0006097AF5B3767D-0-0 | BACKUP | BACKUP_PREPARE | OB_TIMEOUT | WARNING | 2024-04-02 16:15:51.898716 | 2024-04-02 16:15:51.898716 | 0 | info=“DAG_BACKUP_PREPAER”;tenant_id=1002;backup_set_id=8;ls_id=1001;turn_id=1;retry_id=0; |
| 81.16.49.46 | 2882 | 1002 | YB425110312E-0006097B44E7BD76-0-0 | BACKUP | BACKUP_DATA | OB_BACKUP_DEVICE_OUT_OF_SPACE | WARNING | 2024-04-20 17:00:20.437523 | 2024-04-20 17:00:20.437523 | 0 | info=“DAG_BACKUP_DATA”;tenant_id=1002;backup_set_id=16;backup_data_type=2;ls_id=1001;turn_id=1;retry_id=0;task_id=36; |
| 81.16.49.46 | 2882 | 1002 | YB425110312E-0006097B44E7BD79-0-0 | BACKUP | BACKUP_DATA | OB_BACKUP_DEVICE_OUT_OF_SPACE | WARNING | 2024-04-20 17:00:20.475745 | 2024-04-20 17:00:20.475745 | 0 | info=“DAG_BACKUP_DATA”;tenant_id=1002;backup_set_id=16;backup_data_type=2;ls_id=1001;turn_id=1;retry_id=0;task_id=37; |
| 81.16.49.46 | 2882 | 1002 | YB425110312E-0006097C514B292D-0-0 | BACKUP | BACKUP_PREPARE | OB_BACKUP_ADVANCE_CHECKPOINT_TIMEOUT | WARNING | 2024-07-10 10:40:52.496160 | 2024-07-10 10:40:52.496160 | 0 | info=“DAG_BACKUP_PREPAER”;tenant_id=1002;backup_set_id=44;ls_id=1001;turn_id=1;retry_id=0; |
| 81.16.49.46 | 2882 | 1002 | YB425110312E-0006097EB042FC07-0-0 | BACKUP | BACKUP_PREPARE | OB_TIMEOUT | WARNING | 2024-10-29 16:02:30.689259 | 2024-10-29 16:02:30.689259 | 0 | info=“DAG_BACKUP_PREPAER”;tenant_id=1002;backup_set_id=92;ls_id=1001;turn_id=1;retry_id=0; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-00060B91007EE397-0-0 | BACKUP | BACKUP_META | OB_TIMEOUT | WARNING | 2024-04-02 16:15:13.897556 | 2024-04-02 16:15:13.897556 | 0 | info=“DAG_BACKUP_META”;ls_id=1002; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-00060B91007EE47C-0-0 | BACKUP | BACKUP_PREPARE | OB_TIMEOUT | WARNING | 2024-04-02 16:15:13.899922 | 2024-04-02 16:15:13.899922 | 0 | info=“DAG_BACKUP_PREPAER”;tenant_id=1002;backup_set_id=8;ls_id=1002;turn_id=1;retry_id=0; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-00060BB856599BBA-0-0 | BACKUP | PREFETCH_BACKUP_INFO | OB_REPLICA_CANNOT_BACKUP | WARNING | 2024-04-06 16:45:00.088232 | 2024-04-06 16:45:00.088232 | 0 | info=“DAG_PREFETCH_BACKUP_INFO”;tenant_id=1002;backup_set_id=10;backup_data_type=1;ls_id=1002;turn_id=1;retry_id=0;task_id=58; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-00060C28C3FCE1C2-0-0 | BACKUP | PREFETCH_BACKUP_INFO | OB_REPLICA_CANNOT_BACKUP | WARNING | 2024-04-18 16:05:06.397421 | 2024-04-18 16:05:06.397421 | 0 | info=“DAG_PREFETCH_BACKUP_INFO”;tenant_id=1002;backup_set_id=15;backup_data_type=1;ls_id=1002;turn_id=1;retry_id=0;task_id=0; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-00060C3FF6AAB750-0-0 | BACKUP | BACKUP_DATA | OB_BACKUP_DEVICE_OUT_OF_SPACE | WARNING | 2024-04-20 17:00:20.468430 | 2024-04-20 17:00:20.468430 | 0 | info=“DAG_BACKUP_DATA”;tenant_id=1002;backup_set_id=16;backup_data_type=2;ls_id=1002;turn_id=1;retry_id=0;task_id=51; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-00060C3FF6AAB74E-0-0 | BACKUP | BACKUP_DATA | OB_BACKUP_DEVICE_OUT_OF_SPACE | WARNING | 2024-04-20 17:00:20.630305 | 2024-04-20 17:00:20.630305 | 0 | info=“DAG_BACKUP_DATA”;tenant_id=1002;backup_set_id=16;backup_data_type=2;ls_id=1002;turn_id=1;retry_id=0;task_id=50; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-000610A2279F1B90-0-0 | COMPACTION | MDS_TABLE_MERGE | OB_TIMEOUT | WARNING | 2024-08-14 01:37:40.363119 | 2024-08-14 01:37:40.363119 | 0 | info=“DAG_TYPE_MDS_TABLE_MERGE”;ls_id=1002;tablet_id=1152921504622036030;flush_scn=1723569896763204874; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-000610F37F87DF33-0-0 | BACKUP | PREFETCH_BACKUP_INFO | OB_REPLICA_CANNOT_BACKUP | WARNING | 2024-08-22 16:37:02.186903 | 2024-08-22 16:37:02.186903 | 0 | info=“DAG_PREFETCH_BACKUP_INFO”;tenant_id=1002;backup_set_id=63;backup_data_type=1;ls_id=1002;turn_id=1;retry_id=0;task_id=13; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-000611064E6B5F54-0-0 | BACKUP | PREFETCH_BACKUP_INFO | OB_REPLICA_CANNOT_BACKUP | WARNING | 2024-08-24 16:41:28.840891 | 2024-08-24 16:41:28.840891 | 0 | info=“DAG_PREFETCH_BACKUP_INFO”;tenant_id=1002;backup_set_id=64;backup_data_type=1;ls_id=1002;turn_id=1;retry_id=0;task_id=2; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-000611AB6CEC621F-0-0 | BACKUP | PREFETCH_BACKUP_INFO | OB_REPLICA_CANNOT_BACKUP | WARNING | 2024-09-10 16:48:46.611018 | 2024-09-10 16:48:46.611018 | 0 | info=“DAG_PREFETCH_BACKUP_INFO”;tenant_id=1002;backup_set_id=71;backup_data_type=1;ls_id=1002;turn_id=1;retry_id=0;task_id=85; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-000612CEF71A9D82-0-0 | BACKUP | PREFETCH_BACKUP_INFO | OB_REPLICA_CANNOT_BACKUP | WARNING | 2024-10-10 16:16:42.382201 | 2024-10-10 16:16:42.382201 | 0 | info=“DAG_PREFETCH_BACKUP_INFO”;tenant_id=1002;backup_set_id=84;backup_data_type=1;ls_id=1002;turn_id=1;retry_id=0;task_id=11; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-000615589C3F2D51-0-0 | COMPACTION | MAJOR_MERGE | OB_SIZE_OVERFLOW | RETRYED | 2024-11-30 21:17:36.633805 | 2024-12-16 16:06:16.964482 | 538 | info=“DAG_MAJOR_MERGE”;ls_id=1;tablet_id=329;compaction_scn=1732971602166717024;extra_info=“merge_type=“MEDIUM_MERGE””; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-000615589C3F2D77-0-0 | COMPACTION | MINI_MERGE | OB_SIZE_OVERFLOW | RETRYED | 2024-12-01 00:53:57.292093 | 2024-12-16 16:11:05.761313 | 29141 | info=“DAG_MINI_MERGE”;ls_id=1;tablet_id=330;compaction_scn=0;extra_info=“merge_type=“MINI_MERGE””; |
| 81.16.49.44 | 2882 | 1002 | YB425110312C-000615589C3F2D85-0-0 | COMPACTION | MINI_MERGE | OB_SIZE_OVERFLOW | RETRYED | 2024-12-01 00:53:57.154590 | 2024-12-16 16:11:34.253882 | 39139 | info=“DAG_MINI_MERGE”;ls_id=1;tablet_id=329;compaction_scn=0;extra_info=“merge_type=“MINI_MERGE””; |
±------------±---------±----------±----------------------------------±-----------±---------------------±-------------------------------------±--------±---------------------------±---------------------------±----------±-------------------------------------------------------------------------------------------------------------------------------------------+
23 rows in set (0.030 sec)
麻烦 rootservice.log日志和observer.log日志 提供一下
observer.log覆盖了
那这样 就不好排查 是什么造成的合并卡住了 如果你是三副本架构 你就依次重启一下
重启吧。大部分时候 OB 集群节点重启都能解决问题。大多数数据库重启也都能恢复,OB 三副本的优势在于集群节点可以挨个重启,做到对业务很丝滑(除非那种超大的事务可能会中断报错)。
合并问题原因很不好找。一般有:
- 磁盘空间或 OB 内部空间不足
- 大事务或者大表某个操作触发 未知 BUG。
4.2.1 最新 BP 到10 了,可以升级到 8 或 9 ,如果是 BUG ,也许 BUG 就解决了。