在某租户长时间为合并成功,重启observer后,仍然合并卡住

【 使用环境 】生产环境
【 OB or 其他组件 】OB
【 使用版本 】5.7.25-OceanBase_CE-v4.3.1.0
【问题描述】在某租户长时间为合并成功,重启observer后,仍然合并卡住
【附件及日志】
obdiag version: 4.3.0
observer Version: 4.3.1.0
±-----------------------------------------------------------------------------------------+
| record |
±-----±----------------------------------------------------------------------------------+
| step | info |
±-----±----------------------------------------------------------------------------------+
| 1 | observer version: 4.3.1.0 |
| 2 | Starting major compaction hold diagnosis… |
| 3 | Step 1: Checking CDB_OB_MAJOR_COMPACTION for errors |
| 4 | No compaction errors found (IS_ERROR=‘YES’) |
| 5 | Step 2: Checking for suspended compactions |
| 6 | No suspended compactions found |
| 7 | Step 3: Checking __all_virtual_compaction_diagnose_info for failures |
| 8 | No failed compaction tasks in diagnose info |
| 9 | Step 4: Checking for long-running compaction tasks (>20 minutes) |ccccc
| 10 | Found long-running compaction tasks in tenants: [‘1004’] |
| 11 | Step 5: Analyzing compaction speed |
| 12 | Tenant 1004 on ccc.2.11:2882: Progress 51.4%, Speed 0.20 MB/s, Elapsed 35000s |
| 13 | Tenant 1004 on xx.xx.xx.12:2882: Progress 51.5%, Speed 0.20 MB/s, Elapsed 35000s |
| 14 | Tenant 1004 on xx.xx.xx.13:2882: Progress 51.6%, Speed 0.20 MB/s, Elapsed 34999s |
| 15 | Tenants requiring detailed diagnosis: [‘1004’] |
±-----±----------------------------------------------------------------------------------+
The suggest: Very slow compaction detected for tenant 1004 (0.20 MB/s). Check disk I/O, memory pressure, or DAG scheduler status.
Very slow compaction detected for tenant 1004 (0.20 MB/s). Check disk I/O, memory pressure, or DAG scheduler status.
Very slow compaction detected for tenant 1004 (0.20 MB/s). Check disk I/O, memory pressure, or DAG scheduler status.

±------------------------------------------------------------------------------------------------------------+
| record |
±-----±-----------------------------------------------------------------------------------------------------+
| step | info |
±-----±-----------------------------------------------------------------------------------------------------+
| 1 | observer version: 4.3.1.0 |
| 2 | Starting major compaction hold diagnosis… |
| 3 | Step 1: Checking CDB_OB_MAJOR_COMPACTION for errors |
| 4 | No compaction errors found (IS_ERROR=‘YES’) |
| 5 | Step 2: Checking for suspended compactions |
| 6 | No suspended compactions found |
| 7 | Step 3: Checking __all_virtual_compaction_diagnose_info for failures |
| 8 | No failed compaction tasks in diagnose info |
| 9 | Step 4: Checking for long-running compaction tasks (>20 minutes) |
| 10 | Found long-running compaction tasks in tenants: [‘1004’] |
| 11 | Step 5: Analyzing compaction speed |
| 12 | Tenant 1004 on xx.xx.xx.11:2882: Progress 51.4%, Speed 0.20 MB/s, Elapsed 35000s |
| 13 | Tenant 1004 on xx.xx.xx.12:2882: Progress 51.5%, Speed 0.20 MB/s, Elapsed 35000s |
| 14 | Tenant 1004 on xx.xx.xx.13:2882: Progress 51.6%, Speed 0.20 MB/s, Elapsed 34999s |
| 15 | Tenants requiring detailed diagnosis: [‘1004’] |
| 16 | === Detailed diagnosis for tenant_id: 1004 === |
| 17 | Step 1: Getting CDB_OB_MAJOR_COMPACTION status |
| 18 | Compaction status: [{“TENANT_ID”: 1004, “FROZEN_SCN”: 1781200801154149000, “FROZEN_TIME”: |
| | “2026-06-12 02:00:01”, “GLOBAL_BROADCAST_SCN”: 1781200801154149000, “LAST_SCN”: 1779732001889884001, |
| | “LAST_FINISH_TIME”: “2026-06-11 23:31:45”, “START_TIME”: “2026-06-12 02:00:01”, “STATUS”: |
| | “COMPACTING”, “IS_ERROR”: “NO”, “IS_SUSPENDED”: “NO”, “INFO”: “”}] |
| 19 | Step 2: Analyzing diagnose info |
| 20 | Step 3: Getting compaction progress details |
| 21 | GLOBAL_BROADCAST_SCN: 1781200801154149000, LAST_SCN: 1779732001889884001 |
| 22 | Step 4: Getting compaction history |
| 23 | Saved 100 compaction history records |
| 24 | Step 5: Getting compaction suggestions |
| 25 | Step 6: Getting DAG scheduler status |
| 26 | DAG scheduler info collected |
| 27 | Step 7: Getting DAG task status |
| 28 | Step 8: Collecting dmesg logs |
| 29 | Collected dmesg from local |
| 30 | Collected dmesg from remote_xx_xx_xx_12 |
| 31 | Collected dmesg from remote_xx_xx_xx_13 |
| 32 | Step 9: Collecting relevant observer logs |
| 33 | Step 10: Checking for duplicate index names |
| 34 | Checking for duplicate index names that may cause compaction hold… |
| 35 | No duplicate index names found across different tables |
| 36 | Step 11: Checking DDL task status |
| 37 | Checking DDL task status… |
| 38 | No DDL tasks found |
| 39 | Found 20 DDL error records |
| 40 | DDL error: task_id=None, ret_code=None, message= |
| 41 | DDL error: task_id=None, ret_code=None, message= |
| 42 | DDL error: task_id=None, ret_code=None, message= |
| 43 | DDL error: task_id=None, ret_code=None, message= |
| 44 | DDL error: task_id=None, ret_code=None, message= |
| 45 | Step 12: Checking memory throttling issues |
| 46 | Checking memory throttling issues (Issue #1107)… |
| 47 | Tenant memory usage: |
| 48 | Server xx.xx.xx.11:2882: hold=40.86GB, limit=50.0GB, usage=81.72% |
| 49 | Server xx.xx.xx.13:2882: hold=40.69GB, limit=50.0GB, usage=81.38% |
| 50 | Server xx.xx.xx.12:2882: hold=31.5GB, limit=50.0GB, usage=63.0% |
| 51 | Memstore info: |
| 52 | Server xx.xx.xx.11:2882: memstore_usage=8.95%, writing_throttling_trigger=60.0% |
| 53 | Server xx.xx.xx.12:2882: memstore_usage=8.9%, writing_throttling_trigger=60.0% |
| 54 | Server xx.xx.xx.13:2882: memstore_usage=7.65%, writing_throttling_trigger=60.0% |
| 55 | Checking TxShare throttling limit vs meta tenant memory limit… |
| 56 | Meta tenant (tenant_id=1) memory limits: |
| 57 | Server xx.xx.xx.11:2882: limit=20.0GB, hold=4.58GB |
| 58 | Server xx.xx.xx.12:2882: limit=20.0GB, hold=2.45GB |
| 59 | Server xx.xx.xx.13:2882: limit=20.0GB, hold=2.42GB |
| 60 | Throttling-related parameters: |
| 61 | Server xx.xx.xx.11: log_disk_throttling_maximum_duration = 2h |
| 62 | Server xx.xx.xx.11: log_disk_throttling_maximum_duration = 2h |
| 63 | Server xx.xx.xx.11: log_disk_throttling_maximum_duration = 2h |
| 64 | Server xx.xx.xx.11: log_disk_throttling_maximum_duration = 2h |
| 65 | Server xx.xx.xx.11: log_disk_throttling_maximum_duration = 2h |
| 66 | Server xx.xx.xx.12: log_disk_throttling_maximum_duration = 2h |
| 67 | Server xx.xx.xx.12: log_disk_throttling_maximum_duration = 2h |
| 68 | Server xx.xx.xx.12: log_disk_throttling_maximum_duration = 2h |
| 69 | Server xx.xx.xx.12: log_disk_throttling_maximum_duration = 2h |
| 70 | Server xx.xx.xx.12: log_disk_throttling_maximum_duration = 2h |
| 71 | Server xx.xx.xx.13: log_disk_throttling_maximum_duration = 2h |
| 72 | Server xx.xx.xx.13: log_disk_throttling_maximum_duration = 2h |
| 73 | Server xx.xx.xx.13: log_disk_throttling_maximum_duration = 2h |
| 74 | Server xx.xx.xx.13: log_disk_throttling_maximum_duration = 2h |
| 75 | Server xx.xx.xx.13: log_disk_throttling_maximum_duration = 2h |
| 76 | Server xx.xx.xx.11: log_disk_throttling_percentage = 60 |
| 77 | Server xx.xx.xx.11: log_disk_throttling_percentage = 60 |
| 78 | Server xx.xx.xx.11: log_disk_throttling_percentage = 60 |
| 79 | Server xx.xx.xx.11: log_disk_throttling_percentage = 60 |
| 80 | Server xx.xx.xx.11: log_disk_throttling_percentage = 60 |
| 81 | Server xx.xx.xx.12: log_disk_throttling_percentage = 60 |
| 82 | Server xx.xx.xx.12: log_disk_throttling_percentage = 60 |
| 83 | Server xx.xx.xx.12: log_disk_throttling_percentage = 60 |
| 84 | Server xx.xx.xx.12: log_disk_throttling_percentage = 60 |
| 85 | Server xx.xx.xx.12: log_disk_throttling_percentage = 60 |
| 86 | Server xx.xx.xx.13: log_disk_throttling_percentage = 60 |
| 87 | Server xx.xx.xx.13: log_disk_throttling_percentage = 60 |
| 88 | Server xx.xx.xx.13: log_disk_throttling_percentage = 60 |
| 89 | Server xx.xx.xx.13: log_disk_throttling_percentage = 60 |
| 90 | Server xx.xx.xx.13: log_disk_throttling_percentage = 60 |
| 91 | Server xx.xx.xx.11: writing_throttling_maximum_duration = 2h |
| 92 | Server xx.xx.xx.11: writing_throttling_maximum_duration = 2h |
| 93 | Server xx.xx.xx.11: writing_throttling_maximum_duration = 2h |
| 94 | Server xx.xx.xx.11: writing_throttling_maximum_duration = 2h |
| 95 | Server xx.xx.xx.11: writing_throttling_maximum_duration = 2h |
| 96 | Server xx.xx.xx.12: writing_throttling_maximum_duration = 2h |
| 97 | Server xx.xx.xx.12: writing_throttling_maximum_duration = 2h |
| 98 | Server xx.xx.xx.12: writing_throttling_maximum_duration = 2h |
| 99 | Server xx.xx.xx.12: writing_throttling_maximum_duration = 2h |
| 100 | Server xx.xx.xx.12: writing_throttling_maximum_duration = 2h |
| 101 | Server xx.xx.xx.13: writing_throttling_maximum_duration = 2h |
| 102 | Server xx.xx.xx.13: writing_throttling_maximum_duration = 2h |
| 103 | Server xx.xx.xx.13: writing_throttling_maximum_duration = 2h |
| 104 | Server xx.xx.xx.13: writing_throttling_maximum_duration = 2h |
| 105 | Server xx.xx.xx.13: writing_throttling_maximum_duration = 2h |
| 106 | Server xx.xx.xx.11: writing_throttling_trigger_percentage = 60 |
| 107 | Server xx.xx.xx.11: writing_throttling_trigger_percentage = 60 |
| 108 | Server xx.xx.xx.11: writing_throttling_trigger_percentage = 60 |
| 109 | Server xx.xx.xx.11: writing_throttling_trigger_percentage = 60 |
| 110 | Server xx.xx.xx.11: writing_throttling_trigger_percentage = 60 |
| 111 | Server xx.xx.xx.12: writing_throttling_trigger_percentage = 60 |
| 112 | Server xx.xx.xx.12: writing_throttling_trigger_percentage = 60 |
| 113 | Server xx.xx.xx.12: writing_throttling_trigger_percentage = 60 |
| 114 | Server xx.xx.xx.12: writing_throttling_trigger_percentage = 60 |
| 115 | Server xx.xx.xx.12: writing_throttling_trigger_percentage = 60 |
| 116 | Server xx.xx.xx.13: writing_throttling_trigger_percentage = 60 |
| 117 | Server xx.xx.xx.13: writing_throttling_trigger_percentage = 60 |
| 118 | Server xx.xx.xx.13: writing_throttling_trigger_percentage = 60 |
| 119 | Server xx.xx.xx.13: writing_throttling_trigger_percentage = 60 |
| 120 | Server xx.xx.xx.13: writing_throttling_trigger_percentage = 60 |
| 121 | Server xx.xx.xx.11: _tx_share_memory_limit_percentage = 0 |
| 122 | Server xx.xx.xx.11: _tx_share_memory_limit_percentage = 0 |
| 123 | Server xx.xx.xx.11: _tx_share_memory_limit_percentage = 0 |
| 124 | Server xx.xx.xx.11: _tx_share_memory_limit_percentage = 0 |
| 125 | Server xx.xx.xx.11: _tx_share_memory_limit_percentage = 0 |
| 126 | Server xx.xx.xx.12: _tx_share_memory_limit_percentage = 0 |
| 127 | Server xx.xx.xx.12: _tx_share_memory_limit_percentage = 0 |
| 128 | Server xx.xx.xx.12: _tx_share_memory_limit_percentage = 0 |
| 129 | Server xx.xx.xx.12: _tx_share_memory_limit_percentage = 0 |
| 130 | Server xx.xx.xx.12: _tx_share_memory_limit_percentage = 0 |
| 131 | Server xx.xx.xx.13: _tx_share_memory_limit_percentage = 0 |
| 132 | Server xx.xx.xx.13: _tx_share_memory_limit_percentage = 0 |
| 133 | Server xx.xx.xx.13: _tx_share_memory_limit_percentage = 0 |
| 134 | Server xx.xx.xx.13: _tx_share_memory_limit_percentage = 0 |
| 135 | Server xx.xx.xx.13: _tx_share_memory_limit_percentage = 0 |
| 136 | Average meta tenant memory limit: 20.0GB |
| 137 | Checking for memory expansion bug scenario… |
| 138 | Tenant unit configuration: |
| 139 | Server xx.xx.xx.11: memory_size=50.0GB |
| 140 | Server xx.xx.xx.12: memory_size=50.0GB |
| 141 | Server xx.xx.xx.13: memory_size=50.0GB |
| 142 | Checking for throttling-related transaction timeouts… |
| 143 | Step 13: Checking column group schema issue (heterogenous row-to-column) |
| 144 | No SSTABLE_MAJOR_MERGE tasks found (view may not exist in this version) |
±-----±-----------------------------------------------------------------------------------------------------+
The suggest: Please review diagnostic files in ./obdiag_rca/obdiag_major_hold_20260612114322 for detailed analysis

帮忙查询下相关信息
select * from GV$OB_COMPACTION_PROGRESS where STATUS=“NODE_RUNNING”;
select * from GV$OB_COMPACTION_PROGRESS where tenant_id = xx and compaction_scn = xxx and STATUS != “FINISH”;
select * from GV$OB_COMPACTION_DIAGNOSE_INFO;

1 个赞




是否与此有关?大佬

修改了集群内部sql执行的超时时间,解决了

这么快确认原因了么 分享下方便后续社区同学遇到了进行处理
:smiley:

1 个赞

我是 使用 obdiag rca run --scene=major_hold 工具后,让AI去分析的其中的内容。关键点就是 集群内有个节点的schema版本落后。以下是AI总结的

  1. 表象:三个 zone 各剩 125~138 个 tablet 未合并,RS 侧 113 张表停在 INITIAL 状态,compaction_scn
    没有推进到本轮冻结点。
  2. 第一层:三台机器 DAG 调度器完全空闲(TOTAL_DAG_CNT=0)——合并任务根本没被调度出来;诊断信息显示 schedule medium
    failed; error_code=-5627,113 个 tablet 同症。
  3. 第二层:observer.log 证实 decide_medium_snapshot 持续返回 -5627(OB_SCHEMA_EAGAIN,“Schema try again”)——medium
    合并调度时拿不到冻结点(schema_version=1781200654160944)对应的表 schema。
  4. 第三层:__all_virtual_server_schema_info 显示 xx.xx.xx.12 的本地 schema 版本停在 1781191506384304(6-11
    23:25),落后于冻结点;received 版本却是最新的——即"收到了新版本通知,但本地刷新一直失败"。LS 1002 的 leader 在 .12
    上,它定不出 medium snapshot,三副本一起卡死。
  5. 病根:.12 的增量 schema 刷新每轮都在恰好 30 秒处超时(internal_sql_execute_timeout 默认值)。原因是:
    - 业务在对一批约 46 张表持续高频做 DDL,schema 版本不断前进;
    - .12 自 6-11 23:25 一次刷新失败后(时间与两条被取消的 create index DDL 吻合)开始积压,每轮要补的表越来越多;
    - 增量刷新是原子的——读 __all_table_history/__all_column_history 的内部 SQL(实际发往 .11 执行)30
    秒跑不完就整轮作废、下轮从头再来;
    - 积压随新 DDL 持续增长 > 30 秒预算永远不够 → 死循环。.11/.13 因一直小步跟刷而无事。