ob集群重启后,用户租户合并卡住了,错误码5627, 这种问题有人遇到吗

【 使用环境 】 测试环境
【 OB or 其他组件 】
【 使用版本 】4.3.0.1
【问题描述】集群(单节点)重启后,连接mysql模式数据库提示“Server is initializing”,已超过36小时。diagnose_info 提示“schedule medium failed”。
【复现路径】错误码:5627
【附件及日志】
[2024-09-19 16:47:29.974526] WDIAG [STORAGE.COMPACTION] decide_medium_snapshot (ob_medium_compaction_func.cpp:565) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=6][errcode=-5627] failed to choose medium scn for major(ret=-5627, this={ls_id:{id:1001}, tablet_id:{id:237664}})
[2024-09-19 16:47:29.979563] INFO [STORAGE] try_sync_reserved_snapshot (ob_ls_reserved_snapshot_mgr.cpp:213) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=14] submit reserved snapshot log success(ls_id={id:1001}, new_reserved_snapshot=1726733836881976507)
[2024-09-19 16:47:29.979616] INFO [STORAGE] ADD_SUSPECT_INFO (ob_compaction_diagnose.h:830) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=28] success to add suspect info(ret=0, info={tenant_id:1004, merge_type:“MEDIUM_MERGE”, ls_id:{id:1001}, tablet_id:{id:237664}, add_time:1726735649979595, hash:-8620949337907882730}, info_type=3, info_type_str=“schedule medium failed”, diagnose_type=3)
[2024-09-19 16:47:29.979652] WDIAG [STORAGE.COMPACTION] schedule_tablet_medium (ob_tenant_tablet_scheduler.cpp:1716) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=29][errcode=0] failed to schedule next medium(tmp_ret=-5627, ls_id={id:1001}, tablet_id={id:237664})
[2024-09-19 16:47:29.979693] WDIAG ~ObOccamTimeGuard (ob_occam_time_guard.h:269) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=14][errcode=0] cost too much time:(null):(null), (*this=|threshold=30.00s|start at 17:43:47.873|0=106.50ms|1=74.42ms|2=32.02ms|4=83022.68s|total=83022.89s)
[2024-09-19 16:47:29.980852] WDIAG [SHARE.SCHEMA] get_tenant_schema_guard (ob_multi_version_schema_service.cpp:1185) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=44][errcode=-5627] specified schema version larger than latest schema version, need retry(ret=-5627, ret=“OB_SCHEMA_EAGAIN”, tenant_schema_version=1715939734621976, tenant_latest_local_version=1715939734621975, sys_schema_version=-1, sys_latest_local_version=1726526957234032)
[2024-09-19 16:47:30.980953] WDIAG [SHARE.SCHEMA] get_tenant_schema_guard (ob_multi_version_schema_service.cpp:1185) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=16][errcode=-5627] specified schema version larger than latest schema version, need retry(ret=-5627, ret=“OB_SCHEMA_EAGAIN”, tenant_schema_version=1715939734621976, tenant_latest_local_version=1715939734621975, sys_schema_version=-1, sys_latest_local_version=1726526957234032)
[2024-09-19 16:47:31.981062] WDIAG [SHARE.SCHEMA] get_tenant_schema_guard (ob_multi_version_schema_service.cpp:1185) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=21][errcode=-5627] specified schema version larger than latest schema version, need retry(ret=-5627, ret=“OB_SCHEMA_EAGAIN”, tenant_schema_version=1715939734621976, tenant_latest_local_version=1715939734621975, sys_schema_version=-1, sys_latest_local_version=1726526957234032)
[2024-09-19 16:47:32.981218] WDIAG [SHARE.SCHEMA] get_tenant_schema_guard (ob_multi_version_schema_service.cpp:1185) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=71][errcode=-5627] specified schema version larger than latest schema version, need retry(ret=-5627, ret=“OB_SCHEMA_EAGAIN”, tenant_schema_version=1715939734621976, tenant_latest_local_version=1715939734621975, sys_schema_version=-1, sys_latest_local_version=1726526957234032)
[2024-09-19 16:47:33.981328] WDIAG [SHARE.SCHEMA] get_tenant_schema_guard (ob_multi_version_schema_service.cpp:1185) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=29][errcode=-5627] specified schema version larger than latest schema version, need retry(ret=-5627, ret=“OB_SCHEMA_EAGAIN”, tenant_schema_version=1715939734621976, tenant_latest_local_version=1715939734621975, sys_schema_version=-1, sys_latest_local_version=1726526957234032)
[2024-09-19 16:47:34.981488] WDIAG [SHARE.SCHEMA] get_tenant_schema_guard (ob_multi_version_schema_service.cpp:1185) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=24][errcode=-5627] specified schema version larger than latest schema version, need retry(ret=-5627, ret=“OB_SCHEMA_EAGAIN”, tenant_schema_version=1715939734621976, tenant_latest_local_version=1715939734621975, sys_schema_version=-1, sys_latest_local_version=1726526957234032)
[2024-09-19 16:47:35.981679] WDIAG [SHARE.SCHEMA] get_tenant_schema_guard (ob_multi_version_schema_service.cpp:1185) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=101][errcode=-5627] specified schema version larger than latest schema version, need retry(ret=-5627, ret=“OB_SCHEMA_EAGAIN”, tenant_schema_version=1715939734621976, tenant_latest_local_version=1715939734621975, sys_schema_version=-1, sys_latest_local_version=1726526957234032)
[2024-09-19 16:47:36.981781] WDIAG [SHARE.SCHEMA] get_tenant_schema_guard (ob_multi_version_schema_service.cpp:1185) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=24][errcode=-5627] specified schema version larger than latest schema version, need retry(ret=-5627, ret=“OB_SCHEMA_EAGAIN”, tenant_schema_version=1715939734621976, tenant_latest_local_version=1715939734621975, sys_schema_version=-1, sys_latest_local_version=1726526957234032)
[2024-09-19 16:47:37.981889] WDIAG [SHARE.SCHEMA] get_tenant_schema_guard (ob_multi_version_schema_service.cpp:1185) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=21][errcode=-5627] specified schema version larger than latest schema version, need retry(ret=-5627, ret=“OB_SCHEMA_EAGAIN”, tenant_schema_version=1715939734621976, tenant_latest_local_version=1715939734621975, sys_schema_version=-1, sys_latest_local_version=1726526957234032)
[2024-09-19 16:47:38.982032] WDIAG [SHARE.SCHEMA] get_tenant_schema_guard (ob_multi_version_schema_service.cpp:1185) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=49][errcode=-5627] specified schema version larger than latest schema version, need retry(ret=-5627, ret=“OB_SCHEMA_EAGAIN”, tenant_schema_version=1715939734621976, tenant_latest_local_version=1715939734621975, sys_schema_version=-1, sys_latest_local_version=1726526957234032)
[2024-09-19 16:47:39.982185] WDIAG [SHARE.SCHEMA] retry_get_schema_guard (ob_multi_version_schema_service.cpp:1465) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=22][errcode=-5627] fail to get tenant schema guard(ret=-5627, table_id=640697, schema_version=1715939734621976)
[2024-09-19 16:47:39.982230] WDIAG [SHARE.SCHEMA] retry_get_schema_guard (ob_multi_version_schema_service.cpp:1470) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=44][errcode=-5627] fail to get schema guard(ret=-5627, schema_version=1715939734621976)
[2024-09-19 16:47:39.982241] WDIAG [STORAGE.COMPACTION] get_table_schema_to_merge (ob_medium_compaction_func.cpp:915) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=8][errcode=-5627] Fail to get schema(ret=-5627, tenant_id=1004, schema_version=1715939734621976, table_id=640697)
[2024-09-19 16:47:39.982253] WDIAG [STORAGE.COMPACTION] find_valid_freeze_info (ob_medium_compaction_func.cpp:163) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=10][errcode=-5627] failed to get table schema(ret=-5627, medium_info={compaction_type:“invalid_type”, medium_merge_reason:“NONE”, medium_snapshot:0, last_medium_snapshot:0, tenant_id:0, cluster_id:0, medium_compat_version:4, data_version:17180065793, is_schema_changed:0, storage_schema:{this:0x14965f854c10, storage_schema_version:0, version:0, is_use_bloomfilter:0, column_info_simplified:0, compat_mode:0, table_type:16, index_type:0, row_store_type:1, schema_version:-1, column_cnt:0, store_column_cnt:0, tablet_size:134217728, pctfree:10, block_size:0, progressive_merge_round:0, master_key_id:4294967295, compressor_type:1, encryption:"", encrypt_key:"", rowkey_cnt:0, rowkey_array:[cnt:0], column_cnt:0, column_array:[cnt:0], skip_index_cnt:0, skip_idx_attr_array:[cnt:0], column_group_cnt:0, column_group_array:[cnt:0], has_all_column_group:false}, contain_parallel_range:0, parallel_merge_info:{list_size:0, compat:1, }})
[2024-09-19 16:47:39.982306] WDIAG [STORAGE.COMPACTION] choose_major_snapshot (ob_medium_compaction_func.cpp:188) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=51][errcode=-5627] failed to find valid freeze info(ret=-5627, ret=“OB_SCHEMA_EAGAIN”)
[2024-09-19 16:47:39.982316] WDIAG [STORAGE.COMPACTION] decide_medium_snapshot (ob_medium_compaction_func.cpp:565) [925861][T1004_MediumLoo][T1004][YB420A8F51B1-0006226030622162-0-0] [lt=9][errcode=-5627] failed to choose medium scn for major(ret=-5627, this={ls_id:{id:1001}, tablet_id:{id:1152921504606886770}})
@论坛小助手

3 个赞

用户问题总结: 在 OceanBase 集群重启后,用户租户在合并操作时遇到了错误码 5627,希望得到解决方法。

根据您的描述,错误码 5627 并未直接出现在提供的文档片段中,但考虑到这是一个与集群操作或租户管理相关的错误,可能涉及到系统内部的一些限制或配置问题。为了更准确地定位问题并提供有效的解决方案,建议使用 obdiag 工具来收集当前集群的状态信息。

日志采集场景: 合并问题
日志采集命令: obdiag gather scene run --scene=observer.compaction

该命令会帮助我们收集与集群合并操作相关的日志及配置信息,这对于理解为什么在重启后会发生合并卡住的情况至关重要。

请您提供以下信息以便我们更好地帮助您:

  1. 集群重启前是否有正在进行的合并操作或其他大规模的数据操作?
  2. 是否对集群进行了任何配置更改,例如调整了合并策略或参数?
  3. 出现问题时集群的负载情况如何?是否有大量的写入或读取操作正在发生?
  4. 能否提供错误码 5627 发生时的具体上下文信息,如操作命令、执行环境等?
  5. 您是否尝试过任何故障排除步骤?如果有,请分享您的尝试和结果。

更多信息请查看:

  1. 0001 ~ 3999
  2. 4000 ~ 4499

附上敏捷诊断工具 obdiag 使用帮助链接

2 个赞

根据 《OceanBase诊断调优 (十四) ——基于obdiag的OceanBase卡合并问题排查》文档进行排查,其上日志为 decide_medium_snapshot关键字前后20行的日志。
确定合并记录
obclient [oceanbase]> select * from CDB_OB_MAJOR_COMPACTION where status=‘COMPACTING’;
±----------±--------------------±---------------------------±---------------------±--------------------±---------------------------±---------------------------±-----------±---------±-------------±-----+
| TENANT_ID | FROZEN_SCN | FROZEN_TIME | GLOBAL_BROADCAST_SCN | LAST_SCN | LAST_FINISH_TIME | START_TIME | STATUS | IS_ERROR | IS_SUSPENDED | INFO |
±----------±--------------------±---------------------------±---------------------±--------------------±---------------------------±---------------------------±-----------±---------±-------------±-----+
| 1004 | 1726336804517991694 | 2024-09-15 02:00:04.517992 | 1726336804517991694 | 1719943200208502037 | 2024-07-03 02:06:48.996818 | 2024-09-15 02:00:04.833801 | COMPACTING | NO | NO | |
±----------±--------------------±---------------------------±---------------------±--------------------±---------------------------±---------------------------±-----------±---------±-------------±-----+

2 个赞

根据小助手提供的建议优先使用obdaig采集一下,方便后续进行排查处理

1 个赞

根据您提供的信息以及《OceanBase诊断调优 (十四) ——基于obdiag的OceanBase卡合并问题排查》文档中的指导,我们可以进一步分析您遇到的问题。首先,您已经确认了存在一个正在进行合并操作的记录,这表明合并确实没有完成,并且状态一直停留在 COMPACTING

从日志中,我们可以看到错误码 5627 与 decide_medium_snapshot 相关,具体来说是在选择介质快照的过程中出现了问题。错误信息提示 failed to choose medium scn for major,这意味着在选择用于主要合并的介质快照时遇到了困难。

针对这个问题,您可以尝试以下几个步骤来进一步排查:

  1. 检查集群状态:确保集群的所有节点都正常运行,并且没有其他异常情况影响到合并过程。可以使用 obd cluster show 或者 obd cluster check 命令来检查集群的整体健康状况。

  2. 查看合并策略:确认当前的合并策略设置是否合适。可以通过 obclient 查询 __all_sys_parameter 表来查看与合并相关的参数设置,例如 minor_merge_time_intervalmajor_merge_time_interval 等。

  3. 调整合并参数:如果发现合并策略设置不合理,可以适当调整这些参数,然后重新触发合并操作。需要注意的是,在调整参数后,最好等待一段时间观察效果。

  4. 使用 obdiag 收集更多信息:如之前所述,使用 obdiag gather scene run --scene=observer.compaction 来收集与合并相关的详细信息,这有助于进一步定位问题所在。

  5. 联系技术支持:如果以上步骤都不能解决问题,建议联系 OceanBase 的技术支持团队,提供详细的日志信息和集群配置,以便他们能够更深入地分析问题原因并提供解决方案。

希望以上建议能对您有所帮助。如果您还有其他疑问或者需要进一步的帮助,请随时告知。

(小助手的答复已结束,如未能解决您的问题,请继续提问并等待其他同学的回复,谢谢!)

1 个赞

命令:obd obdiag gather scene run myoceanbase --scene=observer.compaction
ob_log_local_20240924111157_20240924114257.zip (43.9 MB)
这是收集的日志文件,请大佬们帮忙看看。谢谢!

1 个赞

1 个赞

查内部表__all_tablet_meta_table, 查询所有snapshot_version未达到指定版本号的tablet
执行下下面的sql看看
select * from __all_tablet_meta_table where tenant_id=1004 and snapshot_version < 1726336804517991694;

1 个赞

1 个赞


关于这个租户的数据是0条

1 个赞


参考截图的命令在RS leader所在的observer上去查看日志

__all_tablet_meta_table 存在对应租户的meta下
需要先alter system change tenant META$1004;

切换后查询提示没有这个字段呢

,将比较字段改为 compaction_scn 返回条数9688。
与之前查询 server 级别的合并进度显示的一致

因为是用于测试,只有一个zone1,
在日志文件rootservice.log* 中没有找到关键字“replica not merged”;这个“T1004_MergeSche”是存在的。
20240924001.log (148.7 KB)

select * __all_virtual_compaction_diagnose_info where tenant_id=1004 limit 20;
看看具体都报的什么错误

select max(schema_version) from __all_virtual_ddl_operation where tenant_Id = 1004;
select * from oceanbase.__all_virtual_server_schema_info where tenant_id=1004;
麻烦搜索一下这几个sql信息

应该是schema刷新落后查一下ddl operation和schema version的ddl是啥
select * from __all_virtual_ddl_operation where tenant_Id = xxx order by schema_version desc limit 10;

这个吗?