OCP创建备租户中间卡住不动

【 使用环境 】生产环境
【 OB or 其他组件 】OB、OCP
【 使用版本 】OB 4.2.5.1,OCP 4.3.5
【问题描述】OCP创建备租户中间卡住不动,应该怎么处理?
【复现路径】创建备租户,主备同步方式:基于网络,备份集存储类型:COS


ScreenShot_2026-01-21_222328_694
ScreenShot_2026-01-21_222535_853
ScreenShot_2026-01-21_222409_631

【日志】
[2026-01-21 21:50:09.780084] INFO New syslog file info: [address: “10.6.0.2:2882”, observer version: OceanBase_CE 4.2.5.1, revision: 101000092024120918-6388bc0561faecba5f75a662c6e11c3dd0598de9, sysname: Linux, os release: 6.6.110-42.4.tl4.x86_64, machine: x86_64, tz GMT offset: 08:00]
[2026-01-21 21:50:09.779878] INFO [RS] do_work (ob_common_ls_service.cpp:140) [3018370][T1017_COMMONLSS][T1017][YB420A060002-000648BF81552B02-0-0] [lt=19] [COMMON_LS_SERVICE] finish one round(ret=0, ret=“OB_SUCCESS”, tmp_ret=0, tmp_ret=“OB_SUCCESS”, idle_time_us=1000000)
[2026-01-21 21:50:09.779957] INFO [RS] report_sys_ls_recovery_stat_in_trans_ (ob_recovery_ls_service.cpp:1074) [2411241][T1012_RecLSSer][T1012][YB420A060002-000648C3D19519AB-0-0] [lt=2] report sys ls recovery stat(ret=0, ret=“OB_SUCCESS”, sync_scn={val:1769003409492243000, v:0}, only_update_readable_scn=true, tenant_info={tenant_id:1012, tenant_role:“STANDBY”, switchover_status:“NORMAL”, switchover_epoch:0, sync_scn:1769003409487988000, replayable_scn:1769003409487988000, readable_scn:1769003408986021000, recovery_until_scn:4611686018427387903, log_mode:“NOARCHIVELOG”, max_ls_id:1006}, ls_recovery_stat={tenant_id:1012, ls_id:{id:1}, sync_scn:{val:1769003409492243000, v:0}, readable_scn:{val:1769003408986021000, v:0}, create_scn:{val:1, v:0}, drop_scn:{val:1, v:0}, config_version_:{proposal_id:24, config_seq:2}}, comment=“iterate no log”)
[2026-01-21 21:50:09.780680] INFO [RS] report_sys_ls_recovery_stat_in_trans_ (ob_recovery_ls_service.cpp:1074) [1421889][T1004_RecLSSer][T1004][YB420A060002-000648C1BDA6E90C-0-0] [lt=2] report sys ls recovery stat(ret=0, ret=“OB_SUCCESS”, sync_scn={val:1769003409590455000, v:0}, only_update_readable_scn=false, tenant_info={tenant_id:1004, tenant_role:“STANDBY”, switchover_status:“NORMAL”, switchover_epoch:0, sync_scn:1769003409371461000, replayable_scn:1769003409371461000, readable_scn:1769003409213363000, recovery_until_scn:4611686018427387903, log_mode:“NOARCHIVELOG”, max_ls_id:1003}, ls_recovery_stat={tenant_id:1004, ls_id:{id:1}, sync_scn:{val:1769003409590455000, v:0}, readable_scn:{val:1769003409288119000, v:0}, create_scn:{val:1, v:0}, drop_scn:{val:1, v:0}, config_version_:{proposal_id:28, config_seq:2}}, comment=“iterate log end”)
[2026-01-21 21:50:09.781130] INFO [RS] do_work (ob_recovery_ls_service.cpp:169) [2411241][T1012_RecLSSer][T1012][YB420A060002-000648C3D19519AB-0-0] [lt=9] REACH SYSLOG RATE LIMIT [bandwidth]
[2026-01-21 21:50:09.782298] INFO [RS] report_sys_ls_recovery_stat_in_trans_ (ob_recovery_ls_service.cpp:1074) [3018191][T1018_RecLSSer][T1018][YB420A060002-000648BF81A926DB-0-0] [lt=1] report sys ls recovery stat(ret=0, ret=“OB_SUCCESS”, sync_scn={val:1769003409690864000, v:0}, only_update_readable_scn=false, tenant_info={tenant_id:1018, tenant_role:“STANDBY”, switchover_status:“NORMAL”, switchover_epoch:0, sync_scn:1769003409388568000, replayable_scn:1769003409388568000, readable_scn:1769003409165387000, recovery_until_scn:4611686018427387903, log_mode:“NOARCHIVELOG”, max_ls_id:1003}, ls_recovery_stat={tenant_id:1018, ls_id:{id:1}, sync_scn:{val:1769003409690864000, v:0}, readable_scn:{val:1769003409266802000, v:0}, create_scn:{val:1, v:0}, drop_scn:{val:1, v:0}, config_version_:{proposal_id:23, config_seq:2}}, comment=“iterate log end”)
[2026-01-21 21:50:09.789107] INFO [RS] do_work (ob_common_ls_service.cpp:140) [1239828][T1_COMMONLSSe][T1][YB420A060002-000648B951F7A507-0-0] [lt=14] REACH SYSLOG RATE LIMIT [bandwidth]
[2026-01-21 21:50:09.791233] WDIAG [RS] check_can_do_recovery_ (ob_tenant_thread_helper.cpp:321) [3372452][T1022_RecLSSer][T1022][YB420A060002-000648C2C754E90F-0-0] [lt=8][errcode=-4018] fail to get restore job(ret=-4018, tenant_id=1022)
[2026-01-21 21:50:09.791238] WDIAG [RS] do_work (ob_recovery_ls_service.cpp:156) [3372452][T1022_RecLSSer][T1022][YB420A060002-000648C2C754E90F-0-0] [lt=4][errcode=-4018] can not do recovery now(ret=-4018, ret=“OB_ENTRY_NOT_EXIST”, tenant_id_=1022)
[2026-01-21 21:50:09.827779] WDIAG [RS] check_can_do_recovery_ (ob_tenant_thread_helper.cpp:321) [3372453][T1022_RecLSSer][T1022][YB420A060002-000648C2C745BAB4-0-0] [lt=9][errcode=-4018] fail to get restore job(ret=-4018, tenant_id=1022)
[2026-01-21 21:50:09.827783] WDIAG [RS] do_work (ob_recovery_ls_service.cpp:156) [3372453][T1022_RecLSSer][T1022][YB420A060002-000648C2C745BAB4-0-0] [lt=3][errcode=-4018] can not do recovery now(ret=-4018, ret=“OB_ENTRY_NOT_EXIST”, tenant_id_=1022)
[2026-01-21 21:50:09.851481] INFO [RS] manage_heartbeat_ (ob_heartbeat_service.cpp:408) [1239904][T1_HBService][T1][YB420A060002-000648B951B62CA0-0-0] [lt=15] manage_heartbeat_ has finished one round(ret=0, ret=“OB_SUCCESS”)
[2026-01-21 21:50:09.852805] INFO [RS] do_work (ob_standby_schema_refresh_trigger.cpp:70) [2857398][T1016_StandbySc][T1016][YB420A060002-000648BE27055ACE-0-0] [lt=13] REACH SYSLOG RATE LIMIT [bandwidth]

2 个赞

ocp的日志文件 整体提供一下

1 个赞

log_task_2203.zip (215.1 KB)

1 个赞

没遇到过这种

1 个赞

是不是资源不足

1 个赞

资源充足的

1 个赞

有尝试回滚掉 重试吗?

1 个赞

试过回滚重试,第一次一百多G就停了,这次是第二次,六百多G停的

1 个赞

我又重试了一次,这次两百多G就不行了
log_task_3015.zip (84.5 KB)

1 个赞

我记得资源不足还在刚创建的时候提示,下一步是走不了的

1 个赞

学习学习

查下这个看下

select * from (SELECT job_id, backup_cluster_name, backup_tenant_name, backup_tenant_id, backup_dest, restore_tenant_name, restore_tenant_id, restore_option, restore_scn_display AS restore_finish_timestamp, IF(recover_scn_display != '', recover_scn_display, NULL) AS restore_current_timestamp, start_timestamp AS start_time, finish_timestamp AS completion_time, restore_progress AS data_restore_progress, recover_progress AS log_restore_progress, status, comment AS error_msg, description FROM ( SELECT job_id, tenant_id, backup_cluster_name, backup_tenant_name, backup_tenant_id, backup_dest, restore_tenant_name, restore_tenant_id, restore_option, restore_scn_display, recover_scn_display, start_timestamp, NULL AS finish_timestamp, status, restore_progress, recover_progress, NULL AS comment, description FROM CDB_OB_RESTORE_PROGRESS UNION SELECT job_id, tenant_id, backup_cluster_name, backup_tenant_name, backup_tenant_id, backup_dest, restore_tenant_name, restore_tenant_id, restore_option, restore_scn_display, NULL as recover_scn_display, start_timestamp, finish_timestamp, status, NULL AS restore_progress, NULL AS recover_progress, comment, description FROM CDB_OB_RESTORE_HISTORY ) RIGHT JOIN (SELECT job_id AS _job_id, max(tenant_id) as _tenant_id FROM ( SELECT job_id, tenant_id FROM CDB_OB_RESTORE_PROGRESS UNION SELECT job_id, tenant_id FROM CDB_OB_RESTORE_HISTORY ) GROUP BY _job_id ) AS t ON job_id = t._job_id AND tenant_id = t._tenant_id WHERE description = 1769085771556 ORDER BY start_time DESC)  where restore_tenant_name = 'payment';

13	szsctek	payment	1016	cos://xx,cos://xx	payment	1024	pool_list=pool_payment_zone1_hor,pool_payment_zone2_ece,pool_payment_zone3_rbd&primary_zone=RANDOM&locality=FULL@zone1,FULL@zone2,FULL@zone3	2026-01-22 20:41:34.222231		2026-01-22 20:42:51.590809	2026-01-22 21:12:09.168821			FAIL	(SERVER)ls_id: 1006, addr: 10.6.0.2:2882, module: RESTORE_DATA, result: -4016(Internal error), trace_id: YB420A060002-000648F938C2F10F-0-0;	1769085771556

trace_id: YB420A060002-000648F938C2F10F-0-0

根据这个trace_id 过滤下 observer.log和rootservice.log发下

如果日志已经回收掉,可以复现取下

可以使用obdiag过滤trace_id收集日志

https://www.oceanbase.com/docs/common-obdiag-cn-1000000005021691

observer.log.zip (377.5 KB)
复现取的日志,只有这些,其他都是空的

看下日志级别是否为WDIAG

show parameters like '%syslog_level%';

image

INFO级别一些重要信息没有记录,需要WDIAG级别,生产环境一般设置为WDIAG级别,日志打印的更详细便于排查问题。设置WDIAG级别后日志生成量较大,产生较快,注意调整日志保留个数及注意空间。

调整后再复现下,然后根据trace_id发下日志

observer.log.zip (16.3 KB)
image
日志好像不多,看看能不能找出原因?