OCP 执行备份报错

【 使用版本 】v3.1.4
【问题描述】清晰明确描述问题
【问题现象及影响】
2023-04-28 10:05:04.849 ERROR 344 — [pool-manual-subtask-executor6,6df97dcbba174ed6,f208dbfefbfb] c.o.o.b.i.o.PhysicalBackupObOpsService : SQL [alter system archivelog]; SQL state [0A000]; error code [1235]; message [(conn=84582) start log archive backup when not STOP is not supported]

14

15

2023-04-28 10:05:04.850 INFO 344 — [pool-manual-subtask-executor6,6df97dcbba174ed6,f208dbfefbfb] c.a.o.c.m.t.model.SubtaskInstanceEntity : Set state for subtask: 5695833, current state: RUNNING, new state: FAILED

16

17

2023-04-28 10:05:04.852 WARN 344 — [pool-manual-subtask-executor6,6df97dcbba174ed6,f208dbfefbfb] c.a.o.c.t.engine.runner.RunnerFactory : Execute task failed, subtask=SubtaskInstanceEntity{id=5695833, name=Start log backup if necessary, state=FAILED, operation=EXECUTE, className=com.oceanbase.ocp.backup.internal.task.schedule.StartLogBackupTask, seriesId=1, startTime=2023-04-28T10:05:04.755+08:00, endTime=2023-04-28T10:05:04.851+08:00}, failedMessage=Failed to start log backup. Error message: (conn=84582) start log archive backup when not STOP is not supported.

18

19

com.alipay.ocp.core.exception.UnexpectedException: [OCP UnexpectedException]: status=500 INTERNAL_SERVER_ERROR, errorCode=BACKUP_START_LOG_BACKUP_FAILED, args=(conn=84582) start log archive backup when not STOP is not supported

20

at com.oceanbase.ocp.backup.internal.operation.PhysicalBackupObOpsService.startLogBackup(PhysicalBackupObOpsService.java:400) ~[ocp-backup-service-3.3.0-20220427.jar!/:3.3.0-20220427]

21

at com.oceanbase.ocp.backup.internal.operation.PhysicalBackupObOpsService.startLogBackup(PhysicalBackupObOpsService.java:381) ~[ocp-backup-service-3.3.0-20220427.jar!/:3.3.0-20220427]

22

at com.oceanbase.ocp.backup.internal.operation.PhysicalBackupTaskService.triggerLogBackup(PhysicalBackupTaskService.java:73) ~[ocp-backup-service-3.3.0-20220427.jar!/:3.3.0-20220427]

23

at com.oceanbase.ocp.backup.internal.task.schedule.StartLogBackupTask.run(StartLogBackupTask.java:65) ~[ocp-backup-service-3.3.0-20220427.jar!/:3.3.0-20220427]

24

at com.alipay.ocp.core.metadb.task.model.SubtaskInstanceEntity.run(SubtaskInstanceEntity.java:221) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]

25

at com.alipay.ocp.core.task.engine.runner.JavaTaskRunner.doExecute(JavaTaskRunner.java:26) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]

26

at com.alipay.ocp.core.task.engine.runner.JavaTaskRunner.run(JavaTaskRunner.java:20) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]

27

at com.alipay.ocp.core.task.engine.runner.RunnerFactory.doRun(RunnerFactory.java:113) ~[ocp-core-3.3.0-20220

28

427.jar!/:3.3.0-20220427]

29

at com.alipay.ocp.core.task.engine.runner.RunnerFactory.redirectOutputIfNotSysSchedule(RunnerFactory.java:185) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]

30

at com.alipay.ocp.core.task.engine.runner.RunnerFactory.run(RunnerFactory.java:102) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]

31

at com.alipay.ocp.core.task.engine.coordinator.worker.subtask.ReadySubtaskWorker.lambda$submitTask$3(ReadySubtaskWorker.java:123) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]

32

at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_312]

33

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_312]

34

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_312]

35

at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_312]

报错提示归档日志还在执行
start log archive backup when not STOP is not supported

#查看下日志备份状态
select * from __all_backup_log_archive_status_v2;

#强制停止所有备份任务
ALTER SYSTEM CANCEL ALL BACKUP FORCE;

#查看备份路径
SHOW PARAMETERS LIKE ‘%backup_dest%’

#修改备份目录
ALTER SYSTEM SET backup_dest=‘file:///data/obbackup’;

再试试重新发起备份;

好的,我试一下,谢谢大佬

报错信息好像还是一样的

SQL SHOW PARAMETERS LIKE ‘%backup_dest%’ 执行后最终结果(我们是 file:///obbackup)

麻烦提供下rootserver.log日志 在ob的安装目录下的log

这个?

日志断流,基本和nfs挂载有关系.
可以参考下这个文档排查下。

排查了一下没发现有什么变化。 :sob:
另外,这个NFS我们使用的是华为云的 NFS服务,在之前我们的备份都是正常的,每天都成功执行,数据库服务等和NFS挂载等都没有改动。而今天看到的报错是从最近失败的任务中看到这个报错的。

日志断流,从rs日志搜到关键字INTERRUPTED报错,通过trace_id,全局搜ob日志有什么异常信息,可以贴下。

请问是这么找的RS日志吗,另外 trace_id 怎么找?

红框的就是treceid ,需要去服务器上搜,完整点。

这是根据 tradceid 搜索出来的日志

打开trace
系统租户执行:
alter system set enable_rich_error_msg=true;

按之前的方式清理备份信息和备份策略

然后后台方式进行备份命令操作

如果报错会打印trace信息,提供下此trace信息和rs主节点的rootserver和observer.log

设置 alter system set enable_rich_error_msg=true 后,通过 traceid 查询的结果一样

解决了吗?

该问题正在后台处理中,有结论会同步到该贴。

#关闭日志备份
ALTER SYSTEM NOARCHIVELOG;

#强制停止所有备份任务
ALTER SYSTEM CANCEL ALL BACKUP FORCE;

#查看日志备份任务状态
SELECT * FROM CDB_OB_BACKUP_ARCHIVELOG\G

#查看数据备份进程
SELECT * FROM CDB_OB_BACKUP_PROGRESS;

#查看备份路径
SHOW PARAMETERS LIKE ‘%backup_dest%’

#修改备份目录
ALTER SYSTEM SET backup_dest=‘file:///data/obbackup’;

重新发起发起备份;

目前用过通过设置新的备份目录路径备份成功。
原备份路径file:///backup/nfsds 新备份路径file:///backup/abc

可以把所有的备份停掉,集群重启一下在进行备份。