OB 企业版V3.2.4 备份还原异常处理分享

【版本】

OB EE 3.2.4.5 ,OCP 4.2.1

【问题现象】

还原集群里一个租户近期的备份到独立的服务器,彼此网络可通,NFS可通,共用一套 OCP。但是还原任务报错。

【相关信息】

    1. OCP 报错日志。
2024-06-18 15:35:23.819  INFO 8 --- [pool-manual-subtask-executor16,96c4adc21eaa4a48,1ad654c1f154] c.o.ocp.obsdk.connector.ConnectTemplate  : [obsdk] sql: set ob_query_timeout = ?, args: [10000000]

2024-06-18 15:35:23.827  INFO 8 --- [pool-manual-subtask-executor16,96c4adc21eaa4a48,1ad654c1f154] c.o.ocp.obsdk.connector.ConnectTemplate  : [obsdk] sql: SELECT * FROM (SELECT  job_id, external_job_id, tenant_id AS restore_tenant_id, tenant_name AS restore_tenant_name, status, start_time, completion_time, restore_start_timestamp, restore_finish_timestamp, restore_current_timestamp, backup_dest, restore_option, info AS error_msg, partition_count, finish_partition_count, white_list, backup_cluster_name, backup_cluster_id, backup_tenant_name, backup_tenant_id FROM (     SELECT  job_id, external_job_id, tenant_id, tenant_name, status, start_time, completion_time, restore_start_timestamp, restore_finish_timestamp, restore_current_timestamp, NULL AS restore_option, info, partition_count, finish_partition_count, white_list, (SELECT value FROM __all_restore_info AS restore_info WHERE restore_info.job_id = arp.job_id AND restore_info.name = 'backup_dest') AS backup_dest,     backup_cluster_name, backup_cluster_id, backup_tenant_name, backup_tenant_id     FROM __all_restore_progress AS arp     UNION     SELECT  job_id, external_job_id, tenant_id, tenant_name, status, start_time, completion_time, restore_start_timestamp, restore_finish_timestamp, restore_current_timestamp, restore_option, info, partition_count, finish_partition_count, white_list, backup_dest, backup_cluster_name, backup_cluster_id, backup_tenant_name, backup_tenant_id     FROM __all_restore_history ))  where start_time >= '2024-06-18 15:32:52'

2024-06-18 15:35:23.833  INFO 8 --- [pool-manual-subtask-executor16,96c4adc21eaa4a48,1ad654c1f154] c.o.o.b.i.t.s.WaitRestoreSuccessTask     : check restore result: [RestoreHistoryTask(id=3, ocpJobId=-1, backupMode=PHYSICAL_BACKUP, triggerType=NORMAL, createTime=2024-06-18T07:32:52.466720Z, finishTime=2024-06-18T07:34:55.254903Z, backupClusterId=null, backupClusterName=obdemo, backupObClusterId=1682561918, backupTenantId=null, backupTenantName=oboracle, backupObTenantId=1005, restoreClusterId=null, restoreClusterName=null, restoreObClusterId=null, restoreTenantName=oboracle_res, restorePoolListStr=null, restoreUri=file:///obbackup, majorVersion=null, restoreEndTime=2024-06-18T06:29:29Z, restoreServer=null, status=RESTORE_FAIL, progress=null, errorMsg=CLOG : OB_CHECKSUM_ERROR(-4103) on "192.168.0.13:2882" with traceid YB420A1E410D-00061B245E63849E-0-0)]

2024-06-18 15:35:23.840 ERROR 8 --- [pool-manual-subtask-executor16,96c4adc21eaa4a48,1ad654c1f154] c.o.o.c.t.e.c.w.subtask.SubtaskExecutor  : Restore task failed.

com.oceanbase.ocp.core.exception.UnexpectedException: [OCP UnexpectedException]: status=500 INTERNAL_SERVER_ERROR, errorCode=BACKUP_RESTORE_TASK_FAILED, args=
	at com.oceanbase.ocp.backup.internal.task.schedule.WaitRestoreSuccessTask.checkRestoreTaskResult(WaitRestoreSuccessTask.java:124)
	at com.oceanbase.ocp.backup.internal.task.schedule.WaitRestoreSuccessTask.run(WaitRestoreSuccessTask.java:98)
	at com.oceanbase.ocp.core.task.engine.runner.JavaSubtaskRunner.execute(JavaSubtaskRunner.java:59)
	at com.oceanbase.ocp.core.task.engine.runner.JavaSubtaskRunner.doRun(JavaSubtaskRunner.java:31)
	at com.oceanbase.ocp.core.task.engine.runner.JavaSubtaskRunner.run(JavaSubtaskRunner.java:25)
	at com.oceanbase.ocp.core.task.engine.runner.RunnerFactory.doRun(RunnerFactory.java:76)
	at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskExecutor.doRun(SubtaskExecutor.java:193)
	at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskExecutor.redirectConsoleOutput(SubtaskExecutor.java:187)
	at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskExecutor.lambda$submit$2(SubtaskExecutor.java:124)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


Set state for subtask: 166860272, operation:EXECUTE, state: FAILED

OCP 现在的日志做得越来越好。里面给出进一步排查问题的线索:errorMsg=CLOG : OB_CHECKSUM_ERROR(-4103) on "192.168.0.13:2882" with traceid YB420A1E410D-00061B245E63849E-0-0)

    1. 登录目标集群的 OBServer 的log 目录,查看这个 TRACE_ID 。(企业版没有部署 obdiag ,否则获取起来应该很容易)。
[2024-06-18 15:34:51.905851] INFO  ob_storage_file.cpp:410 [269922][0][YB420A1E410D-00061B245E63849E-0-0] [lt=10] [dc=0] list files count(dir_path="/obbackup/obdemo/1682561918/incarnation_1/1005/clog/10/data/1105009185966893/1", 
file_names.count()=1, ret=0)
[2024-06-18 15:34:51.905870] INFO  [ARCHIVE] ob_archive_log_file_store.cpp:162 [269922][0][YB420A1E410D-00061B245E63849E-0-0] [lt=11] [dc=0] no valid index info, return max data file id(ret=0, ret="OB_SUCCESS", pg_key={tid:1105009185966
893, partition_id:1, part_cnt:0}, log_id=42240, file_id=1)
[2024-06-18 15:34:51.906926] INFO  [PHYSICAL_RESTORE_ARCHIVE] ob_archive_restore_engine.cpp:1291 [269922][0][YB420A1E410D-00061B245E63849E-0-0] [lt=9] [dc=0] locate_log cost(cost=3366, restore_pg_key={tid:1103909674339117, partition_id:
1, part_cnt:0}, archive_pg_key={tid:1105009185966893, partition_id:1, part_cnt:0}, start_log_id=42240)
[2024-06-18 15:34:51.919993] WDIAG [ARCHIVE] ob_archive_entry_iterator.cpp:548 [269922][0][YB420A1E410D-00061B245E63849E-0-0] [lt=12] [dc=0][errcode=-4103] archive block meta is not valid(ret=-4103, ret="OB_CHECKSUM_ERROR", meta={magic_
:0, total_len_:0, timestamp_:0, clog_epoch_id_:0, accum_checksum_:0, min_log_id_in_file_:0, min_log_ts_in_file_:0, max_log_id_:0, max_checkpoint_ts_:0, max_log_submit_ts_:0, input_bytes_:0, output_bytes_:0, data_checksum_:0, meta_checks
um_:0}, this={is_inited_:true, need_limit_bandwidth_:true, file_store_:0x7f7799150e80, pg_key_:{tid:1105009185966893, partition_id:1, part_cnt:0}, real_tenant_id_:1004, file_id_:1, cur_offset_:959895, cur_block_start_offset_:0, cur_bloc
k_end_offset_:0, buf_cur_:0x7f6ea48f7647, buf_end_:0x7f6ea49ed240, block_end_:0x7f6ea48f7647, rbuf_:{buf_:0x7f6ea480d0b0, buf_len_:1966480}, dd_buf_:{is_decrypted_:false, cur_:null, end_:null, rbuf_:{buf_:0x7f6ef5c0d0b0, buf_len_:0}}, o
rigin_buf_:{is_decrypted_:false, cur_:null, end_:null, rbuf_:{buf_:0x7f6ea480d118, buf_len_:129}}, timeout_:3000000, io_cost_:1056, io_count_:1, limit_bandwidth_cost_:6312, has_load_entire_file_:false})
[2024-06-18 15:34:51.920011] WDIAG [ARCHIVE] ob_archive_entry_iterator.cpp:506 [269922][0][YB420A1E410D-00061B245E63849E-0-0] [lt=14] [dc=0][errcode=-4103] extract_block_meta_ fail(ret=-4103, ret="OB_CHECKSUM_ERROR", pg_key_={tid:110500
9185966893, partition_id:1, part_cnt:0})
[2024-06-18 15:34:51.920022] WDIAG [ARCHIVE] ob_archive_entry_iterator.cpp:404 [269922][0][YB420A1E410D-00061B245E63849E-0-0] [lt=7] [dc=0][errcode=-4103] try_construct_log_buf_ fail(ret=-4103, ret="OB_CHECKSUM_ERROR", pg_key_={tid:1105
009185966893, partition_id:1, part_cnt:0})
[2024-06-18 15:34:51.920118] WDIAG [ARCHIVE] ob_archive_entry_iterator.cpp:225 [269922][0][YB420A1E410D-00061B245E63849E-0-0] [lt=81] [dc=0][errcode=-4103] get next entry fail(ret=-4103, ret="OB_CHECKSUM_ERROR", this={is_inited_:true, n
eed_limit_bandwidth_:true, file_store_:0x7f7799150e80, pg_key_:{tid:1105009185966893, partition_id:1, part_cnt:0}, real_tenant_id_:1004, file_id_:1, cur_offset_:959895, cur_block_start_offset_:0, cur_block_end_offset_:0, buf_cur_:0x7f6e
a48f7647, buf_end_:0x7f6ea49ed240, block_end_:0x7f6ea48f7647, rbuf_:{buf_:0x7f6ea480d0b0, buf_len_:1966480}, dd_buf_:{is_decrypted_:false, cur_:null, end_:null, rbuf_:{buf_:0x7f6ef5c0d0b0, buf_len_:0}}, origin_buf_:{is_decrypted_:false,
 cur_:null, end_:null, rbuf_:{buf_:0x7f6ea480d118, buf_len_:129}}, timeout_:3000000, io_cost_:1056, io_count_:1, limit_bandwidth_cost_:6312, has_load_entire_file_:false})
[2024-06-18 15:34:51.920160] WDIAG [PHYSICAL_RESTORE_ARCHIVE] ob_archive_restore_engine.cpp:793 [269922][0][YB420A1E410D-00061B245E63849E-0-0] [lt=11] [dc=0][errcode=-4103] failed to fetch_and_submit_archive_log_(ret=-4103, ret="OB_CHEC
KSUM_ERROR", task={is_inited_:true, is_expired_:false, has_located_log_:true, start_log_id_:42240, start_log_ts_:1718653814132823, end_snapshot_version_:1718692169000000, leader_takeover_ts_:1718696059472218, last_fetched_log_id_:42239,
 last_checkpoint_ts_:-1, last_fetched_log_submit_ts_:1718653814132822, total_piece_cnt_:1, start_piece_idx_:0, cur_piece_idx_:0, end_piece_idx_:0, start_file_id_in_cur_piece_:1, end_file_id_in_cur_piece_:1, cur_file_id_in_cur_piece_:1, 
cur_file_offset_:0, cur_file_store_:0x7f7799150e80, tenant_restore_meta_:0x7f662f7258a0, retry_cnt_:0, io_fail_cnt_:0, restore_pg_key_:{tid:1103909674339117, partition_id:1, part_cnt:0}, archive_pg_key_:{tid:1105009185966893, partition_
id:1, part_cnt:0}, fetch_log_result_:0})
[2024-06-18 15:34:51.920176] ERROR issue_dba_error (ob_log.cpp:2322) [269922][0][YB420A1E410D-00061B245E63849E-0-0] [lt=11] [dc=0][errcode=-4388] Unexpected internal error happen, please checkout the internal errcode(errcode=-4103, file
="ob_archive_restore_engine.cpp", line_no=624, info="failed to handle_single_task_")

【解决过程】

  1. 查看日志里提示的备份目录是否异常。从 OS 里看没啥异常。访问权限也正常。主集群(V3)的备份也正常。
  2. 随机再挑一个实例的备份还原已有异常。
  3. 检查数据备份和日志备份历史,都很正常。在主集群里发起 集群级别的备份验证,结果都成功了。
ALTER SYSTEM VALIDATE BACKUPSET 289 ;
  1. 提了在线工单,沟通还停留在问题现象的逐一确认过程中。
  2. 再回头看这个日志里似乎提示备份文件损坏: archive block meta is not valid
  3. 其他信息客户前几个月曾经更换了 NFS 路径后端的NAS存储(路径保持不变)。并且更换的时候确实是先停止了集群的日志备份才进行的。

【最终解决方法】

由于工单沟通推进进度实在太慢,要查这个根源估计要 OB 内核研发分析,太过耗精力。跟客户商量不用查原因了,直接解决就行。
于是 停掉集群的日志备份,发起集群合并,再开启 集群的日志备份。
然后再发起数据全量备份,然后做租户恢复,这个时候都成功了。

【经验总结】

最终什么原因是没有查到,推测跟换NFS 背后的存储时日志备份线程没有停彻底有关。另外一种可能就是备份可能碰到什么未知问题。此外还有一个特点就是默认日志备份没有分片(PIECE)。所以一个 日志备份的ROUND 的目录会一直使用直到停掉日志备份在再开启。
目前V3 多个实例运维感觉,有些版本在备份清理上不尽如意,猜测原因很大概率跟这个 默认不分片有关。所以再次开启日志备份的时候顺便开启分片功能。

ALTER SYSTEM SET backup_dest_option='log_archive_piece_switch_interval=7d';

另外就是怀疑备份校验可能只是校验数据备份不校验日志备份,所以没有发现日志clog 备份有问题。要完整的确信备份机制有效,推荐定期做恢复演练。OCP 里有 【恢复抽检】功能就不错。

有关 OB 备份更多总结可以参考《 OB 企业版 3.2 备份恢复实践 (qq.com)》。

3 个赞