the last block is empty but its' prev block is not full

【 使用环境 】测试环境
【 OB or 其他组件 】 OB
【 使用版本 】observer (OceanBase_CE 4.2.4.0)
【问题描述】
[root@zhqdfs log]# /home/admin/oceanbase/bin/observer --version
/home/admin/oceanbase/bin/observer --version
observer (OceanBase_CE 4.2.4.0)

REVISION: 100000082024070810-556a8f594436d32a23ee92289717913cf503184b
BUILD_BRANCH: HEAD
BUILD_TIME: Jul 8 2024 11:07:07
BUILD_FLAGS: RelWithDebInfo
BUILD_INFO:

Copyright (c) 2011-present OceanBase Inc.

部署了一台单节点的observer ,停止observer 后把盘拔下来,再插回去,接着重启observer无法启动,日志:

[2024-09-30 16:04:11.472165] WDIAG [PALF] next (palf_iterator.h:160) [717690][observer][T1001][Y0-0000000000000000-0-0] [lt=8][errcode=-4070] PalfIterator next failed(ret=-4070, this={iterator_impl:{buf_:0x7f0e8ae05000, next_round_pread_size:2121728, curr_read_pos:0, curr_read_buf_start_pos:0, curr_read_buf_end_pos:0, log_storage_:{IteratorStorage:{start_lsn:{lsn:67085781}, end_lsn:{lsn:67085781}, read_buf:{buf_len_:2125824, buf_:0x7f0e8ae05000}, block_size:67104768, log_storage_:0x7f0e8e7c5bf0}, IteratorStorageType::"DiskIteratorStorage"}, curr_entry_is_raw_write:false, curr_entry_size:0, prev_entry_scn:{val:1727683393551800438, v:0}, curr_entry:{LogGroupEntryHeader:{magic:18258, version:1, group_size:2559, proposal_id:3, committed_lsn:{lsn:67085781}, max_scn:{val:1727683393551814940, v:0}, accumulated_checksum:3241546764, log_id:97520, flag:1}}, init_mode_version:0, accumulate_checksum:1334087156, curr_entry_is_padding:0, padding_entry_size:440, padding_entry_scn:{val:1727683393551800438, v:0}}})
[2024-09-30 16:04:11.472243] INFO  [PALF] check_is_the_last_entry (log_iterator_impl.h:907) [717690][observer][T1001][Y0-0000000000000000-0-0] [lt=7] there is no enough data, has iterate end(ret=-4008, this={buf_:0x7f0e8ae05000, next_round_pread_size:2121728, curr_read_pos:0, curr_read_buf_start_pos:0, curr_read_buf_end_pos:1, log_storage_:{IteratorStorage:{start_lsn:{lsn:67104767}, end_lsn:{lsn:67104768}, read_buf:{buf_len_:2125824, buf_:0x7f0e8ae05000}, block_size:67104768, log_storage_:0x7f0e8e7c5bf0}, IteratorStorageType::"DiskIteratorStorage"}, curr_entry_is_raw_write:false, curr_entry_size:0, prev_entry_scn:{val:1727683393551800438, v:0}, curr_entry:{LogGroupEntryHeader:{magic:18258, version:1, group_size:2559, proposal_id:3, committed_lsn:{lsn:67085781}, max_scn:{val:1727683393551814940, v:0}, accumulated_checksum:3241546764, log_id:97520, flag:1}}, init_mode_version:0, accumulate_checksum:1334087156, curr_entry_is_padding:0, padding_entry_size:440, padding_entry_scn:{val:1727683393551800438, v:0}}, header_size=56)
[2024-09-30 16:04:11.472254] INFO  [PALF] check_is_the_last_entry (log_iterator_impl.h:934) [717690][observer][T1001][Y0-0000000000000000-0-0] [lt=11] the entry is the last entry(ret=-4008, this={buf_:0x7f0e8ae05000, next_round_pread_size:2121728, curr_read_pos:0, curr_read_buf_start_pos:0, curr_read_buf_end_pos:1, log_storage_:{IteratorStorage:{start_lsn:{lsn:67104767}, end_lsn:{lsn:67104768}, read_buf:{buf_len_:2125824, buf_:0x7f0e8ae05000}, block_size:67104768, log_storage_:0x7f0e8e7c5bf0}, IteratorStorageType::"DiskIteratorStorage"}, curr_entry_is_raw_write:false, curr_entry_size:0, prev_entry_scn:{val:1727683393551800438, v:0}, curr_entry:{LogGroupEntryHeader:{magic:18258, version:1, group_size:2559, proposal_id:3, committed_lsn:{lsn:67085781}, max_scn:{val:1727683393551814940, v:0}, accumulated_checksum:3241546764, log_id:97520, flag:1}}, init_mode_version:0, accumulate_checksum:1334087156, curr_entry_is_padding:0, padding_entry_size:440, padding_entry_scn:{val:1727683393551800438, v:0}}, cost_ts=82)
[2024-09-30 16:04:11.472318] ERROR issue_dba_error (ob_log.cpp:1875) [717690][observer][T1001][Y0-0000000000000000-0-0] [lt=8][errcode=-4388] Unexpected internal error happen, please checkout the internal errcode(errcode=-4016, file="log_storage.h", line_no=353, info="unexpected error, the last block is empty but its' prev block is not full")
[2024-09-30 16:04:11.472324] EDIAG [PALF] locate_log_tail_and_last_valid_entry_header_ (log_storage.h:353) [717690][observer][T1001][Y0-0000000000000000-0-0] [lt=6][errcode=-4016] unexpected error, the last block is empty but its' prev block is not full(ret=-4016, iterate_block_id=0, max_block_id=1) BACKTRACE:0x136a22fc 0x586cb75 0x59b4f48 0x59b4a5b 0x59b499e 0x59b47c6 0x7d81968 0x7d08cb6 0x7e49f5b 0x7cbde3b 0x7e36c17 0x801af94 0x115b7e89 0xa7ec9e5 0xef15533 0xb1e4d62 0x7c1fb7b 0x7f0f2198c577 0x5ae934e
[2024-09-30 16:04:11.472371] WDIAG [PALF] load (log_storage.h:273) [717690][observer][T1001][Y0-0000000000000000-0-0] [lt=46][errcode=-4016] locate_log_tail_and_last_valid_entry_header_ failed(ret=-4016, ret="OB_ERR_UNEXPECTED", this={log_tail:{lsn:67085781}, readable_log_tail:{lsn:67085781}, log_block_header:{magic:18754, version:1, min_lsn:{lsn:18446744073709551615}, min_scn:{val:18446744073709551615, v:3}, curr_block_id:0, palf_id:1}, block_mgr:{log_dir:"/home/admin/observer1/store/obdemo/clog/tenant_1001/1/log", dir_fd:512, min_block_id:0, max_block_id:2, curr_writable_block_id:274894685184}, logical_block_size_:67104768, curr_block_writable_size_:0, block_header_serialize_buf_:0x7f0e8e7c6838, flashback_version:0})
[2024-09-30 16:04:11.472380] INFO  [PALF] load (log_storage.h:278) [717690][observer][T1001][Y0-0000000000000000-0-0] [lt=9] LogStorage load finish(ret=-4016, ret="OB_ERR_UNEXPECTED", this={log_tail:{lsn:67085781}, readable_log_tail:{lsn:67085781}, log_block_header:{magic:18754, version:1, min_lsn:{lsn:18446744073709551615}, min_scn:{val:18446744073709551615, v:3}, curr_block_id:0, palf_id:1}, block_mgr:{log_dir:"/home/admin/observer1/store/obdemo/clog/tenant_1001/1/log", dir_fd:512, min_block_id:0, max_block_id:2, curr_writable_block_id:274894685184}, logical_block_size_:67104768, curr_block_writable_size_:0, block_header_serialize_buf_:0x7f0e8e7c6838, flashback_version:0}, min_block_id=0, max_block_id=1)

求大佬帮忙看下这个为啥会有这种异常产生?

2 个赞

看日志是说块没写满,但是创建了一个新的块,为啥会自己创建了,然后又校验不过去?

2 个赞

你停止observer 是怎么停止的 是obd部署的么?你把observer.log的日志 整个发过来 分析一下 具体什么问题

2 个赞

你把observer.log整个日志发过来 这样有助于后期排查

2 个赞

磁盘有问题吗?看下 /var/log/message

2 个赞

手撸的部署方式,现在环境铲掉重装了,我们再复现下,等下把observer.log + 确认好的部署操作一起发下,稍等

1 个赞

好的 出问题可以先别铲除 可以分析分析

1 个赞

报另一个错误了,好像是clog 坏块了,

复现操作是:

kill -9 pidof observer
拔掉clog 的nvme磁盘
然后再插回nvme 磁盘
再启动observer

observer.log (2.5 MB)

1 个赞


确实是的 clog损坏了

1 个赞

嗯嗯,我找到这里对比check_sum 的记录值和最新算出来的值不一样,这个clog 也太脆弱了 :sweat_smile:

https://github.com/oceanbase/oceanbase/blob/v4.2.4_CE/src/logservice/palf/log_checksum.cpp

想请教下一种场景,比如我们物理机上发现某些ob 使用的盘寿命要到期了,我们是否可以通过停机,copy 该盘里面的数据到新盘上,然后踢出旧盘使用新盘来替换?

我们现在模拟的这个场景就是想验证下类似盘寿命到期换盘操作,但是会产生clog 损坏的情况

1 个赞

这样是不可以 你可以用oms迁移或者离线导出再导入

1 个赞

同一个集群的,好像只能节点rebuild了

1 个赞

磁盘更换这种场景建议做好raid,比如raid5更换一块盘是没影响的,并且大多数已支持热插拔

找这块的老师确认了下,这种操作OB是不支持的,硬盘替换可使用raid,或者做数据库的迁移

好的好的,那一开始搭建的时候就要做好raid,没做的话只能后期通过节点rebuild 或者迁移的方式来恢复了,多谢