observer 集群初始化失败

【 使用环境 测试环境
1台中心节点部署 obd ,obproxy ,内存 16G ;3台 observer 虚拟机 内存 32G ,observer 主机上同时部署了 agent 。 /data 200G 使用率 76% /redo 100G 使用率 91% ,IO 200MB/S
observer 社区版 4.2

【 OB or 其他组件 】
【 使用版本 】
【问题描述】清晰明确描述问题
创建 tenant 很慢 。obd cluster stop 集群后,obd cluster start 卡在 wait for observer init 阶段
查看 observer 日志
header 6: address=0x2b4303513b40
[2023-09-19 15:24:31.177527] ERROR detect_data_disk_io_failure_ (ob_failure_detector.cpp:385) [3005][T1_Occam][T1][Y0-0000000000000000-0-0] [lt=0][errcode=-4392] disk is hung(msg=“data disk may be hung, add failure event”, data_disk_io_hang_event={type:PROCESS HANG, module:STORAGE, info:data disk io hang event, level:FATAL}, data_disk_error_start_ts=1695108271037974)
[2023-09-19 15:24:31.791380] ERROR inner_aio (ob_io_manager.cpp:770) [2883][SvrStartupHandl][T1][Y0-0000000000000000-0-0] [lt=0][errcode=-4392] disk is hung(msg=“data disk has fatal error”)
[2023-09-19 15:24:31.909988] ERROR inner_aio (ob_io_manager.cpp:770) [2882][SvrStartupHandl][T1][Y0-0000000000000000-0-0] [lt=0][errcode=-4392] disk is hung(msg=“data disk has fatal error”)
[2023-09-19 15:24:31.926457] ERROR inner_aio (ob_io_manager.cpp:770) [2884][SvrStartupHandl][T1][Y0-0000000000000000-0-0] [lt=0][errcode=-4392] disk is hung(msg=“data disk has fatal error”)
[2023-09-19 15:24:33.329114] ERROR inner_aio (ob_io_manager.cpp:770) [2762][observer][T1][Y0-0000000000000000-0-0] [lt=0][errcode=-4392] disk is hung(msg=“data disk has fatal error”)
【复现路径】问题出现前后相关操作
【问题现象及影响】
集群不能正常启动,查看了/var/log/message 没有磁盘报错,dd 检查 磁盘 io在 260MB/s .
【附件】

obd 页面部署的时候,磁盘目录写的是什么? /home?

1 个赞

/home/admin

1 个赞

/data /redo 单独划了分区,1个300G 1个100G

1 个赞

写的是/home/admin 就放在这个里面了,和你的/data /redo 单独划了分区 没关系的

[2023-09-19 15:24:31.177527] ERROR detect_data_disk_io_failure_ (ob_failure_detector.cpp:385) [3005][T1_Occam][T1][Y0-0000000000000000-0-0] [lt=0][errcode=-4392] disk is hung(msg=“data disk may be hung, add failure event”, data_disk_io_hang_event={type:PROCESS HANG, module:STORAGE, info:data disk io hang event, level:FATAL}, data_disk_error_start_ts=1695108271037974)
[2023-09-19 15:24:31.791380] ERROR inner_aio (ob_io_manager.cpp:770) [2883][SvrStartupHandl][T1][Y0-0000000000000000-0-0] [lt=0][errcode=-4392] disk is hung(msg=“data disk has fatal error”)
[2023-09-19 15:24:31.909988] ERROR inner_aio (ob_io_manager.cpp:770) [2882][SvrStartupHandl][T1][Y0-0000000000000000-0-0] [lt=0][errcode=-4392] disk is hung(msg=“data disk has fatal error”)
[2023-09-19 15:24:31.926457] ERROR inner_aio (ob_io_manager.cpp:770) [2884][SvrStartupHandl][T1][Y0-0000000000000000-0-0] [lt=0][errcode=-4392] disk is hung(msg=“data disk has fatal error”)
[2023-09-19 15:24:33.329114] ERROR inner_aio (ob_io_manager.cpp:770) [2762][observer][T1][Y0-0000000000000000-0-0] [lt=0][errcode=-4392] disk is hung(msg=“data disk has fatal error”)

估计还是io的问题,我决定把clog 调整到ssd盘上,ssd盘单位下周才能买到。初次部署的时候集群是起来了。做个测试资源要求都这么高,关键是还没有做什么,只是刚刚建了一个租户。建租户的时候就很慢。关了一下集群就起不来了。。。。

虚拟机?