帮忙看看是不是有io问题

桃纭 · 2025 年3 月 11 日 09:25

【使用环境】测试环境
oceanbase数据库4.2.1.7
【问题描述】清晰明确描述问题
部署的一个服务里面报错了
[ERROR] 2025/03/11 08:37:18 scheduler.go:55: Error 4009 (58030): IO error
[ERROR] 2025/03/11 08:37:11 query.go:534: Error 4009 (58030): IO error
排查是因为执行这个sql时报错的
select v.id,
v.ts,
a.ip,
a.host_ip,
a.ipv6,
a.host_ipv6,
a.agent_uuid,
v.app_name,
v.version,
v.app_type,
v.pid_file_path,
v.cfg_port,
v.install_path,
v.listener_port,
v.exe_path,
v.parent_process_name,
v.running_process_user,
cast(crc32(concat_ws(
‘@’,
a.ip,
a.host_ip,
a.ipv6,
a.host_ipv6,
app_name,
version,
app_type,
pid_file_path,
cfg_port,
install_path,
listener_port,
exe_path,
parent_process_name,
running_process_user)) as char) as hash
from (select ts,
agent_uuid,
agent_ip as ip,
agent_localhost_ip as host_ip,
agent_business_ipv6 as ipv6,
agent_localhost_ipv6 as host_ipv6
from hhit_agent
where is_delete = 0
and online_status = 1
and monitor_status = 1) a
inner join hhit_asset_agent_app_service v on v.ts = a.ts and v.host_uuid = a.agent_uuid
left join hhit_asset_agent_data_hash h on h.data_id = v.id and h.data_type = ‘app_service’ and h.ts = v.ts
where (h.data_hash is null or
h.data_hash != crc32(concat_ws(
‘@’,
a.ip,
a.host_ip,
a.ipv6,
a.host_ipv6,
app_name,
version,
app_type,
pid_file_path,
cfg_port,
install_path,
listener_port,
exe_path,
parent_process_name,
running_process_user))) ORDER BY v.id, v.ts LIMIT ‘20’ OFFSET ‘0’;

observer.log (20.7 MB)

淇铭 · 2025 年3 月 11 日 09:45

看日志没有Error 4009 (58030): IO error 这个报错信息这个报错信息是哪里提示的？语句执行有报错么？

桃纭 · 2025 年3 月 11 日 09:47

直接执行是报sql语法错误，在服务里面执行就报错io错误了

靖顺 · 2025 年3 月 11 日 09:49

obdiag analyze log --from xxx --to xxx 日志分析看看结果

try_again · 2025 年3 月 11 日 09:50

你好，不排除是单次sql执行由于磁盘异常导致执行失败，这条语句有重试过吗
可以检查下磁盘空间，文件系统是否正常 dmesg | grep -i error

桃纭 · 2025 年3 月 11 日 09:51

用dmesg | grep -i error查过是正常的了

桃纭 · 2025 年3 月 11 日 09:52

是不是执行语法错误的sql就可能会报io错误

try_again · 2025 年3 月 11 日 09:59

一般，语法正确且支持的sql执行失败会给出相应错误码，例如本次遇到的4009
语法错误的sql会执行报错为语法错误，You have an error in your SQL syntax; check the manual for the right syntax to use near

淇铭 · 2025 年3 月 11 日 10:16

1、obdiag analyze log --from xxx --to xxx 日志分析看看结果
2、按照这个步骤提供一下日志信息
alter system set enable_rich_error_msg=true;
obclient [test]> select count(*) from t2;
ERROR 1146 (42S02): Table ‘test.t2’ doesn’t exist
[xx.xx.xx.1:2882] [2024-04-13 20:10:20.292087] [YB420BA1CC68-000615A0A8EA5E38-0-0]
[root@x.x.x.1 ~]$ grep “YB420BA1CC68-000615A0A8EA5E38-0-0” rootservice.log
[root@x.x.x.1 ~]$ grep “YB420BA1CC68-000615A0A8EA5E38-0-0” observer.log
alter system set enable_rich_error_msg=false;

桃纭 · 2025 年3 月 11 日 10:16

结果
election.7z (7.2 MB)
remote_192.168.70.2.7z (36.4 MB)
remote_192.168.70.2.7z (7.9 MB)
result_details.txt (1.4 KB)

淇铭 · 2025 年3 月 11 日 10:38

看日志报错信息分析应该是盘故障或者文件或目录的权限导致IO不成功

桃纭 · 2025 年3 月 11 日 11:03

数据库的写入很大吧

淇铭 · 2025 年3 月 11 日 11:17

看着不大呀才1M多用 tsar 实时监控记录磁盘 wawait 和 svctm 指标也可以看看io等待高的进程 iotop可以看一下
简单测试下性能看看：
fio -filename=/data/nfs/fio_test -direct=1 -rw=randwrite -bs=2048K -size=100G -runtime=300 -group_reporting -name=mytest -ioengine=libaio -numjobs=1 -iodepth=64 -iodepth_batch=8 -iodepth_low=8 -iodepth_batch_complete=8

桃纭 · 2025 年3 月 12 日 09:46

数据库可以查到那个用户执行的sql最多嘛

淇铭 · 2025 年3 月 12 日 09:57

你可以根据这个语句查一下
select USER_ID,user_name,
count(*) cnt
from v$OB_SQL_AUDIT
group by 1,2
order by 1,2;

桃纭 · 2025 年3 月 12 日 10:30

root用户插入很大正常嘛

淇铭 · 2025 年3 月 12 日 10:36

正常有好多连接都是通过root进行连接的你可以监控一下磁盘通过tsar或者vsar和iotop 可以看一下磁盘的性能

嗨森滴 · 2025 年3 月 12 日 11:18

磁盘性能问题

桃纭 · 2025 年3 月 12 日 11:18

现在看好像就是数据库写入太多了，有没有办法优化

桃纭 · 2025 年3 月 12 日 11:18

硬盘不是很好