几乎没有负载和数据的情况下,简单查询经常报错 ob_query_timeout

【 使用环境 】生产环境
【 OB or 其他组件 】
OceanBase_CE 4.3.1.0
3台服务器,每台服务器的应用都直连本机的observer

【问题描述】清晰明确描述问题

最近我调试的时候,发现线上的网站,各种查询的时候都经常会报错 Timeout(没有负载,只有我一个人访问)

这种情况我感觉经常会发生在好久没访问,然后首次访问的时候

ERROR mysql_async::query{mysql_async.connection.id=3222172272 mysql_async.query.sql=“SELECT payStripeRm(10001000,12)”}: mysql_async::conn::routines::query: /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/mysql_async-0.34.1/src/conn/routines/query.rs:57: error=Server error: `ERROR HY000 (4012): Timeout, query has reached the maximum query timeout: 10000000(us), maybe you can adjust the session variable ob_query_timeout or query_timeout hint, and try again.’

❯ gist < ./check_report//obdiag_check_report_observer_2024-07-03-12-22-19.table
https://gist.github.com/i18n-now/95241ca6321d2ac8c343ced3c295a3fb

❯ gist < /root/obdiag_analyze_pack_20240703115202/result_details.txt
https://gist.github.com/i18n-now/64d9685dc88e413adcf9ab41364c36c4

这是服务器的日志

❯ pwd
/opt/ob/oceanbase/log

❯ ls
election.log observer.log observer.log.20240615164646718 observer.log.20240702123358982 rootservice.log trace.log
election.log.wf observer.log.20240615161053239 observer.log.20240619211847565 observer.log.wf rootservice.log.w

❯ rg ob_query_timeout | gist
这是日志中包含ob_query_timeout的记录

https://gist.github.com/i18n-now/87e8079fceaa140996faafe0748275b1

1、应用程序都是直连访问数据库2881端口的方式?前面没有使用obproxy(如果有要排查是不是这里的网络链路时延问题)?

2、ob_query_timeout 默认是10000000微秒,要查一下应用sql执行的耗时是不是超过这个值?超过的话,可以将其调大,对应的业务sql看看能否优化,可以结合sql_audit表进行分析。

3、另外还要查一下应用程序获取数据库连接的一些超时设置,如果是使用了连接池,要看一下连接池是不是有做连接失效的探测。

每台observer上慢日志里有相关sql不。这种3应用+3数据库分布式部署本地读写模式还是挺6的,性能最大化,适合小系统应用跑

OceanBase 社区已接收您的帖子,正在跟进中。

obdaig check巡检的时候有什么异常吗,可以发一下巡检结果

  1. 直连访问数据库

  2. 我之前把审计关了,刚刚又用

ALTER SYSTEM SET enable_sql_audit = true;
打开了,等下次遇到错误我再看看慢查询的日志

  1. 我给连接池加上了参数 闲置120秒超时

我再观察一下,还有没有这个错误

❯ gist < ./check_report//obdiag_check_report_observer_2024-07-03-12-22-19.table
https://gist.github.com/i18n-now/95241ca6321d2ac8c343ced3c295a3fb
这里

由于审计没有打开 看报告 只有一些服务器sysctl参数配置建议 看到好多参数 可以优化
可以按照楼上说的 进行检查一下

OB数据库的ob_query_timeout参数默认时10s,超过10s的SQL会被自动终止,如果需要运行更长的时间的SQL,修改这个参数即可
https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000000821230

多谢各位指点

obdiag check 新的检测如下

https://gist.githubusercontent.com/i18n-now/1d0eefe9e359177598f206444a8c5dd6/raw/ad9314ee01b6233474f09820ab9fec6dec6be0f0/gistfile1.txt

大部分参数都已经按推荐配置设置
内存 vm.swappiness : 200. recommended: 0 因为我启用了zram,所以没改
一些进程数限制,不知道为什么,我死活改不掉,不过感觉影响不大,先不管

线上程序的线程数闲置超时也加上了

现在还没能复现问题,等复现了我再回来跟进

我还有一个问题,我想先看看现在大于1秒慢查询,为啥会有sql为空的情况(也不一定都为空,有的有sql)

obclient [oceanbase]> select request_id,usec_to_time(request_time),ELAPSED_TIME,QUEUE_TIME,EXECUTE_TIME,query_sql from v$OB_SQL_AUDIT where ELAPSED_TIME > 1000000 limit 10;
±-----------±---------------------------±-------------±-----------±-------------±----------+
| request_id | usec_to_time(request_time) | ELAPSED_TIME | QUEUE_TIME | EXECUTE_TIME | query_sql |
±-----------±---------------------------±-------------±-----------±-------------±----------+
| 214769 | 2024-07-03 15:01:19.845520 | 1012061 | 101 | 1011413 | |
±-----------±---------------------------±-------------±-----------±-------------±----------+

那你在时间调小点 看看是否有诊断的sql语句

我刚刚又遇到一个错误

ERROR mysql_async::query{mysql_async.connection.id=3222013046 mysql_async.query.sql=“SELECT payIndex(10001000,19905,19936)”}: mysql_async::conn::routines::query: /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/mysql_async-0.34.1/src/conn/routines/query.rs:57: error=Server error: `ERROR HY000 (4012): Timeout, query has reached the maximum query timeout: 10000000(us), maybe you can adjust the session variable ob_query_timeout or query_timeout hint, and try again.’
ERROR re::err: /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/re-0.1.7/src/err.rs:28: 0: anyhow::error::<impl core::convert::From for anyhow::Error>::from
1: pay::post::{{closure}}
2: <re::msg::FnAny as axum::handler::Handler<(M,T1),S>>::call::{{closure}}
3: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
4: <axum::routing::route::RouteFuture as core::future::future::Future>::poll
5: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
6: <axum::middleware::from_fn::FromFn<F,S,I,(T1,)> as tower_service::Service<http::request::Request<axum_core::body::Body>>>::call::{{closure}}
7: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
8: <axum::routing::route::RouteFuture as core::future::future::Future>::poll
9: <axum::error_handling::HandleError<S,F,()> as tower_service::Service<http::request::Request>>::call::{{closure}}
10: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
11: <axum::routing::route::RouteFuture as core::future::future::Future>::poll
12: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
13: <axum::routing::route::RouteFuture as core::future::future::Future>::poll
14: <axum::serve::TowerToHyperServiceFuture<S,R> as core::future::future::Future>::poll
15: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_loop
16: <axum::serve::Serve<M,S> as core::future::into_future::IntoFuture>::into_future::{{closure}}::{{closure}}
17: tokio::runtime::task::raw::poll
18: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
19: tokio::runtime::task::raw::poll
20: std::sys::backtrace::__rust_begin_short_backtrace
21: core::ops::function::FnOnce::call_once{{vtable.shim}}
22: std::sys::pal::unix::thread::thread::new::thread_start
23:
24:

但是慢查询里面也没有啊,怎么排查?从rust报错看是这个sql mysql_async.query.sql=“SELECT payIndex(10001000,19905,19936)”}:

obclient [oceanbase]> select request_id,usec_to_time(request_time),ELAPSED_TIME,QUEUE_TIME,EXECUTE_TIME,query_sql from v$OB_SQL_AUDIT where ELAPSED_TIME > 1000000 limit 10;
±-----------±---------------------------±-------------±-----------±-------------±---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| request_id | usec_to_time(request_time) | ELAPSED_TIME | QUEUE_TIME | EXECUTE_TIME | query_sql |
±-----------±---------------------------±-------------±-----------±-------------±---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1639003 | 2024-07-03 19:28:11.545749 | 1190747 | 0 | 1190604 | SELECT unit_id, cast(sum(data_disk_in_use) as signed) required_size, status in (‘MIGRATE IN’, ‘MIGRATE OUT’) is_migrating from __all_virtual_unit where status in (‘NORMAL’, ‘MIGRATE IN’, ‘MIGRATE OUT’) group by unit_id |
| 1642825 | 2024-07-03 19:29:08.253129 | 1181852 | 0 | 1181308 | SELECT unit_id, cast(sum(data_disk_in_use) as signed) required_size, status in (‘MIGRATE IN’, ‘MIGRATE OUT’) is_migrating from __all_virtual_unit where status in (‘NORMAL’, ‘MIGRATE IN’, ‘MIGRATE OUT’) group by unit_id |
| 1642880 | 2024-07-03 19:29:09.436349 | 1074012 | 0 | 1073874 | SELECT unit_id, cast(sum(data_disk_in_use) as signed) required_size, status in (‘MIGRATE IN’, ‘MIGRATE OUT’) is_migrating from __all_virtual_unit where status in (‘NORMAL’, ‘MIGRATE IN’, ‘MIGRATE OUT’) group by unit_id |
| 1374569 | 2024-07-03 19:29:11.789941 | 1027881 | 326 | 1026980 | |
| 774895 | 2024-07-03 19:25:06.003823 | 1548216 | 143 | 1547483 | |
±-----------±---------------------------±-------------±-----------±-------------±---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
5 rows in set (0.091 sec)

❯ source ~/i18n/conf/env/db.sys.sh && obclient -h127.0.0.1 -P2881 -u$MYSQL_USER -p$MYSQL_PWD -Doceanbase -A -e “select ELAPSED_TIME,EXECUTE_TIME,ELAPSED_TIME,query_sql from v$OB_SQL_AUDIT order by request_time desc” >/tmp/dump.sql

❯ gist < /tmp/dump.sql
https://gist.github.com/i18n-now/885b76dfa02af1593f365e50491589c7

超时的查询是 SELECT payIndex(10001000,19905,19936)

ERROR HY000 (4012): Timeout, query has reached the maximum query timeout: 10000000(us), maybe you can adjust the session variable ob_query_timeout or query_timeout hint, and try again.

但是我导出的所有OB_SQL_AUDIT,搜了下,没有这个查询,很奇怪

我切换到数据库用户
source ~/i18n/conf/env/db.sh && obclient -h127.0.0.1 -P2881 -u$MYSQL_USER -p$MYSQL_PWD -Doceanbase -A -e ‘select request_id,usec_to_time(request_time),ELAPSED_TIME,QUEUE_TIME,EXECUTE_TIME,query_sql from v$OB_SQL_AUDIT’

好奇怪,我运行这个查询,之前出现过一次
803348 2024-07-03 19:38:30.859821 199055 64 169714 SELECT payIndex(10001000,19905,19936)

现在再运行,就没有了

我看了一下,审计日志只有十多分钟,这个调整什么参数控制日志的存储数?
❯ head -2 /tmp/dump.sql
request_id usec_to_time(request_time) ELAPSED_TIME QUEUE_TIME EXECUTE_TIME query_sql
1035000 2024-07-03 21:28:37.095355 187 0 133 SELECT trigger_id, is_deleted FROM __all_tenant_trigger_history WHERE tenant_id = 0 AND base_object_id = 21540 AND schema_version <= 1719387637149504 ORDER BY trigger_id desc, schema_version desc
❯ tail -1 /tmp/dump.sql
1068939 2024-07-03 21:45:05.223068 236 50 64 select @@version_comment, @@version limit 1

这个查询的参数和rust的报错的参数是一样的,应该就是这个sql,但是这里显示查询耗时ELAPSED_TIME只有199055(这是199毫秒吧?),为啥最后会报错ob_query_timeout呢?

1、在sys租户下:
alter system set enable_rich_error_msg=‘true’;

2、业务报错的文本日志。

3、业务报错后,根据提示的svr_ip 将对应时间的完整(未过滤)的日志observer.log 日志附件。

我又找老师具体分析查询报ob_query_timeout超时的问题 查询的时候由于sql_audit诊断信息数据量大 导致的查询时间超时 你把ob_query_timeout调大试一下