【 使用环境 】生产环境 or 测试环境
【 OB or 其他组件 】
【 使用版本 】ocp 4.3.5 ob-ce 4.3.5
【问题描述】
ocp 监控看到租户QPS 一直在2万左右,但TPS一直不超过3,而且显示回滚的量经常比提交的量大
在ocp SQL诊断中 BEGIN 和 COMMIT 的执行也有每秒300次左右
请教一下,这个TPS监控显示的量是正常的吗? 该如何排查?
【 使用环境 】生产环境 or 测试环境
【 OB or 其他组件 】
【 使用版本 】ocp 4.3.5 ob-ce 4.3.5
【问题描述】
ocp 监控看到租户QPS 一直在2万左右,但TPS一直不超过3,而且显示回滚的量经常比提交的量大
在ocp SQL诊断中 BEGIN 和 COMMIT 的执行也有每秒300次左右
请教一下,这个TPS监控显示的量是正常的吗? 该如何排查?
另外在日志中以 rollback 做关键字查询,只有以下几个重复出现:
[2025-03-12 14:32:49.835615] WDIAG [STORAGE.TRANS] sync_rollback_savepoint__ (ob_tx_api.cpp:1878) [153089][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D3DFDF6-0-0] [lt=151][errcode=-4012] tx rpc condition wakeup(ret=-4012, tx.tx_id_={txid:61850779}, waittime=10000, rpc_ret=0, expire_ts=1741761467746549, remain=[{ls_id:{id:1001}, exec_epoch:556963872692272, transfer_epoch:-1}], remain_cnt=1, retries=0, tx.state=4)
[2025-03-12 14:32:49.870020] WDIAG [STORAGE.TRANS] rollback_to_savepoint (ob_trans_part_ctx.cpp:8726) [153090][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D3DFE50-0-0] [lt=0][errcode=-4036] rollback_to need retry because of logging(ret=-4036, trans_id_={txid:61850824}, ls_id_={id:1001}, busy_cbs_.get_size()=1)
[2025-03-12 14:32:49.870043] WDIAG [STORAGE.TRANS] ls_sync_rollback_savepoint__ (ob_tx_api.cpp:1354) [153090][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D3DFE50-0-0] [lt=53][errcode=-4036] rollback to savepoint sync fail(ret=-4036, part_ctx->get_trans_id()={txid:61850824}, part_ctx->get_ls_id()={id:1001}, retry_cnt=0, op_sn=57, savepoint={branch:0, seq:19969}, expire_ts=-1)
[2025-03-12 14:32:49.870061] WDIAG [STORAGE.TRANS] ls_rollback_to_savepoint_ (ob_tx_api.cpp:1702) [153090][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D3DFE50-0-0] [lt=44][errcode=-4036] LS rollback to savepoint fail(ret=-4036, tx_id={txid:61850824}, ls={id:1001}, op_sn=57, savepoint={branch:0, seq:19969}, ctx={this:0x7f76460db2d0, ref:3, trans_id:{txid:61850824}, tenant_id:1012, is_exiting:false, trans_expired_time:1741847569848747, cluster_version:17180067072, trans_need_wait_wrap:{receive_gts_ts_:[mts=0], need_wait_interval_us:0}, stc:[mts=0], ctx_create_time:1741761169848747})
[2025-03-12 14:32:49.870220] WDIAG [STORAGE.TRANS] rollback_to_savepoint (ob_trans_part_ctx.cpp:8726) [153097][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D3DFE50-0-0] [lt=2][errcode=-4036] rollback_to need retry because of logging(ret=-4036, trans_id_={txid:61850824}, ls_id_={id:1001}, busy_cbs_.get_size()=1)
[2025-03-12 14:32:49.870236] WDIAG [STORAGE.TRANS] ls_sync_rollback_savepoint__ (ob_tx_api.cpp:1354) [153097][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D3DFE50-0-0] [lt=37][errcode=-4036] rollback to savepoint sync fail(ret=-4036, part_ctx->get_trans_id()={txid:61850824}, part_ctx->get_ls_id()={id:1001}, retry_cnt=0, op_sn=57, savepoint={branch:0, seq:19969}, expire_ts=-1)
[2025-03-12 14:32:49.870247] WDIAG [STORAGE.TRANS] ls_rollback_to_savepoint_ (ob_tx_api.cpp:1702) [153097][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D3DFE50-0-0] [lt=28][errcode=-4036] LS rollback to savepoint fail(ret=-4036, tx_id={txid:61850824}, ls={id:1001}, op_sn=57, savepoint={branch:0, seq:19969}, ctx={this:0x7f76460db2d0, ref:3, trans_id:{txid:61850824}, tenant_id:1012, is_exiting:false, trans_expired_time:1741847569848747, cluster_version:17180067072, trans_need_wait_wrap:{receive_gts_ts_:[mts=0], need_wait_interval_us:0}, stc:[mts=0], ctx_create_time:1741761169848747})
[2025-03-12 14:32:49.871802] WDIAG [STORAGE.TRANS] rollback_to_savepoint (ob_trans_part_ctx.cpp:8726) [153088][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D1E2244-0-0] [lt=0][errcode=-4036] rollback_to need retry because of logging(ret=-4036, trans_id_={txid:61850821}, ls_id_={id:1001}, busy_cbs_.get_size()=1)
[2025-03-12 14:32:49.871824] WDIAG [STORAGE.TRANS] ls_sync_rollback_savepoint__ (ob_tx_api.cpp:1354) [153088][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D1E2244-0-0] [lt=51][errcode=-4036] rollback to savepoint sync fail(ret=-4036, part_ctx->get_trans_id()={txid:61850821}, part_ctx->get_ls_id()={id:1001}, retry_cnt=0, op_sn=57, savepoint={branch:0, seq:23519}, expire_ts=-1)
[2025-03-12 14:32:49.871838] WDIAG [STORAGE.TRANS] ls_rollback_to_savepoint_ (ob_tx_api.cpp:1702) [153088][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D1E2244-0-0] [lt=34][errcode=-4036] LS rollback to savepoint fail(ret=-4036, tx_id={txid:61850821}, ls={id:1001}, op_sn=57, savepoint={branch:0, seq:23519}, ctx={this:0x7f7648a2b950, ref:3, trans_id:{txid:61850821}, tenant_id:1012, is_exiting:false, trans_expired_time:1741847569847690, cluster_version:17180067072, trans_need_wait_wrap:{receive_gts_ts_:[mts=0], need_wait_interval_us:0}, stc:[mts=0], ctx_create_time:1741761169847690})
[2025-03-12 14:32:49.871973] WDIAG [STORAGE.TRANS] rollback_to_savepoint (ob_trans_part_ctx.cpp:8726) [153089][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D1E2244-0-0] [lt=38][errcode=-4036] rollback_to need retry because of logging(ret=-4036, trans_id_={txid:61850821}, ls_id_={id:1001}, busy_cbs_.get_size()=1)
[2025-03-12 14:32:49.871993] WDIAG [STORAGE.TRANS] ls_sync_rollback_savepoint__ (ob_tx_api.cpp:1354) [153089][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D1E2244-0-0] [lt=47][errcode=-4036] rollback to savepoint sync fail(ret=-4036, part_ctx->get_trans_id()={txid:61850821}, part_ctx->get_ls_id()={id:1001}, retry_cnt=0, op_sn=57, savepoint={branch:0, seq:23519}, expire_ts=-1)
[2025-03-12 14:32:49.872007] WDIAG [STORAGE.TRANS] ls_rollback_to_savepoint_ (ob_tx_api.cpp:1702) [153089][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D1E2244-0-0] [lt=33][errcode=-4036] LS rollback to savepoint fail(ret=-4036, tx_id={txid:61850821}, ls={id:1001}, op_sn=57, savepoint={branch:0, seq:23519}, ctx={this:0x7f7648a2b950, ref:3, trans_id:{txid:61850821}, tenant_id:1012, is_exiting:false, trans_expired_time:1741847569847690, cluster_version:17180067072, trans_need_wait_wrap:{receive_gts_ts_:[mts=0], need_wait_interval_us:0}, stc:[mts=0], ctx_create_time:1741761169847690})
[2025-03-12 14:32:49.880796] WDIAG [STORAGE.TRANS] sync_rollback_savepoint__ (ob_tx_api.cpp:1878) [153090][T1012_L0_G0][T1012][YB420A0A11CC-00062E869D3DFE50-0-0] [lt=114][errcode=-4012] tx rpc condition wakeup(ret=-4012, tx.tx_id_={txid:61850824}, waittime=10000, rpc_ret=0, expire_ts=1741761467792870, remain=[{ls_id:{id:1001}, exec_epoch:556963917339846, transfer_epoch:-1}], remain_cnt=1, retries=0, tx.state=4)
麻烦先使用obdiag分析一下日志和获取一下当前集群信息
在线分析最近一小时的日志,诊断出出现过的错误
obdiag analyze log --since 1h
obdiag gather scene run --scene=observer.base
查询下看看是否有异常事务未提交的
SELECT * FROM oceanbase.gv$ob_trans_stat WHERE state != ‘COMMITTED’;
检查租户的关于过期全局变量global都是多少
ob_query_timeout
ob_trx_timeout
oceanbase.gv$ob_trans_stat 这个表不存在,sys租户,和业务租户都试了。
MySQL [oceanbase]> SELECT * FROM oceanbase.gv$ob_trans_stat WHERE state != 'COMMITTED';
ERROR 1146 (42S02): Table 'oceanbase.gv$ob_trans_stat' doesn't exist
ob_query_timeout 298000000 #5分钟
ob_trx_timeout 86400000000 #一天
使用这个表,__all_virtual_trans_stat;
麻烦把obdiag分析附件发一份并再发一份完整的observer日志
提示文件太多了
[root@OB01 ~]# obdiag analyze log --since 1h
analyze_log start …
analyze log from_time: 2025-03-13 12:25:54, to_time: 2025-03-13 13:26:54
analyze nodes’s log start. Please wait a moment…
analyze start ok
[WARN] 10.10.17.204 The number of log files is 51, out of range (0,50]
麻烦登陆这个租户,将如下SQL同时粘贴到执行窗口,回车,发下结果
select now();
select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
select sleep(5),now();
select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
select sleep(5),now();
select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
MySQL [(none)]> select now();
ue from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
select sleep(5),now();
select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000+---------------------+
| now() |
+---------------------+
| 2025-03-13 13:51:02 |
+---------------------+
1 row in set (0.00 sec)
MySQL [(none)]> select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
or con_id = 1) and class < 1000;+-----------+---------+----------+
| tenant_id | stat_id | value |
+-----------+---------+----------+
| 1012 | 30005 | 89546636 |
+-----------+---------+----------+
1 row in set (0.02 sec)
MySQL [(none)]> select sleep(5),now();
+----------+---------------------+
| sleep(5) | now() |
+----------+---------------------+
| 0 | 2025-03-13 13:51:02 |
+----------+---------------------+
1 row in set (5.00 sec)
MySQL [(none)]> select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
+-----------+---------+----------+
| tenant_id | stat_id | value |
+-----------+---------+----------+
| 1012 | 30005 | 89550030 |
+-----------+---------+----------+
1 row in set (0.02 sec)
MySQL [(none)]> select sleep(5),now();
+----------+---------------------+
| sleep(5) | now() |
+----------+---------------------+
| 0 | 2025-03-13 13:51:07 |
+----------+---------------------+
1 row in set (5.00 sec)
MySQL [(none)]> select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
+-----------+---------+----------+
| tenant_id | stat_id | value |
+-----------+---------+----------+
| 1012 | 30005 | 89561423 |
+-----------+---------+----------+
1 row in set (0.02 sec)
不好意思,麻烦这样查下,全部粘贴进去,回车
select now();
select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
select sleep(5);
select now();
select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
select sleep(5);
select now();
select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
select now();
MySQL [oceanbase]> select now();
_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
select sleep(5);
select now();
select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) an+---------------------+
| now() |
+---------------------+
| 2025-03-13 14:12:59 |
+---------------------+
1 row in set (0.00 sec)
MySQL [oceanbase]> select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
d (con_id > 1000 or con_id = 1) and class < 1000;
select now();+-----------+---------+----------+
| tenant_id | stat_id | value |
+-----------+---------+----------+
| 1012 | 30005 | 90122106 |
+-----------+---------+----------+
1 row in set (0.02 sec)
MySQL [oceanbase]> select sleep(5);
+----------+
| sleep(5) |
+----------+
| 0 |
+----------+
1 row in set (5.00 sec)
MySQL [oceanbase]> select now();
+---------------------+
| now() |
+---------------------+
| 2025-03-13 14:13:04 |
+---------------------+
1 row in set (0.00 sec)
MySQL [oceanbase]> select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
+-----------+---------+----------+
| tenant_id | stat_id | value |
+-----------+---------+----------+
| 1012 | 30005 | 90125546 |
+-----------+---------+----------+
1 row in set (0.01 sec)
MySQL [oceanbase]> select sleep(5);
+----------+
| sleep(5) |
+----------+
| 0 |
+----------+
1 row in set (5.00 sec)
MySQL [oceanbase]> select now();
+---------------------+
| now() |
+---------------------+
| 2025-03-13 14:13:09 |
+---------------------+
1 row in set (0.00 sec)
MySQL [oceanbase]> select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30005) and (con_id > 1000 or con_id = 1) and class < 1000;
+-----------+---------+----------+
| tenant_id | stat_id | value |
+-----------+---------+----------+
| 1012 | 30005 | 90128262 |
+-----------+---------+----------+
1 row in set (0.02 sec)
MySQL [oceanbase]>
MySQL [oceanbase]> select now();
+---------------------+
| now() |
+---------------------+
| 2025-03-13 14:13:11 |
+---------------------+
1 row in set (0.00 sec)
MySQL [oceanbase]>
补充材料:
1.obdiag analyze log --since 1h 这个目前提示文件过多,没有结果
2. obdiag gather scene run --scene=observer.base 这个请查看附件。
3. SELECT * FROM oceanbase.gv$ob_trans_stat WHERE state != ‘COMMITTED’ ; 这个有时是空的,最多的时候结果如下
sql_result.rar (35.2 KB)
MySQL [oceanbase]> SELECT * FROM __all_virtual_trans_stat WHERE state != 'COMMITTED' ;

| tenant_id | svr_ip | svr_port | trans_type | trans_id | session_id | scheduler_addr | is_decided | ls_id | participants | ctx_create_time | expired_time | ref_cnt | last_op_sn | pending_write | state | part_trans_action | trans_ctx_addr | mem_ctx_id | pending_log_size | flushed_log_size | role | is_exiting | coordinator | last_request_time | gtrid | bqual | format_id | start_scn | end_scn | rec_scn | transfer_blocking | busy_cbs | replay_complete | serial_log_final_scn | callback_list_stats |

| 1012 | 10.10.17.204 | 2882 | 0 | 90856067 | 3221684929 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:58.554653 | 2025-03-14 14:10:58.554653 | 3 | 6 | 0 | 10 | 2 | 0x7f764686fcd0 | -1 | 245 | 0 | 0 | 0 | -1 | 2025-03-13 14:10:58.556763 | NULL | NULL | -1 | 18446744073709551615 | 18446744073709551615 | 4611686018427387903 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,5,0,1,0,0]] |
| 1012 | 10.10.17.204 | 2882 | 0 | 90846150 | 3221551189 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:23.807890 | 2025-03-14 14:10:23.807890 | 2 | 3 | 0 | 10 | 2 | 0x7f7646325950 | -1 | 0 | 45 | 0 | 0 | -1 | 2025-03-13 14:10:23.807890 | NULL | NULL | -1 | 1741846252416052002 | 18446744073709551615 | 1741846252416052002 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,2,2,0,0,1741846252416052002]] |
| 1012 | 10.10.17.204 | 2882 | 0 | 90856058 | 3221743548 | "10.10.17.204:2882" | 0 | 1001 | [{id:-1}] | 2025-03-13 14:10:58.532506 | 2025-03-14 14:10:58.532506 | 4 | 57 | 1 | 10 | 2 | 0x7f76463080d0 | -1 | 18104 | 16428 | 0 | 0 | -1 | 2025-03-13 14:10:58.556763 | NULL | NULL | -1 | 1741846258548345001 | 18446744073709551615 | 1741846258548345001 | 0 | -1 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,356,237,0,0,1741846258548345001]] |
| 1012 | 10.10.17.204 | 2882 | 0 | 90856068 | 3221549937 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:58.554653 | 2025-03-14 14:10:58.554653 | 2 | 5 | 0 | 10 | 2 | 0x7f764880f950 | -1 | 39 | 0 | 0 | 0 | -1 | 2025-03-13 14:10:58.556763 | NULL | NULL | -1 | 18446744073709551615 | 18446744073709551615 | 4611686018427387903 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,2,0,3,0,0]] |
| 1012 | 10.10.17.204 | 2882 | 0 | 90856064 | 3221552112 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:58.549370 | 2025-03-14 14:10:58.549370 | 2 | 4 | 0 | 10 | 2 | 0x7f76462f72d0 | -1 | 2789 | 0 | 0 | 0 | -1 | 2025-03-13 14:10:58.550428 | NULL | NULL | -1 | 18446744073709551615 | 18446744073709551615 | 4611686018427387903 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,53,0,0,0,0]] |
| 1012 | 10.10.17.204 | 2882 | 0 | 90856066 | 3221617648 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:58.552542 | 2025-03-14 14:10:58.552542 | 3 | 8 | 1 | 10 | 2 | 0x7f76460df650 | -1 | 2884 | 0 | 0 | 0 | -1 | 2025-03-13 14:10:58.556763 | NULL | NULL | -1 | 18446744073709551615 | 18446744073709551615 | 4611686018427387903 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,47,0,1,0,0]] |
| 1012 | 10.10.17.204 | 2882 | 0 | 90856038 | 3221490388 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:58.506240 | 2025-03-14 14:10:58.506240 | 2 | 13 | 0 | 10 | 2 | 0x7f76463107d0 | -1 | 2789 | 0 | 0 | 0 | -1 | 2025-03-13 14:10:58.549370 | NULL | NULL | -1 | 18446744073709551615 | 18446744073709551615 | 4611686018427387903 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,53,0,3,0,0]] |
| 1012 | 10.10.17.204 | 2882 | 0 | 90856061 | 3221580474 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:58.542008 | 2025-03-14 14:10:58.542008 | 4 | 30 | 1 | 10 | 2 | 0x7f76487fa7d0 | -1 | 8321 | 0 | 0 | 0 | -1 | 2025-03-13 14:10:58.556763 | NULL | NULL | -1 | 18446744073709551615 | 18446744073709551615 | 4611686018427387903 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,126,0,0,0,0]] |
| 1012 | 10.10.17.204 | 2882 | 0 | 90856049 | 3221746276 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:58.524088 | 2025-03-14 14:10:58.524088 | 2 | 57 | 0 | 10 | 2 | 0x7f7648a2b950 | -1 | 5850 | 34651 | 0 | 0 | -1 | 2025-03-13 14:10:58.554653 | NULL | NULL | -1 | 1741846258538842005 | 18446744073709551615 | 1741846258538842005 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,456,356,0,0,1741846258554653001]] |
| 1012 | 10.10.17.206 | 2882 | 0 | 90846150 | 0 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:52.417534 | NULL | 2 | 0 | 0 | 10 | 1 | 0x7f37a91d0450 | -1 | 0 | 45 | 1 | 0 | -1 | 2025-03-13 14:10:52.417534 | NULL | NULL | -1 | 1741846252416052002 | 18446744073709551615 | 1741846252416052002 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,2,2,0,0,0]] |
| 1012 | 10.10.17.206 | 2882 | 0 | 90856058 | 0 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:58.550385 | NULL | 2 | 0 | 0 | 10 | 1 | 0x7f37aa3c6b50 | -1 | 0 | 16428 | 1 | 0 | -1 | 2025-03-13 14:10:58.550385 | NULL | NULL | -1 | 1741846258548345001 | 18446744073709551615 | 1741846258548345001 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,237,237,0,0,0]] |
| 1012 | 10.10.17.206 | 2882 | 0 | 90856054 | 0 | "10.10.17.204:2882" | 1 | 1001 | [{id:-1}] | 2025-03-13 14:10:58.544026 | NULL | 3 | 0 | 0 | 50 | 1 | 0x7f37a8d7ef50 | -1 | 0 | 44528 | 1 | 0 | -1 | 2025-03-13 14:10:58.544026 | NULL | NULL | -1 | 1741846258538842004 | 1741846258555709000 | 1741846258538842004 | 0 | -1 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,463,463,108,0,0]] |
| 1012 | 10.10.17.206 | 2882 | 0 | 90856049 | 0 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:58.546147 | NULL | 2 | 0 | 0 | 10 | 1 | 0x7f37a9824450 | -1 | 0 | 34651 | 1 | 0 | -1 | 2025-03-13 14:10:58.546147 | NULL | NULL | -1 | 1741846258538842005 | 18446744073709551615 | 1741846258538842005 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,356,356,0,0,0]] |
| 1012 | 10.10.17.205 | 2882 | 0 | 90846150 | 0 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:52.417532 | NULL | 2 | 0 | 0 | 10 | 1 | 0x7fab219b5d50 | -1 | 0 | 45 | 1 | 0 | -1 | 2025-03-13 14:10:52.417532 | NULL | NULL | -1 | 1741846252416052002 | 18446744073709551615 | 1741846252416052002 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,2,2,0,0,0]] |
| 1012 | 10.10.17.205 | 2882 | 0 | 90856058 | 0 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:58.550374 | NULL | 2 | 0 | 0 | 10 | 1 | 0x7fab21e02ed0 | -1 | 0 | 16428 | 1 | 0 | -1 | 2025-03-13 14:10:58.550374 | NULL | NULL | -1 | 1741846258548345001 | 18446744073709551615 | 1741846258548345001 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,237,237,0,0,0]] |
| 1012 | 10.10.17.205 | 2882 | 0 | 90856054 | 0 | "10.10.17.204:2882" | 1 | 1001 | [{id:-1}] | 2025-03-13 14:10:58.544042 | NULL | 3 | 0 | 0 | 50 | 1 | 0x7fab1fe6fcd0 | -1 | 0 | 44528 | 1 | 0 | -1 | 2025-03-13 14:10:58.544042 | NULL | NULL | -1 | 1741846258538842004 | 1741846258555709000 | 1741846258538842004 | 0 | -1 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,463,463,0,0,0]] |
| 1012 | 10.10.17.205 | 2882 | 0 | 90856049 | 0 | "10.10.17.204:2882" | 0 | 1001 | NULL | 2025-03-13 14:10:58.545099 | NULL | 2 | 0 | 0 | 10 | 1 | 0x7fab21ac80d0 | -1 | 0 | 34651 | 1 | 0 | -1 | 2025-03-13 14:10:58.545099 | NULL | NULL | -1 | 1741846258538842005 | 18446744073709551615 | 1741846258538842005 | 0 | 0 | 1 | -1 | ["id, length, logged, removed, branch_removed, sync_scn", [0,356,356,0,0,0]] |

17 rows in set (0.01 sec)
这个算出来两个点的tps分别是688和543,如果监控显示个位数可能有问题,我联系OCP这块的老师看下
是的监控上的TPS,一直没有超过5
麻烦登陆这个租户,再这样查下,将如下SQL同时粘贴到执行窗口,回车,发下结果
select now();
select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30007,30009,30011) and (con_id > 1000 or con_id = 1) and class < 1000;
select sleep(5);
select now();
select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30007,30009,30011) and (con_id > 1000 or con_id = 1) and class < 1000;
select sleep(5);
select now();
select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30007,30009,30011) and (con_id > 1000 or con_id = 1) and class < 1000;
select now();
30009 事物回滚这个算出来,倒是和监控对的上的
MySQL [oceanbase]> select now();
ant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30007,30009,30011) and (con_id > 1000 or con_id = 1) and class < 1000;
select sleep(5);
select now();
select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat whe+---------------------+
| now() |
+---------------------+
| 2025-03-13 15:06:06 |
+---------------------+
1 row in set (0.00 sec)
MySQL [oceanbase]> select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30007,30009,30011) and (con_id > 1000 or con_id = 1) and class < 1000;
re stat_id IN (30007,30009,30011) and (con_id > 1000 or con_id = 1) and class < 1000;
select now();+-----------+---------+---------+
| tenant_id | stat_id | value |
+-----------+---------+---------+
| 1012 | 30007 | 1478789 |
| 1012 | 30009 | 282236 |
| 1012 | 30011 | 0 |
+-----------+---------+---------+
3 rows in set (0.02 sec)
MySQL [oceanbase]> select sleep(5);
+----------+
| sleep(5) |
+----------+
| 0 |
+----------+
1 row in set (5.00 sec)
MySQL [oceanbase]> select now();
+---------------------+
| now() |
+---------------------+
| 2025-03-13 15:06:11 |
+---------------------+
1 row in set (0.00 sec)
MySQL [oceanbase]> select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30007,30009,30011) and (con_id > 1000 or con_id = 1) and class < 1000;
+-----------+---------+---------+
| tenant_id | stat_id | value |
+-----------+---------+---------+
| 1012 | 30007 | 1478804 |
| 1012 | 30009 | 282260 |
| 1012 | 30011 | 0 |
+-----------+---------+---------+
3 rows in set (0.02 sec)
MySQL [oceanbase]> select sleep(5);
+----------+
| sleep(5) |
+----------+
| 0 |
+----------+
1 row in set (5.00 sec)
MySQL [oceanbase]> select now();
+---------------------+
| now() |
+---------------------+
| 2025-03-13 15:06:16 |
+---------------------+
1 row in set (0.00 sec)
MySQL [oceanbase]> select /* MONITOR_AGENT */ con_id tenant_id, stat_id, value from oceanbase.v$sysstat where stat_id IN (30007,30009,30011) and (con_id > 1000 or con_id = 1) and class < 1000;
+-----------+---------+---------+
| tenant_id | stat_id | value |
+-----------+---------+---------+
| 1012 | 30007 | 1478814 |
| 1012 | 30009 | 282280 |
| 1012 | 30011 | 0 |
+-----------+---------+---------+
3 rows in set (0.02 sec)
MySQL [oceanbase]> select now();
+---------------------+
| now() |
+---------------------+
| 2025-03-13 15:06:18 |
+---------------------+
1 row in set (0.00 sec)
MySQL [oceanbase]>
是的,对得上的,OCP这里TPS是按照30007,30009,30011计算的,没有问题的
QPS都到2万了,那这个TPS 肯定不止这么点的,还是有问题呀,一样的业务,之前我们压测的时候QPS 2000+ TPS 一直稳定在400左右
详见这个贴子:
OCP监控中发现租户里 事物回滚 比提交多几倍 - 社区问答- OceanBase社区-分布式数据库
在ocp SQL诊断中 BEGIN 和 COMMIT 的执行也有每秒300以上的量
另外,如果这个TPS 是对的,那又是一个回滚事物比提交事物多的现象,需要如何进一步排查?
更新
obdiag analyze log --since 1h
结果正常:
Analyze OceanBase Online Log Summary:
+--------------+----------+------------+--------------------+-------------+-----------+---------+
| Node | Status | FileName | First Found Time | ErrorCode | Message | Count |
+==============+==========+============+====================+=============+===========+=========+
| 10.10.17.204 | PASS | | | | | |
+--------------+----------+------------+--------------------+-------------+-----------+---------+
| 10.10.17.205 | PASS | | | | | |
+--------------+----------+------------+--------------------+-------------+-----------+---------+
| 10.10.17.206 | PASS | | | | | |
+--------------+----------+------------+--------------------+-------------+-----------+---------+
For more details, please run cmd ' cat /root/obdiag_analyze_pack_20250313152820/result_details.txt '
Trace ID: bc8693ba-ffdc-11ef-9201-f8f21e597991
If you want to view detailed obdiag logs, please run: obdiag display-trace bc8693ba-ffdc-11ef-9201-f8f21e597991
[root@OB01 ~]# cat /root/obdiag_analyze_pack_20250313152820/result_details.txt
Analyze OceanBase Online Log Summary:
+--------------+----------+------------+--------------------+-------------+-----------+---------+
| Node | Status | FileName | First Found Time | ErrorCode | Message | Count |
+==============+==========+============+====================+=============+===========+=========+
| 10.10.17.204 | PASS | | | | | |
+--------------+----------+------------+--------------------+-------------+-----------+---------+
| 10.10.17.205 | PASS | | | | | |
+--------------+----------+------------+--------------------+-------------+-----------+---------+
| 10.10.17.206 | PASS | | | | | |
+--------------+----------+------------+--------------------+-------------+-----------+---------+
Details:
Node: 10.10.17.204
Status: PASS
FileName: None
First Found Time: None
ErrorCode: None
Message: None
Count: None
Last Found Time: None
Cause: None
Solution: None
Trace_IDS: None
Node: 10.10.17.205
Status: PASS
FileName: None
First Found Time: None
ErrorCode: None
Message: None
Count: None
Last Found Time: None
Cause: None
Solution: None
Trace_IDS: None
Node: 10.10.17.206
Status: PASS
FileName: None
First Found Time: None
ErrorCode: None
Message: None
Count: None
Last Found Time: None
Cause: None
Solution: None
Trace_IDS: None
[root@OB01 ~]#
另外下面是 obdiag check run 巡检结果:
[root@OB01 ~]# obdiag check run
check start ...
[WARN] step_base ResultFalseException:mod max memory over 10G,Please check on oceanbase.__all_virtual_memory_info to find some large mod
[WARN] step_base ResultFalseException:mod max memory over 10G,Please check on oceanbase.__all_virtual_memory_info to find some large mod
[WARN] TaskBase execute StepResultFalseException: mod max memory over 10G,Please check on oceanbase.__all_virtual_memory_info to find some large mod .
[WARN] TaskBase execute StepResultFalseException: mod max memory over 10G,Please check on oceanbase.__all_virtual_memory_info to find some large mod .
[WARN] step_base ResultFalseException:mod max memory over 10G,Please check on oceanbase.__all_virtual_memory_info to find some large mod
[WARN] TaskBase execute StepResultFalseException: mod max memory over 10G,Please check on oceanbase.__all_virtual_memory_info to find some large mod .
[WARN] step_base ResultFalseException:number of sql_error_4012 is 154
[WARN] step_base ResultFalseException:number of sql_error_4012 is 154
[WARN] TaskBase execute StepResultFalseException: number of sql_error_4012 is 154 .
[WARN] TaskBase execute StepResultFalseException: number of sql_error_4012 is 154 .
[WARN] step_base ResultFalseException:number of sql_error_4012 is 154
[WARN] TaskBase execute StepResultFalseException: number of sql_error_4012 is 154 .
[WARN] step_base ResultFalseException:tsar is not installed. we can not check tcp retransmission.
[WARN] TaskBase execute StepResultFailException: tsar is not installed. we can not check tcp retransmission.
[WARN] step_base ResultFalseException:tsar is not installed. we can not check tcp retransmission.
[WARN] step_base ResultFalseException:tsar is not installed. we can not check tcp retransmission.
[WARN] TaskBase execute StepResultFailException: tsar is not installed. we can not check tcp retransmission.
[WARN] TaskBase execute StepResultFailException: tsar is not installed. we can not check tcp retransmission.
[WARN] network_speed is and the type is <class 'str'>, not int or float or decimal ! set it to 0.
[WARN] step_base ResultFalseException:network_speed is , less than
[WARN] network_speed is and the type is <class 'str'>, not int or float or decimal ! set it to 0.
[WARN] step_base ResultFalseException:network_speed is , less than
[WARN] TaskBase execute StepResultFailException: network_speed is , less than
[WARN] TaskBase execute StepResultFailException: network_speed is , less than
[WARN] step_base ResultFalseException:net.ipv4.tcp_tw_recycle : 0. recommended: 1.
[WARN] TaskBase execute StepResultFalseException: net.ipv4.tcp_tw_recycle : 0. recommended: 1. .
[WARN] step_base ResultFalseException:net.ipv4.tcp_tw_recycle : 0. recommended: 1.
[WARN] step_base ResultFalseException:net.ipv4.tcp_tw_recycle : 0. recommended: 1.
[WARN] TaskBase execute StepResultFalseException: net.ipv4.tcp_tw_recycle : 0. recommended: 1. .
[WARN] TaskBase execute StepResultFalseException: net.ipv4.tcp_tw_recycle : 0. recommended: 1. .
Check observer finished. For more details, please run cmd' cat ./check_report/obdiag_check_report_observer_2025-03-13-15-48-53.table '
Trace ID: 9b29f77c-ffdf-11ef-b508-f8f21e597991
If you want to view detailed obdiag logs, please run: obdiag display-trace 9b29f77c-ffdf-11ef-b508-f8f21e597991
上面的 TaskBase execute StepResultFalseException: number of sql_error_4012 is 154 这个需要关注吗?
在ocp SQL诊断中 BEGIN 和 COMMIT 的执行也有每秒300以上的量 --这个麻烦截图看下,
那又是一个回滚事物比提交事物多的现象 --这个算出来 回滚事物是比提交事物略高,这个问题我再看下