OCP监控中发现租户里 事物回滚 比提交多几倍

【 使用环境 】生产环境 or 测试环境
【 OB or 其他组件 】
ob: 4.3.3.1 OceanBase 构建版本号:4.3.3.1-101000012024102216
obp:4.3.2.0-26
ocp:4.3.2 版本号: 4.3.2-20240925174740

【问题描述】
如以下监控图所示,事物回滚 比提示多了好几倍,请问这种情况正常吗?

使用obdiag检查也没有异常

[root@OBDB-01 log]# obdiag rca run --scene=transaction_rollback
No time option provided, default processing is based on the last 30 minutes
gather log from_time: 2024-10-30 09:42:37, to_time: 2024-10-30 10:13:37

Gather Ob Log Summary:
+---------------+------------------+--------+--------+------------+
| Node          | Status           | Size   | Time   | PackPath   |
+===============+==================+========+========+============+
| 192.168.51.11 | Error:Empty file | 0.000B | 1 s    |            |
+---------------+------------------+--------+--------+------------+
| 192.168.51.12 | Error:Empty file | 0.000B | 1 s    |            |
+---------------+------------------+--------+--------+------------+
| 192.168.51.13 | Error:Empty file | 0.000B | 2 s    |            |
+---------------+------------------+--------+--------+------------+
For result details, please run cmd ' cat /home/admin/obproxy/log/obdiag_rca/obdiag_transaction_rollback_20241030101237/trans_is_killed/obdiag_gather_pack_20241030101237/result_summary.txt '

rca finished. For more details, the result on './obdiag_rca//obdiag_transaction_rollback_20241030101237' 
You can get the suggest by 'cat ./obdiag_rca//obdiag_transaction_rollback_20241030101237/record'
Trace ID: 6e2f0a8a-9664-11ef-a448-90e2baa8770a
If you want to view detailed obdiag logs, please run: obdiag display-trace 6e2f0a8a-9664-11ef-a448-90e2baa8770a
[root@OBDB-01 log]# cat ./obdiag_rca//obdiag_transaction_rollback_20241030101237/record
+---------------------------------------+
|                 record                |
+------+--------------------------------+
| step | info                           |
+------+--------------------------------+
|  1   | syslog_level data is WDIAG     |
|  2   | no log about 'trans is killed' |
+------+--------------------------------+
The suggest:

observer.log里大量 can not do batch insert opt日志:

[2024-10-30 10:18:01.166768] INFO  [SQL.PC] check_can_do_insert_opt (ob_plan_cache.cpp:996) [175078][T1004_L0_G0][T1004][YB42C0A8330B-000625A485D63512-0-0] [lt=1] can not do batch insert opt(ret=0, can_do_batch=false, upd_params_count=0, ins_params_count=0, batch_count=0, pc_ctx.raw_sql_=INSERT INTO set_hold_member_20241030^M
                    (^M
                        period_no, set_m, member_id, set_no, dict_no_type_id,^M
                        hold_m, max_loss, pay_water, set_type_no^M
                    )^M
                    VALUES^M
                    (20241030124,1.00000,729, '7331', 11,^M
1.00000, 0, 0, '117331'),(20241030124,1.00000,729, '7337', 11,^M
1.00000, 0, 0, '117337')^M
                    ON DUPLICATE KEY UPDATE^M
                    set_m=set_m-VALUES(set_m),^M
                    hold_m = hold_m + VALUES(hold_m)^M
                    ;)
[2024-10-30 10:18:01.167304] INFO  [STORAGE.TRANS] rollback (ob_memtable_context.cpp:844) [174289][T1002_L0_G0][T1002][YB42C0A8330B-000625A3DE39F431-0-0] [lt=0] memtable handle rollback to successfuly(from_seq_no={branch:0, seq:15792}, to_seq_no={branch:0, seq:15786}, remove_cnt=1, elapsed=0, this={ObIMvccCtx={alloc_type=0 ctx_descriptor=0 min_table_version=1729562416268256 max_table_version=1730064641355736 trans_version={val:4611686018427387903, v:0} commit_version={val:0, v:0} lock_wait_start_ts=0 replay_compact_version={val:0, v:0}} end_code=0 tx_status=0 is_readonly=false ref=0 trans_id={txid:319403343} ls_id=1001 row_callback[alloc:8, free:0, unsubmit:8] redo[fill:0,sync_succ:0, sync_fail:0] main_list_len=11 pending_log_size=502 callback_list:{cnt=1 stat:[tx_end=0, rollback_to=2, fast_commit=0, remove_memtable=0, ext_info_log=0] detail:[(log_epoch,length,logged,synced,appended,removed,unlog_removed,branch_removed)|0:(1,11,0,0,13,2,2,0)|]}})
[2024-10-30 10:18:01.167794] INFO  [STORAGE.TRANS] rollback (ob_memtable_context.cpp:844) [175078][T1004_L0_G0][T1004][YB42C0A8330B-000625A485D63512-0-0] [lt=0] memtable handle rollback to successfuly(from_seq_no={branch:0, seq:13698}, to_seq_no={branch:0, seq:13696}, remove_cnt=1, elapsed=0, this={ObIMvccCtx={alloc_type=0 ctx_descriptor=0 min_table_version=1730094182157976 max_table_version=1730094672019448 trans_version={val:4611686018427387903, v:0} commit_version={val:0, v:0} lock_wait_start_ts=0 replay_compact_version={val:0, v:0}} end_code=0 tx_status=0 is_readonly=false ref=0 trans_id={txid:27779777} ls_id=1001 row_callback[alloc:14, free:0, unsubmit:14] redo[fill:0,sync_succ:0, sync_fail:0] main_list_len=19 pending_log_size=981 callback_list:{cnt=1 stat:[tx_end=0, rollback_to=3, fast_commit=0, remove_memtable=0, ext_info_log=0] detail:[(log_epoch,length,logged,synced,appended,removed,unlog_removed,branch_removed)|0:(1,19,0,0,22,3,3,0)|]}})
1 个赞

obdiag这个是分析回滚异常的,这里事务是回滚成功了,回滚量较正常事物高出3~4倍是需要关注的,可能和数据有关系,这个sql有试过手工执行吗

INSERT INTO set_hold_member_20241030^M
                    (^M
                        period_no, set_m, member_id, set_no, dict_no_type_id,^M
                        hold_m, max_loss, pay_water, set_type_no^M
                    )^M
                    VALUES^M
                    (20241030124,1.00000,729, '7331', 11,^M
1.00000, 0, 0, '117331'),(20241030124,1.00000,729, '7337', 11,^M
1.00000, 0, 0, '117337')^M
                    ON DUPLICATE KEY UPDATE^M
                    set_m=set_m-VALUES(set_m),^M
                    hold_m = hold_m + VALUES(hold_m)^M
                    ;)
1 个赞

这个sql 是最常见的一个业务SQL,在ocp的 SQL诊断–TOP SQL 里能进前10, 一小时有执行7万左右的执行记录,都是成功的。
can not do batch insert opt 回滚的日志里为什么会出现这个呢?无法进行批量插入,然后是单行插入的么?

1 个赞

这个需要怎么排查呀?

1 个赞

使用obdiag分析下锁冲突情况

obdiag rca run --scene=lock_conflict
https://www.oceanbase.com/docs/common-obdiag-cn-1000000001491160

指令后加上租户名obdiag rca run --scene=lock_conflict --env tenant_name={租户名}

不好意思,使用这个命令
obdiag rca run --scene=lock_conflict --env tenant_name={租户名}

每隔一分钟运行一次,这些时间里,OCP的监控里回滚事物数量都是提交的4倍左右,不过检查的结果都 No block lock found.

[root@OBDB-01 log]# obdiag rca run --scene=lock_conflict --env tenant_name=OB2
[WARN] rca_scene.execute not need execute: No block lock found. 
Trace ID: be9440ae-9780-11ef-bc73-90e2baa8770a
If you want to view detailed obdiag logs, please run: obdiag display-trace be9440ae-9780-11ef-bc73-90e2baa8770a

[root@OBDB-01 log]# obdiag rca run --scene=lock_conflict --env tenant_name=OB2
[WARN] rca_scene.execute not need execute: No block lock found. 
Trace ID: c60e6f12-9780-11ef-a72d-90e2baa8770a
If you want to view detailed obdiag logs, please run: obdiag display-trace c60e6f12-9780-11ef-a72d-90e2baa8770a

[root@OBDB-01 log]# obdiag rca run --scene=lock_conflict --env tenant_name=OB2
[WARN] rca_scene.execute not need execute: No block lock found. 
Trace ID: d5267f4e-9780-11ef-a332-90e2baa8770a
If you want to view detailed obdiag logs, please run: obdiag display-trace d5267f4e-9780-11ef-a332-90e2baa8770a

[root@OBDB-01 log]# obdiag rca run --scene=lock_conflict --env tenant_name=OB2
[WARN] rca_scene.execute not need execute: No block lock found. 
Trace ID: 0ad36378-9781-11ef-8a02-90e2baa8770a
If you want to view detailed obdiag logs, please run: obdiag display-trace 0ad36378-9781-11ef-8a02-90e2baa8770a

[root@OBDB-01 log]# obdiag rca run --scene=lock_conflict --env tenant_name=OB2
[WARN] rca_scene.execute not need execute: No block lock found. 
Trace ID: 1779182a-9781-11ef-848a-90e2baa8770a
If you want to view detailed obdiag logs, please run: obdiag display-trace 1779182a-9781-11ef-848a-90e2baa8770a

偶尔也会以下之类的结果:

 cat ./obdiag_rca//obdiag_lock_conflict_20241031203912/record
+----------------------------------------------------------------------------------------------------+
|                                               record                                               |
+------+---------------------------------------------------------------------------------------------+
| step | info                                                                                        |
+------+---------------------------------------------------------------------------------------------+
|  1   | by locks_sql; the len is 1                                                                  |
|  2   | get holding_lock trans_id:55113179                                                          |
|  3   | get holding_lock_session_id:3221725185                                                      |
|  4   | wait_lock_trans_id is 55113732                                                              |
|  5   | get wait_lock_session_id:3221725248                                                         |
|  6   | exec sql: SELECT * FROM oceanbase.gv$OB_SQL_AUDIT where SID="3221725185"; to get            |
|      | holding_lock_sql_info.                                                                      |
|  7   | holding_lock_session_id: 3221725185. holding_lock_sql_info save on                          |
|      | ./obdiag_rca//obdiag_lock_conflict_20241031203912/rca_holding_lock_sql_info_3221725185.json |
+------+---------------------------------------------------------------------------------------------+
The suggest: holding_lock_session_id: 3221725185; wait_lock_session_id : 3221725248, sql_info: save on ./obdiag_rca//obdiag_lock_conflict_20241031203912/rca_holding_lock_sql_info_3221725185.json. Lock conflicts can be ended by killing holding_lock_session_id or wait_lock_session_id

./obdiag_rca//obdiag_lock_conflict_20241031203912/rca_holding_lock_sql_info_3221725185.json

这个文件里面有看到SQL信息吗?

有的,7000多条SQL,下面是头两条:

  {
    "SVR_IP": "192.168.51.11",
    "SVR_PORT": 2882,
    "REQUEST_ID": 416818941,
    "SQL_EXEC_ID": 793354735,
    "TRACE_ID": "YB42C0A8330B-000625B4547BAD9A-0-0",
    "SID": 3221725185,
    "CLIENT_IP": "192.168.51.11",
    "CLIENT_PORT": 10346,
    "TENANT_ID": 1004,
    "TENANT_NAME": "OB2",
    "EFFECTIVE_TENANT_ID": 1004,
    "USER_ID": 500002,
    "USER_NAME": "s7",
    "USER_GROUP": 0,
    "USER_CLIENT_IP": "192.168.51.232",
    "DB_ID": 500003,
    "DB_NAME": "test_db2",
    "SQL_ID": "CD0C7C7BB2E48DCF6B94C9A353BE8623",
    "QUERY_SQL": "start transaction;",
    "PLAN_ID": 0,
    "AFFECTED_ROWS": 0,
    "RETURN_ROWS": 0,
    "PARTITION_CNT": 0,
    "RET_CODE": 0,
    "QC_ID": 0,
    "DFO_ID": 0,
    "SQC_ID": 0,
    "WORKER_ID": 0,
    "EVENT": "",
    "P1TEXT": "",
    "P1": 0,
    "P2TEXT": "",
    "P2": 0,
    "P3TEXT": "",
    "P3": 0,
    "LEVEL": 0,
    "WAIT_CLASS_ID": 100,
    "WAIT_CLASS#": 0,
    "WAIT_CLASS": "OTHER",
    "STATE": "MAX_WAIT TIME ZERO",
    "WAIT_TIME_MICRO": 0,
    "TOTAL_WAIT_TIME_MICRO": 0,
    "TOTAL_WAITS": 0,
    "RPC_COUNT": 0,
    "PLAN_TYPE": 0,
    "IS_INNER_SQL": 0,
    "IS_EXECUTOR_RPC": 0,
    "IS_HIT_PLAN": 0,
    "REQUEST_TIME": 1730378229662982,
    "ELAPSED_TIME": 68,
    "NET_TIME": 0,
    "NET_WAIT_TIME": 1,
    "QUEUE_TIME": 11,
    "DECODE_TIME": 1,
    "GET_PLAN_TIME": 37,
    "EXECUTE_TIME": 11,
    "APPLICATION_WAIT_TIME": 0,
    "CONCURRENCY_WAIT_TIME": 0,
    "USER_IO_WAIT_TIME": 0,
    "SCHEDULE_TIME": 0,
    "ROW_CACHE_HIT": 0,
    "BLOOM_FILTER_CACHE_HIT": 0,
    "BLOCK_CACHE_HIT": 0,
    "DISK_READS": 0,
    "RETRY_CNT": 0,
    "TABLE_SCAN": 0,
    "CONSISTENCY_LEVEL": -1,
    "MEMSTORE_READ_ROW_COUNT": 0,
    "SSSTORE_READ_ROW_COUNT": 0,
    "DATA_BLOCK_READ_CNT": 0,
    "DATA_BLOCK_CACHE_HIT": 0,
    "INDEX_BLOCK_READ_CNT": 0,
    "INDEX_BLOCK_CACHE_HIT": 0,
    "BLOCKSCAN_BLOCK_CNT": 0,
    "BLOCKSCAN_ROW_CNT": 0,
    "PUSHDOWN_STORAGE_FILTER_ROW_CNT": 0,
    "REQUEST_MEMORY_USED": 245808,
    "EXPECTED_WORKER_COUNT": 0,
    "USED_WORKER_COUNT": 0,
    "SCHED_INFO": null,
    "FUSE_ROW_CACHE_HIT": 0,
    "PS_CLIENT_STMT_ID": -1,
    "PS_INNER_STMT_ID": -1,
    "TX_ID": 55075126,
    "SNAPSHOT_VERSION": 0,
    "REQUEST_TYPE": 2,
    "IS_BATCHED_MULTI_STMT": 0,
    "OB_TRACE_INFO": null,
    "PLAN_HASH": 0,
    "LOCK_FOR_READ_TIME": 0,
    "PARAMS_VALUE": "",
    "RULE_NAME": "",
    "PARTITION_HIT": 1,
    "TX_INTERNAL_ROUTING": 0,
    "TX_STATE_VERSION": 0,
    "FLT_TRACE_ID": "000625c5-13b4-c149-ec73-8cff4e04d662",
    "PL_TRACE_ID": null,
    "PLSQL_EXEC_TIME": 0,
    "STMT_TYPE": "START_TRANS",
    "TOTAL_MEMSTORE_READ_ROW_COUNT": 0,
    "TOTAL_SSSTORE_READ_ROW_COUNT": 0,
    "PROXY_USER": ""
  },
  {
    "SVR_IP": "192.168.51.11",
    "SVR_PORT": 2882,
    "REQUEST_ID": 416818944,
    "SQL_EXEC_ID": 793354738,
    "TRACE_ID": "YB42C0A8330B-000625B788546DF8-0-0",
    "SID": 3221725185,
    "CLIENT_IP": "192.168.51.11",
    "CLIENT_PORT": 10346,
    "TENANT_ID": 1004,
    "TENANT_NAME": "OB2",
    "EFFECTIVE_TENANT_ID": 1004,
    "USER_ID": 500002,
    "USER_NAME": "s7",
    "USER_GROUP": 0,
    "USER_CLIENT_IP": "192.168.51.232",
    "DB_ID": 500003,
    "DB_NAME": "test_db2",
    "SQL_ID": "AD53A5BA62E7868FAD999ECE1B403327",
    "QUERY_SQL": "\r\nSELECT\r\n    is_async\r\nFROM\r\n    set_async_normal_20241031\r\nWHERE\r\n    period_no = 20241031248\r\n    AND async_type = 1\r\n    AND async_id IN (3037054000018400051,3037054000018400052,3037060300009600030)\r\n    AND is_async = 0\r\nFOR UPDATE;",
    "PLAN_ID": 36480261,
    "AFFECTED_ROWS": 0,
    "RETURN_ROWS": 3,
    "PARTITION_CNT": 1,
    "RET_CODE": 0,
    "QC_ID": 0,
    "DFO_ID": 0,
    "SQC_ID": 0,
    "WORKER_ID": 0,
    "EVENT": "",
    "P1TEXT": "",
    "P1": 0,
    "P2TEXT": "",
    "P2": 0,
    "P3TEXT": "",
    "P3": 0,
    "LEVEL": 0,
    "WAIT_CLASS_ID": 100,
    "WAIT_CLASS#": 0,
    "WAIT_CLASS": "OTHER",
    "STATE": "MAX_WAIT TIME ZERO",
    "WAIT_TIME_MICRO": 0,
    "TOTAL_WAIT_TIME_MICRO": 0,
    "TOTAL_WAITS": 0,
    "RPC_COUNT": 0,
    "PLAN_TYPE": 1,
    "IS_INNER_SQL": 0,
    "IS_EXECUTOR_RPC": 0,
    "IS_HIT_PLAN": 1,
    "REQUEST_TIME": 1730378229663109,
    "ELAPSED_TIME": 400,
    "NET_TIME": 0,
    "NET_WAIT_TIME": 0,
    "QUEUE_TIME": 27,
    "DECODE_TIME": 0,
    "GET_PLAN_TIME": 39,
    "EXECUTE_TIME": 322,
    "APPLICATION_WAIT_TIME": 0,
    "CONCURRENCY_WAIT_TIME": 0,
    "USER_IO_WAIT_TIME": 0,
    "SCHEDULE_TIME": 0,
    "ROW_CACHE_HIT": 0,
    "BLOOM_FILTER_CACHE_HIT": 0,
    "BLOCK_CACHE_HIT": 0,
    "DISK_READS": 0,
    "RETRY_CNT": 0,
    "TABLE_SCAN": 0,
    "CONSISTENCY_LEVEL": 3,
    "MEMSTORE_READ_ROW_COUNT": 3,
    "SSSTORE_READ_ROW_COUNT": 3,
    "DATA_BLOCK_READ_CNT": 0,
    "DATA_BLOCK_CACHE_HIT": 0,
    "INDEX_BLOCK_READ_CNT": 0,
    "INDEX_BLOCK_CACHE_HIT": 0,
    "BLOCKSCAN_BLOCK_CNT": 0,
    "BLOCKSCAN_ROW_CNT": 0,
    "PUSHDOWN_STORAGE_FILTER_ROW_CNT": 0,
    "REQUEST_MEMORY_USED": 418483,
    "EXPECTED_WORKER_COUNT": 0,
    "USED_WORKER_COUNT": 0,
    "SCHED_INFO": null,
    "FUSE_ROW_CACHE_HIT": 0,
    "PS_CLIENT_STMT_ID": -1,
    "PS_INNER_STMT_ID": -1,
    "TX_ID": 55075126,
    "SNAPSHOT_VERSION": 1730378229662025000,
    "REQUEST_TYPE": 2,
    "IS_BATCHED_MULTI_STMT": 0,
    "OB_TRACE_INFO": null,
    "PLAN_HASH": 15821241959359090000,
    "LOCK_FOR_READ_TIME": 0,
    "PARAMS_VALUE": "",
    "RULE_NAME": "",
    "PARTITION_HIT": 1,
    "TX_INTERNAL_ROUTING": 0,
    "TX_STATE_VERSION": 0,
    "FLT_TRACE_ID": "",
    "PL_TRACE_ID": null,
    "PLSQL_EXEC_TIME": 0,
    "STMT_TYPE": "SELECT",
    "TOTAL_MEMSTORE_READ_ROW_COUNT": 3,
    "TOTAL_SSSTORE_READ_ROW_COUNT": 3,
    "PROXY_USER": ""
  },

有了这个数据后,要怎么看? 如何分析定位?

holding_lock_session_id 是3221725185看下是哪个客户端的,是这个会话一直在持锁

对应的客户端出现了持锁,得从这里结合业务排查出持锁的语句

也就是此文中的3.2章节,有应急处理和非应急处理两个方案

https://open.oceanbase.com/blog/9733194352

可以直接查下死锁情况

select * from oceanbase.DBA_OB_DEADLOCK_EVENT_HISTORY;

select * from oceanbase.CDB_OB_DEADLOCK_EVENT_HISTORY;

https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000001049938

死锁直接在OCP里 会话管理->查看死锁记录 就可以看的,是没有死锁记录的。

麻烦发下包含这种报错的完整的 observer.log,可以压缩上传下

obproxy 报错failed,1045,Access denied for user - 社区问答- OceanBase社区-分布式数据库

如这个贴子一样, 换了驱动后,就正常了