4.2版本SQL执行缓慢

【 使用环境 】测试环境
【 使用版本 】4.2
【问题描述】
一、背景描述
在测试进行数据库迁移到OB 4.2 后,部分sql执行缓慢,长时间无执行结果。此前曾做过OceanBase 3 版本的迁移测试,相同的SQL执行效率很快,sql查询结果进行inserte仅20秒左右。
涉及版本:5.7.25-OceanBase_CE-v4.2.0.0

二、排查问题
1.初步分析
初步分析怀疑为迁移后统计信息问题,于是对表进行统计信息收集,但问题仍然存在。
##call dbms_stats.gather_table_stats(‘DEV_TMP’,‘RTE_TEMP_2’,granularity=>‘GLOBAL’,method_opt=>‘FOR ALL COLUMNS SIZE AUTO’);

表结构:
表结构及索引情况.txt (10.2 KB)

慢sql相关信息:
慢SQL.txt (151.3 KB)

OB 3.X版本执行计划:
执行计划.txt (73.6 KB)

2.执行计划检查
后对执行计划进行分析情况如下:

执行计划来看SQL的效率还是不错的,但是实际执行很慢,对问题SQL进行进一步排查:
SQL在实际执行情况下的执行计划为下图所示,可见与预计的执行计划偏差不大,但是为什么耗时很久呢?

OceanBase 3版本执行计划如下:

3.sql_audit检查
根据SQL_AUDIT的wait_class字段发现sql在执行时处于并发等待状态

于是进行并发相关参数进行检查,并尝试调到并发线程,但是问题仍未解决


由于SQL在使用并行hint的情况下长时间等待无法运行,尝试减少去掉hint进行执行
正在上传:image.png(1)…

SQL可以完成执行,但是执行时间仍然很长。
问题SQL Trace日志如下:

[2023-10-11 13:18:39.841540] WDIAG [SQL.ENG] check_status (ob_px_data_ch_provider.cpp:424) [11494][T1002_PX_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=23][errcode=-4662] received a interrupt(code={code:-4662, info:{msg:"tid:112273,from:"10.160.34.36:2882",PX ABORT DFO"}}, ret=-4662)
[2023-10-11 13:18:39.841570] WDIAG [SQL.ENG] get_transmit_data_ch (ob_px_sqc_proxy.cpp:267) [11494][T1002_PX_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=34][errcode=-4662] fail peek data channel from ch_provider(ret=-4662)
[2023-10-11 13:18:39.841578] WDIAG [SQL.ENG] get_data_ch (ob_px_transmit_op.cpp:99) [11494][T1002_PX_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=8][errcode=-4662] fail get data ch sets from provider(ret=-4662)
[2023-10-11 13:18:39.841584] WDIAG [SQL.ENG] init_channel (ob_px_transmit_op.cpp:317) [11494][T1002_PX_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=6][errcode=-4662] Fail to get data dtl channel(ret=-4662)
[2023-10-11 13:18:39.841591] WDIAG [SQL.ENG] inner_open (ob_px_transmit_op.cpp:201) [11494][T1002_PX_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=6][errcode=-4662] fail to exec init_channel(*trans_input)(ret=-4662)
[2023-10-11 13:18:39.841596] WDIAG [SQL.ENG] inner_open (ob_px_reduce_transmit_op.cpp:37) [11494][T1002_PX_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=5][errcode=-4662] initialize operator context failed(ret=-4662)
[2023-10-11 13:18:39.841602] WDIAG [SQL.ENG] open (ob_operator.cpp:714) [11494][T1002_PX_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=5][errcode=-4662] Open this operator failed(ret=-4662, op_type="PHY_PX_REDUCE_TRANSMIT")
[2023-10-11 13:18:39.841612] WDIAG [SQL.ENG] execute (ob_px_task_process.cpp:288) [11494][T1002_PX_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=8][errcode=-4662] fail open dfo op(ret=-4662)
[2023-10-11 13:18:39.841632] WDIAG [SQL.ENG] do_process (ob_px_task_process.cpp:451) [11494][T1002_PX_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=6][errcode=-4662] failed to execute plan(ret=-4662)
[2023-10-11 13:18:39.841641] WDIAG [SQL.ENG] process (ob_px_task_process.cpp:178) [11494][T1002_PX_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=6][errcode=-4662] failed to process(get_tenant_id()=1002, ret=-4662)
[2023-10-11 13:18:39.842265] INFO  [SQL.ENG] all_task_success (ob_px_sqc_handler.cpp:304) [11494][T1002_PX_G0][T1002][Y0-0000000000000000-0-0] [lt=7] px task not succ(task.get_result()=-4662, i=2)
[2023-10-11 13:18:39.842481] WDIAG update_qc_error_code (ob_px_util.h:589) [112273][T1002_L0_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=5][errcode=-4662]
[2023-10-11 13:18:39.842496] WDIAG update_qc_error_code (ob_px_util.h:589) [112273][T1002_L0_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=10][errcode=-4662]
[2023-10-11 13:18:39.845149] INFO  [STORAGE.BLKMGR] alloc_block (ob_block_manager.cpp:329) [112785][T1002_PX_G0][T1002][YB420AA02225-00060757FA21A99C-0-0] [lt=16] successfully alloc block(macro_id=[1747507](ver=0,mode=0,seq=13793574))
[2023-10-11 13:18:39.845477] INFO  [STORAGE.BLKMGR] async_write (ob_macro_block_handle.cpp:178) [112785][T1002_PX_G0][T1002][YB420AA02225-00060757FA21A99C-0-0] [lt=24] Async write macro block(macro_id_=[1747507](ver=0,mode=0,seq=13793574))
[2023-10-11 13:18:39.845614] INFO  [STORAGE.BLKMGR] alloc_block (ob_block_manager.cpp:329) [2501][T1002_PX_G0][T1002][YB420AA02225-00060757BF51AA7F-0-0] [lt=11] successfully alloc block(macro_id=[1747508](ver=0,mode=0,seq=13793575))
[2023-10-11 13:18:39.846593] INFO  [SQL.ENG] broadcast_dfo (ob_px_interruption.cpp:74) [112273][T1002_L0_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=4] success to send interrupt message(int_code={code:-4662, info:{msg:"tid:112273,from:"10.160.34.36:2882",PX ABORT DFO"}}, addr="10.160.34.37:2882", interrupt_id=task(1696922810486189,4303357292))
[2023-10-11 13:18:39.846619] INFO  [SQL.ENG] broadcast_dfo (ob_px_interruption.cpp:74) [112273][T1002_L0_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=18] success to send interrupt message(int_code={code:-4662, info:{msg:"tid:112273,from:"10.160.34.36:2882",PX ABORT DFO"}}, addr="10.160.34.36:2882", interrupt_id=task(1696922810486189,4303357292))
[2023-10-11 13:18:39.846643] INFO  [SQL.ENG] broadcast_dfo (ob_px_interruption.cpp:74) [112273][T1002_L0_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=6] success to send interrupt message(int_code={code:-4662, info:{msg:"tid:112273,from:"10.160.34.36:2882",PX ABORT DFO"}}, addr="10.160.34.37:2882", interrupt_id=task(1696922810486189,4303357292))
[2023-10-11 13:18:39.846653] INFO  [SQL.ENG] broadcast_dfo (ob_px_interruption.cpp:74) [112273][T1002_L0_G0][T1002][YB420AA02224-0006075830CFA032-0-0] [lt=6] success to send interrupt message(int_code={code:-4662, info:{msg:"tid:112273,from:"10.160.34.36:2882",PX ABORT DFO"}}, addr="10.160.34.36:2882", interrupt_id=task(1696922810486189,4303357292))

【附件】

3 个赞

看了一下你给出的3.x和4.x版本执行成功的时长都在 1min40s左右,所以问题是 加了hint之后查询卡住了?

2 个赞

4版本在不加hint的情况下执行时间1min40s左右,加了hint 反而执行不出来;
3版本时,同样的SQL 加hint的执行时间在1秒以内。
现在的问题是好多的sql都是这种类似的情况 ,查询同3版本相比很慢,而且在开并行的情况下,效果似乎更差

2 个赞

了解了,我们跟进一下加hint查询失败问题

3 个赞

好的,麻烦您了

1 个赞

日志不全,去这个地址上10.160.34.36:2882 继续捞
或直接把几个节点的日志都发下

observer.tar.gz (4.0 MB)

这个日志是 10.160.34.36:2882 的吗

是的

你好 ,请问这个后面有分析什么原因嘛?

:thinking::thinking::thinking:

插眼

日志看上去好像hang住了吧