ob性能调优的问题


我一个建表语句,执行了16分钟还超时退出了。想问一下,这个从哪些方面查?
我客户端超时是设置的15个小时:SET GLOBAL ob_query_timeout = 9000000000;
事务超时ob_trx_timeout是:86400000000

GV$OB_SQL_AUDIT表信息如下:







第一个图 提示了 报错代码 -4012 超时错误。这个是 OB 很常见的超时错误,可能是语句超时或事务超时。估计你的集群本来就有点问题。先排查集群稳定性情况。

  1. 列举集群当前资源。
select zone,concat(SVR_IP,':',SVR_PORT) observer,
	cpu_capacity_max cpu_total,cpu_assigned_max cpu_assigned,
	cpu_capacity-cpu_assigned_max as cpu_free,
	round(memory_limit/1024/1024/1024,2) as memory_total,
	round((memory_limit-mem_capacity)/1024/1024/1024,2) as system_memory,
	round(mem_assigned/1024/1024/1024,2) as mem_assigned,
	round((mem_capacity-mem_assigned)/1024/1024/1024,2) as memory_free,
	round(log_disk_capacity/1024/1024/1024,2) as log_disk_capacity,
	round(log_disk_assigned/1024/1024/1024,2) as log_disk_assigned,
	round((log_disk_capacity-log_disk_assigned)/1024/1024/1024,2) as log_disk_free,
	round((data_disk_capacity/1024/1024/1024),2) as data_disk,
	round((data_disk_in_use/1024/1024/1024),2) as data_disk_used,
	round((data_disk_capacity-data_disk_in_use)/1024/1024/1024,2) as data_disk_free
from gv$ob_servers;

select t1.name resource_pool_name, t2.`name` unit_config_name, 
	t2.max_cpu, t2.min_cpu, 
	round(t2.memory_size/1024/1024/1024,2) mem_size_gb,
	round(t2.log_disk_size/1024/1024/1024,2) log_disk_size_gb, t2.max_iops, 
	t2.min_iops, t3.unit_id, t3.zone, concat(t3.svr_ip,':',t3.`svr_port`) observer,
	t4.tenant_id, t4.tenant_name
from __all_resource_pool t1
	join __all_unit_config t2 on (t1.unit_config_id=t2.unit_config_id)
	join __all_unit t3 on (t1.`resource_pool_id` = t3.`resource_pool_id`)
	left join __all_tenant t4 on (t1.tenant_id=t4.tenant_id)
order by t1.`resource_pool_id`, t2.`unit_config_id`, t3.unit_id;
    1. 查看集群近期的事件。
SELECT gmt_create ,module,event ,name1 ,value1 ,name2,value2,name3,value3,name4,value4, name5,value5 
FROM `__all_rootservice_event_history` 
WHERE 1=1 
--  AND module IN ('ddl scheduler' )
ORDER BY gmt_create DESC LIMIT 100;

不断过滤掉无关的事件,找跟这个有可能相关的异常事件。

我在日志中找到如下日志:
[2024-04-27 16:32:13.886063] INFO [RPC.OBMYSQL] print_session_info (ob_sql_nio.cpp:1023) [2907][sql_nio0][T0][Y0-0000000000000000-0-0] [lt=15] [sql nio session](*s={this:0x7ff3dc064030, session_id:3221608825, trace_id:YB42C0A8160B-0006170895F3F5F6-0-0, sql_handling_stage:11, sql_initiative_shutdown:false, fd:149, err:0, last_decode_time:1714205772541425, last_write_time:1714205770120820, pending_write_task:{buf:null, sz:0}, need_epoll_trigger_write:false, consume_size:13115, pending_flag:1, may_handling_flag:true, handler_close_flag:false})
[2024-04-27 16:32:28.969036] INFO [RPC.OBMYSQL] print_session_info (ob_sql_nio.cpp:1023) [2907][sql_nio0][T0][Y0-0000000000000000-0-0] [lt=16] [sql nio session](*s={this:0x7ff3dc064030, session_id:3221608825, trace_id:YB42C0A8160B-0006170895F3F5F6-0-0, sql_handling_stage:11, sql_initiative_shutdown:false, fd:149, err:0, last_decode_time:1714205772541425, last_write_time:1714205770120820, pending_write_task:{buf:null, sz:0}, need_epoll_trigger_write:false, consume_size:13115, pending_flag:1, may_handling_flag:true, handler_close_flag:false})
[2024-04-27 16:32:43.991141] INFO [RPC.OBMYSQL] print_session_info (ob_sql_nio.cpp:1023) [2907][sql_nio0][T0][Y0-0000000000000000-0-0] [lt=18] [sql nio session](*s={this:0x7ff3dc064030, session_id:3221608825, trace_id:YB42C0A8160B-0006170895F3F5F6-0-0, sql_handling_stage:11, sql_initiative_shutdown:false, fd:149, err:0, last_decode_time:1714205772541425, last_write_time:1714205770120820, pending_write_task:{buf:null, sz:0}, need_epoll_trigger_write:false, consume_size:13115, pending_flag:1, may_handling_flag:true, handler_close_flag:false})
[2024-04-27 16:32:52.541456] WDIAG [RPC] send (ob_poc_rpc_proxy.h:156) [26232][T1002_L0_G0][T1002][YB42C0A8160B-0006170895F3F5F6-0-0] [lt=16][errcode=-4012] sync rpc execute fail(ret=-4012, addr=“xx.xx.xx.27:2882”, pcode=630)
[2024-04-27 16:32:52.541485] WDIAG [SQL.ENG] execute (ob_table_executor.cpp:569) [26232][T1002_L0_G0][T1002][YB42C0A8160B-0006170895F3F5F6-0-0] [lt=21][errcode=-4012] rpc proxy create table failed(ret=-4012, ret=“OB_TIMEOUT”, dst=“xx.xx.xx.27:2882”)
504152544954494F4E207038312056414C55455320494E2028383129202C0D0A20504152544954494F4E207038322056414C55455320494E2028383229202C0D0A20504152544954494F4E207038332056414C55455320494E2028383329202C0D0A20504152544954494F4E207038342056414C55455320494E2028383429202C0D0A20504152544954494F4E207038352056414C55455320494E2028383529202C0D0A20504152544954494F4E207038362056414C55455320494E2028383629202C0D0A20504152544954494F4E207038372056414C55455320494E2028383729202C0D0A20504152544954494F4E207038382056414C55455320494E2028383829202C0D0A20504152544954494F4E207038392056414C55455320494E2028383929202C0D0A20504152544954494F4E2070392056414C55455320494E20283929202C0D0A20504152544954494F4E207039302056414C55455320494E2028393029202C0D0A20504152544954494F4E207039312056414C55455320494E2028393129202C0D0A20504152544954494F4E207039322056414C55455320494E2028393229202C0D0A20504152544954494F4E207039332056414C55455320494E2028393329202C0D0A20504152544954494F4E207039342056414C55455320494E2028393429202C0D0A20504152544954494F4E207039352056414C55455320494E2028393529202C0D0A20504152544954494F4E207039362056414C55455320494E2028393629202C0D0A20504152544954494F4E207039372056414C55455320494E2028393729202C0D0A20504152544954494F4E207039382056414C55455320494E2028393829202C0D0A20504152544954494F4E207039392056414C55455320494E20283939292029’, ‘return_code’, -4012, ‘xx.xx.xx.11’, 2882))
[2024-04-27 16:32:52.543646] WDIAG [SQL] open_cmd (ob_result_set.cpp:98) [26232][T1002_L0_G0][T1002][YB42C0A8160B-0006170895F3F5F6-0-0] [lt=31][errcode=-4012] execute cmd failed(ret=-4012)
[2024-04-27 16:32:52.543663] WDIAG [SQL] open (ob_result_set.cpp:157) [26232][T1002_L0_G0][T1002][YB42C0A8160B-0006170895F3F5F6-0-0] [lt=6][errcode=-4012] execute plan failed(ret=-4012)
[2024-04-27 16:32:52.543671] WDIAG [SERVER] response_result (ob_sync_cmd_driver.cpp:145) [26232][T1002_L0_G0][T1002][YB42C0A8160B-0006170895F3F5F6-0-0] [lt=4][errcode=-4012] close result set fail(cret=-4012)
[2024-04-27 16:32:52.543676] WDIAG [SERVER] test (ob_query_retry_ctrl.cpp:238) [26232][T1002_L0_G0][T1002][YB42C0A8160B-0006170895F3F5F6-0-0] [lt=4][errcode=0] this worker is timeout, do not need retry(v={force_local_retry:false, stmt_retry_times:0, local_retry_times:0, err_:-4012, err_:“OB_TIMEOUT”, retry_type:0, client_ret:-4012}, THIS_WORKER.get_timeout_ts()=1714206772541425, v.result_.get_stmt_type()=20, v.session_.get_retry_info().get_last_query_retry_err()=0)
[2024-04-27 16:32:52.543687] WDIAG [SERVER] after_func (ob_query_retry_ctrl.cpp:947) [26232][T1002_L0_G0][T1002][YB42C0A8160B-0006170895F3F5F6-0-0] [lt=10][errcode=-4012] [RETRY] check if need retry(v={force_local_retry:false, stmt_retry_times:0, local_retry_times:0, err_:-4012, err_:“OB_TIMEOUT”, retry_type:0, client_ret:-4012}, need_retry=false, THIS_WORKER.can_retry()=true, v.ctx_.multi_stmt_item_={is_part_of_multi_stmt:false, seq_num:0, sql:"
[2024-04-27 16:32:52.543705] WDIAG [SERVER] response_result (ob_sync_cmd_driver.cpp:151) [26232][T1002_L0_G0][T1002][YB42C0A8160B-0006170895F3F5F6-0-0] [lt=17][errcode=-4012] result set open failed, check if need retry(ret=-4012, cli_ret=-4012, retry_ctrl_.need_retry()=0)
[2024-04-27 16:32:52.543768] INFO [SERVER] send_error_packet (obmp_packet_sender.cpp:315) [26232][T1002_L0_G0][T1002][YB42C0A8160B-0006170895F3F5F6-0-0] [lt=4] sending error packet(ob_error=-4012, client error=-4012, extra_err_info=NULL, lbt()=“0x1178639c 0x96a8562 0x9654faf 0x4c0d0e0 0x4997e89 0x499127d 0x498788a 0x4982d1b 0x9570ea4 0x11a51fbf 0x7ff470f0bea5 0x7ff470c348dd”)
[2024-04-27 16:32:52.543824] INFO [SERVER] send_error_packet (obmp_packet_sender.cpp:498) [26232][T1002_L0_G0][T1002][YB42C0A8160B-0006170895F3F5F6-0-0] [lt=12] dump txn free route audit_record(value=5, session->get_sessid()=3221608825, session->get_proxy_sessid()=13882370182548226214)
[2024-04-27 16:32:52.543838] WDIAG [SERVER] do_process (obmp_query.cpp:808) [26232][T1002_L0_G0][T1002][YB42C0A8160B-0006170895F3F5F6-0-0] [lt=12][errcode=-4012] execute query fail(ret=-4012, timeout_timestamp=1714206772541425)
[2024-04-27 16:32:52.544013] TRACE [TRACE] after_process (obmp_base.cpp:145) [26232][T1002_L0_G0][T1002][YB42C0A8160B-0006170895F3F5F6-0-0] [lt=5] [slow query](TRACE=begin_ts=1714205772541445 2024-04-27 08:16:12.541445|[process_begin] u=0 in_queue_time:19, receive_ts:1714205772541425, enqueue_ts:1714205772541429|[start_sql] u=1 addr:{ip:“xx.xx.xx.33”, port:32566}|[query_begin] u=0 trace_id:YB42C0A8160B-0006170895F3F5F6-0-0|[before_processor_run] u=5 |[session] u=2 sid:3221608825, tenant_id:1002|[parse_begin] u=40 stmt:"

这个是网络原因导致的吗?

另外,我在日志中找到如下异常信息:
[2024-04-27 15:45:48.210959] WDIAG [LIB] create_directory (file_directory_utils.cpp:122) [3103][T1001_L0_G0][T1001][Y0-0000000000000000-0-0] [lt=0][errcode=-4009] create directory failed.(directory_path=cgroup/tenant_1002/tenant_1001/OBCG_DEFAULT, errno=2, errmsg=“No such file or directory”, ret=-4009)
[2024-04-27 15:45:48.210983] WDIAG [SERVER.OMT] init_cgroup_dir_ (ob_cgroup_ctrl.cpp:852) [3103][T1001_L0_G0][T1001][Y0-0000000000000000-0-0] [lt=24][errcode=-4009] create tenant cgroup dir failed(ret=-4009, cgroup_path=“cgroup/tenant_1002/tenant_1001/OBCG_DEFAULT”)
不知道跟这个有关系没。

OB 的这些日志不是很好懂,可以先发一下我前面说的那些基础信息。分析日志是最后的手段。

SELECT *
FROM __all_rootservice_event_history
WHERE 1=1
AND module IN (‘ddl’ )
ORDER BY gmt_create DESC LIMIT 100; 如下:

select t1.name resource_pool_name, t2.name unit_config_name,
t2.max_cpu, t2.min_cpu,
round(t2.memory_size/1024/1024/1024,2) mem_size_gb,
round(t2.log_disk_size/1024/1024/1024,2) log_disk_size_gb, t2.max_iops,
t2.min_iops, t3.unit_id, t3.zone, concat(t3.svr_ip,’:’,t3.svr_port) observer,
t4.tenant_id, t4.tenant_name
from __all_resource_pool t1
join __all_unit_config t2 on (t1.unit_config_id=t2.unit_config_id)
join __all_unit t3 on (t1.resource_pool_id = t3.resource_pool_id)
left join __all_tenant t4 on (t1.tenant_id=t4.tenant_id)
order by t1.resource_pool_id, t2.unit_config_id, t3.unit_id;信息如下:

select zone,concat(SVR_IP,’:’,SVR_PORT) observer,
cpu_capacity_max cpu_total,cpu_assigned_max cpu_assigned,
cpu_capacity-cpu_assigned_max as cpu_free,
round(memory_limit/1024/1024/1024,2) as memory_total,
round((memory_limit-mem_capacity)/1024/1024/1024,2) as system_memory,
round(mem_assigned/1024/1024/1024,2) as mem_assigned,
round((mem_capacity-mem_assigned)/1024/1024/1024,2) as memory_free,
round(log_disk_capacity/1024/1024/1024,2) as log_disk_capacity,
round(log_disk_assigned/1024/1024/1024,2) as log_disk_assigned,
round((log_disk_capacity-log_disk_assigned)/1024/1024/1024,2) as log_disk_free,
round((data_disk_capacity/1024/1024/1024),2) as data_disk,
round((data_disk_in_use/1024/1024/1024),2) as data_disk_used,
round((data_disk_capacity-data_disk_in_use)/1024/1024/1024,2) as data_disk_free
from gv$ob_servers;信息如下:

直接三步走:

  1. obdiag check 巡检一下
  2. obdiag analyze log 日志分析一把
  3. obdiag gather scene run --scene=observer.unknown 未知问题信息采集回来

obdiag文档:OceanBase分布式数据库-海量数据 笔笔算数

把巡检结果和日志分析结果先发出来,定位不了再把3收集的信息发出来

找到问题了,分配的内存太少了