obclient 连接 gdb 调试observer失败。请问gdb的进程 需要手工启动,而非obd启动的?

【产品名称】

obd cluster display test ±--------------------------------------------+ | observer | ±----------±--------±-----±------±-------+ | ip | version | port | zone | status | ±----------±--------±-----±------±-------+ | 127.0.0.1 | 3.1.1 | 2881 | zone1 | active | ±----------±--------±-----±------±-------+

【问题描述】

**步骤1:**gdb attach obd启动的observer单进程。单节点

对handle_physical_plan设置断点: thread apply all break ob_sql.cpp:3160

步骤2:obclient连接

结果:每个线程打印 handle_physical_plan执行失败。然后几百个线程 执行同样动作。还没开始就僵住了。

(gdb) c Continuing. [Switching to Thread 0x7f34c5c40700 (LWP 49067)] (gdb) c Continuing. [Switching to Thread 0x7f34ac7ec700 (LWP 49120)]

步骤3 看日志 有个错误

  1. leader选举失败
  2. 执行内部查询sql失败或者超时

请问gdb的进程 需要手工启动,而非obd启动的?

[2022-03-03 17:11:43.395627] WARN [CLOG] runTimerTask (ob_log_event_task_V2.cpp:79) [918][1123][Y0-0000000000000000] [lt=4] [dc=0] run time out of range(partition_key_={tid:1099511627944, partition_id:3, part_cnt:0}, current_ts=1646298703391970, delta=274795) [2022-03-03 17:11:43.395629] WARN [CLOG] runTimerTask (ob_log_event_task_V2.cpp:79) [907][1101][Y0-0000000000000000] [lt=8] [dc=0] run time out of range(partition_key_={tid:1101710651031687, partition_id:0, part_cnt:0}, current_ts=1646298703391970, delta=274795) failed to process record(executor={ObIExecutor:, sql:" SELECT * FROM __all_root_table WHERE (tenant_id, table_id, partition_ [2022-03-03 17:11:43.109337] WARN [STORAGE.TRANS] process_cluster_heartbeat_rpc (ob_weak_read_service.cpp:485) [708][1069][YB427F000001-0005D94CBFAF220D] [lt=15] [dc=0] tenant weak read service process cluster heartbeat RPC fail(ret=-4341, ret=“OB_NOT_IN_SERVICE”, tenant_id=1002, req={req_server:“127.0.0.1:2882”, version:1646298698069432, valid_part_count:15, total_part_count:161, generate_timestamp:1646298703109151}, twrs={inited:true, tenant_id:1002, self:“127.0.0.1:2882”, svr_version_mgr:{server_version:{version:1646298698069432, total_part_count:161, valid_inner_part_count:15, valid_user_part_count:0, epoch_tstamp:1646298702258652}, server_version_for_stat:{version:1646298698069579, total_part_count:54, valid_inner_part_count:6, valid_user_part_count:0, epoch_tstamp:1646298702793259}}}) [2022-03-03 17:11:43.109488] WARN [CLOG] need_update_leader_ (ob_log_state_mgr.cpp:2499) [913][1113][Y0-0000000000000000] [lt=7] [dc=0] get_elect_leader_ failed, leader_ is valid, need update(ret=-7006, partition_key={tid:1099511627923, partition_id:1, part_cnt:0}, self=“127.0.0.1:2882”, leader_={server:“127.0.0.1:2882”, cluster_id:1}, bool_ret=true) [2022-03-03 17:11:43.109515] WARN [CLOG] need_update_leader_ (ob_log_state_mgr.cpp:2499) [913][1113][Y0-0000000000000000] [lt=12] [dc=0] get_elect_leader_ failed, leader_ is valid, need update(ret=-7006, partition_key={tid:1099511627923, partition_id:1, part_cnt:0}, self=“127.0.0.1:2882”, leader_={server:“127.0.0.1:2882”, cluster_id:1}, bool_ret=true) [2022-03-03 17:11:43.109538] WARN [CLOG] need_update_leader_ (ob_log_state_mgr.cpp:2499) [917][1121][Y0-0000000000000000] [lt=4] [dc=0] get_elect_leader_ failed, leader_ is valid, need update(ret=-7006, partition_key={tid:1099511627957, partition_id:9, part_cnt:0}, self=“127.0.0.1:2882”, leader_={server:“127.0.0.1:2882”, cluster_id:1}, bool_ret=true) [2022-03-03 17:11:43.109566] WARN [CLOG] need_update_leader_ (ob_log_state_mgr.cpp:2499) [917][1121][Y0-0000000000000000] [lt=13] [dc=0] get_elect_leader_ failed, leader_ is valid, need update(ret=-7006, partition_key={tid:1099511627957, partition_id:9, part_cnt:0}, self=“127.0.0.1:2882”, leader_={server:“127.0.0.1:2882”, cluster_id:1}, bool_ret=true) [2022-03-03 17:11:43.109615] WARN [CLOG] need_update_leader_ (ob_log_state_mgr.cpp:2499) [917][1121][Y0-0000000000000000] [lt=12] [dc=0] get_elect_leader_ failed, leader_ is valid, need update(ret=-7006, partition_key={tid:1099511627992, partition_id:11, part_cnt:0}, self=“127.0.0.1:2882”, leader_={server:“127.0.0.1:2882”, cluster_id:1}, bool_ret=true) [2022-03-03 17:11:43.109634] WARN [CLOG] need_update_leader_ (ob_log_state_mgr.cpp:2499) [917][1121][Y0-0000000000000000] [lt=7] [dc=0] get_elect_leader_ failed, leader_ is valid, need update(ret=-7006, partition_key={tid:1099511627992, partition_id:11, part_cnt:0}, self=“127.0.0.1:2882”, leader_={server:“127.0.0.1:2882”, cluster_id:1}, bool_ret=true) [2022-03-03 17:11:43.109678] WARN [RPC.OBRPC] rpc_call (ob_rpc_proxy.ipp:567) [945][1175][Y0-0000000000000000] [lt=5] [dc=0] execute rpc fail(ret=-4638, dst=“127.0.0.1:2882”) [2022-03-03 17:11:43.109688] WARN log_user_error_and_warn (ob_rpc_proxy.cpp:300) [945][1175][Y0-0000000000000000] [lt=9] [dc=0] [2022-03-03 17:11:43.109703] WARN [SHARE.PT] get (ob_rpc_partition_table.cpp:83) [945][1175][Y0-0000000000000000] [lt=5] [dc=0] fetch root partition through rpc failed(rs_addr=“127.0.0.1:2882”, ret=-4638) [2022-03-03 17:11:43.109920] WARN [SERVER] get_master_root_server (ob_service.cpp:3373) [397][482][YB427F000001-0005D94CC2AF1E3E] [lt=6] [dc=0] not master rootserver(ret=-4638, master_rs=“127.0.0.1:2882”) [2022-03-03 17:11:43.109930] WARN [SERVER] process (ob_rpc_processor_simple.cpp:1897) [397][482][YB427F000001-0005D94CC2AF1E3E] [lt=9] [dc=0] failed to get master root server(ret=-4638) [2022-03-03 17:11:43.109928] WARN [SQL.OPT] all_select_leader (ob_phy_table_location_info.cpp:536) [49112][224][YB427F000001-0005D94CC20F1EAF] [lt=5] [dc=0] fail to get leader(ret=-4654, phy_part_loc_info.get_partition_location()={table_id:{value:1099511627778, first:1, second:2}, partition_id:0, partition_cnt:0, pg_key:{tid:1099511627778, partition_id:0, part_cnt:0}, replica_locations:[{server_:“127.0.0.1:2882”, role_:2, sql_port_:2881, replica_type_:0, attr_:{pos_type_:4, merge_status_:2, zone_type_:2, zone_status_:3, start_service_time_:0, server_stop_time_:0, server_status_:4}, is_filter_:false, replica_idx_:-1}], renew_time:1646298703108201, is_mark_fail:false}) [2022-03-03 17:11:43.109953] WARN [SQL.OPT] strong_select_replicas (ob_log_plan.cpp:2009) [49112][224][YB427F000001-0005D94CC20F1EAF] [lt=10] [dc=0] fail to all select leader(ret=-4654, *phy_tbl_loc_info={table_location_key:1099511627778, ref_table_id:1099511627778, phy_part_loc_info_list:[{partition_location:{table_id:{value:1099511627778, first:1, second:2}, partition_id:0, partition_cnt:0, pg_key:{tid:1099511627778, partition_id:0, part_cnt:0}, replica_locations:[{server_:“127.0.0.1:2882”, role_:2, sql_port_:2881, replica_type_:0, attr_:{pos_type_:4, merge_status_:2, zone_type_:2, zone_statu

我是用obd 启动的 ,不是按照https://github.com/oceanbase/oceanbase/wiki/how_to_debug 直接启动的。

这个可能是设置断点位置造成的,observer 内部默认大量查询 在执行 

//thread apply all break ob_sql.cpp:3160 ((这个断点不能设置))
int ObSql::handle_physical_plan

当系统启动后,启动GDB来调试代码,尽量不要在关键路径上增加断点。否则,内部相关逻辑被阻塞容易导致非预期异常

单元测试更合适

可以加入 OceanBase 开发者技术交流群 44665211