ob集群执行SQL查询无响应

问题发现:
1.ocp服务未正常启动,检查日志报:Exception in thread “pool-ocp_metric_dump-thread-5” org.springframework.jdbc.UncategorizedSQLException: PreparedStatementCallback; uncategorized SQLException for SQL [replace into metric_hour_data (series_id, timestamp, value) values (?, ?, ?)]; SQL state [HY000]; error code [4012]; Statement is timeout; nested exception is java.sql.BatchUpdateException: Statement is timeout
2.登录obproxy查看集群状态发现sql执行超时,无法进行任何操作,以下为3个observer节点一直在刷日志:
observer1:
[2022-06-02 10:25:05.796011] WARN [CLOG] notify_log_missing (ob_partition_log_packet_handler.cpp:675) [21368][33373][Y0-0000000000000000] [lt=6] [dc=0] notify_log_missing failed(ret=-4268, ctx={type:9, server:“172.16.56.62:2882”, pkey:{tid:1103909674337666, partition_id:0, part_cnt:0}, req_id:0, size:10, cluster_id:1}, start_log_id=144, is_in_member_list=true)
[2022-06-02 10:25:05.796045] WARN [STORAGE] schedule (ob_partition_migrator.cpp:1594) [13029][546][Y0-0000000000000000] [lt=7] [dc=0] failed to inner schedule task(ret=-4268, arg={key:{tid:1103909674337584, partition_id:0, part_cnt:0}, dst:{server:“172.16.56.51:2882”, timestamp:-1, flag:0, replica_type:0, region:“default_region”, memstore_percent:100}, src:{server:“172.16.56.51:2882”, timestamp:-1, flag:0, replica_type:0, region:“default_region”, memstore_percent:100}, data_src:{server:“172.16.56.51:2882”, timestamp:-1, flag:0, replica_type:0, region:“default_region”, memstore_percent:100}, quorum:0, type:“REBUILD_REPLICA_OP”, base_version:“0-0-0”, restore_arg:{tenant_id_:0, locality:"", primary_zone:"", pool_list:"", tcp_invited_nodes:"", base_data_version:-1, curr_data_version:-1, tenant_id:0, restore_timeu:0, backup_schema_id:18446744073709551615, partition_id:-1, schema_id_list:[], sql_info:{backup_table_id:18446744073709551615, backup_index_id:18446744073709551615, backup_tablegroup_id:18446744073709551615, backup_table_name:"", backup_index_name:"", backup_database_name:"", backup_tablegroup_name:""}, schema_version:0, uri_header:"", cluster_name:"", storage_info:"", schema_id_list:[]}, validate_arg:{backup_dest:"", cluster_name:"", uri_header:"", storage_info:"", job_id:-1, task_id:-1, trace_id:Y0-0000000000000000, server:“0.0.0.0”, cluster_id:-1, pg_key:{tid:18446744073709551615, partition_id:-1, part_idx:268435455, subpart_idx:268435455}, table_id:-1, partition_id:-1, tenant_id:18446744073709551615, incarnation:-1, archive_round:-1, backup_set_id:-1, total_partition_count:-1, total_macro_block_count:-1, clog_end_timestamp:-1, start_log_id:-1, end_log_id:-1, log_size:-1, is_dropped_tenant:false, need_validate_clog:true, full_backup_set_id:-1, inc_backup_set_id:-1, cluster_version:0, backup_date:0, compatible:0}, backup_arg:{uri_header:"", storage_info:"", incarnation:0, tenant_id:-1, backup_set_id:0, backup_data_version:0, backup_schema_version:0, prev_full_backup_set_id:0, prev_inc_backup_set_id:0, prev_data_version:0, task_id:0, backup_type:3, backup_date:0, prev_backup_date:0, compatible:0}, phy_restore_arg:{restore_info:{backup_dest:"", cluster_name:"", cluster_id:0, incarnation:0, tenant_id:0, full_backup_set_id:0, inc_backup_set_id:0, log_archive_round:0, restore_snapshot_version:-1, restore_start_ts:0, compatible:0, cluster_version:0, restore_schema_version:-1, backup_date:0, multi_restore_path_list:{backup_set_path_list:[], backup_piece_path_list:[]}}, pg_key:{tid:18446744073709551615, partition_id:-1, part_idx:268435455, subpart_idx:268435455}, restore_data_version:0}, backup_backupset_arg:{src_uri_header:"", src_storage_info:"", dst_uri_header:"", dst_storage_info:"", cluster_name:"", server:“0.0.0.0”, copy_id:-1, job_id:-1, tenant_id:18446744073709551615, cluster_id:-1, incarnation:-1, pg_key:{tid:18446744073709551615, partition_id:-1, part_idx:268435455, subpart_idx:268435455}, backup_set_id:-1, prev_full_backup_set_id:-1, prev_inc_backup_set_id:-1, delete_input:false, backup_type:3, cluster_version:0, backup_date:0, compatible:0}, index_id:18446744073709551615, priority:0, cluster_id:-1, restore_version:0, switch_epoch:0}, task_id=YB42AC103833-0005DFABDE9B3FC3)
[2022-06-02 10:25:05.796077] INFO [STORAGE] ob_partition_service.cpp:5039 [21368][33373][Y0-0000000000000000] [lt=6] [dc=0] begin rebuild_replica(arg={key:{tid:1103909674337736, partition_id:0, part_cnt:0}, dst:{server:“172.16.56.51:2882”, timestamp:-1, flag:0, replica_type:0, region:“default_region”, memstore_percent:100}, src:{server:“172.16.56.51:2882”, timestamp:-1, flag:0, replica_type:0, region:“default_region”, memstore_percent:100}, data_src:{server:“172.16.56.51:2882”, timestamp:-1, flag:0, replica_type:0, region:“default_region”, memstore_percent:100}, quorum:0, type:“REBUILD_REPLICA_OP”, base_version:“0-0-0”, restore_arg:{tenant_id_:0, locality:"", primary_zone:"", pool_list:"", tcp_invited_nodes:"", base_data_version:-1, curr_data_version:-1, tenant_id:0, restore_timeu:0, backup_schema_id:18446744073709551615, partition_id:-1, schema_id_list:[], sql_info:{backup_table_id:18446744073709551615, backup_index_id:18446744073709551615, backup_tablegroup_id:18446744073709551615, backup_table_name:"", backup_index_name:"", backup_database_name:"", backup_tablegroup_name:""}, schema_version:0, uri_header:"", cluster_name:"", storage_info:"", schema_id_list:[]}, validate_arg:{backup_dest:"", cluster_name:"", uri_header:"", storage_info:"", job_id:-1, task_id:-1, trace_id:Y0-0000000000000000, server:“0.0.0.0”, cluster_id:-1, pg_key:{tid:18446744073709551615, partition_id:-1, part_idx:268435455, subpart_idx:268435455}, table_id:-1, partition_id:-1, tenant_id:18446744073709551615, incarnation:-1, archive_round:-1, backup_set_id:-1, total_partition_count:-1, total_macro_block_count:-1, clog_end_timestamp:-1, start_log_id:-1, end_log_id:-1, log_size:-1, is_dropped_tenant:false, need_validate_clog:true, full_backup_set_id:-1, inc_backup_set_id:-1, cluster_version:0, backup_date:0, compatible:0}, backup_arg:{uri_header:"", storage_info:"", incarnation:0, tenant_id:-1, backup_set_id:0, backup_data_version:0, backup_schema_version:0, prev_full_backup_set_id:0, prev_inc_backup_set_id:0, prev_data_version:0, task_id:0, backup_type:3, backup_date:0, prev_backup_date:0, compatible:0}, phy_restore_arg:{restore_info:{backup_dest:"", cluster_name:"", cluster_id:0, incarnation:0, tenant_id:0, full_backup_set_id:0, inc_backup_set_id:0, log_archive_round:0, restore_snapshot_version:-1, restore_start_ts:0, compatible:0, cluster_version:0, restore_schema_version:-1, backup_date:0, multi_restore_path_list:{backup_set_path_list:[], backup_piece_path_list:[]}}, pg_key:{tid:18446744073709551615, partition_id:-1, part_idx:268435455, subpart_idx:268435455}, restore_data_version:0}, backup_backupset_arg:{src_uri_header:"", src_storage_info:"", dst_uri_header:"", dst_storage_info:"", cluster_name:"", server:“0.0.0.0”, copy_id:-1, job_id:-1, tenant_id:18446744073709551615, cluster_id:-1, incarnation:-1, pg_key:{tid:18446744073709551615, partition_id:-1, part_idx:268435455, subpart_idx:268435455}, backup_set_id:-1, prev_full_backup_set_id:-1, prev_inc_backup_set_id:-1, delete_input:false, backup_type:3, cluster_version:0, backup_date:0, compatible:0}, index_id:18446744073709551615, priority:0, cluster_id:-1, restore_version:0, switch_epoch:0})
[2022-06-02 10:25:05.796100] INFO [STORAGE] ob_partition_service.cpp:5055 [21368][33373][Y0-0000000000000000] [lt=14] [dc=0] local partition exists(arg.key_={tid:1103909674337736, partition_id:0, part_cnt:0})
[2022-06-02 10:25:05.796180] WARN [STORAGE] rebuild_replica (ob_partition_service.cpp:5077) [13029][546][Y0-0000000000000000] [lt=29] [dc=0] fail to schedule migrate task(arg={key:{tid:1103909674337584, partition_id:0, part_cnt:0}, dst:{server:"172.16.56.^

observer2:
2022-06-02 10:25:43.116221] ERROR [CLOG] submit_log (ob_partition_log_service.cpp:524) [21435][276][Y0-0000000000000000] [lt=12] [dc=0] submit_log failed(ret=-4264, partition_key={tid:1099511627916, partition_id:0, part_cnt:0}, buff=0x7f3f2b3ede80, size=462, base_timestamp=1653762310590836, is_trans_log=true, role=1, state=4) BACKTRACE:0x99be10e 0x9793491 0x2914032 0x2913c3b 0x2913840 0x77afb38 0x76df9aa 0x8a909d8 0x8a8ee16 0x994f377 0x3385bef 0x2c70eb2 0x9746965 0x9745692 0x974214f
[2022-06-02 10:25:43.116237] ERROR [CLOG] submit_log (ob_partition_log_service.cpp:524) [21435][276][Y0-0000000000000000] [lt=12] [dc=0] submit_log failed(ret=-4264, partition_key={tid:1099511627916, partition_id:0, part_cnt:0}, buff=0x7f4045e4d8e0, size=521, base_timestamp=1653885443056118, is_trans_log=true, role=1, state=4) BACKTRACE:0x99be10e 0x9793491 0x2914032 0x2913c3b 0x2913840 0x77afb38 0x76df9aa 0x8a909d8 0x8a8ee16 0x994f377 0x3385bef 0x2c70eb2 0x9746965 0x9745692 0x974214f
[2022-06-02 10:25:43.116253] ERROR [CLOG] submit_log (ob_partition_log_service.cpp:524) [21435][276][Y0-0000000000000000] [lt=12] [dc=0] submit_log failed(ret=-4264, partition_key={tid:1099511627930, partition_id:6, part_cnt:0}, buff=0x7f3fd83d35f0, size=497, base_timestamp=1653591303014780, is_trans_log=true, role=1, state=4) BACKTRACE:0x99be10e 0x9793491 0x2914032 0x2913c3b 0x2913840 0x77afb38 0x76df9aa 0x8a909d8 0x8a8ee16 0x994f377 0x3385bef 0x2c70eb2 0x9746965 0x9745692 0x974214f
[2022-06-02 10:25:43.116269] ERROR [CLOG] submit_log (ob_partition_log_service.cpp:524) [21435][276][Y0-0000000000000000] [lt=12] [dc=0] submit_log failed(ret=-4264, partition_key={tid:1099511627930, partition_id:11, part_cnt:0}, buff=0x7f403c7f64c0, size=514, base_timestamp=1654116886877989, is_trans_log=true, role=1, state=4) BACKTRACE:0x99be10e 0x9793491 0x2914032 0x2913c3b 0x2913840 0x77afb38 0x76df9aa 0x8a909d8 0x8a8ee16 0x994f377 0x3385bef 0x2c70eb2 0x9746965 0x9745692 0x974214f
[2022-06-02 10:25:43.116280] WARN [CLOG] get_cursor (ob_log_engine.cpp:2363) [21575][528][Y0-0000000000000000] [lt=5] [dc=0] get_cursor failed(ret=-6309)
[2022-06-02 10:25:43.116285] ERROR [CLOG] submit_log (ob_partition_log_service.cpp:524) [21435][276][Y0-0000000000000000] [lt=12] [dc=0] submit_log failed(ret=-4264, partition_key={tid:1099511627916, partition_id:0, part_cnt:0}, buff=0x7f3f7100c600, size=462, base_timestamp=1654005299984961, is_trans_log=true, role=1, state=4) BACKTRACE:0x99be10e 0x9793491 0x2914032 0x2913c3b 0x2913840 0x77afb38 0x76df9aa 0x8a909d8 0x8a8ee16 0x994f377 0x3385bef 0x2c70eb2 0x9746965 0x9745692 0x974214f
[2022-06-02 10:25:43.116301] ERROR [CLOG] submit_log (ob_partition_log_service.cpp:524) [21435][276][Y0-0000000000000000] [lt=11] [dc=0] submit_log failed(ret=-4264, partition_key={tid:1099511627930, partition_id:6, part_cnt:0}, buff=0x7f404ae702b0, size=4^

observer3:
[2022-06-02 10:26:16.452893] WARN [CLOG] need_update_leader_ (ob_log_state_mgr.cpp:2499) [15333][1171][Y0-0000000000000000] [lt=18] [dc=0] get_elect_leader_ failed, leader_ is valid, need update(ret=-7007, partition_key={tid:1107208209170640, partition_id:0, part_cnt:0}, self=“172.16.56.59:2882”, leader_={server:“172.16.56.62:2882”, cluster_id:1}, bool_ret=true)
[2022-06-02 10:26:16.452986] WARN [CLOG] need_update_leader_ (ob_log_state_mgr.cpp:2488) [15333][1171][Y0-0000000000000000] [lt=5] [dc=0] get_elect_leader_ failed(partition_key={tid:1103909674337697, partition_id:10, part_cnt:0}, ret=-7007, replica_type=0)
[2022-06-02 10:26:16.452994] WARN [CLOG] need_update_leader_ (ob_log_state_mgr.cpp:2499) [15333][1171][Y0-0000000000000000] [lt=4] [dc=0] get_elect_leader_ failed, le[2022-06-02 10:26:16.481897] INFO [STORAGE] ob_pg_sstable_garbage_collector.cpp:188 [14785][258][Y0-0000000000000000] [lt=18] [dc=0] do one gc free sstable by queue(ret=0, free sstable cnt=0)
[2022-06-02 10:26:16.489651] INFO [ARCHIVE] ob_archive_mgr.cpp:877 [15332][1169][Y0-0000000000000000] [lt=11] [dc=0] need_start(incarnation=-1, round=-1, backup_info={status:{tenant_id:1, copy_id:0, start_ts:1652342177576411, checkpoint_ts:1653292261406941, status:3, incarnation:1, round:1, status_str:“DOING”, is_mark_deleted:false, is_mount_file_created:true, compatible:1, backup_piece_id:11, start_piece_id:1}, backup_dest:“file:///tidb-data/backup”}, in_archive=false, is_in_backup=true)
[2022-06-02 10:26:16.489716] INFO [ARCHIVE] ob_archive_path.cpp:552 [15332][1169][Y0-0000000000000000] [lt=40] [dc=0] build_server_start_archive_path succ(incarnation=1, round=1, dest_path=“file:///tidb-data/backup/obcluster/1/incarnation_1/clog_info/2_172.16.56.59_2882”, info="")
[2022-06-02 10:26:16.489774] WARN [STORAGE] is_exist (ob_storage_file.cpp:125) [15332][1169][Y0-0000000000000000] [lt=10] [dc=0] failed to stat file(ret=-4009, path="/tidb-data/backup/obcluster/1/incarnation_1/clog_info/2_172.16.56.59_2882", errno=13, errno=“Permission denied”)
[2022-06-02 10:26:16.489787] WARN [STORAGE] is_exist (ob_storage.cpp:151) [15332][1169][Y0-0000000000000000] [lt=10] [dc=0] failed to check is exist(ret=-4009, uri=file:///tidb-data/backup/obcluster/1/incarnation_1/clog_info/2_172.16.56.59_2882)
[2022-06-02 10:26:16.489795] WARN [ARCHIVE] check_file_exist (ob_archive_file_utils.cpp:216) [15332][1169][Y0-0000000000000000] [lt=6] [dc=0] check file is_exist fail(ret=-4009, uri=file:///tidb-data/backup/obcluster/1/incarnation_1/clog_info/2_172.16.56.59_2882)
[2022-06-02 10:26:16.489805] WARN [ARCHIVE] check_server_start_ts_exist (ob_archive_sender.cpp:1821) [15332][1169][Y0-0000000000000000] [lt=6] [dc=0] check file is_exist fail(ret=-4009, ret=“OB_IO_ERROR”, uri=file:///tidb-data/backup/obcluster/1/incarnation_1/clog_info/2_172.16.56.59_2882, storage_info=)
[2022-06-02 10:26:16.489815] WARN [ARCHIVE] check_and_set_start_archive_ts_ (ob_archive_mgr.cpp:914) [15332][1169][Y0-0000000000000000] [lt=8] [dc=0] check_server_start_archive_ts_record_exist_ fail(ret=-4009, ret=“OB_IO_ERROR”, round=1)
[2022-06-02 10:26:16.489826] WARN [ARCHIVE] start_archive_ (ob_archive_mgr.cpp:429) [15332][1169][Y0-0000000000000000] [lt=7] [dc=0] check_and_set_start_archive_ts_ fail(ret=-4009, ret=“OB_IO_ERROR”, start_archive_ts=1654136776489693, archive_round=1)
[2022-06-02 10:26:16.489860] WARN [ARCHIVE] do_check_switch_archive_ (ob_archive_mgr.cpp:691) [15332][1169][Y0-0000000000000000] [lt=6] [dc=0] start_archive_ fail(ret=-4009, ret=“OB_IO_ERROR”, non_frozen_piece_info={has_prev_piece_info:false, cur_piece_info:{key:{incarnation:1, tenant_id:1, round_id:1, backup_piece_id:11, copy_id:0}, create_date:20220522, status:0, file_status:0, start_ts:1653207147009190, checkpoint_ts:1653292261406941, max_ts:9223372036854775807, backup_dest:“file:///tidb-data/backup”, status_str:“ACTIVE”, file_status_str:“AVAILABLE”, compatible:1, start_piece_id:1}}, backup_info={status:{tenant_id:1, copy_id:0, start_ts:1652342177576411, checkpoint_ts:1653292261406941, status:3, incarnation:1, round:1, status_str:“DOING”, is_mark_deleted:false, is_mount_file_created:true, compatible:1, backup_piece_id:11, start_piece_id:1}, backup_dest:“file:///tidb-data/backup”}, cur_piece_id=11, cur_piece_create_date=20220522)
^C
[root@dev-tidb-2 log]#

重启集群后observer一直报[2022-06-02 17:40:18.767281] ERROR [CLOG] is_reconfirm_role_change_or_sync_timeout_ (ob_log_state_mgr.cpp:2050) [22999][1149][Y0-0000000000000000] [lt=12] [dc=0] is_reconfirm_role_change_or_sync_timeout_(partition_key={tid:1103909674287385, partition_id:0, part_cnt:0}, role=1, now=1654162818767262, last_check_start_id_time_=1654162808765608, max_log_id=149, start_id=150, is_wait_replay=true) BACKTRACE:0x99be10e 0x9793491 0x2275200 0x2274e1b 0x2274a64 0x779311e 0x76b61a0 0x76ba639 0x76fadb1 0x751b024 0x8ae2be7 0x2c6e584 0x2c70eb2 0x9746965 0x9745692 0x974214f
[2022-06-02 17:40:18.767342] ERROR [CLOG] is_reconfirm_role_change_or_sync_timeout_ (ob_log_state_mgr.cpp:2050) [22999][1149][Y0-0000000000000000] [lt=44] [dc=0] is_reconfirm_role_change_or_sync_timeout_(partition_key={tid:1103909674287385, partition_id:0, part_cnt:0}, role=1, now=1654162818767332, last_check_start_id_time_=1654162808765608, max_log_id=149, start_id=150, is_wait_replay=true) BACKTRACE:0x99be10e 0x9793491 0x2275200 0x2274e1b 0x2274a64 0x779311e 0x76b61a0 0x76b4205 0x76fadf4 0x751b024 0x8ae2be7 0x2c6e584 0x2c70eb2 0x9746965 0x9745692 0x974214f

检查下磁盘空间,应该是空间不够了导致。
同时看下合并是否都正常。