RS所在observer进程被杀导致集群不可用

【 使用环境 】测试环境
【 OB or 其他组件 】三台主机每台均安装了 OBServer 和 OBP
【 使用版本 】oceanbase-all-in-one-4.2.1.2-102000042023120514.el7.x86_64.tar.gz
【问题描述】我想测试集群节点异常情况,使用OBD在A上给ABC部署了OBServer 和 OBP,RS为A,三台主机分三个zone。先是依次kill了B和C上的observer进程,再依次启动,集群正常。然后kill掉A上的observer进程后,无论是访问B和C上的observer还是obproxy均提示RPC post error
【附件及日志】
OBD配置文件

## Only need to configure when remote login is required
user:
  username: oceanbase
  password: SWXA1234@DAR
#   key_file: your ssh-key file path if need
#   port: your ssh port, default 22
#   timeout: ssh connection timeout (second), default 30
oceanbase-ce:
  servers:
    # Please don't use hostname, only IP can be supported
  - name: z1
    ip: 10.0.101.155
  - name: z2
    ip: 10.0.101.156
  - name: z3
    ip: 10.0.101.159
  global:
    #  The working directory for OceanBase Database. OceanBase Database is started under this directory. This is a required field.
    home_path: /opt/oceanbase/observer
    # The directory for data storage. The default value is $home_path/store.
    # data_dir: /data
    # The directory for clog, ilog, and slog. The default value is the same as the data_dir value.
    # redo_dir: /redo
    # Starting from observer version 4.2, the network selection for the observer is based on the 'local_ip' parameter, and the 'devname' parameter is no longer mandatory.
    # If the 'local_ip' parameter is set, the observer will first use this parameter for the configuration, regardless of the 'devname' parameter.
    # If only the 'devname' parameter is set, the observer will use the 'devname' parameter for the configuration.
    # If neither the 'devname' nor the 'local_ip' parameters are set, the 'local_ip' parameter will be automatically assigned the IP address configured above.
    # devname: eth0
    mysql_port: 2881 # External port for OceanBase Database. The default value is 2881. DO NOT change this value after the cluster is started.
    rpc_port: 2882 # Internal port for OceanBase Database. The default value is 2882. DO NOT change this value after the cluster is started.
    # if current hardware's memory capacity is smaller than 50G, please use the setting of "mini-single-example.yaml" and do a small adjustment.
    memory_limit: 6G # The maximum running memory for an observer
    # The reserved system memory. system_memory is reserved for general tenants. The default value is 30G.
    system_memory: 1G
    production_mode: false
    datafile_size: 2G # Size of the data file.
    datafile_next: 2G # the auto extend step. Please enter an capacity, such as 2G
    datafile_maxsize: 20G # the auto extend max size. Please enter an capacity, such as 20G 
    log_disk_size: 13G # The size of disk space used by the clog files.
    enable_syslog_wf: false # Print system logs whose levels are higher than WARNING to a separate log file. The default value is true.
    enable_syslog_recycle: true # Enable auto system log recycling or not. The default value is false.
    max_syslog_file_count: 4 # The maximum number of reserved log files before enabling auto recycling. The default value is 0.
    root_password: SWXA1234@DAR # root user password, can be empty
    proxyro_password: SWXA1234@DAR
    cluster_id: 1702431481
  z1:
    zone: zone1
  z2:
    zone: zone2
  z3:
    zone: zone3
obproxy-ce:
  depends:
  - oceanbase-ce
  servers:
  - 10.0.101.155
  - 10.0.101.156
  - 10.0.101.159
  global:
    listen_port: 2883 # External port. The default value is 2883.
    prometheus_listen_port: 2884 # The Prometheus port. The default value is 2884.
    home_path: /opt/oceanbase/obproxy
    enable_cluster_checkout: false
    skip_proxy_sys_private_check: true
    enable_strict_kernel_release: false
    obproxy_sys_password: SWXA1234@DAR # obproxy sys user password, can be empty. When a depends exists, OBD gets this value from the oceanbase-ce of the depends.
    observer_sys_password: SWXA1234@DAR # proxyro user pasword, consistent with oceanbase-ce's proxyro_password, can be empty. When a depends exists, OBD gets this value from the oceanbase-ce of the depends.

部分最新的日志

[2023-12-18 16:45:03.691882] INFO  [STORAGE] runTimerTask (ob_tablet_gc_service.cpp:230) [5348][T1_ObTimer][T1][Y0-0000000000000000-0-0] [lt=22] [tabletchange] no logstream(ret=0, ret="OB_SUCCESS", ls_cnt=0, times=640)
[2023-12-18 16:45:03.691883] INFO  [STORAGE] runTimerTask (ob_empty_shell_task.cpp:38) [5349][T1_TabletShellT][T1][Y0-0000000000000000-0-0] [lt=5] ====== [emptytablet] empty shell timer task ======(GC_EMPTY_TABLET_SHELL_INTERVAL=5000000)
[2023-12-18 16:45:03.691898] INFO  [STORAGE] runTimerTask (ob_checkpoint_service.cpp:130) [5345][T1_TxCkpt][T1][Y0-0000000000000000-0-0] [lt=4] ====== checkpoint timer task ======
[2023-12-18 16:45:03.691898] INFO  [STORAGE] runTimerTask (ob_empty_shell_task.cpp:101) [5349][T1_TabletShellT][T1][Y0-0000000000000000-0-0] [lt=10] [emptytablet] no logstream(ret=0, ret="OB_SUCCESS", ls_cnt=0, times=640)
[2023-12-18 16:45:03.691906] INFO  [STORAGE] runTimerTask (ob_checkpoint_service.cpp:193) [5345][T1_TxCkpt][T1][Y0-0000000000000000-0-0] [lt=5] no logstream(ret=0, ls_cnt=0)
[2023-12-18 16:45:03.699191] WDIAG [SERVER] fill_ls_replica (ob_service.cpp:2570) [5410][T1_L0_G9][T1][YB420A00659F-00060CC4058DCE84-0-0] [lt=7][errcode=-4719] get ls handle failed(ret=-4719, ret="OB_LS_NOT_EXIST")
[2023-12-18 16:45:03.707603] INFO  [COMMON] replace_map (ob_kv_storecache.cpp:743) [5080][KVCacheRep][T0][Y0-0000000000000000-0-0] [lt=17] replace map num details(ret=0, replace_node_count=0, map_once_replace_num_=15728, map_replace_skip_count_=2)
[2023-12-18 16:45:03.708236] INFO  [STORAGE] runTimerTask (ob_locality_manager.cpp:708) [5150][LocalityReload][T0][Y0-0000000000000000-0-0] [lt=12] runTimer to refresh locality_info(ret=0)
[2023-12-18 16:45:03.708788] WDIAG [RPC] send (ob_poc_rpc_proxy.h:140) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=11][errcode=-4122] check_blacklist failed(ret=-4122)
[2023-12-18 16:45:03.708804] WDIAG [SQL.EXE] task_execute_v2 (ob_executor_rpc_impl.cpp:143) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=10][errcode=-4122] rpc task_execute fail(ret=-4122, tenant_id=1, svr="10.0.101.155:2882", timeout=29999889, timeout_timestamp=1702889133708635)
[2023-12-18 16:45:03.708814] WDIAG [SQL.EXE] execute_with_sql (ob_remote_scheduler.cpp:250) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=9][errcode=-4122] task execute failed(ret=-4122)
[2023-12-18 16:45:03.708820] WDIAG [SQL.EXE] schedule (ob_remote_scheduler.cpp:53) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=5][errcode=-4122] execute with sql failed(ret=-4122)
[2023-12-18 16:45:03.708826] WDIAG [SQL] do_open_plan (ob_result_set.cpp:545) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=4][errcode=-4122] fail execute plan(ret=-4122)
[2023-12-18 16:45:03.708831] WDIAG [SQL] open (ob_result_set.cpp:157) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=4][errcode=-4122] execute plan failed(ret=-4122)
[2023-12-18 16:45:03.708929] WDIAG [RPC] post (ob_poc_rpc_proxy.h:217) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058AF118-0-0] [lt=4][errcode=-4122] check_blacklist failed(addr="10.0.101.155:2882")
[2023-12-18 16:45:03.708949] WDIAG [RPC] call_rpc (ob_async_rpc_proxy.h:356) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058AF118-0-0] [lt=17][errcode=-4122] call rpc func failed(server="10.0.101.155:2882", timeout=2000000, arg={addr:"10.0.101.155:2882", cluster_id:1702431481}, cluster_id=1702431481, tenant_id=1, group_id=9, ret=-4122, ret="OB_RPC_POST_ERROR")
[2023-12-18 16:45:03.708961] WDIAG [RPC] call (ob_async_rpc_proxy.h:290) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058AF118-0-0] [lt=11][errcode=-4122] call rpc func failed(server="10.0.101.155:2882", timeout=2000000, cluster_id=1702431481, tenant_id=1, arg={addr:"10.0.101.155:2882", cluster_id:1702431481}, group_id=9, ret=-4122, ret="OB_RPC_POST_ERROR")
[2023-12-18 16:45:03.708968] WDIAG [RPC] on_timeout (ob_async_rpc_proxy.h:83) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058AF118-0-0] [lt=6][errcode=0] some error in rcode and enter on_timeout(AsyncCB::rcode_.rcode_=0)
[2023-12-18 16:45:03.708974] WDIAG [SHARE.PT] do_detect_master_rs_ls_ (ob_rpc_ls_table.cpp:296) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058AF118-0-0] [lt=5][errcode=0] fail to send rpc(tmp_ret=-4122, tmp_ret="OB_RPC_POST_ERROR", cluster_id=1702431481, addr="10.0.101.155:2882", timeout=2000000, arg={addr:"10.0.101.155:2882", cluster_id:1702431481})
[2023-12-18 16:45:03.709039] INFO  [SHARE.LOCATION] batch_renew_tablet_locations (ob_location_service.cpp:441) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=4] [TABLET_LOCATION] batch renew tablet locations finished(ret=0, ret="OB_SUCCESS", tenant_id=1, renew_type=0, is_nonblock=true, tablet_list=[{id:116}{id:117}], ls_ids=[], error_code=-4122)
[2023-12-18 16:45:03.709060] WDIAG [SERVER] open (ob_inner_sql_result.cpp:153) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=17][errcode=-4122] open result set failed(ret=-4122)
[2023-12-18 16:45:03.709094] WDIAG [SERVER] fill_ls_replica (ob_service.cpp:2570) [5410][T1_L0_G9][T1][YB420A00659C-00060CC4058AF118-0-0] [lt=12][errcode=-4719] get ls handle failed(ret=-4719, ret="OB_LS_NOT_EXIST")
[2023-12-18 16:45:03.709166] WDIAG [SERVER] do_query (ob_inner_sql_connection.cpp:697) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=104][errcode=-4122] result set open failed(ret=-4122, executor={ObIExecutor:, sql:"select svr_ip, svr_port, a.zone, info, value, b.name, a.status, a.start_service_time, a.stop_time from __all_server a LEFT JOIN __all_zone b ON a.zone = b.zone WHERE (b.name = 'region' or b.name = 'idc' or b.name = 'status' or b.name = 'zone_type') and a.zone != '' order by svr_ip, svr_port, b.name"})
[2023-12-18 16:45:03.709185] WDIAG [SERVER] query (ob_inner_sql_connection.cpp:832) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=18][errcode=-4122] execute failed(ret=-4122, tenant_id=1, executor={ObIExecutor:, sql:"select svr_ip, svr_port, a.zone, info, value, b.name, a.status, a.start_service_time, a.stop_time from __all_server a LEFT JOIN __all_zone b ON a.zone = b.zone WHERE (b.name = 'region' or b.name = 'idc' or b.name = 'status' or b.name = 'zone_type') and a.zone != '' order by svr_ip, svr_port, b.name"}, retry_cnt=0, local_sys_schema_version=1702885661831272, local_tenant_schema_version=1702885661831272)
[2023-12-18 16:45:03.709201] WDIAG [SERVER] after_func (ob_query_retry_ctrl.cpp:947) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=13][errcode=-4122] [RETRY] check if need retry(v={force_local_retry:true, stmt_retry_times:0, local_retry_times:0, err_:-4122, err_:"OB_RPC_POST_ERROR", retry_type:0, client_ret:-4122}, need_retry=false, THIS_WORKER.can_retry()=false, v.ctx_.multi_stmt_item_={is_part_of_multi_stmt:false, seq_num:0, sql:"", batched_queries:NULL, is_ps_mode:false, ab_cnt:0})
[2023-12-18 16:45:03.709227] WDIAG [SERVER] inner_close (ob_inner_sql_result.cpp:218) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=18][errcode=-4122] result set close failed(ret=-4122)
[2023-12-18 16:45:03.709239] WDIAG [SERVER] force_close (ob_inner_sql_result.cpp:198) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=11][errcode=-4122] result set close failed(ret=-4122)
[2023-12-18 16:45:03.709249] WDIAG [SERVER] query (ob_inner_sql_connection.cpp:837) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=10][errcode=-4122] failed to close result(close_ret=-4122, ret=-4122)
[2023-12-18 16:45:03.709268] WDIAG [SERVER] query (ob_inner_sql_connection.cpp:867) [5149][LocltyRefTask][T1][YB420A00659C-00060CC4076A77E3-0-0] [lt=11][errcode=-4122] failed to process record(executor={ObIExecutor:, sql:"select svr_ip, svr_port, a.zone, info, value, b.name, a.status, a.start_service_time, a.stop_time from __all_server a LEFT JOIN __all_zone b ON a.zone = b.zone WHERE (b.name = 'region' or b.name = 'idc' or b.name = 'status' or b.name = 'zone_type') and a.zone != '' order by svr_ip, svr_port, b.name"}, record_ret=-4122, ret=-4122)
[2023-12-18 16:45:03.709291] WDIAG [RPC] wait (ob_async_rpc_proxy.h:422) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058AF118-0-0] [lt=7][errcode=0] execute rpc failed(rc=-4012, server="0.0.0.0:0", timeout=0, packet code=330, arg={addr:"10.0.101.155:2882", cluster_id:1702431481})
[2023-12-18 16:45:03.709305] WDIAG [SHARE.PT] do_detect_master_rs_ls_ (ob_rpc_ls_table.cpp:315) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058AF118-0-0] [lt=9][errcode=0] fail to get result by rpc, just ignore(tmp_ret=-4012, addr="10.0.101.155:2882")
[2023-12-18 16:45:03.709315] WDIAG [SHARE.PT] find_leader (ob_ls_info.cpp:847) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058AF118-0-0] [lt=5][errcode=-4018] fail to get leader replica(ret=-4018, ret="OB_ENTRY_NOT_EXIST", *this={tenant_id:1, ls_id:{id:1}, replicas:[]}, replica count=0)
[2023-12-18 16:45:03.709323] WDIAG [SHARE.PT] find_leader (ob_ls_info.cpp:847) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058AF118-0-0] [lt=7][errcode=-4018] fail to get leader replica(ret=-4018, ret="OB_ENTRY_NOT_EXIST", *this={tenant_id:1, ls_id:{id:1}, replicas:[]}, replica count=0)
[2023-12-18 16:45:03.709330] INFO  [SHARE.PT] get_ls_info_ (ob_rpc_ls_table.cpp:140) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058AF118-0-0] [lt=6] leader doesn't exist, try use all_server_list(tmp_ret=-4018, tmp_ret="OB_ENTRY_NOT_EXIST", ls_info={tenant_id:1, ls_id:{id:1}, replicas:[]})
[2023-12-18 16:45:03.709339] INFO  [SHARE.PT] get_ls_info_ (ob_rpc_ls_table.cpp:151) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058AF118-0-0] [lt=5] server_list is empty, do nothing(ret=0, ret="OB_SUCCESS", server_list=[])
[2023-12-18 16:45:03.709350] INFO  [SHARE.LOCATION] batch_update_caches_ (ob_ls_location_service.cpp:944) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058AF118-0-0] [lt=5] [LS_LOCATION]ls location cache has changed(ret=0, ret="OB_SUCCESS", old_location={cache_key:{tenant_id:1, ls_id:{id:1}, cluster_id:1702431481}, renew_time:1702886013695352, replica_locations:[{server:"10.0.101.155:2882", role:1, sql_port:2881, replica_type:0, property:{memstore_percent_:100}, restore_status:{status:0}, proposal_id:4}]}, new_location={cache_key:{tenant_id:1, ls_id:{id:1}, cluster_id:1702431481}, renew_time:1702889103709348, replica_locations:[]})
[2023-12-18 16:45:03.709282] WDIAG [SERVER] query (ob_inner_sql_connection.cpp:893) [5149][LocltyRefTask][T0][YB420A00659C-00060CC4076A77E3-0-0] [lt=13][errcode=-4122] failed to process final(executor={ObIExecutor:, sql:"select svr_ip, svr_port, a.zone, info, value, b.name, a.status, a.start_service_time, a.stop_time from __all_server a LEFT JOIN __all_zone b ON a.zone = b.zone WHERE (b.name = 'region' or b.name = 'idc' or b.name = 'status' or b.name = 'zone_type') and a.zone != '' order by svr_ip, svr_port, b.name"}, aret=-4122, ret=-4122)
[2023-12-18 16:45:03.709392] WDIAG [SERVER] execute_read_inner (ob_inner_sql_connection.cpp:1652) [5149][LocltyRefTask][T0][Y0-0000000000000000-0-0] [lt=108][errcode=-4122] execute sql failed(ret=-4122, tenant_id=1, sql=select svr_ip, svr_port, a.zone, info, value, b.name, a.status, a.start_service_time, a.stop_time from __all_server a LEFT JOIN __all_zone b ON a.zone = b.zone WHERE (b.name = 'region' or b.name = 'idc' or b.name = 'status' or b.name = 'zone_type') and a.zone != '' order by svr_ip, svr_port, b.name)
[2023-12-18 16:45:03.709405] WDIAG [SERVER] retry_while_no_tenant_resource (ob_inner_sql_connection.cpp:950) [5149][LocltyRefTask][T0][Y0-0000000000000000-0-0] [lt=12][errcode=-4122] retry_while_no_tenant_resource failed(ret=-4122, tenant_id=1)
[2023-12-18 16:45:03.709430] WDIAG [SERVER] execute_read (ob_inner_sql_connection.cpp:1592) [5149][LocltyRefTask][T0][Y0-0000000000000000-0-0] [lt=24][errcode=-4122] execute_read failed(ret=-4122, cluster_id=1702431481, tenant_id=1)
[2023-12-18 16:45:03.709443] WDIAG [COMMON.MYSQLP] read (ob_mysql_proxy.cpp:131) [5149][LocltyRefTask][T0][Y0-0000000000000000-0-0] [lt=12][errcode=-4122] query failed(ret=-4122, conn=0x152c651ae050, start=1702889103708635, sql=select svr_ip, svr_port, a.zone, info, value, b.name, a.status, a.start_service_time, a.stop_time from __all_server a LEFT JOIN __all_zone b ON a.zone = b.zone WHERE (b.name = 'region' or b.name = 'idc' or b.name = 'status' or b.name = 'zone_type') and a.zone != '' order by svr_ip, svr_port, b.name)
[2023-12-18 16:45:03.709457] WDIAG [COMMON.MYSQLP] read (ob_mysql_proxy.cpp:66) [5149][LocltyRefTask][T0][Y0-0000000000000000-0-0] [lt=13][errcode=-4122] read failed(ret=-4122)
[2023-12-18 16:45:03.709490] WDIAG [SHARE] load_region (ob_locality_table_operator.cpp:158) [5149][LocltyRefTask][T0][Y0-0000000000000000-0-0] [lt=11][errcode=-4122] execute sql failed(ret=-4122, sql=select svr_ip, svr_port, a.zone, info, value, b.name, a.status, a.start_service_time, a.stop_time from __all_server a LEFT JOIN __all_zone b ON a.zone = b.zone WHERE (b.name = 'region' or b.name = 'idc' or b.name = 'status' or b.name = 'zone_type') and a.zone != '' order by svr_ip, svr_port, b.name)
[2023-12-18 16:45:03.709504] INFO  [SHARE] load_region (ob_locality_table_operator.cpp:373) [5149][LocltyRefTask][T0][Y0-0000000000000000-0-0] [lt=13] load region(ret=-4122, locality_info={version:0, local_region:"", local_zone:"", local_idc:"", local_zone_type:3, local_zone_status:3, locality_region_array:[], locality_zone_array:[]})
[2023-12-18 16:45:03.709559] WDIAG [STORAGE] load_region (ob_locality_manager.cpp:236) [5149][LocltyRefTask][T0][Y0-0000000000000000-0-0] [lt=15][errcode=-4122] localitity operator load region error(ret=-4122)
[2023-12-18 16:45:03.709572] WDIAG [STORAGE] process (ob_locality_manager.cpp:762) [5149][LocltyRefTask][T0][Y0-0000000000000000-0-0] [lt=11][errcode=-4122] process refresh locality task fail(ret=-4122)
[2023-12-18 16:45:03.727495] INFO  [SHARE.LOCATION] dump_cache (ob_ls_location_service.cpp:1206) [5140][DumpLSLoc][T0][YB420A00659C-00060CC4077A77E3-0-0] [lt=20] [LS_LOCATION]dump tenant ls location caches(tenant_id=1, tenant_ls_locations=[{cache_key:{tenant_id:1, ls_id:{id:1}, cluster_id:1702431481}, renew_time:1702886013695352, replica_locations:[{server:"10.0.101.155:2882", role:1, sql_port:2881, replica_type:0, property:{memstore_percent_:100}, restore_status:{status:0}, proposal_id:4}]}])
[2023-12-18 16:45:03.729836] INFO  [STORAGE] operator() (ob_tenant_freezer.cpp:131) [5253][T1_Occam][T1][Y0-0000000000000000-0-0] [lt=13] ====== tenant freeze timer task ======
[2023-12-18 16:45:03.729859] WDIAG [STORAGE] get_tenant_tx_data_mem_used_ (ob_tenant_freezer.cpp:597) [5253][T1_Occam][T1][Y0-0000000000000000-0-0] [lt=13][errcode=0] [TenantFreezer] no logstream(ret=0, ret="OB_SUCCESS", ls_cnt=0, tenant_info_={slow_freeze:false, slow_freeze_timestamp:0, freeze_interval:0, last_freeze_timestamp:0, slow_tablet:{id:0}})
[2023-12-18 16:45:03.770755] INFO  [COMMON] compute_tenant_wash_size (ob_kvcache_store.cpp:1156) [5079][KVCacheWash][T0][Y0-0000000000000000-0-0] [lt=29] Wash compute wash size(is_wash_valid=false, sys_total_wash_size=-878530560, global_cache_size=16646144, tenant_max_wash_size=0, tenant_min_wash_size=0, tenant_ids_=[500, 508, 509, 1])
[2023-12-18 16:45:03.802389] INFO  [STORAGE] runTimerTask (ob_checkpoint_service.cpp:346) [5347][T1_CKClogDisk][T1][Y0-0000000000000000-0-0] [lt=6] ====== check clog disk timer task ======
[2023-12-18 16:45:03.802406] INFO  [PALF] get_disk_usage (palf_env_impl.cpp:882) [5347][T1_CKClogDisk][T1][Y0-0000000000000000-0-0] [lt=14] get_disk_usage(ret=0, capacity(MB):=0, used(MB):=0)
[2023-12-18 16:45:03.802418] INFO  [STORAGE] cannot_recycle_log_over_threshold_ (ob_checkpoint_service.cpp:260) [5347][T1_CKClogDisk][T1][Y0-0000000000000000-0-0] [lt=8] cannot_recycle_log_size statistics(cannot_recycle_log_size=0, threshold=0, need_update_checkpoint_scn=false)
[2023-12-18 16:45:03.802818] WDIAG [SERVER] fill_ls_replica (ob_service.cpp:2570) [5410][T1_L0_G9][T1][YB420A00659F-00060CC4058DCE85-0-0] [lt=7][errcode=-4719] get ls handle failed(ret=-4719, ret="OB_LS_NOT_EXIST")
[2023-12-18 16:45:03.833330] WDIAG [SERVER] fill_ls_replica (ob_service.cpp:2570) [5410][T1_L0_G9][T1][YB420A00659F-00060CC4069D603E-0-0] [lt=16][errcode=-4719] get ls handle failed(ret=-4719, ret="OB_LS_NOT_EXIST")
[2023-12-18 16:45:03.834451] WDIAG [SERVER] fill_ls_replica (ob_service.cpp:2570) [5410][T1_L0_G9][T1][YB420A00659F-00060CC4058DCE86-0-0] [lt=26][errcode=-4719] get ls handle failed(ret=-4719, ret="OB_LS_NOT_EXIST")
[2023-12-18 16:45:03.835093] INFO  [CLOG.EXTLOG] resize_log_ext_handler_ (ob_cdc_service.cpp:372) [5323][T1_CdcSrv][T1][Y0-0000000000000000-0-0] [lt=8] finish to resize log external storage handler(current_ts=1702889103835089, tenant_max_cpu=1, valid_ls_count=0, other_ls_count=0, new_concurrency=0)
[2023-12-18 16:45:03.844799] INFO  [ARCHIVE] gc_stale_ls_task_ (ob_ls_mgr.cpp:537) [5350][T1_LSArchiveMgr][T1][YB420A00659C-00060CC4059A76A4-0-0] [lt=12] gc stale ls task succ

内部表能访问吗?例如 select * from __all_server ;

这个不符合预期,

我重复一下你的测试流程.

  1. 安装1-1-1 3节点集群, 检测rs 节点在a
  2. 杀死b, 集群服务未停止, 重启b, 集群正常
  3. 杀死c, 集群服务未停止, 重启c, 集群正常
  4. 杀死a, 集群服务不可用, 无法重启a

你用obdiag https://www.oceanbase.com/product/obdiag-rn/releaseNote#V1.5.0 收集一下日志, 可以把日志上传到阿里云 云盘上, 我们自行下载, 进行分析

不能,尝试连接B和C的2881或2883端口均提示RPC post error

@acmwwh 按这个地址收集下日志吧,看起来RPC异常了

或者你先搜一下日志,有没有这个关键字:check_blacklist

在主机A(10.0.101.155)的日志上
[2023-12-18 15:44:27.948354] WDIAG [RPC] check_blacklist (ob_poc_rpc_proxy.cpp:254) [290089][T1_HBService][T1][YB420A00659B-00060CC352EA43F2-0-0] [lt=6][errcode=-4122] address in blacklist(ret=-4122, addr=“10.0.101.156:2882”)
[2023-12-18 15:44:27.948383] WDIAG [RPC] post (ob_poc_rpc_proxy.h:217) [290089][T1_HBService][T1][YB420A00659B-00060CC352EA43F2-0-0] [lt=21][errcode=-4122] check_blacklist failed(addr=“10.0.101.156:2882”)

在主机B(10.0.101.156)的日志上
[2023-12-18 17:49:57.838482] WDIAG [RPC] send (ob_poc_rpc_proxy.h:140) [17751][T1_L0_G100][T1][YB420A00659C-00060CC407CA8132-0-0] [lt=40][errcode=-4122] check_blacklist failed(ret=-4122)
[2023-12-18 17:49:57.838506] WDIAG [SQL.EXE] task_execute (ob_executor_rpc_impl.cpp:81) [17751][T1_L0_G100][T1][YB420A00659C-00060CC407CA8132-0-0] [lt=18][errcode=-4122] rpc task_execute fail(ret=-4122, tenant_id=1, svr=“10.0.101.155:2882”, timeout=5522747, timeout_timestamp=1702893003361077)

在主机C(10.0.101.159)的日志上
[2023-12-18 17:45:41.826266] WDIAG [RPC] post (ob_poc_rpc_proxy.h:217) [1464091][AutoLSLocRpc][T0][YB420A00659F-00060CC406BD6E76-0-0] [lt=7][errcode=-4122] check_blacklist failed(addr=“10.0.101.155:2882”)
[2023-12-18 17:45:41.826304] WDIAG [RPC] call_rpc (ob_async_rpc_proxy.h:356) [1464091][AutoLSLocRpc][T0][YB420A00659F-00060CC406BD6E76-0-0] [lt=35][errcode=-4122] call rpc func failed(server=“10.0.101.155:2882”, timeout=2000000, arg={addr:“10.0.101.159:2882”}, cluster_id=1702431481, tenant_id=1, group_id=10, ret=-4122, ret=“OB_RPC_POST_ERROR”)

此外,我使用OBD安装 obdiag 失败了,字面意思是配置文件生成失败了

ps看下 10.0.101.156:2882 这个进程有没有起来,如果起来了,用obstack看下有没有hung住,如果没有obstack,看下对应的observer还有没有在打日志

一直在打,这是最新的日志
[2023-12-18 19:25:49.069593] WDIAG [SHARE.PT] find_leader (ob_ls_info.cpp:847) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058C6ABE-0-0] [lt=10][errcode=-4018] fail to get leader replica(ret=-4018, ret=“OB_ENTRY_NOT_EXIST”, *this={tenant_id:1, ls_id:{id:1}, replicas:[]}, replica count=0)
[2023-12-18 19:25:49.069599] INFO [SHARE.PT] get_ls_info_ (ob_rpc_ls_table.cpp:140) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058C6ABE-0-0] [lt=5] leader doesn’t exist, try use all_server_list(tmp_ret=-4018, tmp_ret=“OB_ENTRY_NOT_EXIST”, ls_info={tenant_id:1, ls_id:{id:1}, replicas:[]})
[2023-12-18 19:25:49.069611] INFO [SHARE.PT] get_ls_info_ (ob_rpc_ls_table.cpp:151) [5132][SysLocAsyncUp0][T0][YB420A00659C-00060CC4058C6ABE-0-0] [lt=7] server_list is empty, do nothing(ret=0, ret=“OB_SUCCESS”, server_list=[])

其他两个server也看下吧,特别是:10.0.101.155:2882

上述是B机器(10.0.101.156),A机器(10.0.101.155)自从我杀了observer进程就没有日志,C机器(10.0.101.159)的日志如下
[2023-12-18 19:23:26.689542] WDIAG [SERVER] batch_process_tasks (ob_uniq_task_queue.h:507) [1464081][SysLocAsyncUp0][T0][YB420A00659F-00060CC4058F501A-0-0] [lt=11][errcode=-4721] fail to batch process task(ret=-4721)
[2023-12-18 19:23:26.689546] WDIAG [SERVER] run1 (ob_uniq_task_queue.h:458) [1464081][SysLocAsyncUp0][T0][YB420A00659F-00060CC4058F501A-0-0] [lt=4][errcode=-4721] fail to batch execute task(ret=-4721, tasks.count()=1)
[2023-12-18 19:23:26.704520] WDIAG [SERVER] fill_ls_replica (ob_service.cpp:2570) [1464493][T1_L0_G9][T1][YB420A00659C-00060CC4058C653F-0-0] [lt=11][errcode=-4719] get ls handle failed(ret=-4719, ret=“OB_LS_NOT_EXIST”)
[2023-12-18 19:23:26.728546] WDIAG [RPC] send (ob_poc_rpc_proxy.h:140) [1479661][T1_L0_G100][T1][YB420A00659F-00060CC407AD6734-0-0] [lt=68][errcode=-4122] check_blacklist failed(ret=-4122)
[2023-12-18 19:23:26.728570] WDIAG [SQL.EXE] task_execute (ob_executor_rpc_impl.cpp:81) [1479661][T1_L0_G100][T1][YB420A00659F-00060CC407AD6734-0-0] [lt=17][errcode=-4122] rpc task_execute fail(ret=-4122, tenant_id=1, svr=“10.0.101.155:2882”, timeout=5578564, timeout_timestamp=1702898612306895)
[2023-12-18 19:23:26.728584] WDIAG [SQL.EXE] execute (ob_remote_task_executor.cpp:99) [1479661][T1_L0_G100][T1][YB420A00659F-00060CC407AD6734-0-0] [lt=10][errcode=-4122] fail post task(ret=-4122)
[2023-12-18 19:23:26.728592] WDIAG [SQL.EXE] execute (ob_remote_job_executor.cpp:55) [1479661][T1_L0_G100][T1][YB420A00659F-00060CC407AD6734-0-0] [lt=6][errcode=-4122] fail execute task(ret=-4122, *task_info={task_loc:{server:“10.0.101.155:2882”, ob_task_id:{ob_job_id:{ob_execution_id:{server:“10.0.101.159:2882”, execution_id:18446744073709551615, task_type:0, hash:4219408603045787357}, job_id:73439}, task_id:0, task_cnt:0}}, range_location:{part_locs:[], server:“0.0.0.0:0”}, location_idx:0, location_idx_list:[0], state:5, slice_count_pos:[], background:false, retry_times:0, location_idx_list:[0]})
[2023-12-18 19:23:26.728616] WDIAG [SQL.EXE] execute_with_plan (ob_remote_scheduler.cpp:124) [1479661][T1_L0_G100][T1][YB420A00659F-00060CC407AD6734-0-0] [lt=23][errcode=-4122] fail execute remote job(ret=-4122)

我后续又在机器A上手启了observer进程,目前一切正常,其他操作我明天试下。

那就是A机器没有启动成功

是一条日志都没有吗?

@夏进 不好意思才看到,我简述一下我的测试过程,编号按时间顺序:

其中,测试B和C是直接关机,测试A是kill掉observer进程。

我的疑惑是,关掉B或C集群是正常的,可一旦挂掉A数据库就无法访问了,这符合预期吗。

不符合预期,集群有两台正常的server就能工作。但昨天看你的日志,这几台机器都被放入了黑名单,应该是RPC异常导致没有选出RS主,你先把事发当天的日志都打包一下给我们分析吧

mark

老师,当前obd不支持最新的obdiag1.5.0版本,导致安装失败,可以尝试手动安装,https://www.oceanbase.com/docs/common-obdiag-cn-1000000000441301

你好,感谢回复。我已将日志文件链接通过邮件的形式传达,请悉知。