observer服务启动失败

【 使用环境 】 测试环境
【 OB or 其他组件 】
【 使用版本 】 3.1.5 (单机版 - ip: 172.17.151.120 )
【问题描述】

使用命令 启动服务几分钟后服务启动失败 , obproxy 服务启动正常,几次启动失败的最后日志 放到最下方,下面是启动方法:

[root@dev020 ~]# more start-ob.sh
#!/bin/bash

start_observer(){

cd /data/oceanbase/obproxy/

./bin/obproxy

sleep 10

cd /usr/local/oceanbase/

./bin/observer

sleep 5

netstat -tnpl|grep 288

}

start_observer

observer 退出时的 observer。log日志:

[2024-09-12 11:42:59.134867] ERROR [CLOG] notify_scan_finished_ (ob_log_scan_runnable.cpp:660) [11641][1203][Y0-0000000000000000] [lt=10] [dc=0] invalid scan_confirmed_log_cnt(ret=-4016, ret=“OB_ERR_UNEXPECTED”, scan_confirmed_log_cnt=586, next_ilog_id=8355, last_replay_log_id=7474, pkey={tid:1099511627983, partition_id:15, part_cnt:0}) BACKTRACE:0x9a98e9e 0x986d141 0x233b8f6 0x233b51b 0x233b1b7 0x781a30e 0x773dc1e 0x773b7b5 0x2ca95d4 0x2cabf02 0x9820da5 0x981f792 0x981c24f

[2024-09-12 11:42:59.134902] ERROR [CLOG] do_scan_log_ (ob_log_scan_runnable.cpp:216) [11641][1203][Y0-0000000000000000] [lt=32] [dc=0] notify_scan_finished_ failed(ret=-4016) BACKTRACE:0x9a98e9e 0x986d141 0x22a7774 0x22a725b 0x22a6fc1 0x22a5c58 0x773bca7 0x2ca95d4 0x2cabf02 0x9820da5 0x981f792 0x981c24f

[2024-09-12 11:42:59.134912] ERROR [CLOG] do_scan_log_ (ob_log_scan_runnable.cpp:223) [11641][1203][Y0-0000000000000000] [lt=9] [dc=0] log scan runnable exit error(ret=-4016) BACKTRACE:0x9a98e9e 0x986d141 0x22a7774 0x22a725b 0x22a6fc1 0x22a5c58 0x773b958 0x2ca95d4 0x2cabf02 0x9820da5 0x981f792 0x981c24f

[2024-09-12 11:42:59.134925] ERROR on_fatal_error (ob_log_define.h:684) [11641][1203][Y0-0000000000000000] [lt=7] [dc=0] ret = -4016 BACKTRACE:0x9a98e9e 0x74ee390 0x773b97a 0x2ca95d4 0x2cabf02 0x9820da5 0x981f792 0x981c24f BACKTRACE:0x9a98e9e 0x986d141 0x986dda6 0x74ee3d3 0x773b97a 0x2ca95d4 0x2cabf02 0x9820da5 0x981f792 0x981c24f

[2024-09-12 11:45:21.321456] WARN [CLOG] handle (ob_clog_history_reporter.cpp:315) [13298][370][Y0-0000000000000000] [lt=3] [dc=0] exec partition task fail(ret=-4023, ret=“OB_EAGAIN”, partition_task={pkey:{tid:1101710651081588, partition_id:388, part_cnt:0}, head:{partition op str:“ONLINE”, svr:“172.17.151.120:2882”, start_log_id:37, start_log_timestamp_:1708450868572323, end_log_id:18446744073709551615, end_log_timestamp:-1, next:null}, tail:{partition op str:“ONLINE”, svr:“172.17.151.120:2882”, start_log_id:37, start_log_timestamp_:1708450868572323, end_log_id:18446744073709551615, end_log_timestamp:-1, next:null}})

[2024-09-12 11:45:21.321733] INFO [SHARE.SCHEMA] ob_schema_getter_guard.cpp:901 [13298][370][YB42AC119778-000621E3E8F9B042] [lt=8] [dc=0] table not exist(fetch_tenant_id=1, tenant_id=1002, database_id=1101710651031553, session_id=18446744073709551615, table_name=__all_clog_history_info_v2, is_index=false, snapshot_version=1, is_schema_split=0, schema_version=1, schema_mgr_tenant_id=0)

[2024-09-12 11:45:21.321780] WARN [CLOG] handle_online_op_ (ob_clog_history_reporter.cpp:1251) [13298][370][Y0-0000000000000000] [lt=10] [dc=0] fail to insert a clog history info record(ret=-5019, ret=“OB_TABLE_NOT_EXIST”, pkey={tid:1101710651081605, partition_id:0, part_cnt:0}, sql="REPLACE INTO oceanbase._all_clog_history_info_v2 (table_id, partition_idx, partition_cnt, start_log_id, start_log_timestamp, svr_ip, svr_port, end_log_id, end_log_timestamp) VALUES(1101710651081605, 0, 0, 7473, 1708450582334832, ‘172.17.151.120’, 2882, 18446744073709551615, -1) ", online_op={partition op str:“ONLINE”, svr:“172.17.151.120:2882”, start_log_id:7473, start_log_timestamp:1708450582334832, end_log_id:18446744073709551615, end_log_timestamp:-1, next:null})

[2024-09-12 11:47:53.042725] INFO [SHARE.SCHEMA] ob_schema_getter_guard.cpp:901 [15817][370][YB42AC119778-000621E3F27DB528] [lt=7] [dc=0] table not exist(fetch_tenant_id=1, tenant_id=1001, database_id=1100611139403777, session_id=18446744073709551615, table_name=__all_clog_history_info_v2, is_index=false, snapshot_version=1, is_schema_split=0, schema_version=1, schema_mgr_tenant_id=0)

[2024-09-12 11:47:53.042733] INFO [SQL.RESV] ob_dml_resolver.cpp:6674 [15817][370][YB42AC119778-000621E3F27DB528] [lt=7] [dc=0] table not exist(tenant_id=1001, database_id=1100611139403777, table_name=__all_clog_history_info_v2, ret=-5019)

[2024-09-12 11:47:53.042738] INFO [SHARE.SCHEMA] ob_synonym_mgr.cpp:462 [15817][370][YB42AC119778-000621E3F27DB528] [lt=3] [dc=0] synonym is not exist(tenant_id=1001, database_id=1100611139403777, name=__all_clog_history_info_v2)

[2024-09-12 11:47:53.042768] WARN [SERVER] query (ob_inner_sql_connection.cpp:861) [15817][370][YB42AC119778-000621E3F27DB528] [lt=3] [dc=0] failed to process record(executor={ObIExecutor:, sql:"REPLACE INTO oceanbase.__all_clog_history_info_v2 (table_id, partition_idx, partition_cnt, start_log_id, start_log_timestamp, svr_ip, svr_port, end_log_id, end_log_timestamp) VALUES(1100611139404027, 0, 0, 7474, 1708450585983966, ‘172.17.151.120’, 2882, 18446744073709551615, -1) "}, record_ret=-5019, ret=-5019)

[2024-09-12 11:47:53.043195] INFO [SHARE.SCHEMA] ob_schema_getter_guard.cpp:901 [15817][370][YB42AC119778-000621E3F27DB52B] [lt=3] [dc=0] table not exist(fetch_tenant_id=1, tenant_id=1, database_id=1099511627777, session_id=18446744073709551615, table_name=__all_clog_history_info_v2, is_index=false, snapshot_version=1, is_schema_split=0, schema_version=1, schema_mgr_tenant_id=0)

[2024-09-12 11:47:53.043201] INFO [SQL.RESV] ob_dml_resolver.cpp:6674 [15817][370][YB42AC119778-000621E3F27DB52B] [lt=5] [dc=0] table not exist(tenant_id=1, database_id=1099511627777, table_name=__all_clog_history_info_v2, ret=-5019)

[2024-09-12 11:47:53.043602] INFO [SHARE.SCHEMA] ob_schema_getter_guard.cpp:901 [15817][370][YB42AC119778-000621E3F27DB52E] [lt=2] [dc=1] table not exist(fetch_tenant_id=1, tenant_id=1002, database_id=1101710651031553, session_id=18446744073709551615, table_name=__all_clog_history_info_v2, is_index=false, snapshot_version=1, is_schema_split=0, schema_version=1, schema_mgr_tenant_id=0)

[2024-09-12 11:47:53.043608] INFO [SQL.RESV] ob_dml_resolver.cpp:6674 [15817][370][YB42AC119778-000621E3F27DB52E] [lt=4] [dc=0] table not exist(tenant_id=1002, database_id=1101710651031553, table_name=__all_clog_history_info_v2, ret=-5019)

[2024-09-12 11:47:53.043916] WARN [SERVER] get_master_root_server (ob_service.cpp:3375) [15866][466][YB42AC119778-000621E3F26D2A13] [lt=3] [dc=0] not master rootserver(ret=-4638, master_rs=“172.17.151.120:2882”)

[2024-09-12 11:47:53.044073] WARN [CLOG] handle (ob_clog_history_reporter.cpp:315) [15817][370][Y0-0000000000000000] [lt=2] [dc=0] exec partition task fail(ret=-4023, ret=“OB_EAGAIN”, partition_task={pkey:{tid:1101710651031814, partition_id:0, part_cnt:0}, head:{partition op str:“ONLINE”, svr:“172.17.151.120:2882”, start_log_id:7474, start_log_timestamp_:1708450578878666, end_log_id:18446744073709551615, end_log_timestamp:-1, next:null}, tail:{partition op str:“ONLINE”, svr:“172.17.151.120:2882”, start_log_id:7474, start_log_timestamp_:1708450578878666, end_log_id:18446744073709551615, end_log_timestamp:-1, next:null}})

[2024-09-12 11:47:53.044280] WARN log_user_error_and_warn (ob_rpc_proxy.cpp:300) [15629][0][Y0-0000000000000000] [lt=1] [dc=0]

[2024-09-12 11:47:53.044410] WARN [SERVER] get_master_root_server (ob_service.cpp:3375) [15866][466][YB42AC119778-000621E3F26D2A18] [lt=4] [dc=0] not master rootserver(ret=-4638, master_rs=“172.17.151.120:2882”)

[2024-09-12 11:47:53.044561] INFO [SHARE] ob_inner_config_root_addr.cpp:172 [15629][0][Y0-0000000000000000] [lt=4] [dc=0] fetch addr_list &readonly_addr_list(ret=0, addr_list=[{server:“172.17.151.120:2882”, role:2, sql_port:2881, replica_type:0, reserved:0, property:{memstore_percent_:100}}], readonly_addr_list=[], cluster_type=1)

[2024-09-12 11:47:53.044722] WARN [SERVER] inner_close (ob_inner_sql_result.cpp:152) [15817][370][YB42AC119778-000621E3F27DB537] [lt=5] [dc=0] result set close failed(ret=-5019, need_retry=false)

[2024-09-12 11:47:53.044929] WARN log_user_error_and_warn (ob_rpc_proxy.cpp:300) [15629][0][Y0-0000000000000000] [lt=4] [dc=0]

[2024-09-12 11:47:53.045064] WARN [SERVER] get_master_root_server (ob_service.cpp:3375) [15866][466][YB42AC119778-000621E3F26D2A1F] [lt=3] [dc=0] not master rootserver(ret=-4638, master_rs=“172.17.151.120:2882”)

[2024-09-12 11:47:53.045230] WARN [RPC.OBRPC] rpc_call (ob_rpc_proxy.ipp:567) [15629][0][Y0-0000000000000000] [lt=3] [dc=0] execute rpc fail(ret=-4638, dst=“172.17.151.120:2882”)

[2024-09-12 11:47:53.045601] WARN log_user_error_and_warn (ob_rpc_proxy.cpp:300) [15629][0][Y0-0000000000000000] [lt=5] [dc=0]

[2024-09-12 11:47:53.045723] WARN [RPC.OBRPC] rpc_call (ob_rpc_proxy.ipp:567) [15629][0][YB42AC119778-000621E3F26D2A26] [lt=3] [dc=0] execute rpc fail(ret=-4638, dst=“172.17.151.120:2882”)

[2024-09-12 11:47:53.045886] WARN [SERVER] get_master_root_server (ob_service.cpp:3375) [15866][466][YB42AC119778-000621E3F26D2A28] [lt=3] [dc=0] not master rootserver(ret=-4638, master_rs=“172.17.151.120:2882”)

[2024-09-12 11:47:53.046052] WARN [SERVER] get_master_root_server (ob_service.cpp:3375) [15866][466][YB42AC119778-000621E3F26D2A2A] [lt=3] [dc=0] not master rootserver(ret=-4638, master_rs=“172.17.151.120:2882”)

[2024-09-12 11:47:53.046184] WARN log_user_error_and_warn (ob_rpc_proxy.cpp:300) [15629][0][Y0-0000000000000000] [lt=4] [dc=0]

[2024-09-12 11:47:53.046389] INFO [STORAGE] ob_pg_sstable_garbage_collector.cpp:188 [15761][262][Y0-0000000000000000] [lt=8] [dc=0] do one gc free sstable by queue(ret=0, free sstable cnt=0)

[2024-09-12 11:47:53.046513] WARN [SQL.RESV] resolve_table_relation_recursively (ob_dml_resolver.cpp:6639) [15817][370][YB42AC119778-000621E3F27DB53A] [lt=4] [dc=0] synonym not exist(tenant_id=1, database_id=1099511627777, table_name=__all_clog_history_info_v2, ret=-5019)

[2024-09-12 11:47:53.046692] WARN [RPC.OBRPC] rpc_call (ob_rpc_proxy.ipp:567) [15629][0][YB42AC119778-000621E3F26D2A2F] [lt=3] [dc=0] execute rpc fail(ret=-4638, dst=“172.17.151.120:2882”)

[2024-09-12 11:47:53.046840] WARN [STORAGE] get_curr_leader_and_memberlist (ob_partition_service.cpp:4548) [15629][0][Y0-0000000000000000] [lt=4] [dc=0] fail to get leader from election(ret=-7002, pkey={tid:1099511627777, partition_id:0, part_cnt:1})

[2024-09-12 11:47:53.046933] WARN log_user_error_and_warn (ob_rpc_proxy.cpp:300) [15629][0][Y0-0000000000000000] [lt=3] [dc=0]

【复现路径】问题出现前后相关操作
【附件及日志】推荐使用OceanBase敏捷诊断工具obdiag收集诊断信息,详情参见链接(右键跳转查看):

【SOP系列 22 】——故障诊断第一步(自助诊断和诊断信息收集)

【备注】基于 LLM 和开源文档 RAG 的论坛小助手已开放测试,在发帖时输入 [@论坛小助手] 即可召唤小助手,欢迎试用!

根据您提供的日志信息,您的 observer 服务启动失败,并且错误日志中显示了 -4016 的错误码,这通常意味着遇到了内部错误。根据文档库中的信息,错误码 -4016 表示内部错误,具体原因可能是多种情况引起的,如配置问题、系统资源不足或其他内部组件异常等 [1]

从您提供的日志来看,错误主要发生在 notify_scan_finished_ 方法调用时,具体错误信息如下:

invalid scan_confirmed_log_cnt(ret=-4016, ret="OB_ERR_UNEXPECTED", scan_confirmed_log_cnt=586, next_ilog_id=8355, last_replay_log_id=7474, pkey={tid:1099511627983, partition_id:15, part_cnt:0})

这表明在扫描日志的过程中出现了意料之外的情况,导致了服务无法继续运行。虽然具体的错误原因没有明确指出,但是根据错误码和错误信息,可以推测可能是由于日志文件的状态不一致或者某些内部状态未能正确更新导致的。

此外,日志中还包含了其他一些警告信息,比如:

not master rootserver(ret=-4638, master_rs="172.17.151.120:2882")

这提示 observer 在启动过程中未能成功识别到主根服务器(master rootserver),这可能是集群配置问题或者是网络问题导致的。

综合以上信息,建议您采取以下措施来排查问题:

  1. 检查集群配置:确保集群配置正确无误,特别是涉及到节点间通信的部分,如 IP 地址和端口是否正确配置。

  2. 检查日志文件:确认日志文件是否完整且没有损坏。如果存在日志文件损坏的情况,可能需要修复或重建这些文件。

  3. 网络连通性:确保 observer 节点能够正常与其他节点通信,特别是主根服务器。

  4. 联系技术支持:由于 -4016 错误码通常表示内部错误,可能需要更深入的技术分析才能确定具体原因。建议联系 OceanBase 的技术支持团队以获得进一步的帮助。

如果您已经尝试过上述步骤但仍无法解决问题,建议使用 obdiag 工具收集详细的诊断信息,并联系 OceanBase 的技术支持团队进行进一步分析。

更多信息请查看:

  1. 4000 ~ 4499
  2. 0001 ~ 3999

这个小助手还是有点用啊,我看了日志感觉像是clog被删除了。。符合提到的第二点 :face_with_peeking_eye:
建议看看CLOG目录的文件还在不在

应该是 迭代 Clog 文件出错,这通常是由于 Clog 文件本身出了问题 看一下日志目录 发一下这段时间observer.log日志 尽量发全
3.x社区版 目前也不维护了 建议使用ob4.2.1.X的版本 是长期维护的版本