OCP告警:OBServer 非预期的内部错误

【 使用环境 】生产环境
【 OB or 其他组件 】OCP
【 使用版本 】4.3.0-20240617185317
【问题描述】OCP接管集群后告警

observer.log:
[2024-08-15 16:49:19.517497] INFO [STORAGE.TRANS] replay (ob_id_service.cpp:210) [636658][T1004_ReplaySrv][T1004][Y0-0000000000000000-0-0] [lt=11] ObIDService replay(service_type_=0, ret=0, ls_log={last_id_:1723711768702618829, limit_id_:1723711768702618829}, base_header={version:1, log_type:7, flag:0, replay_hint:0})
[2024-08-15 16:49:19.517501] WDIAG [LIB] ~ObTimeGuard (utility.h:855) [636322][T1002_ReplaySrv][T1002][Y0-0000000000000000-0-0] [lt=33][errcode=-4389] destruct(*this=time guard ‘ls_meta’ cost too much time, used=4081409, time_dist: after lock=4081363, (null)=3)
[2024-08-15 16:49:19.517507] ERROR issue_dba_error (ob_log.cpp:1868) [636658][T1004_ReplaySrv][T1004][Y0-0000000000000000-0-0] [lt=7][errcode=-4388] Unexpected internal error happen, please checkout the internal errcode(errcode=-4389, file=“ob_ls_adapter.cpp”, line_no=106, info=“single replay task cost too much time. replay may be delayed”)
[2024-08-15 16:49:19.517511] INFO [STORAGE.TRANS] handle_replay_result (ob_id_service.cpp:237) [636322][T1002_ReplaySrv][T1002][Y0-0000000000000000-0-0] [lt=8] handle reply result(ret=0, service_type_=0, last_id=1723711767138154163, limited_id=1723711767138154163, log_ts={val:1723711750472305210, v:0})
[2024-08-15 16:49:19.517515] WDIAG [LIB] ~ObTimeGuard (utility.h:855) [636309][T1002_TxLoopWor][T1002][Y0-0000000000000000-0-0] [lt=28][errcode=-4389] destruct(*this=time guard ‘ls_meta’ cost too much time, used=1046556, time_dist: after lock=1046548)
[2024-08-15 16:49:19.517526] WDIAG [STORAGE.TRANS] for_each_remove_ (ob_tx_retain_ctx_mgr.cpp:386) [636309][T1002_TxLoopWor][T1002][Y0-0000000000000000-0-0] [lt=10][errcode=-4389] [RetainCtxMgr] remove retain ctx use too much time(use_ts=1046562, remove_count=0)
[2024-08-15 16:49:19.517523] INFO [STORAGE.TRANS] replay (ob_id_service.cpp:210) [636322][T1002_ReplaySrv][T1002][Y0-0000000000000000-0-0] [lt=11] ObIDService replay(service_type_=0, ret=0, ls_log={last_id_:1723711767138154163, limit_id_:1723711767138154163}, base_header={version:1, log_type:7, flag:0, replay_hint:0})
[2024-08-15 16:49:19.517508] INFO [STORAGE.TRANS] handle_tx_commit_result_ (ob_trans_service_v4.cpp:550) [636657][T1004_ApplySrv0][T1004][YB42C0A837A2-00061FB4891BE0DC-0-0] [lt=42] handle tx commit result(ret=0, ref_cnt_0=3, tx={this:0x7f35ecd2f960, tx_id:{txid:1041191}, state:10, addr:“192.168.55.162:2882”, tenant_id:1004, session_id:3221828347, assoc_session_id:3221828347, xid:NULL, xa_mode:"", xa_start_addr:“0.0.0.0:0”, access_mode:0, tx_consistency_type:0, isolation:1, snapshot_version:{val:18446744073709551615, v:3}, snapshot_scn:0, active_scn:1723711750854298, op_sn:4, alloc_ts:1723711720865542, active_ts:1723711750851488, commit_ts:1723711750851488, finish_ts:1723711759516551, timeout_us:86400000000, lock_timeout_us:-1, expire_ts:1723798150851488, coord_id:{id:1002}, parts:[{id:{id:1002}, addr:“192.168.55.162:2882”, epoch:1084893382459217, first_scn:1723711750854296, last_scn:1723711750854299, last_touch_ts:1723711750854299}], exec_info_reap_ts:1723711750854298, commit_version:{val:1723711750851302499, v:0}, commit_times:0, commit_cb:0x7f3613af5788, cluster_id:1723708822, cluster_version:17180000514, flags_.SHADOW:false, flags_.INTERRUPTED:false, flags_.BLOCK:false, flags_.REPLICA:false, can_elr:true, cflict_txs:[], abort_cause:0, commit_expire_ts:1723711760780639, commit_task_.is_registered():false, ref:3}, commit_fin=true, result=0)
[2024-08-15 16:49:19.517530] INFO [COMMON] advance_clock (ob_fifo_arena.cpp:374) [636321][T1002_ReplaySrv][T1002][Y0-0000000000000000-0-0] [lt=4] current clock is (clock_=0, max_seq_=0, mem_limit=888772996, cur_mem_hold=77594624, attr_.tenant_id_=1002)
[2024-08-15 16:49:19.517537] WDIAG [LIB] ~ObTimeGuard (utility.h:855) [636309][T1002_TxLoopWor][T1002][Y0-0000000000000000-0-0] [lt=10][errcode=-4389] destruct(*this=time guard ‘try_gc_retain_ctx’ cost too much time, used=1046576, time_dist: (null)=0)
[2024-08-15 16:49:19.517538] ERROR issue_dba_error (ob_log.cpp:1868) [636322][T1002_ReplaySrv][T1002][Y0-0000000000000000-0-0] [lt=12][errcode=-4388] Unexpected internal error happen, please checkout the internal errcode(errcode=-4389, file=“ob_ls_adapter.cpp”, line_no=106, info=“single replay task cost too much time. replay may be delayed”)
[2024-08-15 16:49:19.517552] INFO [STORAGE] ~ObStorageTableGuard (ob_storage_table_guard.cpp:153) [636321][T1002_ReplaySrv][T1002][Y0-0000000000000000-0-0] [lt=12] throttle statics(need_speed_limit=false, last_throttle_status=false, last_print_log_time=1723711744087363, stat={total_throttle_time_us:0, total_skip_throttle_time_us:0, last_log_timestamp:1723711759516551, last_throttle_status:false, 0=0, 1=0, 2=0, 3=0})
[2024-08-15 16:49:19.517545] INFO [STORAGE.TRANS] print_retain_ctx_info (ob_tx_retain_ctx_mgr.cpp:273) [636309][T1002_TxLoopWor][T1002][Y0-0000000000000000-0-0] [lt=6] [RetainCtxMgr] print retain ctx(ls_id={id:1}, this={retain_ctx_list_.get_size():274, max_wait_ckpt_ts_:{val:1723711692771255671, v:0}, last_push_gc_task_ts_:1723711098094878, skip_remove_cnt_:1}, retain_ctx_list_.get_first()={cause_:0, tx_id_:{txid:1006578}, ls_id_:{id:1}, tx_ctx_:0x7f36a629bdd0}, retain_ctx_list_.get_last()={cause_:0, tx_id_:{txid:1006874}, ls_id_:{id:1}, tx_ctx_:0x7f3637d2c050})
[2024-08-15 16:49:19.517542] TRACE [TRACE] reset (ob_trans_define_v4.cpp:461) [636657][T1004_ApplySrv0][T1004][YB42C0A837A2-00061FB4891BE0DC-0-0] [lt=28] [tx slow commit][tx desc trace](TRACE=begin_ts=1723711720865615 2024-08-15 08:48:40.865615|[reuse] u=0 ret:0, addr:0x7f35ecd2f960, txid:{txid:0}, tag1:0, tag2:1, ref:1, thread_id:636736|[create_global_implicit_savepoint] u=29985271 ret:0, txid:{txid:1041191}, savepoint:1723711750854288, release:true, opid:2, ref:2, thread_id:636736|[get_read_snapshot] u=601 ret:0, expire_ts:1723711760780639, txid:{txid:1041191}, isolation_level:1, snapshot_source:1, snapshot_version:{scn_val:1723711750850453995}, snapshot_txid:1041191, snapshot_scn:1723711750854293, trace_id:“YB42C0A837A2-00061FB4891BE0DF-0-0”, ref:2, thread_id:636736|[create_global_implicit_savepoint] u=46 ret:0, txid:{txid:1041191}, savepoint:1723711750854295, release:false, opid:3, ref:2, thread_id:636736|[submit_commit_tx] u=421 ret:0, expire_ts:1723711760780639, tag1:false, tag2:"", ref:2, thread_id:636736|[handle_tx_commit_result] u=8665582 ret:0, arg:0, is_finish:true, result:0, state:10, tag1:3, ref:3, commit_version:{scn_val:1723711750851302499}, thread_id:636657|[exec_commit_cb] u=2 arg:0x7f3613af5788, ref:3, thread_id:636657|total_timeu=38651923)
[2024-08-15 16:49:19.517606] INFO [LIB] stat (utility.h:1157) [636321][T1002_ReplaySrv][T1002][Y0-0000000000000000-0-0] [lt=13] [REPLAY STAT REPLAY TASK IN QUEUE TIME](cur_stat_count=1, stat_interval=1000000, avg (count/cost)=101, this=0x7f36a5a048f8, extra_info="")
[2024-08-15 16:49:19.517513] EDIAG [CLOG] replay (ob_ls_adapter.cpp:106) [636658][T1004_ReplaySrv][T1004][Y0-0000000000000000-0-0] [lt=6][errcode=-4389] single replay task cost too much time. replay may be delayed(replay_task={ls_id_:{id:1}, log_type:7, log_type:“TIMESTAMP”, lsn_:{lsn:33958539}, scn_:{val:1723711752147585641, v:0}, is_pre_barrier_:false, is_post_barrier_:false, log_size_:40, replay_hint_:0, is_raw_write_:false, first_handle_ts_:1723711756155303, replay_cost_:3362201, retry_cost_:0, log_buf_:0x7f3610a65f00}) BACKTRACE:0x113b8115 0x6ab8be9 0x4d6acd4 0x6ab8954 0x6ab0bf1 0x8757ba0 0x875702c 0x884dd18 0x4b13ab8 0x11a55a31 0x11a51fbf 0x7f37669281ca 0x7f37665838d3
[2024-08-15 16:49:19.517615] INFO [LIB] stat (utility.h:1157) [636321][T1002_ReplaySrv][T1002][Y0-0000000000000000-0-0] [lt=8] [REPLAY STAT REPLAY TASK EXECUTE COST TIME](cur_stat_count=1, stat_interval=1000000, avg (count/cost)=153, this=0x7f36a5a04b28, extra_info="")
[2024-08-15 16:49:19.517618] INFO [LIB] stat (utility.h:1157) [636658][T1004_ReplaySrv][T1004][Y0-0000000000000000-0-0] [lt=104] [REPLAY STAT REPLAY TASK IN QUEUE TIME](cur_stat_count=2, stat_interval=1000000, avg (count/cost)=8, this=0x7f36702048f8, extra_info="")
[2024-08-15 16:49:19.517625] INFO [LIB] stat (utility.h:1157) [636658][T1004_ReplaySrv][T1004][Y0-0000000000000000-0-0] [lt=6] [REPLAY STAT REPLAY TASK EXECUTE COST TIME](cur_stat_count=2, stat_interval=1000000, avg (count/cost)=1681292, this=0x7f3670204b28, extra_info="")
[2024-08-15 16:49:19.517549] EDIAG [CLOG] replay (ob_ls_adapter.cpp:106) [636322][T1002_ReplaySrv][T1002][Y0-0000000000000000-0-0] [lt=11][errcode=-4389] single replay task cost too much time. replay may be delayed(replay_task={ls_id_:{id:1}, log_type:7, log_type:“TIMESTAMP”, lsn_:{lsn:25254194}, scn_:{val:1723711750472305210, v:0}, is_pre_barrier_:false, is_post_barrier_:false, log_size_:40, replay_hint_:0, is_raw_write_:false, first_handle_ts_:1723711755436086, replay_cost_:4081449, retry_cost_:0, log_buf_:0x7f360a2a2450}) BACKTRACE:0x113b8115 0x6ab8be9 0x4d6acd4 0x6ab8954 0x6ab0bf1 0x8757ba0 0x875702c 0x884dd18 0x4b13ab8 0x11a55a31 0x11a51fbf 0x7f37669281ca 0x7f37665838d3
[2024-08-15 16:49:19.517610] TRACE [TRACE] print_trace_log_if_necessary_ (ob_trans_ctx.cpp:137) [636657][T1004_ApplySrv0][T1004][YB42C0A837A2-00061FB4891BE0DF-0-0] [lt=26] [long trans] (TRACE=begin_ts=1723711750851690 2024-08-15 08:49:10.851690|[init] u=0 addr:0x7f366ffc0dd0, id:1002, trans_id:{txid:1041191}, ref:0, opid:1055|[start_trans] u=2 ret:0, left_time:86400000000, ref:2, opid:1056|[start_access] u=1 ret:0, trace_id:“YB42C0A837A2-00061FB4891BE0DF-0-0”, opid:3, data_seq:1723711750854296, pending:1, ref:2, tid:1619|[end_access] u=157 opid:3, pending:0, ref:2, tid:1619|[start_access] u=4 ret:0, trace_id:“YB42C0A837A2-00061FB4891BE0DF-0-0”, opid:3, data_seq:1723711750854299, pending:1, ref:2, tid:1619|[end_access] u=35 opid:3, pending:0, ref:2, tid:1619|[set_stc] u=42 stc:1723711750851928, opid:1061|[commit] u=22 ret:0, tid:636736, ref:3, opid:1061|[get_gts_callback] u=6 ret:-4023, srr:1723711750851584, gts:{scn_val:1723711750851018706}, ref:2, opid:1062|[after_submit_log] u=231 ret:0, log_no:0, base_ts:{scn_val:1723711750851302499}, t:{scn_val:1723711750851302499}, lsn:{lsn:6786307}|[get_gts_callback] u=7 ret:0, srr:1723711750851936, gts:{scn_val:1723711750851302499}, ref:3, opid:1063|[log_sync_succ_cb] u=8665237 ret:0, log_type:0x1, t:{scn_val:1723711750851302499}, offset:{lsn:6786307}, ref:2|[log_sync_succ_cb] u=1 ret:0, log_type:0x40, t:{scn_val:1723711750851302499}, offset:{lsn:6786307}, ref:2|[response_scheduler] u=24 ret:0, tag1:false, tag2:false, status:0, commit_version:{scn_val:1723711750851302499}|[exiting] u=45 ref:2, arg1:3221828347, opid:1065|[log_sync_succ_cb] u=0 ret:0, log_type:0x100, t:{scn_val:1723711750851302499}, offset:{lsn:6786307}, ref:1|[end_trans_cb] u=103 ret:0, arg1:0, arg2:{scn_val:1723711750851302499}, async:false, opid:1066|[destroy] u=2 opid:1066|total_timeu=8665919)[2024-08-15 16:49:19.517622] INFO [STORAGE] ~ObStorageTableGuard (ob_storage_table_guard.cpp:153) [636719][T1004_ReplaySrv][T1004][Y0-0000000000000000-0-0] [lt=5] throttle statics(need_speed_limit=false, last_throttle_status=false, last_print_log_time=1723711713536291, stat={total_throttle_time_us:0, total_skip_throttle_time_us:0, last_log_timestamp:1723711759517608, last_throttle_status:false, 0=0, 1=0, 2=0, 3=0})
[2024-08-15 16:49:19.517642] INFO [LIB] stat (utility.h:1157) [636657][T1004_ApplySrv0][T1004][YB42C0A837A2-00061FB4891BE0DF-0-0] [lt=29] [APPLY STAT CB APPEND COST TIME](cur_stat_count=2, stat_interval=5000000, avg (count/cost)=13, this=0x7f364f2ca280, extra_info="")[2024-08-15 16:49:19.517647] INFO [LIB] stat (utility.h:1157) [636657][T1004_ApplySrv0][T1004][YB42C0A837A2-00061FB4891BE0DF-0-0] [lt=5] [APPLY STAT CB IN QUEUE TIME](cur_stat_count=2, stat_interval=5000000, avg (count/cost)=6667694, this=0x7f364f2ca4b0, extra_info="")
[2024-08-15 16:49:19.517651] INFO [LIB] stat (utility.h:1157) [636657][T1004_ApplySrv0][T1004][YB42C0A837A2-00061FB4891BE0DF-0-0] [lt=3] [APPLY STAT CB WAIT COMMIT TIME](cur_stat_count=2, stat_interval=5000000, avg (count/cost)=0, this=0x7f364f2ca6e0, extra_info="")[2024-08-15 16:49:19.517654] INFO [LIB] stat (utility.h:1157) [636657][T1004_ApplySrv0][T1004][YB42C0A837A2-00061FB4891BE0DF-0-0] [lt=3] [APPLY STAT CB EXECUTE TIME](cur_stat_count=2, stat_interval=5000000, avg (count/cost)=-71, this=0x7f364f2ca910, extra_info="")[2024-08-15 16:49:19.517657] INFO [LIB] stat (utility.h:1157) [636657][T1004_ApplySrv0][T1004][YB42C0A837A2-00061FB4891BE0DF-0-0] [lt=3] [APPLY STAT CB TOTAL TIME](cur_stat_count=2, stat_interval=5000000, avg (count/cost)=6667636, this=0x7f364f2cab40, extra_info="")[2024-08-15 16:49:19.517661] WDIAG [CLOG] statistics_cb_cost_ (ob_log_apply_service.cpp:933) [636657][T1004_ApplySrv0][T1004][YB42C0A837A2-00061FB4891BE0DF-0-0] [lt=3][errcode=-4389] cb cost too much time(lsn={lsn:6786307}, scn={val:1723711750851302499, v:0}, idx=3, total_cost_time=8665437, append_cost_time=9, cb_wait_thread_time=8665210, cb_wait_commit_time=0, cb_cost_time=218, append_start_time=1723711750852171, append_finish_time=1723711750852180, cb_first_handle_time=1723711759517390, cb_finish_time=1723711759517608)

持续在报错吗?

如果是 可以过滤下 grep “EDIAG” observer.log

打印下过滤出日志的堆栈信息

addr2line -pCfe bin/observer xxx (xxx 为 上述过滤日志中 ‘lbt=’ 后面的信息)

例如:addr2line -pCfe /home/admin/oceanbase/bin/observer 0x58b9f0 0x58c00f 0x7fcb153fe61f 0x4a1342 0x9fc8fb 0x9f66f5 0x9dc80e 0x9dd097 0x9dd727 0x9e3fdf 0x9fdcd9 0x9f74eb 0x9e2c1b 0x4abcc0 0x4a801b 0x7fcb144b6444 0x4a1028

是在告警状态中,但是最后一次告警时间是2024年8月15日 16:50:20,log里面已经没有了

当前信息暂时无法判断问题原因,如果是偶现的后续可以继续观察,如果有复现可以按上面的方式提供信息。

你好!刚才又报错了,这是按照上面的方式得到的信息

oceanbase::common::ObLogger::backtrace_if_needed(oceanbase::common::ObPLogItem&, bool) 于 ??:?
void oceanbase::common::ObLogger::do_log_message<oceanbase::common::ObLogger::log_message_kvoceanbase::common::ObILogKV(char const*, int, char const*, int, char const*, unsigned long, int, char const*, oceanbase::common::ObILogKV const&&)::{lambda(char*, long, long&)#1}>(bool, char const*, int, char const*, int, char const*, bool, unsigned long, int, oceanbase::common::ObLogger::log_message_kvoceanbase::common::ObILogKV(char const*, int, char const*, int, char const*, unsigned long, int, char const*, oceanbase::common::ObILogKV const&&)::{lambda(char*, long, long&)#1}&) 于 main.cpp:?
void oceanbase::common::ObLogger::log_it<oceanbase::common::ObLogger::log_message_kvoceanbase::common::ObILogKV(char const*, int, char const*, int, char const*, unsigned long, int, char const*, oceanbase::common::ObILogKV const&&)::{lambda(char*, long, long&)#1}&>(char const*, int, char const*, int, char const*, unsigned long, int, oceanbase::common::ObLogger::log_message_kvoceanbase::common::ObILogKV(char const*, int, char const*, int, char const*, unsigned long, int, char const*, oceanbase::common::ObILogKV const&&)::{lambda(char*, long, long&)#1}&) 于 main.cpp:?
void oceanbase::common::ObLogger::log_message_kvoceanbase::common::ObILogKV(char const*, int, char const*, int, char const*, unsigned long, int, char const*, oceanbase::common::ObILogKV const&&) 于 main.cpp:?
void oceanbase::common::OB_PRINToceanbase::common::ObILogKV(char const*, int, char const*, int, char const*, unsigned long, int, char const*, char const*, oceanbase::common::ObILogKV const&&) 于 ??:?
oceanbase::logservice::ObLSAdapter::replay(oceanbase::logservice::ObLogReplayTask*)::$_498::operator()(char const*) const [clone .llvm.5890169007296046080] 于 0_cxx.cxx:?
oceanbase::logservice::ObLSAdapter::replay(oceanbase::logservice::ObLogReplayTask*) 于 ??:?
oceanbase::logservice::ObLogReplayService::handle(void*) 于 ??:?
oceanbase::common::ObSimpleThreadPool::run1() 于 ??:?
oceanbase::lib::Threads::run(long) 于 ??:?
oceanbase::lib::thread::__th_start(void*) 于 0_cxx.cxx:?
?? ??:0
?? ??:0

报错时间observer.log日志麻烦也提供下。

不好意思,又覆盖了 :melting_face: 下次再出现的时候我把两个都提供一下。

ob日志默认应该是保留12小时呢吧

但我这只有半个小时左右的log,请问在哪里可以修改保留时间吗

image

observer日志保留相关咨询 - #4,来自 11小透明r7vfmm_gaMjAxODUwMDg2Ni4xNzIwNTA0NDU0_ga_T35KTM57DZ*MTcyNDIxOTQ5My45NS4xLjE3MjQyMjU4ODAuNy4wLjA.

可以参考下这个帖子

告警频率怎样呢?后续还持续告警吗

后续没有告警了,谢谢老师