observer crash error

【产品名称】

Oceanbase

【产品版本】

OceanBase CE 3.1.1

【问题描述】

三Zone,三副本集群,一台observer服务突然不可用,

observer.log 报错:

CRASH ERROR!!! sig=6, sig_code=-6, sig_addr=3ed0002804e, timestamp=1648708206340743, tid=164927, tname=TNT_L0_1001, trace_id=12378264125728-1645876233425640, extra_info=((null)), lbt=0xb1f5a99 0xb1f6b50 0xb1f5d39 0xb1f6a5f 0x7f37340815cf 0x7f37338cf207 0x7f37338d08f7 0x7f37338c8025 0x7f37338c80d1 0x438662d 0x3bd8b1c 0x3b39e0c 0x4811d93 0x4870123 0x48106cf 0x3982dcb 0x3c4c513 0x3c4c89b 0x3c4c89b 0x4190be3 0x41e00d3 0x41901af

生成了,stack.163918.20223316305

日志信息恐有敏感信息,可单独提供。

  1. ps -ef | grep observer 确认下observer已经故障;
  2. 麻烦使用addr2line工具解析并提供下如下堆栈信息:0xb1f5a99 0xb1f6b50 0xb1f5d39 0xb1f6a5f 0x7f37340815cf 0x7f37338cf207 0x7f37338d08f7 0x7f37338c8025 0x7f37338c80d1 0x438662d 0x3bd8b1c 0x3b39e0c 0x4811d93 0x4870123 0x48106cf 0x3982dcb 0x3c4c513 0x3c4c89b 0x3c4c89b 0x4190be3 0x41e00d3 0x41901af,例如addr2line -pCfe $observer $symbol_addr
  3. 搜索下observer.log.* 的ERROR/WARN级别日志
1 个赞

当时observer进程已不在,__all_server这台机器状态已不可用。



addr2line 结果:

oceanbase::common::safe_backtrace(char*, long, long&) at ./build_debug/deps/oblib/src/lib/./deps/oblib/src/lib/signal/ob_signal_utils.cpp:87

oceanbase::common::coredump_cb(int, siginfo_t*) at ./build_debug/deps/oblib/src/lib/./deps/oblib/src/lib/signal/ob_signal_handlers.cpp:83

oceanbase::common::ob_signal_handler(int, siginfo_t*, void*) at ./build_debug/deps/oblib/src/lib/./deps/oblib/src/lib/signal/ob_signal_handlers.cpp:61

oceanbase::common::handler(int, siginfo_t*, void*) at ./build_debug/deps/oblib/src/lib/./deps/oblib/src/lib/signal/ob_signal_handlers.cpp:31

?? ??:0

?? ??:0

?? ??:0

?? ??:0

?? ??:0

oceanbase::sql::ObQueryRange::serialize(char*, long, long&) const at ./build_debug/src/sql/./src/sql/rewrite/ob_query_range.cpp:4519

oceanbase::common::serialization::EnumEncoder<false, oceanbase::sql::ObQueryRange>::encode(char*, long, long&, oceanbase::sql::ObQueryRange const&) at ./build_debug/src/sql/./deps/oblib/src/lib/utility/serialization.h:1631

int oceanbase::common::serialization::encode<oceanbase::sql::ObQueryRange>(char*, long, long&, oceanbase::sql::ObQueryRange const&) at ./build_debug/src/sql/./deps/oblib/src/lib/utility/serialization.h:1649

oceanbase::sql::ObTableScanSpec::serialize_(char*, long, long&) const at ./build_debug/src/sql/./src/sql/engine/table/ob_table_scan_op.cpp:231

oceanbase::sql::ObTableScanSpec::serialize_dispatch_(char*, long, long&, std::integral_constant<bool, false>) const at ./build_debug/src/sql/./src/sql/engine/table/ob_table_scan_op.h:80

oceanbase::sql::ObTableScanSpec::serialize(char*, long, long&) const at ./build_debug/src/sql/./src/sql/engine/table/ob_table_scan_op.cpp:231

oceanbase::sql::ObOpSpec::serialize(char*, long, long&, oceanbase::sql::ObPhyOpSeriCtx&) const at ./build_debug/src/sql/./src/sql/engine/ob_operator.h:202

oceanbase::sql::ObPxTreeSerializer::serialize_tree(char*, long, long&, oceanbase::sql::ObOpSpec&, bool, oceanbase::sql::ObPhyOpSeriCtx*) at ./build_debug/src/sql/./src/sql/engine/px/ob_px_util.cpp:1453

oceanbase::sql::ObPxTreeSerializer::serialize_tree(char*, long, long&, oceanbase::sql::ObOpSpec&, bool, oceanbase::sql::ObPhyOpSeriCtx*) at ./build_debug/src/sql/./src/sql/engine/px/ob_px_util.cpp:1466

oceanbase::sql::ObPxTreeSerializer::serialize_tree(char*, long, long&, oceanbase::sql::ObOpSpec&, bool, oceanbase::sql::ObPhyOpSeriCtx*) at ./build_debug/src/sql/./src/sql/engine/px/ob_px_util.cpp:1466

oceanbase::sql::ObPxRpcInitSqcArgs::serialize_(char*, long, long&) const at ./build_debug/src/sql/./src/sql/engine/px/ob_dfo.cpp:502

oceanbase::sql::ObPxRpcInitSqcArgs::serialize_dispatch_(char*, long, long&, std::integral_constant<bool, false>) const at ./build_debug/src/sql/./src/sql/engine/px/ob_dfo.h:955

oceanbase::sql::ObPxRpcInitSqcArgs::serialize(char*, long, long&) const at ./build_debug/src/sql/./src/sql/engine/px/ob_dfo.cpp:479

麻烦执行一下, observer -V 贴一下结果;

这个堆栈有点奇怪,目录是build_debug下面,正常打出来的包应该是build_release目录下的

1 个赞

这个版本是我们为了解决readonly 副本bug的修复版本

https://github.com/oceanbase/oceanbase/pull/786/commits/7a0d3fc9c7db0412570b82b12e61856a16ceec13

和3.1.1 CE以及master的堆栈都对不上,你看下本地代码的 ob_query_range.cpp:4519 这里上下30行的代码;

另外,运行建议使用release打出来的binary,debug模式会有比较大的性能损耗;

确认了一下,确实是3.1.1这个版本的bug;

具体原因:抽取 query range 耗费了大量的内存。而且在算法迭代过程中没有检查查询超时,导致这个查询一直消耗内存,知道用尽了 SQL ARENA 的内存才退出,影响了其他查询的正常执行;

在这里: https://github.com/oceanbase/oceanbase/commit/9ff1baa323e98e00290e2766db7ade67350061c9 修复了,升级到3.1.2以上的版本可以修复

1 个赞