oceanbase 单机服务故障

【 使用环境 】测试环境
【 OB or 其他组件 】
【 使用版本 】 3.1.4-OceanBase CE
【问题描述】集群状态 running ; obproxy 正常 ;oceanbase-ce is stopped
【复现路径】问题出现前后相关操作
【问题现象及影响】
排查故障原因

  1. observer.log 日志错误
    [root@dev018 log]# tail -2 observer.log
    [2022-10-11 21:37:01.634695] INFO [STORAGE] ob_pg_sstable_garbage_collector.cpp:188 [15179][262][Y0-0000000000000000] [lt=7] [dc=0] do one gc free sstable by queue(ret=0, free sstable cnt=0)
    CRASH ERROR!!! sig=11, sig_code=128, sig_addr=0, timestamp=1665495421634940, tid=15255, tname=ILOGFlush, trace_id=0-0, extra_info=((null)), lbt=0x9ab6278 0x9aa6b18 0x7fe4d607a62f 0x7514558 0x74ebedd 0x744f5d0 0x744ce07 0x7449f3a 0x9a223b6 0x340b9ae 0x2cabf01 0x9820da4 0x981f791 0x981c24e

  2. 发现 stack* 文件
    [root@dev018 observer]# ls
    admin etc etc3 log stack.10853.2022101151714 stack.15046.2022101113371 store
    bin etc2 lib run stack.13191.202210117413 stack.17981.2022101062111
    查看错误内容
    [root@dev018 observer]# tail -3 stack.15046.2022101113371
    tid: 15889, tname: RestoreReporter, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d59569fd 0x7fe4d59872d3 0x373b726 0x2ca95d3 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
    tid: 15955, tname: TTLScheduler, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d6076de2 0x7adff48 0x9a22088 0x340b9ae 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
    tid: 3504, tname: TNT_L0_1001, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d5989e29 0x98217ba 0x931bc82 0x92e53fc 0x92eff18 0x92f05e6 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO

不清楚 oceanbase-ce 服务为什么故障。

【附件】

  1. 集群配置 dev018_ob_config.yaml
  2. obsever.log 日志
  3. stack文件

新用户无法上传 日志 和配置文件 。 受伤害

ob 配置文件

Only need to configure when remote login is required

user:

username: your username

password: your password if need

key_file: your ssh-key file path if need

port: your ssh port, default 22

timeout: ssh connection timeout (second), default 30

oceanbase-ce:
servers:
# Please don’t use hostname, only IP can be supported
- 172.17.151.118
global:
# The working directory for OceanBase Database. OceanBase Database is started under this directory. This is a required field.
home_path: /root/observer
# The directory for data storage. The default value is $home_path/store.
# data_dir: /data
# The directory for clog, ilog, and slog. The default value is the same as the data_dir value.
# redo_dir: /redo
# Please set devname as the network adaptor’s name whose ip is in the setting of severs.
# if set severs as “127.0.0.1”, please set devname as “lo”
# if current ip is 192.168.1.10, and the ip’s network adaptor’s name is “eth0”, please use “eth0”
devname: ens192
mysql_port: 2881 # External port for OceanBase Database. The default value is 2881. DO NOT change this value after the cluster is started.
rpc_port: 2882 # Internal port for OceanBase Database. The default value is 2882. DO NOT change this value after the cluster is started.
zone: zone1
cluster_id: 1
# please set memory limit to a suitable value which is matching resource.
memory_limit: 10G # The maximum running memory for an observer
system_memory: 2G # The reserved system memory. system_memory is reserved for general tenants. The default value is 30G.
stack_size: 512K
cpu_count: 16
cache_wash_threshold: 1G
__min_full_resource_pool_memory: 268435456
workers_per_cpu_quota: 10
schema_history_expire_time: 1d
# The value of net_thread_count had better be same as cpu’s core number.
net_thread_count: 4
major_freeze_duty_time: Disable
minor_freeze_times: 10
enable_separate_sys_clog: 0
enable_merge_by_turn: FALSE
datafile_disk_percentage: 20 # The percentage of the data_dir space to the total disk space. This value takes effect only when datafile_size is 0. The default value is 90.
syslog_level: INFO # System log level. The default value is INFO.
enable_syslog_wf: false # Print system logs whose levels are higher than WARNING to a separate log file. The default value is true.
enable_syslog_recycle: true # Enable auto system log recycling or not. The default value is false.
max_syslog_file_count: 4 # The maximum number of reserved log files before enabling auto recycling. The default value is 0.
# observer cluster name, consistent with obproxy’s cluster_name
appname: obcluster
obproxy-ce:

Set dependent components for the component.

When the associated configurations are not done, OBD will automatically get the these configurations from the dependent components.

depends:
- oceanbase-ce
servers:
- 172.17.151.118
global:
listen_port: 2883 # External port. The default value is 2883.
prometheus_listen_port: 2884 # The Prometheus port. The default value is 2884.
home_path: /root/obproxy
# oceanbase root server list
# format: ip:mysql_port;ip:mysql_port. When a depends exists, OBD gets this value from the oceanbase-ce of the depends.
# rs_list: 192.168.1.2:2881
enable_cluster_checkout: false
# observer cluster name, consistent with oceanbase-ce’s appname. When a depends exists, OBD gets this value from the oceanbase-ce of the depends.
# cluster_name: obcluster
skip_proxy_sys_private_check: true
enable_strict_kernel_release: false

observer.log 日志

2022-10-11 21:37:01.627233] INFO [STORAGE.TRANS] ob_trans_part_ctx.cpp:731 [15595][1063][YB42AC119776-0005EABCF36F8772] [lt=9] [dc=0] update gts cache success(updated=true, context={ObDistTransCtx:{ObTransCtx:{this:0x7fe368ab7b50, ctx_type:2, trans_id:{hash:15421428008438187002, inc:1697675, addr:“172.17.151.118:2882”, t:1665495421627165}, tenant_id:1001, is_exiting:false, trans_type:0, is_readonly:false, trans_expired_time:1665495451627096, self:{tid:1100611139404002, partition_id:0, part_cnt:0}, state:{prepare_version:-1, state:0}, cluster_version:12884967428, trans_need_wait_wrap:{receive_gts_ts:{mts:0}, need_wait_interval_us:0}, trans_param:[access_mode=1, type=2, isolation=1, magic=17361641481138401520, autocommit=1, consistency_type=0(CURRENT_READ), read_snapshot_type=2(PARTICIPANT_SNAPSHOT), cluster_version=12884967428, is_inner_trans=1], can_elr:false, uref:1073741825, ctx_create_time:1665495421626741}, scheduler:“172.17.151.118:2882”, coordinator:{tid:18446744073709551615, partition_id:-1, part_idx:268435455, subpart_idx:268435455}, participants:[{tid:1100611139404002, partition_id:0, part_cnt:0}], request_id:-1, timeout_task:{is_inited:true, is_registered:true, is_running:false, delay:30186911, ctx:0x7fe368ab7b50, bucket_idx:9280, run_ticket:333099090363, is_scheduled:true, prev:0x7fe450c7a140, next:0x7fe450c7a140}, xid:{gtrid_str:"", bqual_str:"", format_id:1, gtrid_str_.ptr():“data_size:0, data:”, bqual_str_.ptr():“data_size:0, data:”}}, snapshot_version:1665495421551836, local_trans_version:-1, submit_log_pending_count:0, submit_log_count:0, stmt_info:{sql_no:0, start_task_cnt:0, end_task_cnt:0, need_rollback:false, task_info:{tasks_:[]}}, global_trans_version:-1, redo_log_no:0, mutator_log_no:0, session_id:1, is_gts_waiting:false, part_trans_action:-1, timeout_task:{is_inited:true, is_registered:true, is_running:false, delay:30186911, ctx:0x7fe368ab7b50, bucket_idx:9280, run_ticket:333099090363, is_scheduled:true, prev:0x7fe450c7a140, next:0x7fe450c7a140}, batch_commit_trans:false, batch_commit_state:0, is_dup_table_trans:false, last_ask_scheduler_status_ts:1665495421627186, last_ask_scheduler_status_response_ts:1665495421627186, ctx_dependency_wrap:{prev_trans_arr:[], next_trans_arr:[], prev_trans_commit_count:0}, preassigned_log_meta:{log_id_:18446744073709551615, submit_timestamp_:-1, proposal_id_:{time_to_usec:-1, server:“0.0.0.0”}}, is_dup_table_prepare:false, dup_table_syncing_log_id:18446744073709551615, is_prepare_leader_revoke:false, is_local_trans:true, forbidden_sql_no:-1, is_dirty_:false, undo_status:{undo_action_arr:[]}, max_durable_sql_no:0, max_durable_log_ts:0, mt_ctx_.get_checksum_log_ts():0, is_changing_leader:false, has_trans_state_log:false, is_trans_state_sync_finished:false, status:0, same_leader_batch_partitions_count:0, is_hazardous_ctx:false, mt_ctx_.get_callback_count():0, in_xa_prepare_state:false, is_listener:false, last_replayed_redo_log_id:0, status:0, is_xa_trans_prepared:false, redo_log_id_serialize_size:1, participants_serialize_size:1, undo_serialize_size:7})
[2022-10-11 21:37:01.634695] INFO [STORAGE] ob_pg_sstable_garbage_collector.cpp:188 [15179][262][Y0-0000000000000000] [lt=7] [dc=0] do one gc free sstable by queue(ret=0, free sstable cnt=0)
CRASH ERROR!!! sig=11, sig_code=128, sig_addr=0, timestamp=1665495421634940, tid=15255, tname=ILOGFlush, trace_id=0-0, extra_info=((null)), lbt=0x9ab6278 0x9aa6b18 0x7fe4d607a62f 0x7514558 0x74ebedd 0x744f5d0 0x744ce07 0x7449f3a 0x9a223b6 0x340b9ae 0x2cabf01 0x9820da4 0x981f791 0x981c24e

stack.15046.2022101113371 日志

tid: 15658, tname: ObSSTableGC, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d59569fd 0x7fe4d59872d3 0x87d2c3d 0x340b9ae 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
tid: 15659, tname: AutoPartSche, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d59569fd 0x7fe4d59872d3 0x8a1b3a2 0x2ca95d3 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
tid: 15660, tname: WeakReadSvr, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d59569fd 0x7fe4d59872d3 0x8e9eba3 0x2ca95d3 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
tid: 15661, tname: GCPartAdpt, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d59569fd 0x7fe4d59872d3 0x8b78ee9 0x2ca95d3 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
tid: 15662, tname: RSMonitor, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d59569fd 0x7fe4d59872d3 0x9372ef2 0x2ca95d3 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
tid: 15663, tname: LeaseHB, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d6076de2 0x7adff48 0x9a22088 0x340b9ae 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
tid: 15885, tname: IdxBuild, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d5989e29 0x98217ba 0x5700972 0x9a26d18 0x9a263aa 0x9a2aabc 0x9a2a4ed 0x340b9ae 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
tid: 15886, tname: CacheCalculator, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d59569fd 0x7fe4d59872d3 0x932baa9 0x2ca95d3 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
tid: 15887, tname: BGThreadMonitor, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d59569fd 0x7fe4d59872d3 0x33d91eb 0x2ca95d3 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
tid: 15888, tname: BackupDestDetec, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d6076de2 0x998df6d 0x373d79e 0x2ca95d3 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
tid: 15889, tname: RestoreReporter, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d59569fd 0x7fe4d59872d3 0x373b726 0x2ca95d3 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
tid: 15955, tname: TTLScheduler, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d6076de2 0x7adff48 0x9a22088 0x340b9ae 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO
tid: 3504, tname: TNT_L0_1001, lbt: 0x9ab6278 0x9aab260 0x9aaa704 0x7fe4d607a62f 0x7fe4d5989e29 0x98217ba 0x931bc82 0x92e53fc 0x92eff18 0x92f05e6 0x2cabf01 0x9820da4 0x981f791 0x981c24e, sql: TODO

addr2line.log (1.1 KB)

代发 crash 线程栈的 call stack 信息