observer重启失败

【 使用环境 】 测试环境
【 OB or 其他组件 】observer
【 使用版本 】OceanBase_CE 4.2.1.3
【问题描述】rootservice报错scheduler task not inited(ret=-4006, inited_=false)
【复现路径】租户为1c 2g 3zone,在千万级表上加索引太慢,调整zone为4c4G,任务卡住,又调整了zone为8c8G,两个任务卡住,用obd cluster stop 集群,再start集群,一个ob启动失败,两个启动成功。
【附件及日志】
rootservice:
[2024-02-29 11:37:12.411264] INFO [RS] destroy (ob_root_service.cpp:945) [2970763][observer][T0][Y0-000000000000
0000-0-0] [lt=8] [ROOTSERVICE_NOTICE] start to destroy rootservice
[2024-02-29 11:37:12.411382] INFO [RS] destroy (ob_root_service.cpp:957) [2970763][observer][T0][Y0-000000000000
0000-0-0] [lt=63] lost replica checker destroy
[2024-02-29 11:37:12.411396] INFO [RS] destroy (ob_rs_reentrant_thread.cpp:115) [2970763][observer][T0][Y0-00000
00000000000-0-0] [lt=11] rs_monitor_check : reentrant thread check unregister success(thread_name="", last_run_ti
mestamp=0)
[2024-02-29 11:37:12.411421] INFO [RS] destroy (ob_root_service.cpp:965) [2970763][observer][T0][Y0-000000000000
0000-0-0] [lt=19] root balance destroy
[2024-02-29 11:37:12.411430] INFO [RS] destroy (ob_root_service.cpp:972) [2970763][observer][T0][Y0-000000000000
0000-0-0] [lt=8] empty server checker destroy
[2024-02-29 11:37:12.411438] INFO [RS] destroy (ob_root_service.cpp:979) [2970763][observer][T0][Y0-000000000000
0000-0-0] [lt=8] rs_monitor_check : thread checker destroy
[2024-02-29 11:37:12.411446] INFO [RS] destroy (ob_root_service.cpp:985) [2970763][observer][T0][Y0-000000000000
0000-0-0] [lt=7] schema history recycler destroy
[2024-02-29 11:37:12.411465] INFO [RS] destroy (ob_root_service.cpp:989) [2970763][observer][T0][Y0-000000000000
0000-0-0] [lt=13] inner queue destroy
[2024-02-29 11:37:12.411474] INFO [RS] destroy (ob_root_service.cpp:991) [2970763][observer][T0][Y0-000000000000
0000-0-0] [lt=9] inspect queue destroy
[2024-02-29 11:37:12.411518] INFO [RS] destroy (ob_root_service.cpp:993) [2970763][observer][T0][Y0-000000000000
0000-0-0] [lt=8] ddl builder destroy
[2024-02-29 11:37:12.411533] INFO [RS] destroy (ob_rs_reentrant_thread.cpp:115) [2970763][observer][T0][Y0-00000
00000000000-0-0] [lt=14] rs_monitor_check : reentrant thread check unregister success(thread_name="", last_run_ti
mestamp=0)
[2024-02-29 11:37:12.411542] INFO [RS] destroy (ob_root_service.cpp:998) [2970763][observer][T0][Y0-000000000000
0000-0-0] [lt=9] heartbeat checker destroy
[2024-02-29 11:37:12.411559] INFO [RS] destroy (ob_root_service.cpp:1002) [2970763][observer][T0][Y0-00000000000
00000-0-0] [lt=7] event table operator destroy
[2024-02-29 11:37:12.411620] WDIAG [RS] destroy (ob_dbms_job_master.cpp:96) [2970763][observer][T0][Y0-0000000000
000000-0-0] [lt=13][errcode=-4006] scheduler task not inited(ret=-4006, inited_=false)
[2024-02-29 11:37:12.411669] INFO [RS] destroy (ob_root_service.cpp:1005) [2970763][observer][T0][Y0-00000000000
00000-0-0] [lt=42] ObDBMSJobMaster destroy
[2024-02-29 11:37:12.411683] INFO [RS] destroy (ob_root_service.cpp:1008) [2970763][observer][T0][Y0-00000000000
00000-0-0] [lt=8] ddl task scheduler destroy
[2024-02-29 11:37:12.411692] INFO [RS] destroy (ob_rs_reentrant_thread.cpp:115) [2970763][observer][T0][Y0-00000
00000000000-0-0] [lt=8] rs_monitor_check : reentrant thread check unregister success(thread_name="", last_run_tim
estamp=0)
[2024-02-29 11:37:12.411705] INFO [RS] destroy (ob_root_service.cpp:1023) [2970763][observer][T0][Y0-00000000000
00000-0-0] [lt=8] disaster recovery task mgr destroy
[2024-02-29 11:37:12.411746] WDIAG [RS] destroy (ob_dbms_sched_job_master.cpp:95) [2970763][observer][T0][Y0-0000
000000000000-0-0] [lt=9][errcode=-4006] scheduler task not inited(ret=-4006, inited_=false)
[2024-02-29 11:37:12.411757] INFO [RS] destroy (ob_root_service.cpp:1027) [2970763][observer][T0][Y0-00000000000
00000-0-0] [lt=10] ObDBMSSchedJobMaster destroy
[2024-02-29 11:37:12.411779] INFO [RS] destroy (ob_root_service.cpp:1029) [2970763][observer][T0][Y0-00000000000
00000-0-0] [lt=12] global ctx timer destroyed
[2024-02-29 11:37:12.411787] INFO [RS] destroy (ob_root_service.cpp:1038) [2970763][observer][T0][Y0-00000000000
00000-0-0] [lt=7] [ROOTSERVICE_NOTICE] destroy rootservice end(ret=0, ret=“OB_SUCCESS”)
[2024-02-29 11:37:12.461311] INFO [RS] stop (ob_disaster_recovery_task_table_updater.cpp:188) [2970763][observer
][T0][Y0-0000000000000000-0-0] [lt=27] stop ObDRTaskTableUpdater success
[2024-02-29 11:37:12.461335] INFO [RS] wait (ob_disaster_recovery_task_table_updater.cpp:194) [2970763][observer
][T0][Y0-0000000000000000-0-0] [lt=24] wait ObDRTaskTableUpdater

obd obdiag check:
±----------------------------------±---------------------------------------------------------------------------
---------------+
| task | task_report
|
±----------------------------------±---------------------------------------------------------------------------
---------------+
| cluster.major | [fail] [cluster:myoceanbase] StepSQLHandler execute Exception: (4016, ‘Oooo
ooooooooops’) |
| cluster.tenant_log_size | [fail] [cluster:myoceanbase] StepSQLHandler execute Exception: (4016, ‘Oooo
ooooooooops’) |
| cluster.task_opt_stat_gather_fail | [fail] [cluster:myoceanbase] StepSQLHandler execute Exception: (4016, ‘Oooo
ooooooooops’) …

这个怎么避免呢

  1. 部署得方式是什么
  2. 是使用得什么修改得呢。ocp还是黑屏sql得方式呢
  3. 麻烦提供下修改后得错误信息呢
    4.麻烦提供下observer.log和rootservice.log完整日志呢。

使用obd部署的3节点集群
使用ocp-express修改的zone的资源
修改zone资源后跳到任务界面,等了很长时间第一步一直不执行,没报错,就直接drop database了(因为使用的oms同步,同步10多个索引,半个小时加了7个还剩8个),drop database完成后,就重启ob集群了,启动好后修改资源正常,就一个ob启动失败
最新启动失败的日志
rootservice.log (4.1 KB)
observer.log (215.1 KB)

故障时间02-28 17:09:45 前后的日志
https://www.alipan.com/s/w1CHvqeR2mK

obd cluster start name -s ip -c oceanbase-ce 试试看

obd cluster start myoceanbase -s 10.1.2.238 -c oceanbase-ce
Get local repositories ok
Search plugins ok
Load cluster param plugin ok
Cluster status check ok
Check before start observer ok
Start observer ok
observer program health check x
[WARN] OBD-2002: Failed to start 10.1.2.238 observer
[ERROR] oceanbase-ce start failed
See OceanBase分布式数据库-海量数据 笔笔算数 .
Trace ID: 77bf0a16-d6e6-11ee-8c2a-fa163ebe87c8
If you want to view detailed obd logs, please run: obd display-trace 77bf0a16-d6e6-11ee-8c2a-fa163ebe87c8

OBD-2002:failed to start x.x.x.x observer

错误原因:出现该报错的原因有很多,常见的原因有以下两种。

  • memory_limit 小于 8G。
  • system_memory 太大或太小。通常情况下 memory_limt/3 ≤ system_memory ≤ memory_limt/2

原本是最小化安装,单个ob5G,现在集群参数memory_limit改成了10G,system_memory改成3G,还是启动不了

麻烦提供下ob启动时的最新obd完整日志和observer.log日志。

配置文件麻烦也提供下

rootservice.log (4.1 KB)
obd.log (94.3 KB)
observer.log (221.2 KB)
用obd部署的,配置文件是指什么,在那儿呢

cd ~/.obd/cluster/name/config.yaml 这个,name是你部署集群时的名称

config.yaml.txt (2.2 KB)

config.yaml.bak.txt (2.2 KB)
obd.log (94.3 KB)
observer.log (231.4 KB)
rootservice.log (4.1 KB)
这个是我把配置文件改成 memory_limit: 12G system_memory: 4G 后启动的日志,observer报错
WDIAG [SHARE.SCHEMA] check_if_tenant_has_been_dropped (ob_multi_version_schema_service.cpp:2067) [4192109][observer][T0][Y0-0000000000000000-0-0] [lt=36][errcode=-4014] inner stat error(ret=-4014)

OBServer 重启失败,错误代码 4016 -OceanBase知识库 看下这个是否可以解决。

机器同步的公网时钟源,时间没有问题,看了observer主要有这些错误,
grep ret=- observer.log|more
[2024-03-01 15:00:56.891759] WDIAG [OCCAM] init (ob_vtable_event_recycle_buffer.h:58) [4192109][observer][T0][Y0-
0000000000000000-0-0] [lt=66][errcode=-4002] invalid argument(ret=-4002, mem_tag=“MdsEventCache”, recycle_buffer_
number=0, recycle_buffer_number=0, hash_idx_bkt_num_each=8192)
[2024-03-01 15:00:56.898099] WDIAG [COMMON] get_time_info (ob_io_define.cpp:1455) [4192120][IO_TUNING0][T0][Y0-00
00000000000000-0-0] [lt=5][errcode=-4006] not init yet(ret=-4006, is_inited_=true)
[2024-03-01 15:00:56.898195] WDIAG [STORAGE.BLKMGR] get_all_macro_ids (ob_block_manager.cpp:530) [4192120][IO_TUN
ING0][T0][Y0-0000000000000000-0-0] [lt=12][errcode=-4006] fail to for each block map(ret=-4006)
[2024-03-01 15:00:56.898240] WDIAG [COMMON] send_detect_task (ob_io_struct.cpp:785) [4192120][IO_TUNING0][T0][Y0-
0000000000000000-0-0] [lt=36][errcode=-4006] fail to get macro ids(ret=-4006, macro_ids=[])
[2024-03-01 15:00:56.898311] WDIAG [COMMON] run1 (ob_io_struct.cpp:820) [4192120][IO_TUNING0][T0][Y0-000000000000
0000-0-0] [lt=64][errcode=-4006] fail to send detect task(ret=-4006)
[2024-03-01 15:00:56.899801] WDIAG [SHARE.SCHEMA] check_if_tenant_has_been_dropped (ob_multi_version_schema_servi
ce.cpp:2067) [4192109][observer][T0][Y0-0000000000000000-0-0] [lt=36][errcode=-4014] inner stat error(ret=-4014)
[2024-03-01 15:00:56.899844] WDIAG [SERVER] nonblock_get_leader (ob_inner_sql_connection.cpp:1772) [4192109][obse
rver][T0][Y0-0000000000000000-0-0] [lt=29][errcode=-4014] user tenant has been dropped(ret=-4014, ret=“OB_INNER_S
TAT_ERROR”, tenant_id=1)
[2024-03-01 15:00:56.899902] WDIAG [SERVER] execute_read_inner (ob_inner_sql_connection.cpp:1683) [4192109][obser
ver][T0][Y0-0000000000000000-0-0] [lt=36][errcode=-4014] nonblock get leader failed(ret=-4014, tenant_id=1, ls_id
={id:1}, cluster_id=1708669127)

不知道怎么处理

obd cluster list 和select * from dba_ob_servers\G 看下。

obd cluster list
±---------------------------------------------------------------+
| Cluster List |
±------------±-------------------------------±----------------+
| Name | Configuration Path | Status (Cached) |
±------------±-------------------------------±----------------+
| myoceanbase | /root/.obd/cluster/myoceanbase | running |
±------------±-------------------------------±----------------+
Trace ID: 080b3158-d7aa-11ee-821f-fa163ebe87c8
If you want to view detailed obd logs, please run: obd display-trace 080b3158-d7aa-11ee-821f-fa163ebe87c8

*************************** 1. row ***************************
SVR_IP: 10.1.2.238
SVR_PORT: 2882
ID: 1
ZONE: zone1
SQL_PORT: 2881
WITH_ROOTSERVER: NO
STATUS: INACTIVE
START_SERVICE_TIME: NULL
STOP_TIME: NULL
BLOCK_MIGRATE_IN_TIME: NULL
CREATE_TIME: 2024-02-23 14:23:48.361534
MODIFY_TIME: 2024-02-28 17:14:10.618544
BUILD_VERSION: 4.2.1.3_103000032023122818-8fe69c2056b07154bbd1ebd2c26e818ee0d5c56f(Dec 28 2023 19:07:15)
LAST_OFFLINE_TIME: 2024-02-28 17:14:10.616210
*************************** 2. row ***************************
SVR_IP: 10.1.2.120
SVR_PORT: 2882
ID: 2
ZONE: zone2
SQL_PORT: 2881
WITH_ROOTSERVER: YES
STATUS: ACTIVE
START_SERVICE_TIME: 2024-03-01 14:21:41.532358
STOP_TIME: NULL
BLOCK_MIGRATE_IN_TIME: NULL
CREATE_TIME: 2024-02-23 14:23:48.384293
MODIFY_TIME: 2024-03-01 14:21:42.406223
BUILD_VERSION: 4.2.1.3_103000032023122818-8fe69c2056b07154bbd1ebd2c26e818ee0d5c56f(Dec 28 2023 19:07:15)
LAST_OFFLINE_TIME: NULL
*************************** 3. row ***************************
SVR_IP: 10.1.2.121
SVR_PORT: 2882
ID: 3
ZONE: zone3
SQL_PORT: 2881
WITH_ROOTSERVER: NO
STATUS: ACTIVE
START_SERVICE_TIME: 2024-03-01 14:21:48.750040
STOP_TIME: NULL
BLOCK_MIGRATE_IN_TIME: NULL
CREATE_TIME: 2024-02-23 14:23:48.412995
MODIFY_TIME: 2024-03-01 14:21:50.426099
BUILD_VERSION: 4.2.1.3_103000032023122818-8fe69c2056b07154bbd1ebd2c26e818ee0d5c56f(Dec 28 2023 19:07:15)
LAST_OFFLINE_TIME: NULL

alter system stop/start server ‘节点IP:2882’ ;
执行上面的命令重启下节点看下,如报错:执行下面命令:查看trace id内容,在observer日志中查看
select last_trace_id();
根据trace查询下日志
grep ‘trace id’ observer.log 然后发下日志

alter system start server ‘10.1.2.238:2882’;
Query OK, 0 rows affected (0.01 sec)

select last_trace_id();
±----------------------------------+
| last_trace_id() |
±----------------------------------+
| YB420A0A5578-00061293622BF306-0-0 |
±----------------------------------+

grep ‘YB420A0A5578-00061293622BF382-0-0’ *log 没有输出

SELECT * FROM oceanbase.DBA_OB_SERVERS\G STATUS状态为: INACTIVE
*************************** 1. row ***************************
SVR_IP: 10.1.2.238
SVR_PORT: 2882
ID: 1
ZONE: zone1
SQL_PORT: 2881
WITH_ROOTSERVER: NO
STATUS: INACTIVE
START_SERVICE_TIME: NULL
STOP_TIME: NULL
BLOCK_MIGRATE_IN_TIME: NULL
CREATE_TIME: 2024-02-23 14:23:48.361534
MODIFY_TIME: 2024-03-01 17:38:29.675689
BUILD_VERSION: 4.2.1.3_103000032023122818-8fe69c2056b07154bbd1ebd2c26e818ee0d5c56f(Dec 28 2023 19:07:15)
LAST_OFFLINE_TIME: 2024-02-28 17:14:10.616210

还有什么可以定位或者恢复的办法么

是在10.1.2.238执行的嘛。
先用obdiag巡检下集群收集下信息把 【SOP系列 22 】——故障诊断第一步(自助诊断和诊断信息收集) - 社区问答- OceanBase社区-分布式数据库

是在10.1.2.238执行的,我grep ERROR 发现应该是数据块损坏了,但是我没有清理过文件
please checkout the internal errcode(errcode=-4016, file=“ob_server_log_block_mgr.cpp”, line_no=600, info=“check_log_pool_whehter_is_integrity_ failed, unexpected error”)
please checkout the internal errcode(errcode=-4016, file=“ob_server_log_block_mgr.cpp”, line_no=108, info=“do_load_ failed”)
please checkout the internal errcode(errcode=-4016, file=“ob_server.cpp”, line_no=2144, info=“log block mgr init failed”)
please checkout the internal errcode(errcode=-4016, file=“ob_server.cpp”, line_no=328, info=“init io failed”)
please checkout the internal errcode(errcode=-4016, file=“ob_server.cpp”, line_no=506, info="[OBSERVER_NOTICE] fail to init observer")

由于log_pool目录过大,清理完部分log_pool下文件后,observer启动不了 - 社区问答- OceanBase社区-分布式数据库

使用 obd obdiag gather clog myoceanbase 报错
[2024-03-04 15:33:54.020] [ERROR] oceanbase-diagnostic-tool-py_script_gather_clog-1.0 RuntimeError: ‘data_dir’
[2024-03-04 15:33:54.020] [ERROR] Traceback (most recent call last):
[2024-03-04 15:33:54.020] [ERROR] File “core.py”, line 4696, in obdiag_online_func
[2024-03-04 15:33:54.020] [ERROR] File “core.py”, line 188, in call_plugin
[2024-03-04 15:33:54.020] [ERROR] File “_plugin.py”, line 343, in call
[2024-03-04 15:33:54.020] [ERROR] File “_plugin.py”, line 302, in _new_func
[2024-03-04 15:33:54.020] [ERROR] File “/root/.obd/plugins/oceanbase-diagnostic-tool/1.0/gather_clog.py”, line 82, in gather_clog
[2024-03-04 15:33:54.020] [ERROR] if run():
[2024-03-04 15:33:54.020] [ERROR] File “/root/.obd/plugins/oceanbase-diagnostic-tool/1.0/gather_clog.py”, line 57, in run
[2024-03-04 15:33:54.020] [ERROR] obdiag_cmd = get_obdiag_cmd()
[2024-03-04 15:33:54.020] [ERROR] File “/root/.obd/plugins/oceanbase-diagnostic-tool/1.0/gather_clog.py”, line 43, in get_obdiag_cmd
[2024-03-04 15:33:54.021] [ERROR] cmd = r"{base} --clog_dir {data_dir} --from {from_option} --to {to_option} --encrypt {encrypt_option}".format(
[2024-03-04 15:33:54.021] [ERROR] KeyError: ‘data_dir’
[2024-03-04 15:33:54.021] [ERROR]
[2024-03-04 15:33:54.021] [DEBUG] - sub gather_clog ref count to 0
[2024-03-04 15:33:54.021] [DEBUG] - export gather_clog

more ~/.obdiag/config.yml
servers:
nodes:
- ip: 10.1.2.238
ssh_port: 22
ssh_username: root
ssh_password: root123
private_key: ‘’
home_path: /data/myoceanbase/oceanbase
data_dir: /data/myoceanbase/oceanbase/store
redo_dir: /data/myoceanbase/oceanbase/store
不知道为什么取不到data_dir