observer重启失败

王利博 · 2024 年3 月 1 日 16:51

obd cluster list 和select * from dba_ob_servers\G 看下。

AntTech_OWSC2W · 2024 年3 月 1 日 17:10

obd cluster list
±---------------------------------------------------------------+
| Cluster List |
±------------±-------------------------------±----------------+
| Name | Configuration Path | Status (Cached) |
±------------±-------------------------------±----------------+
| myoceanbase | /root/.obd/cluster/myoceanbase | running |
±------------±-------------------------------±----------------+
Trace ID: 080b3158-d7aa-11ee-821f-fa163ebe87c8
If you want to view detailed obd logs, please run: obd display-trace 080b3158-d7aa-11ee-821f-fa163ebe87c8

*************************** 1. row ***************************
SVR_IP: 10.1.2.238
SVR_PORT: 2882
ID: 1
ZONE: zone1
SQL_PORT: 2881
WITH_ROOTSERVER: NO
STATUS: INACTIVE
START_SERVICE_TIME: NULL
STOP_TIME: NULL
BLOCK_MIGRATE_IN_TIME: NULL
CREATE_TIME: 2024-02-23 14:23:48.361534
MODIFY_TIME: 2024-02-28 17:14:10.618544
BUILD_VERSION: 4.2.1.3_103000032023122818-8fe69c2056b07154bbd1ebd2c26e818ee0d5c56f(Dec 28 2023 19:07:15)
LAST_OFFLINE_TIME: 2024-02-28 17:14:10.616210
*************************** 2. row ***************************
SVR_IP: 10.1.2.120
SVR_PORT: 2882
ID: 2
ZONE: zone2
SQL_PORT: 2881
WITH_ROOTSERVER: YES
STATUS: ACTIVE
START_SERVICE_TIME: 2024-03-01 14:21:41.532358
STOP_TIME: NULL
BLOCK_MIGRATE_IN_TIME: NULL
CREATE_TIME: 2024-02-23 14:23:48.384293
MODIFY_TIME: 2024-03-01 14:21:42.406223
BUILD_VERSION: 4.2.1.3_103000032023122818-8fe69c2056b07154bbd1ebd2c26e818ee0d5c56f(Dec 28 2023 19:07:15)
LAST_OFFLINE_TIME: NULL
*************************** 3. row ***************************
SVR_IP: 10.1.2.121
SVR_PORT: 2882
ID: 3
ZONE: zone3
SQL_PORT: 2881
WITH_ROOTSERVER: NO
STATUS: ACTIVE
START_SERVICE_TIME: 2024-03-01 14:21:48.750040
STOP_TIME: NULL
BLOCK_MIGRATE_IN_TIME: NULL
CREATE_TIME: 2024-02-23 14:23:48.412995
MODIFY_TIME: 2024-03-01 14:21:50.426099
BUILD_VERSION: 4.2.1.3_103000032023122818-8fe69c2056b07154bbd1ebd2c26e818ee0d5c56f(Dec 28 2023 19:07:15)
LAST_OFFLINE_TIME: NULL

王利博 · 2024 年3 月 1 日 17:18

alter system stop/start server ‘节点IP:2882’ ;
执行上面的命令重启下节点看下，如报错：执行下面命令：查看trace id内容，在observer日志中查看
select last_trace_id();
根据trace查询下日志
grep ‘trace id’ observer.log 然后发下日志

AntTech_OWSC2W · 2024 年3 月 1 日 17:41

alter system start server ‘10.1.2.238:2882’;
Query OK, 0 rows affected (0.01 sec)

select last_trace_id();
±----------------------------------+
| last_trace_id() |
±----------------------------------+
| YB420A0A5578-00061293622BF306-0-0 |
±----------------------------------+

grep ‘YB420A0A5578-00061293622BF382-0-0’ *log 没有输出

SELECT * FROM oceanbase.DBA_OB_SERVERS\G STATUS状态为: INACTIVE
*************************** 1. row ***************************
SVR_IP: 10.1.2.238
SVR_PORT: 2882
ID: 1
ZONE: zone1
SQL_PORT: 2881
WITH_ROOTSERVER: NO
STATUS: INACTIVE
START_SERVICE_TIME: NULL
STOP_TIME: NULL
BLOCK_MIGRATE_IN_TIME: NULL
CREATE_TIME: 2024-02-23 14:23:48.361534
MODIFY_TIME: 2024-03-01 17:38:29.675689
BUILD_VERSION: 4.2.1.3_103000032023122818-8fe69c2056b07154bbd1ebd2c26e818ee0d5c56f(Dec 28 2023 19:07:15)
LAST_OFFLINE_TIME: 2024-02-28 17:14:10.616210

AntTech_OWSC2W · 2024 年3 月 4 日 09:51

还有什么可以定位或者恢复的办法么

王利博 · 2024 年3 月 4 日 10:52

是在10.1.2.238执行的嘛。
先用obdiag巡检下集群收集下信息把【SOP系列 22 】——故障诊断第一步(自助诊断和诊断信息收集) - 社区问答- OceanBase社区-分布式数据库

AntTech_OWSC2W · 2024 年3 月 4 日 15:38

是在10.1.2.238执行的，我grep ERROR 发现应该是数据块损坏了，但是我没有清理过文件
please checkout the internal errcode(errcode=-4016, file=“ob_server_log_block_mgr.cpp”, line_no=600, info=“check_log_pool_whehter_is_integrity_ failed, unexpected error”)
please checkout the internal errcode(errcode=-4016, file=“ob_server_log_block_mgr.cpp”, line_no=108, info=“do_load_ failed”)
please checkout the internal errcode(errcode=-4016, file=“ob_server.cpp”, line_no=2144, info=“log block mgr init failed”)
please checkout the internal errcode(errcode=-4016, file=“ob_server.cpp”, line_no=328, info=“init io failed”)
please checkout the internal errcode(errcode=-4016, file=“ob_server.cpp”, line_no=506, info="[OBSERVER_NOTICE] fail to init observer")

由于log_pool目录过大，清理完部分log_pool下文件后，observer启动不了 - 社区问答- OceanBase社区-分布式数据库

使用 obd obdiag gather clog myoceanbase 报错
[2024-03-04 15:33:54.020] [ERROR] oceanbase-diagnostic-tool-py_script_gather_clog-1.0 RuntimeError: ‘data_dir’
[2024-03-04 15:33:54.020] [ERROR] Traceback (most recent call last):
[2024-03-04 15:33:54.020] [ERROR] File “core.py”, line 4696, in obdiag_online_func
[2024-03-04 15:33:54.020] [ERROR] File “core.py”, line 188, in call_plugin
[2024-03-04 15:33:54.020] [ERROR] File “_plugin.py”, line 343, in call
[2024-03-04 15:33:54.020] [ERROR] File “_plugin.py”, line 302, in _new_func
[2024-03-04 15:33:54.020] [ERROR] File “/root/.obd/plugins/oceanbase-diagnostic-tool/1.0/gather_clog.py”, line 82, in gather_clog
[2024-03-04 15:33:54.020] [ERROR] if run():
[2024-03-04 15:33:54.020] [ERROR] File “/root/.obd/plugins/oceanbase-diagnostic-tool/1.0/gather_clog.py”, line 57, in run
[2024-03-04 15:33:54.020] [ERROR] obdiag_cmd = get_obdiag_cmd()
[2024-03-04 15:33:54.020] [ERROR] File “/root/.obd/plugins/oceanbase-diagnostic-tool/1.0/gather_clog.py”, line 43, in get_obdiag_cmd
[2024-03-04 15:33:54.021] [ERROR] cmd = r"{base} --clog_dir {data_dir} --from {from_option} --to {to_option} --encrypt {encrypt_option}".format(
[2024-03-04 15:33:54.021] [ERROR] KeyError: ‘data_dir’
[2024-03-04 15:33:54.021] [ERROR]
[2024-03-04 15:33:54.021] [DEBUG] - sub gather_clog ref count to 0
[2024-03-04 15:33:54.021] [DEBUG] - export gather_clog

more ~/.obdiag/config.yml
servers:
nodes:
- ip: 10.1.2.238
ssh_port: 22
ssh_username: root
ssh_password: root123
private_key: ‘’
home_path: /data/myoceanbase/oceanbase
data_dir: /data/myoceanbase/oceanbase/store
redo_dir: /data/myoceanbase/oceanbase/store
不知道为什么取不到data_dir

AntTech_OWSC2W · 2024 年3 月 4 日 16:16

./obdiag gather clog
2024-03-04 16:11:47,395 [INFO] Use gather_pack_20240304161147 as pack dir.
2024-03-04 16:11:47,395 [INFO] Sending Collect Shell Command to node 10.1.2.238 …
2024-03-04 16:11:47,431 [INFO] Connected (version 2.0, client OpenSSH_8.8)
2024-03-04 16:11:47,526 [INFO] Authentication (password) successful!
2024-03-04 16:11:47,527 [INFO] [remote host 10.1.2.238] excute cmd = [mkdir -p /tmp/clog_10.1.2.238_20240304161147]
2024-03-04 16:11:47,841 [INFO] [remote host 10.1.2.238] run cmd = [/data/myoceanbase/oceanbase/bin/observer --version] start …
2024-03-04 16:11:48,216 [INFO] get observer version, run cmd = [/data/myoceanbase/oceanbase/bin/observer --version]
Traceback (most recent call last):
File “obdiag.py”, line 163, in
File “obdiag.py”, line 87, in gather_clog
File “obdiag_client.py”, line 367, in handle_gather_clog_command
File “handler/gather/gather_obadmin.py”, line 84, in handle
File “handler/gather/gather_obadmin.py”, line 72, in handle_from_node
File “handler/gather/gather_obadmin.py”, line 140, in __handle_from_node
File “utils/version_utils.py”, line 34, in compare_versions_lower
ValueError: invalid literal for int() with base 10: ‘CE 4’
[488563] Failed to execute script ‘obdiag’ due to unhandled exception!

/data/myoceanbase/oceanbase/bin/observer --version
observer (OceanBase_CE 4.2.1.3)

REVISION: 103000032023122818-8fe69c2056b07154bbd1ebd2c26e818ee0d5c56f
BUILD_BRANCH: HEAD
BUILD_TIME: Dec 28 2023 19:07:15
BUILD_FLAGS: RelWithDebInfo
BUILD_INFO:

靖顺 · 2024 年3 月 4 日 23:15

上面obdiag的报错是踩中一个obdiag 版本解析的bug了，今天已经修复，预计本周五会发一个1.6.1的obdiag.

AntTech_OWSC2W · 2024 年3 月 5 日 09:27

哦哦，observer启动时报错file=“ob_server_log_block_mgr.cpp”, line_no=600, info=“check_log_pool_whehter_is_integrity_ failed, unexpected error” 这个有啥定位手段么

渔舟唱晚 · 2024 年3 月 5 日 09:46

这个问题解决了吗

AntTech_OWSC2W · 2024 年3 月 5 日 09:49

还没

王利博 · 2024 年3 月 5 日 10:35

select a.zone, a.SVR_IP,a.SVR_PORT, b.status,cpu_capacity,cpu_assigned_max,cpu_capacity-cpu_assigned_max as cpu_free,round(memory_limit /1024/1024/1024 ,2) as memory_total_gb,round((memory_limit-mem_capacity) /1024/1024/1024 ,2) as system_memory_gb,round(mem_assigned /1024/1024/1024 ,2) as mem_assigned_gb,round((mem_capacity-mem_assigned) /1024/1024/1024 ,2) as memory_free_gb,round(log_disk_capacity /1024/1024/1024 ,2) as log_disk_capacity_gb,round(log_disk_assigned /1024/1024/1024 ,2) as log_disk_assigned_gb,round((log_disk_capacity-log_disk_assigned) /1024/1024/1024 ,2) as log_disk_free_gb,round((data_disk_capacity /1024/1024/1024 ),2) as data_disk_gb,round((data_disk_in_use /1024/1024/1024 ),2) as data_disk_used_gb,round((data_disk_capacity-data_disk_in_use) /1024/1024/1024 ,2) as data_disk_free_gb from gv$ob_servers a join oceanbase.DBA_OB_SERVERS b on a.zone=b.zone\G;
麻烦看下

AntTech_OWSC2W · 2024 年3 月 5 日 10:43

*************************** 1. row ***************************
zone: zone2
SVR_IP: 10.1.2.120
SVR_PORT: 2882
status: ACTIVE
cpu_capacity: 16
cpu_assigned_max: 5
cpu_free: 11
memory_total_gb: 9.00
system_memory_gb: 3.00
mem_assigned_gb: 5.00
memory_free_gb: 1.00
log_disk_capacity_gb: 40.00
log_disk_assigned_gb: 14.00
log_disk_free_gb: 26.00
data_disk_gb: 50.00
data_disk_used_gb: 0.89
data_disk_free_gb: 49.11
*************************** 2. row ***************************
zone: zone3
SVR_IP: 10.1.2.121
SVR_PORT: 2882
status: ACTIVE
cpu_capacity: 16
cpu_assigned_max: 5
cpu_free: 11
memory_total_gb: 9.00
system_memory_gb: 3.00
mem_assigned_gb: 5.00
memory_free_gb: 1.00
log_disk_capacity_gb: 40.00
log_disk_assigned_gb: 14.00
log_disk_free_gb: 26.00
data_disk_gb: 50.00
data_disk_used_gb: 0.86
data_disk_free_gb: 49.14
2 rows in set (0.38 sec)

没有启动失败的10.1.2.238

AntTech_OWSC2W · 2024 年3 月 5 日 11:31

observer启动时报错file=“ob_server_log_block_mgr.cpp”, line_no=600, info=“check_log_pool_whehter_is_integrity_ failed, unexpected error” 这个有啥定位手段么

王利博 · 2024 年3 月 5 日 11:32

麻烦提供下free -h df -h信息和日志目录的相关文件

AntTech_OWSC2W · 2024 年3 月 5 日 12:42

log_2.exe
https://www.alipan.com/s/4S52i91BbGe
点击链接保存，或者复制本段内容，打开「阿里云盘」APP ，无需下载极速在线查看，视频原画倍速播放。

free -h
total used free shared buff/cache available
Mem: 47Gi 21Gi 5.2Gi 957Mi 20Gi 23Gi
Swap: 5.0Gi 1.3Gi 3.7Gi

df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 4.0M 0 4.0M 0% /dev
tmpfs 24G 0 24G 0% /dev/shm
tmpfs 9.5G 1019M 8.5G 11% /run
tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
/dev/vda3 979G 204G 735G 22% /
tmpfs 24G 13M 24G 1% /tmp
/dev/vda1 974M 83M 824M 10% /boot

王利博 · 2024 年3 月 5 日 13:56

238节点的相关日志信息看看ls -l 路径/name/oceanbase-ce/log/。
还有路径/name/oceanbase-ce/store目录.是否有缺失。

AntTech_OWSC2W · 2024 年3 月 5 日 14:14

tree.txt
https://www.alipan.com/s/meJETkya42g

王利博 · 2024 年3 月 5 日 14:18

阿里云看不到。可以截图上传下。