observer一个节点inactive,obd重启集群或手动启动observer,状态一直是inactive

【 使用环境 】生产环境
【 OB or 其他组件 】oceanbase-ce
【 使用版本 】OceanBase_CE 4.1.0.0
【问题描述】集群中一个节点状态异常,obd查询某个节点状态为INACTIVE,observer进程正常,监听端口正常,尝试obd重启集群,重启操作系统,手动启动observer进程都无法恢复正常状态,使用sys租户的root连接异常节点的2881,报错ERROR 5150 (HY000): Tenant not in this server。
【复现路径】obd cluster display obcluster
这是报错节点的observer.log和主节点的rootservice.log
rootservice.log.txt (2.0 MB)
observer.log.txt (3.0 MB)

obd cluster list 看一下

obd cluster stop/start obcluster -S ip
如果是单机可以不加-S 如果是三节点集群 可以加-S 指定节点 -c 指定组件

状态是running

三节点的集群,我整个集群和单个节点都重启过obd cluster restart obcluster -s 10.10.42.60 -c oceanbase-cere,节点状态一直是inactive,故障节点的obproxy是正常的,能提供服务,只有observer状态异常

通过obdiag巡检收集下相关信息看看。

  1. obdiag check 巡检
  2. obdiag analyze log 日志分析
  3. obdiag gather scene run --scene=observer.unknown 未知问题信息采集回来

obdiag文档:OceanBase分布式数据库-海量数据 笔笔算数1

可以把巡检结果和日志分析结果先发出来,定位不了再把3收集的信息发出来

[root@fpgs-nb-nei-ob1 ~]# obd obdiag check obcluster
Get local repositories and plugins ok
obdiag plugin : oceanbase-diagnostic-tool-install-1.0
Get local repositories and plugins ok
Deploy obdiag successful.
Current version : 1.5.2.
Path of obdiag : /root/oceanbase-diagnostic-tool
Open ssh connection ok
Check database connectivity x
[ERROR] cluster :ordereddict([(‘db_host’, ‘10.10.42.60’), (‘db_port’, 2881), (‘ob_cluster_name’, ‘obcluster’), (‘tenant_sys’, ordereddict([(‘password’, ‘observer’), (‘user’, ‘root@sys’)])), (‘servers’, ordereddict([(‘nodes’, [ordereddict([(‘ip’, ‘10.10.42.60’), (‘ssh_port’, 22), (‘ssh_username’, ‘oceanbase’), (‘ssh_password’, ‘oceanbase@123’), (‘private_key’, ‘’), (‘home_path’, ‘/data/OceanBase/observer’), (‘data_dir’, ‘/data/OceanBase/obdata’), (‘redo_dir’, ‘/data/OceanBase/obdata/redo’)]), ordereddict([(‘ip’, ‘10.10.42.208’), (‘ssh_port’, 22), (‘ssh_username’, ‘oceanbase’), (‘ssh_password’, ‘oceanbase@123’), (‘private_key’, ‘’), (‘home_path’, ‘/data/OceanBase/observer’), (‘data_dir’, ‘/data/OceanBase/obdata’), (‘redo_dir’, ‘/data/OceanBase/obdata/redo’)]), ordereddict([(‘ip’, ‘10.10.42.149’), (‘ssh_port’, 22), (‘ssh_username’, ‘oceanbase’), (‘ssh_password’, ‘oceanbase@123’), (‘private_key’, ‘’), (‘home_path’, ‘/data/OceanBase/observer’), (‘data_dir’, ‘/data/OceanBase/obdata’), (‘redo_dir’, ‘/data/OceanBase/obdata/redo’)])]), (‘global’, ordereddict())]))]) . Invalid cluster information. Please check the conf in OBD
See OceanBase分布式数据库-海量数据 笔笔算数 .
Trace ID: 42953498-0dee-11ef-9e8c-fa163ec0a4a4
If you want to view detailed obd logs, please run: obd display-trace 42953498-0dee-11ef-9e8c-fa163ec0a4a4

部署方式是什么呢。
df -h && free -h
登陆root@sys -Doceanbase库 看下
select a.zone, a.SVR_IP,a.SVR_PORT, b.status,cpu_capacity,cpu_assigned_max,cpu_capacity-cpu_assigned_max as cpu_free,round(memory_limit /1024/1024/1024 ,2) as memory_total_gb,round((memory_limit-mem_capacity) /1024/1024/1024 ,2) as system_memory_gb,round(mem_assigned /1024/1024/1024 ,2) as mem_assigned_gb,round((mem_capacity-mem_assigned) /1024/1024/1024 ,2) as memory_free_gb,round(log_disk_capacity /1024/1024/1024 ,2) as log_disk_capacity_gb,round(log_disk_assigned /1024/1024/1024 ,2) as log_disk_assigned_gb,round((log_disk_capacity-log_disk_assigned) /1024/1024/1024 ,2) as log_disk_free_gb,round((data_disk_capacity /1024/1024/1024 ),2) as data_disk_gb,round((data_disk_in_use /1024/1024/1024 ),2) as data_disk_used_gb,round((data_disk_capacity-data_disk_in_use) /1024/1024/1024 ,2) as data_disk_free_gb from gv$ob_servers a join oceanbase.DBA_OB_SERVERS b on a.zone=b.zone\G;
看一下

[root@fpgs-nb-nei-ob1 ~]# df -h && free -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 16G 58M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/vda1 40G 9.7G 31G 25% /
/dev/mapper/vgdata2-lvdata2 750G 50M 750G 1% /backup
/dev/mapper/vgdata-lvdata1 750G 430G 321G 58% /data
tmpfs 3.2G 0 3.2G 0% /run/user/0
tmpfs 3.2G 0 3.2G 0% /run/user/1000
total used free shared buff/cache available
Mem: 31G 3.4G 23G 57M 4.3G 27G
Swap: 0B 0B 0B

obclient [oceanbase]> select a.zone, a.SVR_IP,a.SVR_PORT, b.status,cpu_capacity,cpu_assigned_max,cpu_capacity-cpu_assigned_max as cpu_free,round(memory_limit /1024/1024/1024 ,2) as memory_total_gb,round((memory_limit-mem_capacity) /1024/1024/1024 ,2) as system_memory_gb,round(mem_assigned /1024/1024/1024 ,2) as mem_assigned_gb,round((mem_capacity-mem_assigned) /1024/1024/1024 ,2) as memory_free_gb,round(log_disk_capacity /1024/1024/1024 ,2) as log_disk_capacity_gb,round(log_disk_assigned /1024/1024/1024 ,2) as log_disk_assigned_gb,round((log_disk_capacity-log_disk_assigned) /1024/1024/1024 ,2) as log_disk_free_gb,round((data_disk_capacity /1024/1024/1024 ),2) as data_disk_gb,round((data_disk_in_use /1024/1024/1024 ),2) as data_disk_used_gb,round((data_disk_capacity-data_disk_in_use) /1024/1024/1024 ,2) as data_disk_free_gb from gv$ob_servers a join oceanbase.DBA_OB_SERVERS b on a.zone=b.zone\G;
*************************** 1. row ***************************
zone: zone3
SVR_IP: 10.10.42.149
SVR_PORT: 2882
status: ACTIVE
cpu_capacity: 16
cpu_assigned_max: 15
cpu_free: 1
memory_total_gb: 26.00
system_memory_gb: 5.00
mem_assigned_gb: 20.00
memory_free_gb: 1.00
log_disk_capacity_gb: 80.00
log_disk_assigned_gb: 5.25
log_disk_free_gb: 74.75
data_disk_gb: 300.00
data_disk_used_gb: 63.85
data_disk_free_gb: 236.15
*************************** 2. row ***************************
zone: zone2
SVR_IP: 10.10.42.208
SVR_PORT: 2882
status: ACTIVE
cpu_capacity: 16
cpu_assigned_max: 15
cpu_free: 1
memory_total_gb: 26.00
system_memory_gb: 5.00
mem_assigned_gb: 20.00
memory_free_gb: 1.00
log_disk_capacity_gb: 80.00
log_disk_assigned_gb: 5.25
log_disk_free_gb: 74.75
data_disk_gb: 300.00
data_disk_used_gb: 64.11
data_disk_free_gb: 235.89
2 rows in set (0.010 sec)

ERROR:
No query specified

analyze_pack_20240509182819.part001.rar (9 MB)
analyze_pack_20240509182819.part002.rar (9 MB)
analyze_pack_20240509182819.part003.rar (9 MB)
analyze_pack_20240509182819.part004.rar (9 MB)
analyze_pack_20240509182819.part005.rar (7.4 MB)

  1. 调整下memory内存试看看。
  2. 可以使用obdiag巡检下日志

2.1. obdiag check 巡检

2.2. obdiag analyze log 日志分析

2.3. obdiag gather scene run --scene=observer.unknown 未知问题信息采集回来

obdiag文档:OceanBase分布式数据库-海量数据 笔笔算数1

把巡检结果和日志分析结果先发出来,定位不了再把3收集的信息发出来