OBServer负载跑一段时间崩溃

Leckun · 2025 年6 月 3 日 10:18

【使用环境】测试环境
【使用版本】OBServer
5.7.25-OceanBase_CE-v4.3.5.0
【问题描述】连续插入一天数据后崩溃。
【复现路径】
【附件及日志】observer日志和obdiag日志见附件。貌似没有产生coredump。

observer.log.zip (28.0 MB)
gather_scene_20250603100931.zip (28.7 MB)

Leckun · 2025 年6 月 3 日 10:47

obdiag发现有句这个，不知道是什么原因
[日 6月 1 17:41:43 2025] Out of memory: Killed process 2196804 (observer) total-vm:32695856kB, anon-rss:31240376kB, file-rss:0kB, shmem-rss:0kB
系统内存如下，分配给oceanbaser的memory_limit_percent为80%
[root@localhost log]# free -h
total used free shared buff/cache available
Mem: 41Gi 39Gi 269Mi 1.2Gi 2.1Gi 799Mi
Swap: 5.9Gi 5.9Gi 0B

淇铭 · 2025 年6 月 3 日 10:51

ob集群怎么部署的是obd部署的么？如果是请提供一下
obd cluster list --查一下集群名
obd cluster edit-config {集群名} --保存配置文件到文本上提供一下

看着像是配置的ob内存超过了物理机的内存导致的

Leckun · 2025 年6 月 3 日 10:57

是OBD部署的，一开始给的9G，后来通过ocpexpress修改为0M后，修改memory_limit_percent为80%。
enable_syslog_wf: false
max_syslog_file_count: 4
production_mode: false
memory_limit: 9G
datafile_size: 24G
system_memory: 1G
log_disk_size: 24G
cpu_count: 22
datafile_maxsize: 1T
datafile_next: 149G
depends:

ob-configserver
obproxy-ce:
version: 4.3.2.0
package_hash: 463bda45e43079495653238474515da8dada721c

淇铭 · 2025 年6 月 3 日 11:12

上面的日志是ob宕机时候的日志么？

Leckun · 2025 年6 月 3 日 11:13

是的，6.1日宕机的，2天了，后面没重开过直接取的日志

Leckun · 2025 年6 月 3 日 11:22

服务器上还有其它应用程序，会不会分配了80%内存后，ob占用到了80%，而其它应用程序内存又在涨，总量达到OOM KILLER的限值导致杀掉了OB？

Leckun · 2025 年6 月 3 日 11:28

通过OCP看当时的内存占用如图

淇铭 · 2025 年6 月 3 日 13:54

clog设置的太小了看这日志信息 new log_disk_size(24576MB) is not enough to hold all tenants
把log_disk_size设置大一点

下面的信息查一下
show parameters where name in (‘memory_limit’,‘memory_limit_percentage’,‘system_memory’,‘log_disk_size’,‘log_disk_percentage’,‘datafile_size’,‘datafile_disk_percentage’);

select zone,concat(SVR_IP,’:’,SVR_PORT) observer,
cpu_capacity_max cpu_total,cpu_assigned_max cpu_assigned,
cpu_capacity-cpu_assigned_max as cpu_free,
round(memory_limit/1024/1024/1024,2) as memory_total,
round((memory_limit-mem_capacity)/1024/1024/1024,2) as system_memory,
round(mem_assigned/1024/1024/1024,2) as mem_assigned,
round((mem_capacity-mem_assigned)/1024/1024/1024,2) as memory_free,
round(log_disk_capacity/1024/1024/1024,2) as log_disk_capacity,
round(log_disk_assigned/1024/1024/1024,2) as log_disk_assigned,
round((log_disk_capacity-log_disk_assigned)/1024/1024/1024,2) as log_disk_free,
round((data_disk_capacity/1024/1024/1024),2) as data_disk,
round((data_disk_in_use/1024/1024/1024),2) as data_disk_used,
round((data_disk_capacity-data_disk_in_use)/1024/1024/1024,2) as data_disk_free
from oceanbase.gv$ob_servers;

Leckun · 2025 年6 月 3 日 14:40

查询如下（重启后的）。

zone |observer        |cpu_total|cpu_assigned|cpu_free|memory_total|system_memory|mem_assigned|memory_free|log_disk_capacity|log_disk_assigned|log_disk_free|data_disk|data_disk_used|data_disk_free|
-----+----------------+---------+------------+--------+------------+-------------+------------+-----------+-----------------+-----------------+-------------+---------+--------------+--------------+
zone1|172.16.0.65:2882|     22.0|        20.0|     2.0|       29.34|         1.00|       28.00|       0.34|           100.00|            80.00|        20.00|  1024.00|          9.11|       1014.89|

你发的ERROR提示是set log_disk_size greater than 94208MB，即92GB，但现在这个log_disk_size是obd部署的时候默认值100G，提示是否有问题？

淇铭 · 2025 年6 月 3 日 18:16

我看你配置文件设置的是24G 这个查看一下看看是不是租户的clog配置的有问题
select a.zone,a.svr_ip,b.tenant_name,b.tenant_type, a.max_cpu, a.min_cpu,
round(a.memory_size/1024/1024/1024,2) memory_size_gb,
round(a.log_disk_size/1024/1024/1024,2) log_disk_size,
round(a.log_disk_in_use/1024/1024/1024,2) log_disk_in_use,
round(a.data_disk_in_use/1024/1024/1024,2) data_disk_in_use
from oceanbase.gv$ob_units a join oceanbase.dba_ob_tenants b on a.tenant_id=b.tenant_id order by b.tenant_name;

Leckun · 2025 年6 月 3 日 18:19

zone |svr_ip     |tenant_name|tenant_type|max_cpu|min_cpu|memory_size_gb|log_disk_size|log_disk_in_use|data_disk_in_use|
-----+-----------+-----------+-----------+-------+-------+--------------+-------------+---------------+----------------+
zone1|172.16.0.65|META$1002  |META       |       |       |          1.00|         0.60|           0.43|            0.28|
zone1|172.16.0.65|META$1004  |META       |       |       |          2.40|         7.20|           5.75|            0.23|
zone1|172.16.0.65|ocp_meta   |USER       |    1.0|    1.0|          1.00|         5.40|           4.29|            0.36|
zone1|172.16.0.65|sys        |SYS        |    3.0|    3.0|          2.00|         2.00|           1.59|            0.23|
zone1|172.16.0.65|wisdom     |USER       |   16.0|   16.0|         21.60|        64.80|          51.78|           10.54|

这是查询的结果，wisdom租户是实际使用的租户

淇铭 · 2025 年6 月 3 日 18:25

这个查询一下

show parameters where name in (‘memory_limit’,‘memory_limit_percentage’,‘system_memory’,‘log_disk_size’,‘log_disk_percentage’,‘datafile_size’,‘datafile_disk_percentage’);

cuijunli123 · 2025 年6 月 3 日 21:52

资源不足吧

AntTech_FYPTIV · 2025 年6 月 3 日 21:54

资源的问题

ob青松 · 2025 年6 月 3 日 22:23

Leckun:

zone |observer        |cpu_total|cpu_assigned|cpu_free|memory_total|system_memory|mem_assigned|memory_free|log_disk_capacity|log_disk_assigned|log_disk_free|data_disk|data_disk_used|data_disk_free|
-----+----------------+---------+------------+--------+------------+-------------+------------+-----------+-----------------+-----------------+-------------+---------+--------------+--------------+
zone1|172.16.0.65:2882|     22.0|        20.0|     2.0|       29.34|         1.00|       28.00|       0.34|           100.00|            80.00|        20.00|  1024.00|          9.11|       1014.89|

信息反馈内存不足

乐1983 · 2025 年6 月 4 日 06:30

学习一下

Leckun · 2025 年6 月 4 日 09:34

zone |svr_type|svr_ip     |svr_port|name                    |data_type|value|info                                                                                                                                                                                                                                                           |section   |scope  |source |edit_level       |default_value|isdefault|
-----+--------+-----------+--------+------------------------+---------+-----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-------+-------+-----------------+-------------+---------+
zone1|observer|172.16.0.65|    2882|log_disk_percentage     |INT      |0    |the percentage of disk space used by the log files. Range: [0,99] in integer;only effective when parameter log_disk_size is 0;when log_disk_percentage is 0: a) if the data and the log are on the same disk, means log_disk_percentage = 30 b) if the data and|LOGSERVICE|CLUSTER|DEFAULT|DYNAMIC_EFFECTIVE|0            |        1|
zone1|observer|172.16.0.65|    2882|log_disk_size           |CAPACITY |100G |the size of disk space used by the log files. Range: [0, +∞)                                                                                                                                                                                                   |LOGSERVICE|CLUSTER|DEFAULT|DYNAMIC_EFFECTIVE|0M           |        0|
zone1|observer|172.16.0.65|    2882|memory_limit_percentage |INT      |70   |the size of the memory reserved for internal use(for testing purpose). Range: [10, 95]                                                                                                                                                                         |OBSERVER  |CLUSTER|DEFAULT|DYNAMIC_EFFECTIVE|80           |        0|
zone1|observer|172.16.0.65|    2882|system_memory           |CAPACITY |1G   |the memory reserved for internal use which cannot be allocated to any outer-tenant, and should be determined to guarantee every server functions normally. Range: [0M,)                                                                                        |OBSERVER  |CLUSTER|DEFAULT|DYNAMIC_EFFECTIVE|0M           |        0|
zone1|observer|172.16.0.65|    2882|memory_limit            |CAPACITY |0M   |the size of the memory reserved for internal use(for testing purpose), 0 means follow memory_limit_percentage. Range: 0, [1G,).                                                                                                                                |OBSERVER  |CLUSTER|DEFAULT|DYNAMIC_EFFECTIVE|0M           |        1|
zone1|observer|172.16.0.65|    2882|datafile_disk_percentage|INT      |0    |the percentage of disk space used by the data files. Range: [0,99] in integer                                                                                                                                                                                  |SSTABLE   |CLUSTER|DEFAULT|DYNAMIC_EFFECTIVE|0            |        1|
zone1|observer|172.16.0.65|    2882|datafile_size           |CAPACITY |24G  |size of the data file. Range: [0, +∞)                                                                                                                                                                                                                          |SSTABLE   |CLUSTER|DEFAULT|DYNAMIC_EFFECTIVE|0M           |        0|

Leckun · 2025 年6 月 4 日 09:35

这是分配量，不是使用量

淇铭 · 2025 年6 月 4 日 09:53

你是用obd搭建的ob集群后面黑屏修改的配置项log_disk_size的大小么？log_disk_size看着是100G