OCP meta集群ocp_monagent启动后被杀

【 使用环境 】生产环境
【 OB or 其他组件 】OCP
【 使用版本 】4.3.1
【问题描述】ocp的metadb集群某一台主机的ocp_monagent服务oom被杀,尝试修改内存占用大小后,一启动就被kill。在错误日志中没有发现明显的bug日志。
手动拉起进程,很快就退出了:

2025-09-23T15:12:31.1491+08:00 INFO [2315329,3a68702a730dd7ad] caller=http/http_command.go:90:writeOk: command request /api/v1/status succeed
2025-09-23T15:13:01.50088+08:00 INFO [2315329,0727d7e9e5ee9702] caller=http/http_command.go:47:func1: handling command request /api/v1/status <nil>
2025-09-23T15:13:01.50183+08:00 INFO [2315329,20cb64f8a58a57c1] caller=http/http_command.go:90:writeOk: command request /api/v1/status succeed
2025-09-23T15:13:13.1035+08:00 INFO [2315329,2bd28384d1e5a5cd] caller=http/http_command.go:47:func1: handling command request /api/v1/startService {ocp_monagent}
2025-09-23T15:13:13.10368+08:00 INFO [2315329,] caller=agentd/service.go:133:startProc: starting service fields: service=ocp_monagent
2025-09-23T15:13:13.10409+08:00 INFO [2315329,] caller=agentd/service.go:140:startProc: service process started. pid: 2344696 fields:, service=ocp_monagent
2025-09-23T15:13:13.10419+08:00 INFO [2315329,b84ede0cbc6baa0f] caller=http/http_command.go:90:writeOk: command request /api/v1/startService succeed
2025-09-23T15:13:15.60871+08:00 WARN [2315329,] caller=agentd/service.go:182:guard: service exited with code -1. service state: running fields:, service=ocp_monagent
2025-09-23T15:13:15.60885+08:00 WARN [2315329,] caller=agentd/service.go:205:guard: service exited too quickly. live time: 2504560377, MinLiveTime: 3000000000, count:1 fields: service=ocp_monagent
2025-09-23T15:13:15.60887+08:00 INFO [2315329,] caller=runtime/asm_amd64.s:1594:goexit: recovering service fields: service=ocp_monagent
2025-09-23T15:13:15.60897+08:00 INFO [2315329,] caller=agentd/service.go:350:removePid: remove pid file /home/admin/ocp_agent/run/ocp_monagent.pid
2025-09-23T15:13:15.60906+08:00 INFO [2315329,] caller=agentd/service.go:133:startProc: starting service fields:, service=ocp_monagent
2025-09-23T15:13:15.61055+08:00 INFO [2315329,] caller=agentd/service.go:140:startProc: service process started. pid: 2344747 fields:, service=ocp_monagent
2025-09-23T15:13:17.84521+08:00 WARN [2315329,] caller=agentd/service.go:182:guard: service exited with code -1. service state: running fields:, service=ocp_monagent
2025-09-23T15:13:17.84532+08:00 WARN [2315329,] caller=agentd/service.go:205:guard: service exited too quickly. live time: 2234643145, MinLiveTime: 3000000000, count:2 fields:, service=ocp_monagent
2025-09-23T15:13:17.84534+08:00 INFO [2315329,] caller=runtime/asm_amd64.s:1594:goexit: recovering service fields: service=ocp_monagent
2025-09-23T15:13:17.84549+08:00 INFO [2315329,] caller=agentd/service.go:350:removePid: remove pid file /home/admin/ocp_agent/run/ocp_monagent.pid
2025-09-23T15:13:17.84565+08:00 INFO [2315329,] caller=agentd/service.go:133:startProc: starting service fields:, service=ocp_monagent
2025-09-23T15:13:17.84604+08:00 INFO [2315329,] caller=agentd/service.go:140:startProc: service process started. pid: 2344795 fields: service=ocp_monagent
2025-09-23T15:13:20.36003+08:00 WARN [2315329,] caller=agentd/service.go:182:guard: service exited with code -1. service state: running fields: service=ocp_monagent
2025-09-23T15:13:20.36015+08:00 WARN [2315329,] caller=agentd/service.go:205:guard: service exited too quickly. live time: 2513959564, MinLiveTime: 3000000000, count:3 fields:, service=ocp_monagent
2025-09-23T15:13:20.36021+08:00 ERROR [2315329,] caller=agentd/service.go:207:guard: service exited too quickly. live time: 2513959564, MinLiveTime: 3000000000, count: 3 fields:, service=ocp_monagent
2025-09-23T15:13:20.36034+08:00 INFO [2315329,] caller=agentd/service.go:350:removePid: remove pid file /home/admin/ocp_agent/run/ocp_monagent.pid
2025-09-23T15:13:31.54292+08:00 INFO [2315329,bfb197683c077ff8] caller=http/http_command.go:47:func1: handling command request /api/v1/status <nil>
2025-09-23T15:13:31.54465+08:00 INFO [2315329,19f58780e08b2ca3] caller=http/http_command.go:90:writeOk: command request /api/v1/status succeed

通过dmesg查看系统日志:

[Tue Sep 23 14:59:49 2025] ocp_monagent invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), order=0, oom_score_adj=0
[Tue Sep 23 14:59:49 2025] CPU: 0 PID: 2344745 Comm: ocp_monagent Not tainted 4.19.0-240.23.15.el8_2.bclinux.x86_64 #1
[Tue Sep 23 14:59:49 2025] Hardware name: RDO KVM, BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[Tue Sep 23 14:59:49 2025] Call Trace:
[Tue Sep 23 14:59:49 2025]  dump_stack+0x5c/0x80
[Tue Sep 23 14:59:49 2025]  dump_header+0x51/0x314
[Tue Sep 23 14:59:49 2025]  oom_kill_process.cold.28+0xb/0x10
[Tue Sep 23 14:59:49 2025]  out_of_memory+0x1c1/0x4a0
[Tue Sep 23 14:59:49 2025]  mem_cgroup_out_of_memory+0xbe/0xd0
[Tue Sep 23 14:59:49 2025]  try_charge+0x6f4/0x780
[Tue Sep 23 14:59:49 2025]  ? mem_cgroup_commit_charge+0x7a/0x550
[Tue Sep 23 14:59:49 2025]  mem_cgroup_try_charge+0x8b/0x190
[Tue Sep 23 14:59:49 2025]  __add_to_page_cache_locked+0x274/0x350
[Tue Sep 23 14:59:49 2025]  ? __mod_lruvec_state+0x44/0x110
[Tue Sep 23 14:59:49 2025]  ? scan_shadow_nodes+0x30/0x30
[Tue Sep 23 14:59:49 2025]  add_to_page_cache_lru+0x4a/0xc0
[Tue Sep 23 14:59:49 2025]  iomap_readpages_actor+0x103/0x230
[Tue Sep 23 14:59:49 2025]  iomap_apply+0xff/0x310
[Tue Sep 23 14:59:49 2025]  ? iomap_ioend_try_merge+0xe0/0xe0
[Tue Sep 23 14:59:49 2025]  ? __blk_mq_delay_run_hw_queue+0x141/0x160
[Tue Sep 23 14:59:49 2025]  ? iomap_ioend_try_merge+0xe0/0xe0
[Tue Sep 23 14:59:49 2025]  iomap_readpages+0xa8/0x1e0
[Tue Sep 23 14:59:49 2025]  ? iomap_ioend_try_merge+0xe0/0xe0
[Tue Sep 23 14:59:49 2025]  read_pages+0x6b/0x190
[Tue Sep 23 14:59:49 2025]  ? xfrm_timer_handler+0x1e0/0x380
[Tue Sep 23 14:59:49 2025]  __do_page_cache_readahead+0x16f/0x1e0
[Tue Sep 23 14:59:49 2025]  filemap_fault+0x2de/0x840
[Tue Sep 23 14:59:49 2025]  ? do_truncate+0x86/0xc0
[Tue Sep 23 14:59:49 2025]  ? __mod_lruvec_state+0x44/0x110
[Tue Sep 23 14:59:49 2025]  ? page_add_file_rmap+0x103/0x170
[Tue Sep 23 14:59:49 2025]  ? alloc_set_pte+0xbf/0x4a0
[Tue Sep 23 14:59:49 2025]  ? _cond_resched+0x15/0x30
[Tue Sep 23 14:59:49 2025]  __xfs_filemap_fault+0x6d/0x200 [xfs]
[Tue Sep 23 14:59:49 2025]  __do_fault+0x38/0xc0
[Tue Sep 23 14:59:49 2025]  do_fault+0x191/0x3d0
[Tue Sep 23 14:59:49 2025]  __handle_mm_fault+0x3e6/0x7c0
[Tue Sep 23 14:59:49 2025]  handle_mm_fault+0xc2/0x1d0
[Tue Sep 23 14:59:49 2025]  __do_page_fault+0x21b/0x4d0
[Tue Sep 23 14:59:49 2025]  do_page_fault+0x32/0x110
[Tue Sep 23 14:59:49 2025]  ? async_page_fault+0x8/0x30
[Tue Sep 23 14:59:49 2025]  async_page_fault+0x1e/0x30
[Tue Sep 23 14:59:49 2025] RIP: 0033:0xbf0360
[Tue Sep 23 14:59:49 2025] Code: Bad RIP value.
[Tue Sep 23 14:59:49 2025] RSP: 002b:000000c001296900 EFLAGS: 00010202
[Tue Sep 23 14:59:49 2025] RAX: 00000000017d7cc0 RBX: 000000c0000460e0 RCX: 000000c001961b98
[Tue Sep 23 14:59:49 2025] RDX: 0000000000bf3500 RSI: 00000000023a9d40 RDI: 000000c001961b98
[Tue Sep 23 14:59:49 2025] RBP: 000000c001296920 R08: 000000c0008f78c0 R09: 0000000000000001
[Tue Sep 23 14:59:49 2025] R10: 0000000001623aa8 R11: 00007fe92a53efff R12: 0000000000000000
[Tue Sep 23 14:59:49 2025] R13: 0000000000000032 R14: 000000c000603040 R15: 00007fe92a43b787
[Tue Sep 23 14:59:49 2025] memory: usage 2097152kB, limit 2097152kB, failcnt 17058805
[Tue Sep 23 14:59:49 2025] memory+swap: usage 2097152kB, limit 9007199254740988kB, failcnt 0
[Tue Sep 23 14:59:49 2025] kmem: usage 1905836kB, limit 9007199254740988kB, failcnt 0
[Tue Sep 23 14:59:49 2025] Memory cgroup stats for /ocp_agent/ocp_monagent:
[Tue Sep 23 14:59:49 2025] anon 190922752
                           file 4194304
                           kernel_stack 258048
                           slab 1950138368
                           sock 0
                           shmem 0
                           file_mapped 1081344
                           file_dirty 0
                           file_writeback 540672
                           anon_thp 85983232
                           inactive_anon 0
                           active_anon 190988288
                           inactive_file 3022848
                           active_file 610304
                           unevictable 0
                           slab_reclaimable 423550976
                           slab_unreclaimable 1526587392
                           pgfault 11918213208
                           pgmajfault 75306
                           workingset_refault 36474306
                           workingset_activate 7658673
                           workingset_nodereclaim 0
                           pgrefill 8842497
                           pgscan 55700801
                           pgsteal 50964855
                           pgactivate 1083819
                           pgdeactivate 7780478
                           pglazyfree 0
                           pglazyfreed 0
                           thp_fault_alloc 229944
                           thp_collapse_alloc 9900
[Tue Sep 23 14:59:49 2025] Tasks state (memory values in pages):
[Tue Sep 23 14:59:49 2025] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Tue Sep 23 14:59:49 2025] [2344696]     0 2344696 20082860    52429   671744        0             0 ocp_monagent
[Tue Sep 23 14:59:49 2025] [2344718]     0 2344718     5826      629    86016        0             0 chronyc
[Tue Sep 23 14:59:49 2025] [2344733]     0 2344733     5826      674    90112        0             0 chronyc
[Tue Sep 23 14:59:49 2025] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ocp_monagent,mems_allowed=0-1,oom_memcg=/ocp_agent/ocp_monagent,task_memcg=/ocp_agent/ocp_monagent,task=ocp_monagent,pid=2344696,uid=0
[Tue Sep 23 14:59:49 2025] Memory cgroup out of memory: Killed process 2344696 (ocp_monagent) total-vm:80331440kB, anon-rss:189804kB, file-rss:19912kB, shmem-rss:0kB,UID:0
[Tue Sep 23 14:59:49 2025] oom_reaper: reaped process 2344696 (ocp_monagent), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Tue Sep 23 14:59:51 2025] ocp_monagent invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), order=0, oom_score_adj=0
[Tue Sep 23 14:59:51 2025] CPU: 0 PID: 2344753 Comm: ocp_monagent Not tainted 4.19.0-240.23.15.el8_2.bclinux.x86_64 #1
[Tue Sep 23 14:59:51 2025] Hardware name: RDO KVM, BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014

【复现路径】问题出现前后相关操作
【附件及日志】推荐使用OceanBase敏捷诊断工具obdiag收集诊断信息,详情参见链接(右键跳转查看):

【SOP系列 22 】——故障诊断第一步(自助诊断和诊断信息收集)

【备注】基于 LLM 和开源文档 RAG 的论坛小助手已开放测试,在发帖时输入 [@论坛小助手] 即可召唤小助手,欢迎试用!

看下这个机器的内存情况

free -m

这台机器是物理机吗?或者是什么类型的虚拟机?

参考这个调大些ocp_monagent的内存再试下

https://www.oceanbase.com/knowledge-base/ocp-ee-1000000002829414?back=kb

这台服务器是KVM虚拟机,我试下通过命令行的方式调整这个值试试。

现在看起好像正常了,没有再被kill了。
我之前按照官网文档的方式通过修改agentd.conf中的memoryquota的值,但是好像 没有生效。

这个链接发一下,这样修改不对

monagent_process_stop ocp_monagent 进程停止-V4.3.1-文档-分布式数据库使用文档

官方文档很多版本都是这么写的。

这里我反馈调整下