【 使用环境 】生产环境
【 OB or 其他组件 】OCP
【 使用版本 】4.3.1
【问题描述】ocp的metadb集群某一台主机的ocp_monagent服务oom被杀,尝试修改内存占用大小后,一启动就被kill。在错误日志中没有发现明显的bug日志。
手动拉起进程,很快就退出了:
2025-09-23T15:12:31.1491+08:00 INFO [2315329,3a68702a730dd7ad] caller=http/http_command.go:90:writeOk: command request /api/v1/status succeed
2025-09-23T15:13:01.50088+08:00 INFO [2315329,0727d7e9e5ee9702] caller=http/http_command.go:47:func1: handling command request /api/v1/status <nil>
2025-09-23T15:13:01.50183+08:00 INFO [2315329,20cb64f8a58a57c1] caller=http/http_command.go:90:writeOk: command request /api/v1/status succeed
2025-09-23T15:13:13.1035+08:00 INFO [2315329,2bd28384d1e5a5cd] caller=http/http_command.go:47:func1: handling command request /api/v1/startService {ocp_monagent}
2025-09-23T15:13:13.10368+08:00 INFO [2315329,] caller=agentd/service.go:133:startProc: starting service fields: service=ocp_monagent
2025-09-23T15:13:13.10409+08:00 INFO [2315329,] caller=agentd/service.go:140:startProc: service process started. pid: 2344696 fields:, service=ocp_monagent
2025-09-23T15:13:13.10419+08:00 INFO [2315329,b84ede0cbc6baa0f] caller=http/http_command.go:90:writeOk: command request /api/v1/startService succeed
2025-09-23T15:13:15.60871+08:00 WARN [2315329,] caller=agentd/service.go:182:guard: service exited with code -1. service state: running fields:, service=ocp_monagent
2025-09-23T15:13:15.60885+08:00 WARN [2315329,] caller=agentd/service.go:205:guard: service exited too quickly. live time: 2504560377, MinLiveTime: 3000000000, count:1 fields: service=ocp_monagent
2025-09-23T15:13:15.60887+08:00 INFO [2315329,] caller=runtime/asm_amd64.s:1594:goexit: recovering service fields: service=ocp_monagent
2025-09-23T15:13:15.60897+08:00 INFO [2315329,] caller=agentd/service.go:350:removePid: remove pid file /home/admin/ocp_agent/run/ocp_monagent.pid
2025-09-23T15:13:15.60906+08:00 INFO [2315329,] caller=agentd/service.go:133:startProc: starting service fields:, service=ocp_monagent
2025-09-23T15:13:15.61055+08:00 INFO [2315329,] caller=agentd/service.go:140:startProc: service process started. pid: 2344747 fields:, service=ocp_monagent
2025-09-23T15:13:17.84521+08:00 WARN [2315329,] caller=agentd/service.go:182:guard: service exited with code -1. service state: running fields:, service=ocp_monagent
2025-09-23T15:13:17.84532+08:00 WARN [2315329,] caller=agentd/service.go:205:guard: service exited too quickly. live time: 2234643145, MinLiveTime: 3000000000, count:2 fields:, service=ocp_monagent
2025-09-23T15:13:17.84534+08:00 INFO [2315329,] caller=runtime/asm_amd64.s:1594:goexit: recovering service fields: service=ocp_monagent
2025-09-23T15:13:17.84549+08:00 INFO [2315329,] caller=agentd/service.go:350:removePid: remove pid file /home/admin/ocp_agent/run/ocp_monagent.pid
2025-09-23T15:13:17.84565+08:00 INFO [2315329,] caller=agentd/service.go:133:startProc: starting service fields:, service=ocp_monagent
2025-09-23T15:13:17.84604+08:00 INFO [2315329,] caller=agentd/service.go:140:startProc: service process started. pid: 2344795 fields: service=ocp_monagent
2025-09-23T15:13:20.36003+08:00 WARN [2315329,] caller=agentd/service.go:182:guard: service exited with code -1. service state: running fields: service=ocp_monagent
2025-09-23T15:13:20.36015+08:00 WARN [2315329,] caller=agentd/service.go:205:guard: service exited too quickly. live time: 2513959564, MinLiveTime: 3000000000, count:3 fields:, service=ocp_monagent
2025-09-23T15:13:20.36021+08:00 ERROR [2315329,] caller=agentd/service.go:207:guard: service exited too quickly. live time: 2513959564, MinLiveTime: 3000000000, count: 3 fields:, service=ocp_monagent
2025-09-23T15:13:20.36034+08:00 INFO [2315329,] caller=agentd/service.go:350:removePid: remove pid file /home/admin/ocp_agent/run/ocp_monagent.pid
2025-09-23T15:13:31.54292+08:00 INFO [2315329,bfb197683c077ff8] caller=http/http_command.go:47:func1: handling command request /api/v1/status <nil>
2025-09-23T15:13:31.54465+08:00 INFO [2315329,19f58780e08b2ca3] caller=http/http_command.go:90:writeOk: command request /api/v1/status succeed
通过dmesg查看系统日志:
[Tue Sep 23 14:59:49 2025] ocp_monagent invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), order=0, oom_score_adj=0
[Tue Sep 23 14:59:49 2025] CPU: 0 PID: 2344745 Comm: ocp_monagent Not tainted 4.19.0-240.23.15.el8_2.bclinux.x86_64 #1
[Tue Sep 23 14:59:49 2025] Hardware name: RDO KVM, BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[Tue Sep 23 14:59:49 2025] Call Trace:
[Tue Sep 23 14:59:49 2025] dump_stack+0x5c/0x80
[Tue Sep 23 14:59:49 2025] dump_header+0x51/0x314
[Tue Sep 23 14:59:49 2025] oom_kill_process.cold.28+0xb/0x10
[Tue Sep 23 14:59:49 2025] out_of_memory+0x1c1/0x4a0
[Tue Sep 23 14:59:49 2025] mem_cgroup_out_of_memory+0xbe/0xd0
[Tue Sep 23 14:59:49 2025] try_charge+0x6f4/0x780
[Tue Sep 23 14:59:49 2025] ? mem_cgroup_commit_charge+0x7a/0x550
[Tue Sep 23 14:59:49 2025] mem_cgroup_try_charge+0x8b/0x190
[Tue Sep 23 14:59:49 2025] __add_to_page_cache_locked+0x274/0x350
[Tue Sep 23 14:59:49 2025] ? __mod_lruvec_state+0x44/0x110
[Tue Sep 23 14:59:49 2025] ? scan_shadow_nodes+0x30/0x30
[Tue Sep 23 14:59:49 2025] add_to_page_cache_lru+0x4a/0xc0
[Tue Sep 23 14:59:49 2025] iomap_readpages_actor+0x103/0x230
[Tue Sep 23 14:59:49 2025] iomap_apply+0xff/0x310
[Tue Sep 23 14:59:49 2025] ? iomap_ioend_try_merge+0xe0/0xe0
[Tue Sep 23 14:59:49 2025] ? __blk_mq_delay_run_hw_queue+0x141/0x160
[Tue Sep 23 14:59:49 2025] ? iomap_ioend_try_merge+0xe0/0xe0
[Tue Sep 23 14:59:49 2025] iomap_readpages+0xa8/0x1e0
[Tue Sep 23 14:59:49 2025] ? iomap_ioend_try_merge+0xe0/0xe0
[Tue Sep 23 14:59:49 2025] read_pages+0x6b/0x190
[Tue Sep 23 14:59:49 2025] ? xfrm_timer_handler+0x1e0/0x380
[Tue Sep 23 14:59:49 2025] __do_page_cache_readahead+0x16f/0x1e0
[Tue Sep 23 14:59:49 2025] filemap_fault+0x2de/0x840
[Tue Sep 23 14:59:49 2025] ? do_truncate+0x86/0xc0
[Tue Sep 23 14:59:49 2025] ? __mod_lruvec_state+0x44/0x110
[Tue Sep 23 14:59:49 2025] ? page_add_file_rmap+0x103/0x170
[Tue Sep 23 14:59:49 2025] ? alloc_set_pte+0xbf/0x4a0
[Tue Sep 23 14:59:49 2025] ? _cond_resched+0x15/0x30
[Tue Sep 23 14:59:49 2025] __xfs_filemap_fault+0x6d/0x200 [xfs]
[Tue Sep 23 14:59:49 2025] __do_fault+0x38/0xc0
[Tue Sep 23 14:59:49 2025] do_fault+0x191/0x3d0
[Tue Sep 23 14:59:49 2025] __handle_mm_fault+0x3e6/0x7c0
[Tue Sep 23 14:59:49 2025] handle_mm_fault+0xc2/0x1d0
[Tue Sep 23 14:59:49 2025] __do_page_fault+0x21b/0x4d0
[Tue Sep 23 14:59:49 2025] do_page_fault+0x32/0x110
[Tue Sep 23 14:59:49 2025] ? async_page_fault+0x8/0x30
[Tue Sep 23 14:59:49 2025] async_page_fault+0x1e/0x30
[Tue Sep 23 14:59:49 2025] RIP: 0033:0xbf0360
[Tue Sep 23 14:59:49 2025] Code: Bad RIP value.
[Tue Sep 23 14:59:49 2025] RSP: 002b:000000c001296900 EFLAGS: 00010202
[Tue Sep 23 14:59:49 2025] RAX: 00000000017d7cc0 RBX: 000000c0000460e0 RCX: 000000c001961b98
[Tue Sep 23 14:59:49 2025] RDX: 0000000000bf3500 RSI: 00000000023a9d40 RDI: 000000c001961b98
[Tue Sep 23 14:59:49 2025] RBP: 000000c001296920 R08: 000000c0008f78c0 R09: 0000000000000001
[Tue Sep 23 14:59:49 2025] R10: 0000000001623aa8 R11: 00007fe92a53efff R12: 0000000000000000
[Tue Sep 23 14:59:49 2025] R13: 0000000000000032 R14: 000000c000603040 R15: 00007fe92a43b787
[Tue Sep 23 14:59:49 2025] memory: usage 2097152kB, limit 2097152kB, failcnt 17058805
[Tue Sep 23 14:59:49 2025] memory+swap: usage 2097152kB, limit 9007199254740988kB, failcnt 0
[Tue Sep 23 14:59:49 2025] kmem: usage 1905836kB, limit 9007199254740988kB, failcnt 0
[Tue Sep 23 14:59:49 2025] Memory cgroup stats for /ocp_agent/ocp_monagent:
[Tue Sep 23 14:59:49 2025] anon 190922752
file 4194304
kernel_stack 258048
slab 1950138368
sock 0
shmem 0
file_mapped 1081344
file_dirty 0
file_writeback 540672
anon_thp 85983232
inactive_anon 0
active_anon 190988288
inactive_file 3022848
active_file 610304
unevictable 0
slab_reclaimable 423550976
slab_unreclaimable 1526587392
pgfault 11918213208
pgmajfault 75306
workingset_refault 36474306
workingset_activate 7658673
workingset_nodereclaim 0
pgrefill 8842497
pgscan 55700801
pgsteal 50964855
pgactivate 1083819
pgdeactivate 7780478
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 229944
thp_collapse_alloc 9900
[Tue Sep 23 14:59:49 2025] Tasks state (memory values in pages):
[Tue Sep 23 14:59:49 2025] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Tue Sep 23 14:59:49 2025] [2344696] 0 2344696 20082860 52429 671744 0 0 ocp_monagent
[Tue Sep 23 14:59:49 2025] [2344718] 0 2344718 5826 629 86016 0 0 chronyc
[Tue Sep 23 14:59:49 2025] [2344733] 0 2344733 5826 674 90112 0 0 chronyc
[Tue Sep 23 14:59:49 2025] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ocp_monagent,mems_allowed=0-1,oom_memcg=/ocp_agent/ocp_monagent,task_memcg=/ocp_agent/ocp_monagent,task=ocp_monagent,pid=2344696,uid=0
[Tue Sep 23 14:59:49 2025] Memory cgroup out of memory: Killed process 2344696 (ocp_monagent) total-vm:80331440kB, anon-rss:189804kB, file-rss:19912kB, shmem-rss:0kB,UID:0
[Tue Sep 23 14:59:49 2025] oom_reaper: reaped process 2344696 (ocp_monagent), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Tue Sep 23 14:59:51 2025] ocp_monagent invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), order=0, oom_score_adj=0
[Tue Sep 23 14:59:51 2025] CPU: 0 PID: 2344753 Comm: ocp_monagent Not tainted 4.19.0-240.23.15.el8_2.bclinux.x86_64 #1
[Tue Sep 23 14:59:51 2025] Hardware name: RDO KVM, BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
【复现路径】问题出现前后相关操作
【附件及日志】推荐使用OceanBase敏捷诊断工具obdiag收集诊断信息,详情参见链接(右键跳转查看):
【备注】基于 LLM 和开源文档 RAG 的论坛小助手已开放测试,在发帖时输入 [@论坛小助手] 即可召唤小助手,欢迎试用!