服务器存在监控exporter异常 && 服务器Agent内存超限

【 使用环境 】生产环境
【 OB or 其他组件 】OB
【 使用版本 】OBCE 4.2.1
【问题描述】服务器存在监控exporter异常 && 服务器Agent内存超限
【复现路径】问题出现前后相关操作
【附件及日志】
之前一直出现内存超限问题,执行了 “/home/admin/ocp_agent/bin/ocp_agentctl config -u monagent.limit.memory.quota=4096MB”,然后重启了 ocp_agent ,观察了一周后 又复现了这个问题,并伴随着 两个告警一起出现,不知道这两个有没有直接联系?

  1. 把服务器内存超限的那个节点的ocp_agent的日志捞回来,默认是在/home/admin/ocp_agent/log/monagent-xxx.log, 找到对应时间点的。
  2. 观察下ocp_monagent的内存使用,使用top | grep ocp_monagent看就行,是不是使用量比较高,比如超过1.5GB? 是的话,去主机上可以执行下这个命令,用于获取agent的内存调用图,发出agent来分析下内存的消耗
MON_PID=$(cat /home/admin/ocp_agent/run/ocp_monagent.pid)
curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$MON_PID.sock http://unix-socket-server/debug/pprof/heap --output /tmp/heap.pb.gz

另:

  1. 这两个告警大多数时候是关联性的,ocp_agentd是ocp_agent的守护进程,对于ocp_monagent内存超限如果超过2GB会自动重启ocp_monagent,这期间可能会爆监控exporter异常的告警。
  2. ocp_monagent内存超限的话对ob本身没有啥影响,可能的影响点:agent内存超限会重启,重启期间,监控的数据会丢点(秒级)


这个是 内存不够,未收集数据是吧。

我之前在这个节点执行过
“/home/admin/ocp_agent/bin/ocp_agentctl config -u monagent.limit.memory.quota=4096MB
/home/admin/ocp_agent/bin/ocp_agentctl restart ” 扩内存。
2、在节点查看 ocp_monagent 内存使用,看着都超过了 4G


3、内存的消耗

heap.pb.gz (196.3 KB)

咱们OCP管理多少节点,一共多少租户?

可以考虑升级下OCP422 有一些优化。

你截的top指令的图,得看第6列【RES】实际使用内存,实际使用内存现在才60MB左右,第5列是虚拟的内存。
从提供的heap.pb.gz文件分析了下,可以看到内存的使用情况

这份文件因为不是内存高的时候,所以没啥太大价值

7台机器,51个租户(包含系统租户)

在 ocp_monagent 进程 高内存时候抓了下 分析文件,帮忙在分下下,谢谢了!


heap (1).pb.gz (198.1 KB)

根据你第二次发的heap文件分析了下, 先贴结果

      File: ocp_monagent
Build ID: 9dd5b81a810ce122c7d9b2ecd215f48a830d4c0d
Type: inuse_space
Time: Mar 25, 2024 at 5:12pm (CST)
Showing nodes accounting for 1459.42MB, 100% of 1459.42MB total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                          660.68MB 99.85% |   github.com/oceanbase/obagent/monitor/plugins/inputs/oceanbase.(*SqlAuditInput).Collect /workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/plugins/inputs/oceanbase/sql_audit.go:295 (inline)
                                               1MB  0.15% |   github.com/oceanbase/obagent/lib/log_analyzer.ParseLines /workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/lib/log_analyzer/log_analyzer.go:86 (inline)
  661.68MB 45.34% 45.34%   661.68MB 45.34%                | github.com/oceanbase/obagent/monitor/message.(*Message).AddField /workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/message/message.go:99
----------------------------------------------------------+-------------
                                          120.06MB 67.42% |   github.com/oceanbase/obagent/monitor/plugins/processors/retag.(*RetagProcessor).Process /workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/plugins/processors/retag/retag.go:108 (inline)
                                           29.01MB 16.29% |   github.com/oceanbase/obagent/monitor/plugins/processors/retag.(*RetagProcessor).Process /workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/plugins/processors/retag/retag.go:125 (inline)
                                           13.50MB  7.58% |   github.com/oceanbase/obagent/monitor/common.(*addTags).Apply /workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/common/transformer.go:126 (inline)
                                              13MB  7.30% |   github.com/oceanbase/obagent/lib/log_analyzer.(*ObLogAnalyzer).ParseLine /workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/lib/log_analyzer/ob_log_analyzer.go:148 (inline)
                                            2.50MB  1.40% |   github.com/oceanbase/obagent/monitor/plugins/inputs/log_tailer.(*LogTailerExecutor).processLogByLine.func2 /workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/plugins/inputs/log_tailer/log_tailer_executor.go:139 (inline)
  178.08MB 12.20% 57.54%   178.08MB 12.20%                | github.com/oceanbase/obagent/monitor/message.(*Message).AddTag /workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/message/message.go:161
----------------------------------------------------------+-------------
                                              95MB   100% |   github.com/oceanbase/obagent/monitor/plugins/inputs/log_tailer.(*LogTailerExecutor).processLogByLine.func1 /workspace/code-repo/rpm/.rpm_create/SOURCES/ocp-agent-ce/monitor/plugins/inputs/log_tailer/log_tailer_executor.go:124 (inline)
      95MB  6.51% 64.05%       95MB  6.51%                | bufio.(*Scanner).Text /home/admin/go/src/bufio/scan.go:113
----------------------------------------------------------+-------------
                                           49.54MB   100% |   database/sql.(*Rows).Scan /home/admin/go/src/database/sql/sql.go:3287
   49.54MB  3.39% 67.44%    49.54MB  3.39%                | database/sql.convertAssignRows /home/admin/go/src/database/sql/convert.go:261
----------------------------------------------------------+-------------
                                           44.08MB   100% |   strings.(*Builder).Grow /home/admin/go/src/strings/builder.go:82 (inline)
   44.08MB  3.02% 70.47%    44.08MB  3.02%                | strings.(*Builder).grow /home/admin/go/src/strings/builder.go:68

再说结论:
目前最消耗内存的是sql_audit插件的采集,sql_audit数据是和业务流量强相关的,说明业务侧应该是流量比较大。
再帮忙去ocp的monitordb中查几个数据:

select cluster_name,count(1) from ob_hist_sql_audit_stat_0 partition(P20240324) group by cluster_name;
select cluster_name,count(1) from ob_hist_sql_audit_stat_0 partition(P20240325) group by cluster_name;

select cluster_name,count(1) from ob_hist_sql_audit_sample partition(P20240324) group by cluster_name;
select cluster_name,count(1) from ob_hist_sql_audit_sample partition(P20240325) group by cluster_name;

select cluster_name,count(1) from ob_hist_sqltext partition(P20240324) group by cluster_name;
select cluster_name,count(1) from ob_hist_sqltext partition(P20240325) group by cluster_name;

另:拿到agent的heap.pb.gz文件后可以用go tool打开看看,

go tool pprof -http=127.0.0.1:8081 heap.pb.gz
mysql> select cluster_name,count(1) from ob_hist_sql_audit_stat_0 partition(P20240324) group by cluster_name;
+-----------------+----------+
| cluster_name    | count(1) |
+-----------------+----------+
| catl_cloudDB_ce |  6219530 |
| XMdbcloud       |  1405060 |
+-----------------+----------+
2 rows in set (1.19 sec)

mysql> select cluster_name,count(1) from ob_hist_sql_audit_stat_0 partition(P20240325) group by cluster_name;
+-----------------+----------+
| cluster_name    | count(1) |
+-----------------+----------+
| catl_cloudDB_ce |  3668364 |
| XMdbcloud       |   742093 |
+-----------------+----------+
2 rows in set (1.25 sec)

mysql> select cluster_name,count(1) from ob_hist_sql_audit_sample partition(P20240324) group by cluster_name;
+-----------------+----------+
| cluster_name    | count(1) |
+-----------------+----------+
| catl_cloudDB_ce |  4672981 |
| XMdbcloud       |     1494 |
+-----------------+----------+
2 rows in set (0.78 sec)

mysql> select cluster_name,count(1) from ob_hist_sql_audit_sample partition(P20240325) group by cluster_name;
+-----------------+----------+
| cluster_name    | count(1) |
+-----------------+----------+
| catl_cloudDB_ce |  1613409 |
| XMdbcloud       |     1031 |
+-----------------+----------+
2 rows in set (0.47 sec)

mysql> select cluster_name,count(1) from ob_hist_sqltext partition(P20240324) group by cluster_name;
+-----------------+----------+
| cluster_name    | count(1) |
+-----------------+----------+
| catl_cloudDB_ce |   110518 |
| XMdbcloud       |     1373 |
+-----------------+----------+
2 rows in set (0.22 sec)

mysql> select cluster_name,count(1) from ob_hist_sqltext partition(P20240325) group by cluster_name;
+-----------------+----------+
| cluster_name    | count(1) |
+-----------------+----------+
| catl_cloudDB_ce |    79958 |
| XMdbcloud       |     1247 |
+-----------------+----------+
2 rows in set (0.18 sec)

看起来catl_cloudDB_ce这个集群一天的sql采集数据量比较大,注意到ob_hist_sql_audit_stat_0和ob_hist_sql_audit_sample都是百万级别。其中ob_hist_sql_audit_stat_0是用于sql诊断页面的基础数据的,ob_hist_sql_audit_sample这张表主要存储的是慢sql/并行sql/事务相关的采样sql。这里边可能可减少的数据是ob_hist_sql_audit_sample,需要再往下再挖一下看看具体的采样类型是啥。

select cluster_name,cause_type, count(1) from ob_hist_sql_audit_sample partition(P20240324) group by cluster_name, cause_type;

select cluster_name,cause_type, count(1) from ob_hist_sql_audit_sample partition(P20240325) group by cluster_name, cause_type;

比如可以调整一下慢sql进入的阈值,
去对应集群的节点上执行:

# 默认monagent.ob.slow.sql.threshold慢sql的采集阈值是100ms, 这个改动成1000ms,对于分析sql问题也基本够用了
/home/admin/ocp_agent/bin/ocp_agentctl config -u monagent.ob.slow.sql.threshold=1000ms

当然直接停止sql的采集是最直接有效降低agent内存的方法,关闭了sql采集,top sql就没数据了,想分析sql问题就不好在ocp上看了,其他的监控功能都不受到影响。看你们的取舍吧,我只提供方法:

能不能通过增加ocp_monagent 内存 来规避这个问题?

主机内存还有富裕的话是可以再增加agent内存来规避的。

你说的这个超过阈值 就会重启ocp_monagent ,这个阈值在哪里可以调整?

老师,我使用 go tool pprof -http=127.0.0.1:8081 heap.pb.gz 之前,是否需要先安装go语言包