ObAgent 报错error="input plugin mysqlTableInput collect time out"

【 使用环境 】
环境规划

角色 IP 备注
obd 172.118.81.156 OceanBase自动化部署工具
observer 172.118.81.151 zone1,监听2881和2882端口
observer 172.118.81.152 zone2,监听2881和2882端口
observer 172.118.81.153 zone3,监听2881和2882端口
obproxy 172.118.81.151 监听2883和2884端口
obproxy 172.118.81.156 监听2883和2884端口
ansible 172.118.81.156 自动化配置环境
obagent 172.118.81.151 zone1,监听8088和8089端口
obagent 172.118.81.152 zone2,监听8088和8089端口
obagent 172.118.81.153 zone3,监听8088和8089端口

【 OB or 其他组件 】
【 使用版本 】

obagent    | 1.2.0   | 4.el7 

【问题描述】
昨天ObAgent部署好后,访问

curl --user admin:root -L 'http://172.118.81.151:8088/metrics/ob/basic'
curl --user admin:root -L 'http://172.118.81.152:8088/metrics/ob/basic'
curl --user admin:root -L 'http://172.118.81.153:8088/metrics/ob/basic'

都能正常返回数据,今天部署prometheus后浏览器访问页面报错

Get "http://172.118.81.151:8088/metrics/stat": dial tcp 172.118.81.151:8088: connect: connection refused

到obd服务器上,尝试以下两种方法后仍然报错

obd cluster restart obagent
obd cluster redeploy obagent -c obagent-only.yaml

虽然这两种方法都能重新启动agent服务,并开启8088端口。

# curl --user admin:root -L 'http://172.118.81.151:8088/metrics/ob/basic'
curl: (56) Recv failure: Connection reset by peer

monagent.log日志都是像下方重复的内容。

# grep ERR /home/admin/obagent/log/monagent.log
2023-07-21T16:52:18.90808+08:00 ERROR [23376,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-21T16:52:19.19118+08:00 ERROR [23376,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-21T16:52:19.25093+08:00 ERROR [23376,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-21T16:52:19.20057+08:00 ERROR [23376,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-21T16:52:19.29594+08:00 ERROR [23376,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin nodeExporterInput collect time out"
2023-07-21T16:52:19.29576+08:00 ERROR [23376,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-21T16:52:19.84277+08:00 ERROR [23376,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-21T16:52:19.93953+08:00 ERROR [23376,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-21T16:52:20.40661+08:00 ERROR [23376,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-21T16:52:21.59825+08:00 ERROR [23376,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin nodeExporterInput collect time out"
2023-07-21T16:52:21.88406+08:00 ERROR [23376,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-21T16:52:20.13801+08:00 ERROR [23376,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"

【复现路径】
【问题现象及影响】
prometheus和obagent无法正常访问
【附件】
monagent.log (1.2 MB)
monagent_stdout.log (61.9 KB)

你好,麻烦看下obagent进程是正常运行的吗,另外看下observer 是否能正常执行SQL

你好,observer和obproxy都是好的。

$ obd cluster display obce-3zones 
Get local repositories and plugins ok
Open ssh connection ok
Cluster status check ok
Connect to observer ok
Wait for observer init ok
+--------------------------------------------------+
|                     observer                     |
+----------------+---------+------+-------+--------+
| ip             | version | port | zone  | status |
+----------------+---------+------+-------+--------+
| 172.118.81.151 | 3.1.4   | 2881 | zone1 | active |
| 172.118.81.152 | 3.1.4   | 2881 | zone2 | active |
| 172.118.81.153 | 3.1.4   | 2881 | zone3 | active |
+----------------+---------+------+-------+--------+
obclient -h172.118.81.151 -P2881 -uroot -p'cloudadmin' -Doceanbase -A

Connect to obproxy ok
+--------------------------------------------------+
|                     obproxy                      |
+----------------+------+-----------------+--------+
| ip             | port | prometheus_port | status |
+----------------+------+-----------------+--------+
| 172.118.81.151 | 2883 | 2884            | active |
| 172.118.81.156 | 2883 | 2884            | active |
+----------------+------+-----------------+--------+
obclient -h172.118.81.151 -P2883 -uroot -p'cloudadmin' -Doceanbase -A
Trace ID: e1eec470-27c9-11ee-8e52-0050568d699e

obclient [oceanbase]> show full processlist;
+------------+---------+--------+----------------------+-----------+---------+------+--------+-----------------------+----------------+------+----------------------+
| Id         | User    | Tenant | Host                 | db        | Command | Time | State  | Info                  | Ip             | Port | Proxy_sessid         |
+------------+---------+--------+----------------------+-----------+---------+------+--------+-----------------------+----------------+------+----------------------+
| 3221718176 | proxyro | sys    | 172.118.81.156:42358 | oceanbase | Sleep   |    4 | SLEEP  | NULL                  | 172.118.81.151 | 2881 | 12427209952421150738 |
| 3221516384 | root    | sys    | 172.118.81.156:42362 | oceanbase | Query   |    0 | ACTIVE | show full processlist | 172.118.81.151 | 2881 | 12427209952421150739 |
| 3221777289 | proxyro | sys    | 172.118.81.151:40524 | oceanbase | Sleep   |    5 | SLEEP  | NULL                  | 172.118.81.152 | 2881 | 12427209930946314254 |
+------------+---------+--------+----------------------+-----------+---------+------+--------+-----------------------+----------------+------+----------------------+
3 rows in set (0.406 sec)

obagent进程不正常,会自动退出。

-bash-4.2$ obd cluster display obagent 
Get local repositories and plugins ok
Open ssh connection ok
Cluster status check ok
[WARN] 172.118.81.151 obagent is stopped
[WARN] 172.118.81.152 obagent is stopped
[WARN] 172.118.81.153 obagent is stopped
Trace ID: d83e7524-27c9-11ee-9d8e-0050568d699e
If you want to view detailed obd logs, please run: obd display-trace d83e7524-27c9-11ee-9d8e-0050568d699e

方便发一下obd的日志吗

obd.2023-07-20.log (533.9 KB)
obd.2023-07-21.log (825.6 KB)
obd.2023-07-22.log (6.3 KB)
obd.log (2.1 KB)
以上是从部署到今天的obd日志。

我看 obd 日志是21号 零点左右部署的,monagent.log最早到下午四点多,方便把全部的monagent.log发一下吗,另外,使用obagent 配置中的monagent.ob.monitor.user能登上observer吗

151-monagent.log (2.0 MB)
151-monagent_stdout.log (103.8 KB)
152-monagent.log (1.9 MB)
152-monagent_stdout.log (144.4 KB)
153-monagent.log (1.8 MB)
153-monagent_stdout.log (106.9 KB)
这6个文件分别是3个节点的obagent的log目录下的所有文件。
obagent-only.yaml文件里monitor_user是root,

$ grep monitor_user obagent-only.yaml 
    monitor_user: root

使用root用户是可以登observer的

$ obclient -h172.118.81.151 -uroot@sys -P2881 -p -c -A oceanbase
Enter password: 
Welcome to the OceanBase.  Commands end with ; or \g.
Your OceanBase connection id is 3221722687
Server version: OceanBase 3.1.4 (r103000102023020719-16544a206f00dd3ceb4ca3011a625fbb24568154) (Built Feb  7 2023 19:32:02)

Copyright (c) 2000, 2018, OceanBase and/or its affiliates. All rights reserved.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

obclient [oceanbase]> show full processlist;
+------------+------+--------+----------------------+-----------+---------+------+--------+-----------------------+----------------+------+--------------+
| Id         | User | Tenant | Host                 | db        | Command | Time | State  | Info                  | Ip             | Port | Proxy_sessid |
+------------+------+--------+----------------------+-----------+---------+------+--------+-----------------------+----------------+------+--------------+
| 3221722687 | root | sys    | 172.118.81.156:43388 | oceanbase | Query   |    0 | ACTIVE | show full processlist | 172.118.81.151 | 2881 |         NULL |
+------------+------+--------+----------------------+-----------+---------+------+--------+-----------------------+----------------+------+--------------+
1 row in set (0.666 sec)

看起来执行“show full processlist” 的耗时有点久,可能是 obagent 查ob超时导致的启动问题,麻烦执行下这个SQL看看耗时:select /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) */ group_concat(svr_ip SEPARATOR ‘,’) as servers, status, count(1) as cnt from __all_server group by status

obclient [oceanbase]> select /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) */ group_concat(svr_ip SEPARATOR ‘,’) as servers, status, count(1) as cnt from __all_server group by status;
±---------------------------------------------±-------±-----+
| servers | status | cnt |
±---------------------------------------------±-------±-----+
| 172.118.81.151,172.118.81.152,172.118.81.153 | active | 3 |
±---------------------------------------------±-------±-----+
1 row in set (0.079 sec)

看日志是接口返回很慢,可能goroutine有积压。


麻烦启动下obagent,然后curl 一下 “http://localhost:8089/debug/pprof/goroutine?debug=1 --output /tmp/goroutine.txt” 拿下goroutine的信息

现在重启obagent失败,

-bash-4.2$ obd cluster restart obagent
Get local repositories and plugins ok
Load cluster param plugin ok
Open ssh connection ok
Cluster status check ok
Search plugins ok
Load cluster param plugin ok
Cluster status check ok
Check before start obagent ok
Start obagent ok
obagent program health check ok
Stop obagent ok
Start obagent ok
obagent program health check ok
+------------------------------------------------------+
|                       obagent                        |
+----------------+-------------+------------+----------+
| ip             | server_port | pprof_port | status   |
+----------------+-------------+------------+----------+
| 172.118.81.151 | 8088        | 8089       | inactive |
| 172.118.81.152 | 8088        | 8089       | inactive |
| 172.118.81.153 | 8088        | 8089       | inactive |
+----------------+-------------+------------+----------+
obagent restart
Trace ID: 9aa5c880-2acf-11ee-b2e4-0050568d699e
If you want to view detailed obd logs, please run: obd display-trace 9aa5c880-2acf-11ee-b2e4-0050568d699e

monagent.log日志里不停报错

2023-07-25T17:52:59.11463+08:00 ERROR [1923,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-25T17:52:55.28598+08:00 ERROR [1923,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-25T17:53:10.87926+08:00 ERROR [1923,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-25T17:53:11.89254+08:00 ERROR [1923,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-25T17:53:11.89277+08:00 ERROR [1923,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-25T17:53:12.13514+08:00 ERROR [1923,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"
2023-07-25T17:52:56.71686+08:00 ERROR [1923,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin nodeExporterInput collect time out"
2023-07-25T17:53:12.14129+08:00 ERROR [1923,] caller=engine/pipeline.go:154:collectAndProcess: input plugin collect failed fields: error="input plugin mysqlTableInput collect time out"

是的,有日志就说明obagent进程起来了,进程起来后请求接口应该是能通的,拿下goroutine的数据方便分析哪里的问题

抓了好几次,goroutine的数据都如下

# tail -f /tmp/goroutine.txt 
{"successful":false,"timestamp":"2023-07-25T19:33:29.237204185+08:00","duration":0,"status":500,"traceId":"","server":"","error":{"code":1002,"message":"Unexpected error: invalid header authorization: , should contain 2 content.","subErrors":null}}

你 curl 接口时没指定用户名密码,要 -u username:password 指定下

又重新重启了obagent

-bash-4.2$ obd cluster restart obagent 
Get local repositories and plugins ok
Load cluster param plugin ok
Open ssh connection ok
Cluster status check ok
Stop obagent ok
Start obagent ok
obagent program health check ok
+------------------------------------------------------+
|                       obagent                        |
+----------------+-------------+------------+----------+
| ip             | server_port | pprof_port | status   |
+----------------+-------------+------------+----------+
| 172.118.81.151 | 8088        | 8089       | active   |
| 172.118.81.152 | 8088        | 8089       | active   |
| 172.118.81.153 | 8088        | 8089       | inactive |
+----------------+-------------+------------+----------+

现在151节点的obagent是起来的,8089端口也是好的

[root@cpe-172-118-81-151 admin]# netstat -tnlp |grep 8089
tcp6       0      0 :::8089                 :::*                    LISTEN      6086/monagent

而执行curl结果如下

# curl -u root:cloudadmin http://localhost:8089/debug/pprof/goroutine?debug=1 --output /tmp/goroutine.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:04:37 --:--:--     0
curl: (52) Empty reply from server

确认root:cloudadmin 这个用户和密码是能连到observer的。是还有什么地方不对么?

另外,在我回复您消息的时候,151节点的obagent服务自己死掉了。再次使用curl访问报

Failed connect to localhost:8089; Connection refused

因为是curl obagent的接口,需要使用obagent的用户名密码,也就是{pprof_basic_auth_user}:{pprof_basic_auth_password},现在看来是obagent进程起来后大约四分多钟就会因为协程数量过多被kill掉,需要查一下是哪里存在协程泄露

如何查协程泄露呢?

启动obagent后,在进程挂掉前执行“curl -u {pprof_basic_auth_user}:{pprof_basic_auth_password} http://localhost:8089/debug/pprof/goroutine?debug=1 --output /tmp/goroutine.txt” ,然后把goroutine.txt 发出来看下

{"successful":false,"timestamp":"2023-07-26T11:36:38.035074121+08:00","duration":0,"status":500,"traceId":"","server":"","error":{"code":1002,"message":"Unexpected error: auth failed for user: root","subErrors":null}}

obagent-only.yaml里关于用户和密码的设置

-bash-4.2$ grep 'monitor_' obagent-only.yaml 
    monitor_user: root
    monitor_password: cloudadmin

使用以上账户和密码确认是可以登陆的

-bash-4.2$ obclient -h172.118.81.151 -uroot@sys -P2881 -pcloudadmin -c -A oceanbase
Welcome to the OceanBase.  Commands end with ; or \g.
Your OceanBase connection id is 3221709270
Server version: OceanBase 3.1.4 (r103000102023020719-16544a206f00dd3ceb4ca3011a625fbb24568154) (Built Feb  7 2023 19:32:02)

Copyright (c) 2000, 2018, OceanBase and/or its affiliates. All rights reserved.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

obclient [oceanbase]>

不是登录observer的用户名密码,是obagent自身接口的用户名密码