ocp-server无法启动

【 使用环境 】生产环境
【 OB or 其他组件 】
【 使用版本 】ocp:4.3.2,oceanbase-ce:4.3.3.1
【问题描述】通过obd cluster restart xx 重启集群后,ocp界面无法访问,启动失败;已经第二次遇到,之前是测试数据就重新部署了;
【复现路径】操作前开启过全链接的2个参数;


obd日志

ocp

附件是obd日志和ocp日志

obd.txt (379.6 KB)
ocp-server.log (25.1 MB)

提供一份yaml文件看看 ~/.obd/cluster/xxxx/

你的部署架构是怎样的呢?ocp是部署在153上吗?有大量数据源连接失败,

172.16.207.152:8883
172.16.207.153:8881

ocp_meta租户和ocp_monitor租户手动连接可以成功吗?

另外 observer.log也麻烦发下

2024-11-12 10:53:49.131 ERROR 49711 --- [Druid-ConnectionPool-Create-1400974072,,] com.alibaba.druid.pool.DruidDataSource   : create connection SQLException, url: jdbc:oceanbase://172.16.207.153:8881/oceanbase?useUnicode=true&characterEncoding=UTF8&encloseParamInParentheses=false, errorCode -1, state 08000

java.sql.SQLNonTransientConnectionException: Could not connect to HostAddress{host='172.16.207.153', port=8881}. 拒绝连接 (Connection refused)
	at com.oceanbase.jdbc.internal.util.exceptions.ExceptionFactory.createException(ExceptionFactory.java:122)
	at com.oceanbase.jdbc.internal.util.exceptions.ExceptionFactory.create(ExceptionFactory.java:225)
	at com.oceanbase.jdbc.internal.protocol.AbstractConnectProtocol.connectWithoutProxy(AbstractConnectProtocol.java:1735)
	at com.oceanbase.jdbc.internal.util.Utils.retrieveProxy(Utils.java:1431)
	at com.oceanbase.jdbc.OceanBaseConnection.newConnection(OceanBaseConnection.java:311)
	at com.oceanbase.jdbc.Driver.connect(Driver.java:89)
	at com.alibaba.druid.pool.DruidAbstractDataSource.createPhysicalConnection(DruidAbstractDataSource.java:1657)
	at com.alibaba.druid.pool.DruidAbstractDataSource.createPhysicalConnection(DruidAbstractDataSource.java:1723)
	at com.alibaba.druid.pool.DruidDataSource$CreateConnectionThread.run(DruidDataSource.java:2838)
Caused by: java.net.ConnectException: 拒绝连接 (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:607)
	at com.oceanbase.jdbc.internal.protocol.AbstractConnectProtocol.createSocket(AbstractConnectProtocol.java:285)
	at com.oceanbase.jdbc.internal.protocol.AbstractConnectProtocol.createConnection(AbstractConnectProtocol.java:560)
	at com.oceanbase.jdbc.internal.protocol.AbstractConnectProtocol.connectWithoutProxy(AbstractConnectProtocol.java:1715)
	... 6 common frames omitted

……


2024-11-12 10:59:36.774 ERROR 49711 --- [ocp-async-6,0c703dd89381762a,0516bff573ac5c96] o.h.engine.jdbc.spi.SqlExceptionHelper   : Could not connect to 172.16.207.152:8883 : (conn=1210075265) Server is initializing
2024-11-12 10:59:36.775  WARN 49711 --- [ocp-async-2,b0439f6850cf1555,e8070a105b44bdb3] o.h.engine.jdbc.spi.SqlExceptionHelper   : SQL Error: 0, SQLState: 08004
2024-11-12 10:59:36.775 ERROR 49711 --- [ocp-async-2,b0439f6850cf1555,e8070a105b44bdb3] o.h.engine.jdbc.spi.SqlExceptionHelper   : metadb-connect-pool - Connection is not available, request timed out after 2000ms.
2024-11-12 10:59:36.775  WARN 49711 --- [ocp-async-2,b0439f6850cf1555,e8070a105b44bdb3] o.h.engine.jdbc.spi.SqlExceptionHelper   : SQL Error: 8001, SQLState: 08004
2024-11-12 10:59:36.775 ERROR 49711 --- [ocp-async-2,b0439f6850cf1555,e8070a105b44bdb3] o.h.engine.jdbc.spi.SqlExceptionHelper   : Could not connect to 172.16.207.152:8883 : (conn=1210075265) Server is initializing


config.yaml.txt (2.4 KB)

可以手动连接的;目前是三台服务器:152,153,154。其中ocp部署在153上面的

observer.rar (3.0 MB)

image
image

单独启动该组件试试看
obd cluster start xxxx -c ocp-server-ce
每台节点内存多大,我看你proxy设置了80G有点浪费了

单独重启也是一样,我看了之前类似的贴子,也单独重启过了,还是起不来;服务器单台内存185G

需要我把内存调小点,在单独重启试试吗

你手动root@ocp_meta 分别通过obproxy(172.16.207.152:8883)以及直连数据库(172.16.207.153:8881)测试下看是否成功,也可能是密码不对



可以连接

现在看是正常的,你单独重启ocp-server看下

obd cluster start xxxx -c ocp-server-ce

ocp-server.log (6.8 MB)
执行了,目前yi’j已经快5分钟了,还没结束,前期的执行还算正常,后面的日志我看还是一样的报错;

检查下153及另外两台机器的hosts配置,需要有主机名称以及IP

image
有配置,这几天没有操作hosts和修改主机名操作

我联系ocp的老师分析下,有进展回复你

好的,感谢

感谢老师,已经解决了;目前已经恢复了,我来回重启了多次发现目前能正常操作了;
解决方法:修改了ocp.analyze.enabled和ocp.analyze.ob.trace.enabled,之前开启了,我把他们关闭了,重启集群后正常访问ocp

这个问题我们继续分析下,感谢反馈

好的老师,也麻烦看看这2个参数,应该是开启 租户全链路追踪配置的必要参数;

这两个参数设置为True时,ocp就会启动失败吗?你这个环境可以复现吗?