OCP搭建后集群中未接管obcluster集群,接管集群子任务check observer process user报错,从ocp集群上查看observer相关进程存在

【 使用环境 】测试环境
【 OB or 其他组件 】OCP
【 使用版本 】OCP 3.3.0 bp2
【问题描述】OCP搭建后集群中未接管obcluster集群,接管集群子任务check observer process user报错,从ocp集群上查看observer相关进程存在
【复现路径】重试问题依旧,删除重新搭建也是同样问题
【问题现象及影响】OCP无法查看obcluster集群




日志信息
2023-02-07 17:02:20.177  INFO 68 --- [pool-manual-subtask-executor2,e700d3a68bac4717,7e6f8a943e89] c.a.o.c.m.t.model.SubtaskInstanceEntity  : Run subtask, id=76, context=Context{parallelIdx=0, stringMap={cluster_version=3.1.4, cluster_name=obcluster, target_server_status=RUNNING, ssh_port=22, prohibit_rollback=false, service_name=obcluster:1, target_zone_status=RUNNING, task_instance_id=45, ob_connect_address=172.16.234.18:2881, task_operation=execute, cluster_type=PRIMARY, service_version=3.1.4, cluster_id=1, root_sys_password=******, service_type=OB_CLUSTER, ob_data_dir=/data/1, connection_mode=direct, target_cluster_status=RUNNING, latest_execution_start_time=2023-02-07T17:02:20.150+08:00, sub_task_instance_id=76, credential_id=1}, listMap={add_region_ids=[2], server_ids=[1], add_idc_ids=[1], all_host_ids=[1], add_host_ids=[1], host_ids=[1], zone_names=[zone1]}}, executor=172.16.234.18

2023-02-07 17:02:20.260  INFO 68 --- [pool-manual-subtask-executor2,e700d3a68bac4717,7e6f8a943e89] c.o.o.e.internal.template.HttpTemplate   : POST request to agent, url:http://172.16.234.18:62888/api/v1/process/info, request body:GetProcessInfoRequest(processName=observer), params:null

2023-02-07 17:02:20.424 ERROR 68 --- [pool-manual-subtask-executor2,e700d3a68bac4717,7e6f8a943e89] com.alipay.ocp.core.util.ExceptionUtils  : Checked Exception: com.alipay.ocp.core.exception.UnexpectedException occurred with code error.ob.cluster.takeover.wrong.user, and args [root]

2023-02-07 17:02:20.431  INFO 68 --- [pool-manual-subtask-executor2,e700d3a68bac4717,7e6f8a943e89] c.a.o.c.m.t.model.SubtaskInstanceEntity  : Set state for subtask: 76, current state: RUNNING, new state: FAILED

2023-02-07 17:02:20.437  WARN 68 --- [pool-manual-subtask-executor2,e700d3a68bac4717,7e6f8a943e89] c.a.o.c.t.engine.runner.RunnerFactory    : Execute task failed, subtask=SubtaskInstanceEntity{id=76, name=Check observer process user, state=FAILED, operation=EXECUTE, className=com.alipay.ocp.service.task.business.host.CheckObserverProcessUserTask, seriesId=13, startTime=2023-02-07T17:02:20.150+08:00, endTime=2023-02-07T17:02:20.436+08:00}, failedMessage=The user of observer process must be admin, current is root

com.alipay.ocp.core.exception.UnexpectedException: [OCP UnexpectedException]: status=500 INTERNAL_SERVER_ERROR, errorCode=OB_CLUSTER_TAKEOVER_OBSERVER_WRONG_USER, args=root
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_312]
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_312]
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.8.0_312]
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[na:1.8.0_312]
	at com.alipay.ocp.core.util.ExceptionUtils.newException(ExceptionUtils.java:96) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]
	at com.alipay.ocp.core.util.ExceptionUtils.throwException(ExceptionUtils.java:90) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]
	at com.alipay.ocp.core.util.ExceptionUtils.unExpected(ExceptionUtils.java:77) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]
	at com.alipay.ocp.service.task.business.host.CheckObserverProcessUserTask.run(CheckObserverProcessUserTask.java:56) ~[ocp-service-3.3.0-20220427.jar!/:3.3.0-20220427]
	at com.alipay.ocp.core.metadb.task.model.SubtaskInstanceEntity.run(SubtaskInstanceEntity.java:221) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]
	at com.alipay.ocp.core.task.engine.runner.JavaTaskRunner.doExecute(JavaTaskRunner.java:26) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]
	at com.alipay.ocp.core.task.engine.runner.JavaTaskRunner.run(JavaTaskRunner.java:20) ~[ocp-core-3.3.0-202
20427.jar!/:3.3.0-20220427]
	at com.alipay.ocp.core.task.engine.runner.RunnerFactory.doRun(RunnerFactory.java:113) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]
	at com.alipay.ocp.core.task.engine.runner.RunnerFactory.redirectOutputIfNotSysSchedule(RunnerFactory.java:185) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]
	at com.alipay.ocp.core.task.engine.runner.RunnerFactory.run(RunnerFactory.java:102) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]
	at com.alipay.ocp.core.task.engine.coordinator.worker.subtask.ReadySubtaskWorker.lambda$submitTask$3(ReadySubtaskWorker.java:123) ~[ocp-core-3.3.0-20220427.jar!/:3.3.0-20220427]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_312]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_312]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_312]
	at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_312]
查看进程
[root@sc413-ocp01 ocp-3.3.0-ce-bp2-x86_64]# ps -ef | grep -E "observer|obproxy*|ocp"
root      24316      1  0 16:39 ?        00:00:01 bash /home/admin/obproxy/obproxyd.sh /home/admin/obproxy 172.16.234.18 2883 daemon
root      24331      1 13 16:39 ?        00:02:48 /home/admin/obproxy/bin/obproxy --listen_port 2883
root      24448      1 99 16:39 ?        02:21:48 /home/admin/oceanbase/bin/observer -r 172.16.234.18:2882:2881 -o __min_full_resource_pool_memory=268435456,enable_syslog_recycle=True,enable_syslog_wf=True,max_syslog_file_count=4,memory_limit=52G,system_memory=26G,cpu_count=24,datafile_size=224G -z zone1 -p 2881 -P 2882 -n obcluster -c 1 -d /data/1 -l INFO
admin     27762  27730  0 16:51 ?        00:00:00 bash /home/admin/ocp-server/bin/ocp-server
admin     27763  27730  0 16:51 ?        00:00:00 bash /home/admin/bin/ocp_obproxyd.sh
admin     27823  27762 82 16:51 ?        00:07:46 /usr/lib/jvm/java-1.8.0/bin/java -server -XX:+UseG1GC -Xms45875m -Xmx45875m -Xss512k -XX:+PrintCommandLineFlags -XX:MetaspaceSize=1024m -XX:MaxMetaspaceSize=1024m -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -Xloggc:/home/admin/ocp-server/bin/../log/gc.log -XX:+UseGCLogFileRotation -XX:GCLogFileSize=50M -XX:NumberOfGCLogFiles=2 -XX:ErrorFile=/home/admin/ocp-server/bin/../log/hs_err_pid%p.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/admin/ocp-server/bin/../log/ -Dfile.encoding=UTF-8 -jar /home/admin/ocp-server/bin/../lib/ocp-server-3.3.0-20220427.jar
admin     28932  27730 15 16:52 ?        00:01:13 ./bin/obproxy -p2888 -n ocp_obproxy -o obproxy_config_server_url=http://127.0.0.1:8080/services?Action=GetObProxyConfig&User_ID=alibaba&UID=admin,syslog_level=INFO,skip_proxyro_check=true,skip_proxy_sys_private_check=true,enable_strict_kernel_release=false,enable_metadb_used=false,enable_proxy_scramble=true,proxy_mem_limited=1G,log_dir_size_threshold=10G
root      35065  12445  0 17:00 pts/0    00:00:00 grep --color=auto -E observer|obproxy*|ocp

查看监听端口
[root@sc413-ocp01 ocp-3.3.0-ce-bp2-x86_64]# netstat -tunlp | grep -E "2881|2882|2883"
tcp        0      0 0.0.0.0:2881            0.0.0.0:*               LISTEN      24448/observer      
tcp        0      0 0.0.0.0:2882            0.0.0.0:*               LISTEN      24448/observer      
tcp        0      0 0.0.0.0:2883            0.0.0.0:*               LISTEN      24331/obproxy  

【附件】
ocp.log.tar.gz (219.1 KB)

2 个赞


看你的环境,应该是ocp接管自己的metadb,这个地方报错是metadb的进程用户非admin用户,目前ocp只支持接管admin用户部署的OB。可以参看下这个 【SOP 系列 07】如何使用 OCP 接管 OBD 部署的 OceanBase 集群

1 个赞

请教下老师,我是通过OCP直接部署的,您发的链接中在”调整部署的操作系统用户为 admin“这一部分我了解需要调整成admin,但我不是通过obd部署,另外OCP部署的config.yaml文件中我也没看到文档中下面这部分内容

注意这里 key_file 是用的私钥文件。

## Only need to configure when remote login is required
user:
   username: admin
#   password: your password if need
   key_file: /home/admin/.ssh/id_rsa
#   port: your ssh port, default 22
#   timeout: ssh connection timeout (second), default 30

ocp部署的时候有个create_metadb_cluster: true参数表示创建个ob,这个ob是ocp调用内部的obd工具进行部署的,外部不感知。user模块的用户是用于通信传输,并非部署用户,部署ocp使用的什么用户,那进程就是对应用户权限;

感谢老师讲解,最开始我看官网文档中没有强调说明,另外已经在/etc/sudoers将admin授权,切换admin sudo黑屏部署OCP也是报错。我查看OCP上并没有obd命令,现在是安装obd工具去修改再参考您发的链接修改,还是有其他方式针对黑屏部署的OCP去修改observer进程用户呢?

补充说明,在admin下部署报错后,才使用的root部署ocp,目前这套环境是root下部署

admin执行安装,不需要sudo;
ocp部署只使用了obd的auto方式部署,并不是完整的obd工具;
不需要安装obd,因为无法接管OB,参看文档,跳过obd步骤;

可能是ocp的安装文件权限非admin,有具体报错吗?

我看文档中只有这部分内容是修改用户启动进程,这部分是通过obd修改的哦。我这种黑屏部署的ocp如果不通过obd应该如何把observer进程从root修改成admin呢?

这部分已经刷掉了,我最开始切换到admin直接安装在docker加载这一块就直接报错了,然后加上sudo后面又出现报错后,才切到root部署ocp。

可以使用这种方式接管metadb,单独使用obd部署一个单机OB,部署ocp的时候,把create_metadb_cluster: false,这样就会使用已有的ob当metadb使用(需要提前创建所需租户和用户),后续再使用接管方式;

我把OCP移除,使用admin重新部署试下。如果还是不行再采用老师这种方式看下。

chrome1