正常手动停止集群时总是在stopzone 过程中卡主

【 使用环境 】测试环境
【 OB or 其他组件 】 observer
【 使用版本 】3.1
【问题描述】通过ocp平台多次停止集群时,都是会在stopzone的时候卡主,需要不断重试停止集群的任务才会停止下去
【相关日志】
2022-08-23 11:46:36.760 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.m.t.model.SubtaskInstanceEntity : Run subtask, id=2090411, context=Context{parallelIdx=-1, stringMap={cluster_name=ymob_prod, target_server_status=RUNNING, prohibit_rollback=false, former_cluster_status=RUNNING, target_zone_status=RUNNING, task_instance_id=2079591, task_operation=execute, zone_name=zone2, ob_cluster_id=1660632206, cluster_id=1000002, ocpagent_service_name=agent, target_cluster_status=RUNNING, latest_execution_start_time=2022-08-23T11:46:36.755+08:00, sub_task_instance_id=2090411}, listMap={1660632206.ymob_prod.zone3.host_ids=[1000003], server_ids=[1000005], 1660632206.ymob_prod.zone1.server_ids=[1000004], 1660632206.ymob_prod.zone3.server_ids=[1000006], 1660632206.ymob_prod.zone2.server_ids=[1000005], 1660632206.ymob_prod.zone1.host_ids=[1000002], host_ids=[1000001], 1660632206.ymob_prod.zone2.host_ids=[1000001], zone_names=[zone1, zone2, zone3]}}, executor=10.221.22.221

2

3

2022-08-23 11:46:36.768 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.s.t.business.zone.StopObZoneTask : try to stop zone by ob cmd, clusterId=1000002, obClusterId=1660632206, clusterType=PRIMARY, zoneName=zone2

4

5

2022-08-23 11:46:36.768 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.s.t.b.zone.ObZoneTaskHandler : begin to stop zone, clusterId=1000002, zone=zone2

6

7

2022-08-23 11:46:36.780 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [10000000]

8

9

2022-08-23 11:46:36.782 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: select max(value) value from oceanbase.__all_virtual_sys_parameter_stat where name = ‘min_observer_version’

10

11

2022-08-23 11:46:36.803 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [10000000]

12

13

2022-08-23 11:46:36.804 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: SELECT zone, MAX(CASE name WHEN ‘region’ THEN info ELSE ‘’ END ) region, MAX(CASE name WHEN ‘idc’ THEN info ELSE ‘’ END ) idc, MAX(CASE name WHEN ‘status’ THEN info ELSE ‘’ END ) status, MAX(CASE name WHEN ‘merge_status’ THEN info ELSE ‘’ END ) merge_status, MAX(CASE name WHEN ‘broadcast_version’ THEN value ELSE 0 END ) broadcast_version, MAX(CASE name WHEN ‘all_merged_version’ THEN value ELSE 0 END ) all_merged_version, MAX(CASE name WHEN ‘last_merged_version’ THEN value ELSE 0 END ) last_merged_version, MAX(CASE name WHEN ‘merge_start_time’ THEN value ELSE 0 END ) merge_start_time, MAX(CASE name WHEN ‘last_merged_time’ THEN value ELSE 0 END ) last_merged_time, MAX(CASE name WHEN ‘is_merge_timeout’ THEN value ELSE 0 END ) merge_timeout FROM oceanbase.__all_zone WHERE zone <> ‘’ GROUP BY zone

14

15

2022-08-23 11:46:36.806 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [10000000]

16

17

2022-08-23 11:46:36.808 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [1800000000]

18

19

2022-08-23 11:46:36.809 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: alter system stop zone ?, args: [zone2]

20

21

2022-08-23 11:46:36.823 ERROR 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] update failed, sql:[alter system stop zone ?], error message:[PreparedStatementCallback; SQL [alter system stop zone ?]; (conn=196738) not enough member or quorum mismatch, stop zone not allowed; nested exception is java.sql.SQLTransientConnectionException: (conn=196738) not enough member or quorum mismatch, stop zone not allowed]

22

23

2022-08-23 11:46:36.824 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [10000000]

24

25

2022-08-23 11:46:36.825 WARN 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.s.t.b.zone.ObZoneTaskHandler : operate zone failed, exception msg=SQL [alter system stop zone ?; args:zone2]; SQL state [HY000]; error code [4179]; message [(conn=196738) not enough member or quorum mismatch, stop zone not allowed]

26

27

2022-08-23 11:46:36.826 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] com.alipay.ocp.common.pattern.Retry : wait for 15 seconds

28

29

2022-08-23 11:46:51.828 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [10000000]

30

31

2022-08-23 11:46:51.830 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [1800000000]

32

33

2022-08-23 11:46:51.831 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: alter system stop zone ?, args: [zone2]

34

35

2022-08-23 11:46:51.844 ERROR 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] update failed, sql:[alter system stop zone ?], error message:[PreparedStatementCallback; SQL [alter system stop zone ?]; (conn=196738) not enough member or quorum mismatch, stop zone not allowed; nested exception is java.sql.SQLTransientConnectionException: (conn=196738) not enough member or quorum mismatch, stop zone not allowed]

36

37

2022-08-23 11:46:51.846 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [10000000]

38

39

2022-08-23 11:46:51.847 WARN 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.s.t.b.zone.ObZoneTaskHandler : operate zone failed, exception msg=SQL [alter system stop zone ?; args:zone2]; SQL state [HY000]; error code [4179]; message [(conn=196738) not enough member or quorum mismatch, stop zone not allowed]

40

41

2022-08-23 11:46:51.848 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] com.alipay.ocp.common.pattern.Retry : wait for 15 seconds

42

43

2022-08-23 11:47:06.849 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [10000000]

44

45

2022-08-23 11:47:06.851 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [1800000000]

46

47

2022-08-23 11:47:06.853 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: alter system stop zone ?, args: [zone2]

48

49

2022-08-23 11:47:06.963 ERROR 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] update failed, sql:[alter system stop zone ?], error message:[PreparedStatementCallback; SQL [alter system stop zone ?]; (conn=196738) log is not sync, cannot stop zone not allowed; nested exception is java.sql.SQLTransientConnectionException: (conn=196738) log is not sync, cannot stop zone not allowed]

50

51

2022-08-23 11:47:06.965 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [10000000]

52

53

2022-08-23 11:47:06.966 WARN 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.s.t.b.zone.ObZoneTaskHandler : operate zone failed, exception msg=SQL [alter system stop zone ?; args:zone2]; SQL state [HY000]; error code [4179]; message [(conn=196738) log is not sync, cannot stop zone not allowed]

54

55

2022-08-23 11:47:06.967 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] com.alipay.ocp.common.pattern.Retry : wait for 15 seconds

56

57

2022-08-23 11:47:21.969 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [10000000]

58

59

2022-08-23 11:47:21.971 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [1800000000]

60

61

2022-08-23 11:47:21.973 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: alter system stop zone ?, args: [zone2]

62

63

2022-08-23 11:47:22.084 ERROR 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] update failed, sql:[alter system stop zone ?], error message:[PreparedStatementCallback; SQL [alter system stop zone ?]; (conn=196738) log is not sync, cannot stop zone not allowed; nested exception is java.sql.SQLTransientConnectionException: (conn=196738) log is not sync, cannot stop zone not allowed]

64

65

2022-08-23 11:47:22.085 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [10000000]

66

67

2022-08-23 11:47:22.087 WARN 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.s.t.b.zone.ObZoneTaskHandler : operate zone failed, exception msg=SQL [alter system stop zone ?; args:zone2]; SQL state [HY000]; error code [4179]; message [(conn=196738) log is not sync, cannot stop zone not allowed]

68

69

2022-08-23 11:47:22.088 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] com.alipay.ocp.common.pattern.Retry : wait for 15 seconds

70

71

2022-08-23 11:47:37.090 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [10000000]

72

73

2022-08-23 11:47:37.091 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [1800000000]

74

75

2022-08-23 11:47:37.093 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: alter system stop zone ?, args: [zone2]

76

77

2022-08-23 11:47:37.203 ERROR 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] update failed, sql:[alter system stop zone ?], error message:[PreparedStatementCallback; SQL [alter system stop zone ?]; (conn=196738) log is not sync, cannot stop zone not allowed; nested exception is java.sql.SQLTransientConnectionException: (conn=196738) log is not sync, cannot stop zone not allowed]

78

79

2022-08-23 11:47:37.204 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.obsdk.connector.ConnectTemplate : [obsdk] sql: set ob_query_timeout = ?, args: [10000000]

80

81

2022-08-23 11:47:37.206 WARN 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.s.t.b.zone.ObZoneTaskHandler : operate zone failed, exception msg=SQL [alter system stop zone ?; args:zone2]; SQL state [HY000]; error code [4179]; message [(conn=196738) log is not sync, cannot stop zone not allowed]

82

83

2022-08-23 11:47:37.207 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] com.alipay.ocp.common.pattern.Retry : wait for 15 seconds

84

85

2022-08-23 11:47:52.208 INFO 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.m.t.model.SubtaskInstanceEntity : Set state for subtask: 2090411, current state: RUNNING, new state: FAILED

86

87

2022-08-23 11:47:52.209 WARN 464 — [pool-subtask-executor-thread-1,17b22052ec594397,22d60764fdc6] c.a.o.c.t.engine.runner.RunnerFactory : Execute task failed, subtask=SubtaskInstanceEntity{id=2090411, name=Stop zone, state=FAILED, operation=EXECUTE, className=com.alipay.ocp.servi

【问题现象及影响】

【附件】

一般此类情况出现在一次性删除多个server的场景,结束集群时有多个stop observer任务在执行只有执行完一个才能执行下一个。即在此流程中只有上一个DELETE SERVER操作结束,才能发起下一个STOP SERVER操作。

但是我这个新创建的集群,没有删除server的操作的。

停止集群操作里就是会停止各个observer。你的zone下有多少个observer?每次都是要等很久么?如果是很久的话能不能发下详细日志,这边一起看下

模式为1-1-1,每次重启集群都会卡主
log_task_2079109.zip (46.7 KB)

看任务的日志是最终重试成功了,报错信息是执行 stop zone 命令的时候 observer 返回的,看信息是日志没有同步完成,如果遇到失败的情况可以稍等一下,然后重试任务。
另外,可以在重启集群之前,发起一次合并,这样会减少observer重启后同步日志的时间。