执行OBSERVER替换任务,老OBSERVER一直在DELETING,16个小时了还没结束

这个迁移还没有迁移完成,unit的IOPS限制的太低了,建议取消掉这个限制在看下,1074租户数据量是多少?

[oceanbase]> select * from dba_ob_units where MIGRATE_FROM_SVR_IP<>"" \G
*************************** 1. row ***************************
              UNIT_ID: 1038
            TENANT_ID: 1074
               STATUS: ACTIVE
     RESOURCE_POOL_ID: 1038
        UNIT_GROUP_ID: 1037
          CREATE_TIME: 2025-01-02 17:38:26.739191
          MODIFY_TIME: 2025-01-06 16:03:37.853346
                 ZONE: zone01
               SVR_IP: 192.168.180.34
             SVR_PORT: 2882
  MIGRATE_FROM_SVR_IP: 192.168.180.31
MIGRATE_FROM_SVR_PORT: 2882
       MANUAL_MIGRATE: NO
       UNIT_CONFIG_ID: 1038
              MAX_CPU: 2
              MIN_CPU: 1
          MEMORY_SIZE: 6442450944
        LOG_DISK_SIZE: 19327352832
             MAX_IOPS: 1024                     ----IOPS太低
             MIN_IOPS: 1024
          IOPS_WEIGHT: 1
1 row in set (0.022 sec)
select * from gv$ob_units where svr_ip='192.168.180.31' \G
*************************** 1. row ***************************
          SVR_IP: 192.168.180.31
        SVR_PORT: 2882
         UNIT_ID: 1038
       TENANT_ID: 1073
            ZONE: zone01
       ZONE_TYPE: ReadWrite
          REGION: nanjing
         MAX_CPU: NULL
         MIN_CPU: NULL
     MEMORY_SIZE: 1073741824
        MAX_IOPS: NULL
        MIN_IOPS: NULL
     IOPS_WEIGHT: NULL
   LOG_DISK_SIZE: 1932735283
 LOG_DISK_IN_USE: 0
DATA_DISK_IN_USE: 0
          STATUS: MIGRATE OUT
     CREATE_TIME: 2025-01-02 17:39:39.858023
*************************** 2. row ***************************
          SVR_IP: 192.168.180.31
        SVR_PORT: 2882
         UNIT_ID: 1038
       TENANT_ID: 1074
            ZONE: zone01
       ZONE_TYPE: ReadWrite
          REGION: nanjing
         MAX_CPU: 2
         MIN_CPU: 1
     MEMORY_SIZE: 5368709120
        MAX_IOPS: 1024
        MIN_IOPS: 1024
     IOPS_WEIGHT: 1
   LOG_DISK_SIZE: 17394617549
 LOG_DISK_IN_USE: 13849401319
DATA_DISK_IN_USE: 4540051456
          STATUS: MIGRATE OUT
     CREATE_TIME: 2025-01-02 17:39:39.858023
2 rows in set (0.016 sec)

经过几天时间观察,目前是这样的情况,我周一在ZONE02里我把OBSERVER 180.32替换掉以后,180.32上除了1074租户的副本之外,其他租户的副本都已经迁移到其他节点上了,就只有1074这个副本一直卡着不动,不知道是什么原因,最终导致替换超时。

后面我手动在zone03里模拟了一下添加新observer,没问题,但当我把现有的OBSERVER停止服务的时候,就会提示如下报错。

通过以上验证,可以得出上次替换失败一直卡16小时也应该是在“停止OBSERVER服务”的这一步卡住了,我已经手动将IOPS限制改成30000了,目前还不正常,请协助进一步分析原因,并给一下解决方案,谢谢。

1074租户副本分布情况
FULL{1}@zone01, FULL{1}@zone02, FULL{1}@zone03

停止observer服务报错:
[alter system stop server ?], error message:[PreparedStatementCallback; SQL [alter system stop server ?]; (conn=3222067135) Tenant(1074) LS(1002) has no leader, stop server not allowed; nested exception is java.sql.SQLTransientConnectionException: (conn=3222067135) Tenant(1074) LS(1002) has no leader, stop server not allowed

好的,我们继续看下

漫长之旅,希望看到结果