【布道师经验分享】节点故障后恢复(obd&非节点替换)

使用场景说明:

1、使用obd部署的ob集群且没有接管到ocp。

2、无空闲的机器可以用来替换故障节点,希望通过下线故障节点处理后重新加入到ob集群。

故障前节点分布

MySQL [oceanbase]> select * from dba_ob_servers;
+----------------+----------+----+-------+----------+-----------------+--------+----------------------------+-----------+-----------------------+----------------------------+----------------------------+--------------------------------------------------------------------------+-------------------+
| SVR_IP         | SVR_PORT | ID | ZONE  | SQL_PORT | WITH_ROOTSERVER | STATUS | START_SERVICE_TIME         | STOP_TIME | BLOCK_MIGRATE_IN_TIME | CREATE_TIME                | MODIFY_TIME                | BUILD_VERSION                                                            | LAST_OFFLINE_TIME |
+----------------+----------+----+-------+----------+-----------------+--------+----------------------------+-----------+-----------------------+----------------------------+----------------------------+--------------------------------------------------------------------------+-------------------+
| 11.xxx.xxx.191 |    12882 |  1 | zone1 |    12881 | YES             | ACTIVE | 2024-07-31 08:14:53.119695 | NULL      | NULL                  | 2024-07-24 10:50:42.478616 | 2024-07-31 08:14:55.115175 | 4.3.1.0_1-bad90e897a7f0f56b8ce5c43e186aa8f4bab03c4(Jun 28 2024 23:14:03) | NULL              |
| 11.xxx.xxx.191 |    22882 |  4 | zone1 |    22881 | NO              | ACTIVE | 2024-07-31 08:14:07.606076 | NULL      | NULL                  | 2024-07-24 10:51:01.744417 | 2024-07-31 08:14:10.278981 | 4.3.1.0_1-bad90e897a7f0f56b8ce5c43e186aa8f4bab03c4(Jun 28 2024 23:14:03) | NULL              |
| 11.xxx.xxx.191 |    32882 |  2 | zone2 |    32881 | NO              | ACTIVE | 2024-08-05 10:06:37.915423 | NULL      | NULL                  | 2024-07-24 10:50:42.523764 | 2024-08-05 10:06:39.509455 | 4.3.1.0_1-bad90e897a7f0f56b8ce5c43e186aa8f4bab03c4(Jun 28 2024 23:14:03) | NULL              |
| 11.xxx.xxx.192 |    33332 |  7 | zone3 |    33331 | NO              | ACTIVE | 2024-08-05 17:23:09.258153 | NULL      | NULL                  | 2024-08-05 17:22:59.938657 | 2024-08-05 17:23:11.080485 | 4.3.1.0_1-bad90e897a7f0f56b8ce5c43e186aa8f4bab03c4(Jun 28 2024 23:14:03) | NULL              |
| 11.xxx.xxx.192 |    42882 |  5 | zone2 |    42881 | NO              | ACTIVE | 2024-07-31 08:14:03.193183 | NULL      | NULL                  | 2024-07-24 10:51:01.792832 | 2024-07-31 08:14:06.075483 | 4.3.1.0_1-bad90e897a7f0f56b8ce5c43e186aa8f4bab03c4(Jun 28 2024 23:14:03) | NULL              |
| 11.xxx.xxx.192 |    52882 |  3 | zone3 |    52881 | NO              | ACTIVE | 2024-07-31 08:14:57.594518 | NULL      | NULL                  | 2024-07-24 10:50:42.562099 | 2024-07-31 08:14:59.182698 | 4.3.1.0_1-bad90e897a7f0f56b8ce5c43e186aa8f4bab03c4(Jun 28 2024 23:14:03) | NULL              |
+----------------+----------+----+-------+----------+-----------------+--------+----------------------------+-----------+-----------------------+----------------------------+----------------------------+--------------------------------------------------------------------------+-------------------+
6 rows in set (0.00 sec)

模拟故障

本次模拟的故障节点 11.xxx.xxx.192:32882

[heshun.lxd@sqaobnoxdn011162217192.sa128 /home/heshun.lxd]
$ps -ef | grep observer7
heshun.+  86129      1 80 Aug05 ?        1-09:22:50 /home/heshun.lxd/observer7/bin/observer -r 11.162.217.191:12882:12881;11.162.217.191:32882:32881;11.162.217.192:52882:52881 -p 33331 -P 33332 -z zone3 -n obcluster -c 1 -d /obdata/data/data7 -I 11.162.217.192 -o __min_full_resource_pool_memory=2147483648,memory_limit=12G,system_memory=2G,datafile_size=10G,log_disk_size=36G,cpu_count=8,enable_syslog_wf=False,enable_syslog_recycle=True,max_syslog_file_count=30
heshun.+  87058      1  0 Aug05 ?        00:00:16 /home/heshun.lxd/observer7/bin/obshell daemon --ip 11.162.217.192 --port 33333
heshun.+  87097  87058  0 Aug05 ?        00:01:34 /home/heshun.lxd/observer7/bin/obshell server --ip 11.162.217.192 --port 33333
heshun.+  93073  92972  0 10:46 pts/0    00:00:00 grep --color=auto observer7

[heshun.lxd@sqaobnoxdn011162217192.sa128 /home/heshun.lxd]
$

[heshun.lxd@sqaobnoxdn011162217192.sa128 /home/heshun.lxd]
$kill -9 86129 87058 87097

[heshun.lxd@sqaobnoxdn011162217192.sa128 /home/heshun.lxd]
$ps -ef | grep observer7
heshun.+  93636  92972  0 10:47 pts/0    00:00:00 grep --color=auto observer7


MySQL [oceanbase]> select * from dba_ob_servers 
where svr_ip='11.xxx.xxx.192' and svr_port=33332 \G
*************************** 1. row ***************************
               SVR_IP: 11.xxx.xxx.192
             SVR_PORT: 33332
                   ID: 7
                 ZONE: zone3
             SQL_PORT: 33331
      WITH_ROOTSERVER: NO
               STATUS: INACTIVE
   START_SERVICE_TIME: NULL
            STOP_TIME: NULL
BLOCK_MIGRATE_IN_TIME: NULL
          CREATE_TIME: 2024-08-05 17:22:59.938657
          MODIFY_TIME: 2024-08-07 12:10:19.770275
        BUILD_VERSION: 4.3.1.0_1-bad90e897a7f0f56b8ce5c43e186aa8f4bab03c4(Jun 28 2024 23:14:03)
    LAST_OFFLINE_TIME: 2024-08-07 10:47:46.794891
1 row in set (0.00 sec)

调小永久下线时间

alter system set server_permanent_offline_time = '60s';

MySQL [oceanbase]> select * from DBA_OB_ROOTSERVICE_EVENT_HISTORY 
where event like "%permanent_offline%" order by 1 desc limit 1 \G
*************************** 1. row ***************************
  TIMESTAMP: 2024-08-07 12:06:14.704733
     MODULE: server
      EVENT: permanent_offline
      NAME1: server
     VALUE1: "11.xxx.xxx.192:33332"
      NAME2:
     VALUE2:
      NAME3:
     VALUE3:
      NAME4:
     VALUE4:
      NAME5:
     VALUE5:
      NAME6:
     VALUE6:
 EXTRA_INFO:
  RS_SVR_IP: 11.xxx.xxx.191
RS_SVR_PORT: 12882
1 row in set (0.00 sec)

确认故障节点涉及的租户和zone信息

select a.tenant_name,b.zone
from dba_ob_tenants a,dba_ob_units b
where a.tenant_id=b.tenant_id
and (b.svr_ip='11.xxx.xxx.192' and b.svr_port=33332);

+-------------+-------+
| tenant_name | zone  |
+-------------+-------+
| test1       | zone3 |
| test2       | zone3 |
+-------------+-------+
2 rows in set (0.14 sec)

确认选择哪种方式处理

ob从 4.0 版本开始,要求每个租户在各 zone 的 unit_num 必须保持一致,从下面的sql 可以看出来涉及的2个租户的unit_num等于单zone内的节点数。

MySQL [oceanbase]> select tenant_name,tenant_id,locality,unit_num 
from dba_ob_tenants where tenant_name in ('test1','test2');
+-------------+-----------+---------------------------------------------+----------+
| tenant_name | tenant_id | locality                                    | unit_num |
+-------------+-----------+---------------------------------------------+----------+
| test1       |      1002 | FULL{1}@zone1, FULL{1}@zone2, FULL{1}@zone3 |        2 |
| test2       |      1004 | FULL{1}@zone1, FULL{1}@zone2, FULL{1}@zone3 |        2 |
+-------------+-----------+---------------------------------------------+----------+
2 rows in set (0.08 sec)

这里有2种思路:

调小unit_num

将涉及的租户 unit_num大小减去故障节点数,本次是unit_num从2调整为1。

优先评估磁盘空间是否够用:从下面的sql查看结果来看各zone内的磁盘评估使用率超过50%了,unit_num从2调整为1,单个节点无法容纳当前的数据量

select b.zone,a.svr_ip,a.svr_port,
round(total_size/1024/1024/1024,2) total_size, 
round(used_size/1024/1024/1024,2) used_size,
round(free_size/1024/1024/1024,2) free_size,
round(used_size/total_size,2)*100 used_pct
from __all_virtual_disk_stat a, dba_ob_servers b
where a.svr_ip=b.svr_ip
and a.svr_port=b.svr_port
order by zone desc;

+-------+----------------+----------+------------+-----------+-----------+----------+
| zone  | svr_ip         | svr_port | total_size | used_size | free_size | used_pct |
+-------+----------------+----------+------------+-----------+-----------+----------+
| zone3 | 11.xxx.xxx.192 |    52882 |      10.00 |      7.21 |      2.79 |    72.00 |
| zone2 | 11.xxx.xxx.191 |    32882 |      10.00 |      4.12 |      5.88 |    41.00 |
| zone2 | 11.xxx.xxx.192 |    42882 |      10.00 |      7.95 |      2.04 |    80.00 |
| zone1 | 11.xxx.xxx.191 |    12882 |      10.00 |      7.05 |      2.95 |    71.00 |
| zone1 | 11.xxx.xxx.191 |    22882 |      10.00 |      5.03 |      4.96 |    50.00 |
+-------+----------------+----------+------------+-----------+-----------+----------+
5 rows in set (0.11 sec)

如果在我们实际的环境中磁盘容量满足,在没有ocp express等有效监控评估对应的租户在各节点的cpu/memory/io 使用情况,避免因 unit_num 缩容后造成节点负载高等其他异常。

如果评估unit_num 缩容之后资源够用:

# 1、关闭涉及租户的 enable_transfer
alter system set enable_transfer='false' tenant='test1';
alter system set enable_transfer='false' tenant='test2';

# 2、查询故障节点涉及的租户及对应的 unit_group_id
select a.tenant_name,b.unit_group_id
from dba_ob_tenants a,dba_ob_units b
where a.tenant_id=b.tenant_id
and (svr_ip='11.xxx.xxx.192' and svr_port=33332);
+-------------+---------------+
| tenant_name | unit_group_id |
+-------------+---------------+
| test1       |          1007 |
| test2       |          1009 |
+-------------+---------------+
2 rows in set (0.15 sec)


# 3、缩小租户的unit_num 并删除故障节点对应的unit_group_id
alter resource tenant test1 unit_num=1 DELETE unit_group = ( 1007 );
alter resource tenant test2 unit_num=1 DELETE unit_group = ( 1009 );

等待机器修复好了之后再把unit_num改回来,这样只需要每个租户个数个日志流的两次迁移操作即可。

调小副本数

将涉及的租户的副本从3副本调整为2副本。

1、调整 locality

从每个租户的 locality 里剔除 zone3

MySQL [oceanbase]> alter tenant test1 locality='f@zone1,f@zone2';
Query OK, 0 rows affected (1.33 sec)

MySQL [oceanbase]> alter tenant test2 locality='f@zone1,f@zone2';
Query OK, 0 rows affected (1.17 sec)

# 对应的记录 JOB_STATUS 为success 说明locality变更完成
select * from DBA_OB_TENANT_JOBS
where job_type='alter_tenant_locality';

MySQL [oceanbase]> select * from DBA_OB_TENANT_JOBS where job_type='alter_tenant_locality' \G
*************************** 1. row ***************************
     JOB_ID: 14
   JOB_TYPE: ALTER_TENANT_LOCALITY
 JOB_STATUS: SUCCESS
RESULT_CODE: 0
   PROGRESS: 100
 START_TIME: 2024-08-07 18:40:08.206965
MODIFY_TIME: 2024-08-07 18:47:12.775644
  TENANT_ID: 1002
   SQL_TEXT: alter tenant test1 locality='f@zone1,f@zone2'
 EXTRA_INFO: FROM: 'FULL{1}@zone1, FULL{1}@zone2, FULL{1}@zone3', TO: 'FULL{1}@zone1, FULL{1}@zone2'
  RS_SVR_IP: 11.xxx.xxx.191
RS_SVR_PORT: 12882
*************************** 2. row ***************************
     JOB_ID: 15
   JOB_TYPE: ALTER_TENANT_LOCALITY
 JOB_STATUS: SUCCESS
RESULT_CODE: 0
   PROGRESS: 100
 START_TIME: 2024-08-07 18:40:22.269937
MODIFY_TIME: 2024-08-07 18:40:55.905170
  TENANT_ID: 1004
   SQL_TEXT: alter tenant test2 locality='f@zone1,f@zone2'
 EXTRA_INFO: FROM: 'FULL{1}@zone1, FULL{1}@zone2, FULL{1}@zone3', TO: 'FULL{1}@zone1, FULL{1}@zone2'
  RS_SVR_IP: 11.xxx.xxx.191
RS_SVR_PORT: 12882
2 rows in set (0.01 sec)

2、调整 resource_pool_list

从每个租户的 resource_pool_list 里剔除zone3

# 确认涉及租户的在各zone种的 resource pool name
select * from dba_ob_resource_pools where tenant_id in (1002,1004);
+------------------+------+-----------+----------------------------+----------------------------+------------+----------------+-------------------+--------------+
| RESOURCE_POOL_ID | NAME | TENANT_ID | CREATE_TIME                | MODIFY_TIME                | UNIT_COUNT | UNIT_CONFIG_ID | ZONE_LIST         | REPLICA_TYPE |
+------------------+------+-----------+----------------------------+----------------------------+------------+----------------+-------------------+--------------+
|             1001 | p1   |      1002 | 2024-07-24 17:03:25.453796 | 2024-08-01 10:35:15.150609 |          2 |           1001 | zone1;zone2;zone3 | FULL         |
|             1002 | p2_1 |      1004 | 2024-08-06 20:11:07.067122 | 2024-08-07 10:30:55.220775 |          2 |           1002 | zone1             | FULL         |
|             1003 | p2_2 |      1004 | 2024-08-06 20:11:28.524490 | 2024-08-07 10:30:56.059745 |          2 |           1002 | zone2             | FULL         |
|             1004 | p2_3 |      1004 | 2024-08-06 20:11:29.307956 | 2024-08-07 10:30:56.801430 |          2 |           1002 | zone3             | FULL         |
+------------------+------+-----------+----------------------------+----------------------------+------------+----------------+-------------------+--------------+
4 rows in set (0.02 sec)

# 删除 test2 租户 在zone3上的resource_pool,记录删除的为 p2_3
MySQL [oceanbase]> alter tenant test2 resource_pool_list=('p2_1', 'p2_2');
Query OK, 0 rows affected (0.59 sec)

#租户 test1 的 resource pool 在所有的3个 zone 上存在复用的情况,需要先执行split 操作。
MySQL [oceanbase]> alter resource pool  p1  SPLIT INTO ('p1_1','p1_2','p1_3') ON ('zone1','zone2','zone3');
Query OK, 0 rows affected (0.10 sec)

# 删除 test1 租户 在zone3上的resource_pool,记录删除的为 p1_3
alter tenant test1 resource_pool_list=('p1_1', 'p1_2');

#删除后确认
MySQL [oceanbase]> select * from dba_ob_resource_pools where tenant_id in (1002,1004);
+------------------+------+-----------+----------------------------+----------------------------+------------+----------------+-----------+--------------+
| RESOURCE_POOL_ID | NAME | TENANT_ID | CREATE_TIME                | MODIFY_TIME                | UNIT_COUNT | UNIT_CONFIG_ID | ZONE_LIST | REPLICA_TYPE |
+------------------+------+-----------+----------------------------+----------------------------+------------+----------------+-----------+--------------+
|             1002 | p2_1 |      1004 | 2024-08-06 20:11:07.067122 | 2024-08-07 10:30:55.220775 |          2 |           1002 | zone1     | FULL         |
|             1003 | p2_2 |      1004 | 2024-08-06 20:11:28.524490 | 2024-08-07 10:30:56.059745 |          2 |           1002 | zone2     | FULL         |
|             1005 | p1_1 |      1002 | 2024-08-07 18:57:10.471703 | 2024-08-07 18:57:10.471703 |          2 |           1001 | zone1     | FULL         |
|             1006 | p1_2 |      1002 | 2024-08-07 18:57:10.481192 | 2024-08-07 18:57:10.481192 |          2 |           1001 | zone2     | FULL         |
+------------------+------+-----------+----------------------------+----------------------------+------------+----------------+-----------+--------------+
4 rows in set (0.00 sec)

3、删除 resource_pool

删除每个租户在 zone3 里的 resource_pool

MySQL [oceanbase]> select * from dba_ob_resource_pools where tenant_id is null;
+------------------+------+-----------+----------------------------+----------------------------+------------+----------------+-----------+--------------+
| RESOURCE_POOL_ID | NAME | TENANT_ID | CREATE_TIME                | MODIFY_TIME                | UNIT_COUNT | UNIT_CONFIG_ID | ZONE_LIST | REPLICA_TYPE |
+------------------+------+-----------+----------------------------+----------------------------+------------+----------------+-----------+--------------+
|             1004 | p2_3 |      NULL | 2024-08-06 20:11:29.307956 | 2024-08-07 18:53:48.789520 |          2 |           1002 | zone3     | FULL         |
|             1007 | p1_3 |      NULL | 2024-08-07 18:57:10.488961 | 2024-08-07 18:58:58.294668 |          2 |           1001 | zone3     | FULL         |
+------------------+------+-----------+----------------------------+----------------------------+------------+----------------+-----------+--------------+
2 rows in set (0.01 sec)

MySQL [oceanbase]> drop resource pool p1_3;
Query OK, 0 rows affected (0.02 sec)

MySQL [oceanbase]> drop resource pool p2_3;
Query OK, 0 rows affected (0.01 sec)

MySQL [oceanbase]> select * from dba_ob_resource_pools where tenant_id is null;
Empty set (0.01 sec)

4、删除故障 observer

MySQL [oceanbase]> alter system delete server '11.xxx.xxx.192:33332' zone='zone3';
Query OK, 0 rows affected (0.24 sec)


MySQL [oceanbase]> select * from dba_ob_servers order by zone;
+----------------+----------+----+-------+----------+-----------------+--------+----------------------------+-----------+-----------------------+----------------------------+----------------------------+--------------------------------------------------------------------------+-------------------+
| SVR_IP         | SVR_PORT | ID | ZONE  | SQL_PORT | WITH_ROOTSERVER | STATUS | START_SERVICE_TIME         | STOP_TIME | BLOCK_MIGRATE_IN_TIME | CREATE_TIME                | MODIFY_TIME                | BUILD_VERSION                                                            | LAST_OFFLINE_TIME |
+----------------+----------+----+-------+----------+-----------------+--------+----------------------------+-----------+-----------------------+----------------------------+----------------------------+--------------------------------------------------------------------------+-------------------+
| 11.xxx.xxx.191 |    12882 |  1 | zone1 |    12881 | YES             | ACTIVE | 2024-07-31 08:14:53.119695 | NULL      | NULL                  | 2024-07-24 10:50:42.478616 | 2024-07-31 08:14:55.115175 | 4.3.1.0_1-bad90e897a7f0f56b8ce5c43e186aa8f4bab03c4(Jun 28 2024 23:14:03) | NULL              |
| 11.xxx.xxx.191 |    22882 |  4 | zone1 |    22881 | NO              | ACTIVE | 2024-07-31 08:14:07.606076 | NULL      | NULL                  | 2024-07-24 10:51:01.744417 | 2024-07-31 08:14:10.278981 | 4.3.1.0_1-bad90e897a7f0f56b8ce5c43e186aa8f4bab03c4(Jun 28 2024 23:14:03) | NULL              |
| 11.xxx.xxx.191 |    32882 |  2 | zone2 |    32881 | NO              | ACTIVE | 2024-08-05 10:06:37.915423 | NULL      | NULL                  | 2024-07-24 10:50:42.523764 | 2024-08-05 10:06:39.509455 | 4.3.1.0_1-bad90e897a7f0f56b8ce5c43e186aa8f4bab03c4(Jun 28 2024 23:14:03) | NULL              |
| 11.xxx.xxx.192 |    42882 |  5 | zone2 |    42881 | NO              | ACTIVE | 2024-07-31 08:14:03.193183 | NULL      | NULL                  | 2024-07-24 10:51:01.792832 | 2024-07-31 08:14:06.075483 | 4.3.1.0_1-bad90e897a7f0f56b8ce5c43e186aa8f4bab03c4(Jun 28 2024 23:14:03) | NULL              |
| 11.xxx.xxx.192 |    52882 |  3 | zone3 |    52881 | NO              | ACTIVE | 2024-07-31 08:14:57.594518 | NULL      | NULL                  | 2024-07-24 10:50:42.562099 | 2024-07-31 08:14:59.182698 | 4.3.1.0_1-bad90e897a7f0f56b8ce5c43e186aa8f4bab03c4(Jun 28 2024 23:14:03) | NULL              |
+----------------+----------+----+-------+----------+-----------------+--------+----------------------------+-----------+-----------------------+----------------------------+----------------------------+--------------------------------------------------------------------------+-------------------+
5 rows in set (0.02 sec)

5、修复机器

删除相关目录结构模拟修复后的机器

# home_path: /home/heshun.lxd/observer7
# data_dir: /obdata/data/data7
# redo_dir: /obdata/log/log7

rm -rf /home/heshun.lxd/observer7
rm -rf /obdata/data/data7
rm -rf /obdata/log/log7

6、重新加入集群

由于我们已经在ob集群里删除了 11.xxx.xxx.192:33332 节点,obd的配置文件里还是存在的,现在需要重新通过obd将 11.162.217.192:33332 节点扩容加入到 ob集群里,需要先手动删除:

cd ~/.obd/cluster/${deploy_name}

# 修改前备份
cp -ap config.yaml bak_config.yaml
cp -ap inner_config.yaml bak_inner_config.yaml


# 分别删除已经从ob集群里删除到节点 11.xxx.xxx.192:33332 和对应的配置
vim config.yaml
vim inner_config.yaml

在故障节点所在的 zone 上添加新节点的配置文件 add_zone3_server7.yaml

oceanbase-ce:
  servers:
    - name: server7
      ip: 11.xxx.xxx.192
  server7:
    obshell_port: 33333
    mysql_port: 33331
    rpc_port: 33332
    home_path: /home/heshun.lxd/observer7
    data_dir: /obdata/data/data7
    redo_dir: /obdata/log/log7
    zone: zone3

将修复后的机器添扩容加入到ob集群

obd cluster scale_out ob431 -c add_zone3_server7.yaml -v 

7、给所有租户添加副本

在增加副本之前,首先需要确认租户在目标 zone 上是否有资源池

MySQL [oceanbase]> select * from dba_ob_resource_pools;
+------------------+----------+-----------+----------------------------+----------------------------+------------+----------------+-------------------+--------------+
| RESOURCE_POOL_ID | NAME     | TENANT_ID | CREATE_TIME                | MODIFY_TIME                | UNIT_COUNT | UNIT_CONFIG_ID | ZONE_LIST         | REPLICA_TYPE |
+------------------+----------+-----------+----------------------------+----------------------------+------------+----------------+-------------------+--------------+
|                1 | sys_pool |         1 | 2024-07-24 10:50:37.618503 | 2024-07-24 10:50:37.671789 |          1 |              1 | zone1;zone2;zone3 | FULL         |
|             1002 | p2_1     |      1004 | 2024-08-06 20:11:07.067122 | 2024-08-07 10:30:55.220775 |          2 |           1002 | zone1             | FULL         |
|             1003 | p2_2     |      1004 | 2024-08-06 20:11:28.524490 | 2024-08-07 10:30:56.059745 |          2 |           1002 | zone2             | FULL         |
|             1005 | p1_1     |      1002 | 2024-08-07 18:57:10.471703 | 2024-08-07 18:57:10.471703 |          2 |           1001 | zone1             | FULL         |
|             1006 | p1_2     |      1002 | 2024-08-07 18:57:10.481192 | 2024-08-07 18:57:10.481192 |          2 |           1001 | zone2             | FULL         |
+------------------+----------+-----------+----------------------------+----------------------------+------------+----------------+-------------------+--------------+
5 rows in set (0.02 sec)

## 确认各 resource pool 使用的规格名称
MySQL [oceanbase]> select * from dba_ob_unit_configs;
+----------------+-----------------+----------------------------+----------------------------+---------+---------+-------------+---------------+---------------------+---------------------+-------------+
| UNIT_CONFIG_ID | NAME            | CREATE_TIME                | MODIFY_TIME                | MAX_CPU | MIN_CPU | MEMORY_SIZE | LOG_DISK_SIZE | MAX_IOPS            | MIN_IOPS            | IOPS_WEIGHT |
+----------------+-----------------+----------------------------+----------------------------+---------+---------+-------------+---------------+---------------------+---------------------+-------------+
|              1 | sys_unit_config | 2024-07-24 10:50:37.599549 | 2024-07-24 10:50:37.599549 |       2 |       2 |  2147483648 |    3221225472 | 9223372036854775807 | 9223372036854775807 |           2 |
|           1001 | u1              | 2024-07-24 17:03:25.421358 | 2024-08-06 20:08:01.205235 |       3 |       3 |  9663676416 |   25769803776 |               10000 |               10000 |           0 |
|           1002 | u2              | 2024-08-06 20:09:21.183164 | 2024-08-06 20:09:21.183164 |       2 |       2 |  2147483648 |    6442450944 | 9223372036854775807 | 9223372036854775807 |           2 |
+----------------+-----------------+----------------------------+----------------------------+---------+---------+-------------+---------------+---------------------+---------------------+-------------+
3 rows in set (0.02 sec)

由于我们将 test1 租户(1001)和 test2 租户(1002) 分别在 zone3 上删除了对应的 resource pool,因此我们需要给 zone3 上给上述的租户拥有对应的 resource pool,可以有2种方式:

  • 调整现有资源池的 zone 列表,该方法适用于租户在所有zone使用同一个 resource pool name, 在缩小副本数->调整 resource_pool_list的步骤中因无法使用同一个resource pool name,当时已经执行了split操作,所以这里不使用该方法。
  • 新增一个资源池给租户,要求资源池的 unit_num 与租户已有的资源池一致。

创建 resource pool

# 给 test1 租户在zone上创建 resource pool
create resource pool p1_3 unit='u1' ,unit_num=2,zone_list=('zone3');


# 给 test2 租户在zone上创建 resource pool
create resource pool p2_3 unit='u2' ,unit_num=2,zone_list=('zone3');


# 最后2条记录,tenant_id 等于NULL的说明刚新建的resource pool还没有关联给任何租户使用。
MySQL [oceanbase]> select * from dba_ob_resource_pools;
+------------------+----------+-----------+----------------------------+----------------------------+------------+----------------+-------------------+--------------+
| RESOURCE_POOL_ID | NAME     | TENANT_ID | CREATE_TIME                | MODIFY_TIME                | UNIT_COUNT | UNIT_CONFIG_ID | ZONE_LIST         | REPLICA_TYPE |
+------------------+----------+-----------+----------------------------+----------------------------+------------+----------------+-------------------+--------------+
|                1 | sys_pool |         1 | 2024-07-24 10:50:37.618503 | 2024-07-24 10:50:37.671789 |          1 |              1 | zone1;zone2;zone3 | FULL         |
|             1002 | p2_1     |      1004 | 2024-08-06 20:11:07.067122 | 2024-08-07 10:30:55.220775 |          2 |           1002 | zone1             | FULL         |
|             1003 | p2_2     |      1004 | 2024-08-06 20:11:28.524490 | 2024-08-07 10:30:56.059745 |          2 |           1002 | zone2             | FULL         |
|             1005 | p1_1     |      1002 | 2024-08-07 18:57:10.471703 | 2024-08-07 18:57:10.471703 |          2 |           1001 | zone1             | FULL         |
|             1006 | p1_2     |      1002 | 2024-08-07 18:57:10.481192 | 2024-08-07 18:57:10.481192 |          2 |           1001 | zone2             | FULL         |
|             1008 | p1_3     |      NULL | 2024-08-08 11:07:34.387838 | 2024-08-08 11:07:34.387838 |          2 |           1001 | zone3             | FULL         |
|             1010 | p2_3     |      NULL | 2024-08-08 11:12:49.406296 | 2024-08-08 11:12:49.406296 |          2 |           1002 | zone3             | FULL         |
+------------------+----------+-----------+----------------------------+----------------------------+------------+----------------+-------------------+--------------+
7 rows in set (0.01 sec)

修改租户的 resource_pool_list

MySQL [oceanbase]>  alter tenant test1 resource_pool_list=('p1_1','p1_2','p1_3');
Query OK, 0 rows affected (2.34 sec)

MySQL [oceanbase]>  alter tenant test2 resource_pool_list=('p2_1','p2_2','p2_3');
Query OK, 0 rows affected (1.29 sec)

MySQL [oceanbase]> select * from dba_ob_resource_pools;
+------------------+----------+-----------+----------------------------+----------------------------+------------+----------------+-------------------+--------------+
| RESOURCE_POOL_ID | NAME     | TENANT_ID | CREATE_TIME                | MODIFY_TIME                | UNIT_COUNT | UNIT_CONFIG_ID | ZONE_LIST         | REPLICA_TYPE |
+------------------+----------+-----------+----------------------------+----------------------------+------------+----------------+-------------------+--------------+
|                1 | sys_pool |         1 | 2024-07-24 10:50:37.618503 | 2024-07-24 10:50:37.671789 |          1 |              1 | zone1;zone2;zone3 | FULL         |
|             1002 | p2_1     |      1004 | 2024-08-06 20:11:07.067122 | 2024-08-07 10:30:55.220775 |          2 |           1002 | zone1             | FULL         |
|             1003 | p2_2     |      1004 | 2024-08-06 20:11:28.524490 | 2024-08-07 10:30:56.059745 |          2 |           1002 | zone2             | FULL         |
|             1005 | p1_1     |      1002 | 2024-08-07 18:57:10.471703 | 2024-08-07 18:57:10.471703 |          2 |           1001 | zone1             | FULL         |
|             1006 | p1_2     |      1002 | 2024-08-07 18:57:10.481192 | 2024-08-07 18:57:10.481192 |          2 |           1001 | zone2             | FULL         |
|             1008 | p1_3     |      1002 | 2024-08-08 11:07:34.387838 | 2024-08-08 11:44:49.982360 |          2 |           1001 | zone3             | FULL         |
|             1010 | p2_3     |      1004 | 2024-08-08 11:12:49.406296 | 2024-08-08 11:45:14.888533 |          2 |           1002 | zone3             | FULL         |
+------------------+----------+-----------+----------------------------+----------------------------+------------+----------------+-------------------+--------------+
7 rows in set (0.00 sec)

修改租户的 locality

MySQL [oceanbase]>  alter tenant test1 locality='f@zone1,f@zone2,f@zone3';
Query OK, 0 rows affected (0.59 sec)

MySQL [oceanbase]>  alter tenant test2 locality='f@zone1,f@zone2,f@zone3';
Query OK, 0 rows affected (0.45 sec)


# 等 JOB_STATUS 是 SUCCESS 说明修改租户的locality操作执行成功了
select * from dba_ob_tenant_jobs 
where job_type='alter_tenant_locality' order by start_time desc limit 2 \G

*************************** 1. row ***************************
     JOB_ID: 18
   JOB_TYPE: ALTER_TENANT_LOCALITY
 JOB_STATUS: SUCCESS
RESULT_CODE: 0
   PROGRESS: 100
 START_TIME: 2024-08-08 11:48:56.763835
MODIFY_TIME: 2024-08-08 11:53:19.692981
  TENANT_ID: 1004
   SQL_TEXT: alter tenant test2 locality='f@zone1,f@zone2,f@zone3'
 EXTRA_INFO: FROM: 'FULL{1}@zone1, FULL{1}@zone2', TO: 'FULL{1}@zone1, FULL{1}@zone2, FULL{1}@zone3'
  RS_SVR_IP: 11.xxx.xxx.191
RS_SVR_PORT: 12882
*************************** 2. row ***************************
     JOB_ID: 17
   JOB_TYPE: ALTER_TENANT_LOCALITY
 JOB_STATUS: SUCCESS
RESULT_CODE: 0
   PROGRESS: 100
 START_TIME: 2024-08-08 11:48:49.527044
MODIFY_TIME: 2024-08-08 11:53:19.129475
  TENANT_ID: 1002
   SQL_TEXT: alter tenant test1 locality='f@zone1,f@zone2,f@zone3'
 EXTRA_INFO: FROM: 'FULL{1}@zone1, FULL{1}@zone2', TO: 'FULL{1}@zone1, FULL{1}@zone2, FULL{1}@zone3'
  RS_SVR_IP: 11.xxx.xxx.191
RS_SVR_PORT: 12882
2 rows in set (0.00 sec)

到这里,没有通过替换节点的方式将故障的节点处理后,重新再加入到ob集群。