【 使用环境 】生产环境
【 OB or 其他组件 】OB 社区版
【 使用版本 】v4.3.2
【问题描述】生产环境计划从 v4.3.2 升级到 v4.3.3,3 个 zone,obproxy 和 2 个 observer 混部。计划使用 obd 升级,请帮忙确认下升级相关的问题:
-
滚动升级过程中,对业务的 latency 和 qps 的影响,比如 qps 会掉 % 多少,官网描述的比较模糊,不明确:
-
测试环境做过测试,发现有 qps 掉 0 的情况,租户 zone 的优先级设置是 “Random”,目标表是 hash 分区表,这个现象不符合预期。所以我是否遗漏了一些必要的配置?
-
Obproxy 有链接保持的特性,所以如果要做升级,业务应用端,需要如何配合?我理解超过 100ms 会 kill,那么应用端仍然需要有捕获异常重试的机制吧?
辞霜
#3
升级相关的问题:机器规格与集群规格包括业务量都会影响这边也无法确认qps等等。
3 个 zone,obproxy 和 2 个 observer 混部----这集群架构是啥样的
我理解ob虽然是滚动升级的方式也应该选择在业务低峰时候,这属于重大运维变更操作。
obproxy有探活机制会自动选择节点。
感谢回复和建议,我的架构大概是下面这样的:
我这边是计划在业务低峰期去做版本升级的动作。
测试使用的是一个 python 脚本,脚本里基本都是简单的 insert 和 update 操作,脚本里面会捕获异常,但是不处理异常,继续下一次循环,并且仅升级了 observer。测试后,看监控发现会有 qps 掉 0 的现象,感觉很奇怪,和官网描述的“略微”相差的有点大
~
所以,我应该如何调整架构或者测试的 python 脚本,甚至表结构?谢谢 ~
辞霜
#5
有部署ocp么。qps 掉 0 的现象不太合理可以使用ocp观察一下
升级的话也建议使用ocp进行升级
辞霜
#6
这边升级测试一下看看是否存在qps 掉 0 的现象
老师您好,我这边是用 odb 升级的 observer,您提到的 ocp 是指 ocp express 吗?
辞霜
#8
不是,ocp-express不建议使用了后续提供的技术支持力度较小
老师您好,我使用的是社区版本,如果我要使用 OCP 那么就是用 obd web 白屏安装一个就可以了哈?
辞霜
#14
不太合理。
ob高可用集群即便是节点断联,也是可以自动切主,业务影响很小不可能导致qps为0,而且目前升级为滚动方式基本可以理解为存在一台节点不可用状态。
使用obproxy试试呢接管ob集群
明白,综合老师您给出的建议,需要做下面的两个调整:
- OCP 升级 observer
- Python 测试脚本链接一个 obproxy
另外,有另一个问题请教下,如果现在测试集群已经是 v4.3.3 了,只能继续向上升级,不能降级的哈?
您上面提到对业务影响比较小,这个有一个大概的范围可以参考吗?比如 latency 涨多少,或者 tps 掉多少?我理解准确的值不太容易给出来取决于 workload 和集群规模,根据您的实践经验有参考值吗?
谢谢 ~
辞霜
#16
是的目前不能降级,可以升级到433bp1
对业务影响范围这个目前官方这边暂时未提供参考。
辞霜
#17
上面图中监控的是集群还是租户的掉0现象,还有这个监控是ocp-express的吧?
是的 ocp express。我用 odb 重新测试了一次 v4.3.3.0 → v4.3.3.1,在 ocp express 上看还是会掉零。测试过程如下:
- 确认租户 primary zone 设置
SELECT * FROM oceanbase.DBA_OB_TENANTS WHERE TENANT_NAME = 'tenants_part'\G;
*************************** 1. row ***************************
TENANT_ID: 1008
TENANT_NAME: tenants_part
TENANT_TYPE: USER
CREATE_TIME: 2024-10-17 15:42:26.215499
MODIFY_TIME: 2024-10-17 16:06:56.862331
PRIMARY_ZONE: RANDOM
LOCALITY: FULL{1}@zone1, FULL{1}@zone4, FULL{1}@zone5
PREVIOUS_LOCALITY: NULL
COMPATIBILITY_MODE: MYSQL
STATUS: NORMAL
IN_RECYCLEBIN: NO
LOCKED: NO
TENANT_ROLE: PRIMARY
SWITCHOVER_STATUS: NORMAL
SWITCHOVER_EPOCH: 0
SYNC_SCN: 1730788332392798004
REPLAYABLE_SCN: 1730788332392798004
READABLE_SCN: 1730788332392798004
RECOVERY_UNTIL_SCN: 4611686018427387903
LOG_MODE: NOARCHIVELOG
ARBITRATION_SERVICE_STATUS: DISABLED
UNIT_NUM: 1
COMPATIBLE: 4.3.3.0
MAX_LS_ID: 1003
RESTORE_DATA_MODE: NORMAL
1 row in set (0.172 sec)
- 创建 hash partition 表,并确认 leader 在各个 zone 的分布
obclient [oceanbase]> select DATABASE_NAME,TABLE_NAME,PARTITION_NAME,ZONE,SVR_IP,SVR_PORT,ROLE from DBA_OB_TABLE_LOCATIONS where table_name='t_part_tab' and role='leader';
+---------------+------------+----------------+-------+-------------+----------+--------+
| DATABASE_NAME | TABLE_NAME | PARTITION_NAME | ZONE | SVR_IP | SVR_PORT | ROLE |
+---------------+------------+----------------+-------+-------------+----------+--------+
| python_db | t_part_tab | p0 | zone1 | x.x.x.106 | 2882 | LEADER |
| python_db | t_part_tab | p1 | zone1 | x.x.x.106 | 2882 | LEADER |
| python_db | t_part_tab | p2 | zone1 | x.x.x.106 | 2882 | LEADER |
| python_db | t_part_tab | p3 | zone1 | x.x.x.106 | 2882 | LEADER |
| python_db | t_part_tab | p4 | zone1 | x.x.x.106 | 2882 | LEADER |
| python_db | t_part_tab | p5 | zone1 | x.x.x.106 | 2882 | LEADER |
| python_db | t_part_tab | p6 | zone4 | x.x.x.214 | 2882 | LEADER |
| python_db | t_part_tab | p7 | zone4 | x.x.x.214 | 2882 | LEADER |
| python_db | t_part_tab | p8 | zone4 | x.x.x.214 | 2882 | LEADER |
| python_db | t_part_tab | p9 | zone4 | x.x.x.214 | 2882 | LEADER |
| python_db | t_part_tab | p10 | zone4 | x.x.x.214 | 2882 | LEADER |
| python_db | t_part_tab | p11 | zone4 | x.x.x.214 | 2882 | LEADER |
| python_db | t_part_tab | p12 | zone5 | x.x.x.177 | 2882 | LEADER |
| python_db | t_part_tab | p13 | zone5 | x.x.x.177 | 2882 | LEADER |
| python_db | t_part_tab | p14 | zone5 | x.x.x.177 | 2882 | LEADER |
| python_db | t_part_tab | p15 | zone5 | x.x.x.177 | 2882 | LEADER |
| python_db | t_part_tab | p16 | zone5 | x.x.x.177 | 2882 | LEADER |
| python_db | t_part_tab | p17 | zone5 | x.x.x.177 | 2882 | LEADER |
+---------------+------------+----------------+-------+-------------+----------+--------+
18 rows in set (0.040 sec)
- 启动 python 脚本并验证数据
1)链接一个 obproxy
2)查询各个子分区的数据分布:除 p0,p3,p6,p9,p12,p15 分区的数据是 null 外,其他分区均有数据,即每个 zone 上均有数据写入
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p0);
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
1 row in set (0.001 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p1);
+----------+
| COUNT(*) |
+----------+
| 44599 |
+----------+
1 row in set (0.024 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p2);
+----------+
| COUNT(*) |
+----------+
| 44740 |
+----------+
1 row in set (0.023 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p3);
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
1 row in set (0.002 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p4);
+----------+
| COUNT(*) |
+----------+
| 45008 |
+----------+
1 row in set (0.023 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p5);
+----------+
| COUNT(*) |
+----------+
| 45144 |
+----------+
1 row in set (0.024 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p6);
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
1 row in set (0.004 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p7);
+----------+
| COUNT(*) |
+----------+
| 45555 |
+----------+
1 row in set (0.025 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p8);
+----------+
| COUNT(*) |
+----------+
| 45693 |
+----------+
1 row in set (0.025 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p9);
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
1 row in set (0.004 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p10);
+----------+
| COUNT(*) |
+----------+
| 46021 |
+----------+
1 row in set (0.026 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p11);
+----------+
| COUNT(*) |
+----------+
| 46177 |
+----------+
1 row in set (0.025 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p12);
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
1 row in set (0.012 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p13);
+----------+
| COUNT(*) |
+----------+
| 46916 |
+----------+
1 row in set (0.031 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p14);
+----------+
| COUNT(*) |
+----------+
| 47071 |
+----------+
1 row in set (0.026 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p15);
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
1 row in set (0.012 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p16);
+----------+
| COUNT(*) |
+----------+
| 47314 |
+----------+
1 row in set (0.026 sec)
obclient [python_db]> SELECT COUNT(*) from t_part_tab PARTITION (p17);
+----------+
| COUNT(*) |
+----------+
| 47487 |
+----------+
1 row in set (0.026 sec)
查询数据字典:
obclient [information_schema]> select TABLE_SCHEMA,TABLE_NAME,PARTITION_NAME,PARTITION_METHOD,PARTITION_EXPRESSION,TABLE_ROWS,CREATE_TIME from PARTITIONS;
+--------------+------------+----------------+------------------+----------------------+------------+----------------------------+
| TABLE_SCHEMA | TABLE_NAME | PARTITION_NAME | PARTITION_METHOD | PARTITION_EXPRESSION | TABLE_ROWS | CREATE_TIME |
+--------------+------------+----------------+------------------+----------------------+------------+----------------------------+
| python_db | t_part_tab | p0 | HASH | id | NULL | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p1 | HASH | id | 5422 | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p2 | HASH | id | 5385 | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p3 | HASH | id | NULL | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p4 | HASH | id | 5332 | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p5 | HASH | id | 5298 | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p6 | HASH | id | NULL | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p7 | HASH | id | 5380 | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p8 | HASH | id | 5359 | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p9 | HASH | id | NULL | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p10 | HASH | id | 5297 | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p11 | HASH | id | 5271 | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p12 | HASH | id | NULL | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p13 | HASH | id | 5387 | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p14 | HASH | id | 5360 | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p15 | HASH | id | NULL | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p16 | HASH | id | 5315 | 2024-11-05 10:36:00.603746 |
| python_db | t_part_tab | p17 | HASH | id | 5295 | 2024-11-05 10:36:00.603746 |
+--------------+------------+----------------+------------------+----------------------+------------+----------------------------+
18 rows in set (0.034 sec)
-
升级前准备
1)停止 python 脚本
2)truncate 目标 hash 分区表
-
正式升级
$ obd cluster upgrade myoceanbase -c oceanbase-ce -V 4.3.3.1 --usable d505f601a931f82cbfba8960ddd0a4146a58b6305e223cd2060d69c237a9a31a
6.ocp express 监控显示
君野
#19
建议可以在客户端加一些qps统计,平台上的监控数据可能被平均了,不一定准确。或者可以尝试使用sysbench作为客户端,然后忽略一些错误码,以防客户端压力退出,–mysql-ignore-errors=1062,2013,4265,5066,6002,6213,6224,6222,4746,4012,4009,4250,4009,4038