OB 社区版 4.2.2 备租户延时增大异常问题咨询

OB 信息

    1. 生产环境,版本:4.2.2.1 ,软件包:oceanbase-ce-4.2.2.1-101000012024030709.el7.x86_64.rpm
    1. 主集群拓扑 1-1-1,备集群是单节点集群(严格来说 4.2 已经没有备集群概念,这里方便指代)。主集群服务器 CPU、内存和 NVMe SSD 符合 OB 生产要求,备集群 CPU 和 内存一样,磁盘是 SATA SSD, IO 性能相比主集群差一些,但比 SAS 盘还是要好很多。
    1. 主集群上部分租户在备集群上了做了备租户。除了一个备租户时延变大,其他都正常。
    1. 问题租户的数据容量 100G 不到。主租户只有写,但是写入量非常少。

问题现象

主租户资源规格信息

+-----------------------+-------------------------------+---------+---------+-------------+------------------+---------------------+---------------------+---------+-------+------------------+-----------+-------------+
| resource_pool_name    | unit_config_name              | max_cpu | min_cpu | mem_size_gb | log_disk_size_gb | max_iops            | min_iops            | unit_id | zone  | observer         | tenant_id | tenant_name |
+-----------------------+-------------------------------+---------+---------+-------------+------------------+---------------------+---------------------+---------+-------+------------------+-----------+-------------+
| pool_ten005_zone1_gtu | config_ten005_zone1_U4C4G_gtu |       4 |       4 |        4.00 |            12.00 | 9223372036854775807 | 9223372036854775807 |    1022 | zone1 | 10.0.0.36:2882 |      1016 | ten005      |
| pool_ten005_zone3_xld | config_ten005_zone3_U4C4G_xld |       4 |       4 |        4.00 |            12.00 | 9223372036854775807 | 9223372036854775807 |    1023 | zone3 | 10.0.0.38:2882 |      1016 | ten005      |
| pool_ten005_zone2_tqe | config_ten005_zone2_U4C4G_tqe |       4 |       4 |        4.00 |            12.00 | 9223372036854775807 | 9223372036854775807 |    1024 | zone2 | 10.0.0.37:2882 |      1016 | ten005      |
+-----------------------+-------------------------------+---------+---------+-------------+------------------+---------------------+---------------------+---------+-------+------------------+-----------+-------------+
3 rows in set (0.00 sec)

  • 备租户性能信息。

备租户在下午调整过 租户资源规格,内存从 8G 降到 2G ,后又提升到 4G 。

备租户资源规格信息

+-----------------------+-------------------------------+---------+---------+-------------+------------------+---------------------+---------------------+---------+-------+------------------+-----------+-------------+
| resource_pool_name    | unit_config_name              | max_cpu | min_cpu | mem_size_gb | log_disk_size_gb | max_iops            | min_iops            | unit_id | zone  | observer         | tenant_id | tenant_name |
+-----------------------+-------------------------------+---------+---------+-------------+------------------+---------------------+---------------------+---------+-------+------------------+-----------+-------------+
| pool_ten005_zone1_gfj | config_ten005_zone1_U4C4G_hkk |       4 |       4 |        4.00 |            12.00 | 9223372036854775807 | 9223372036854775807 |    1003 | zone1 | 10.0.0.41:2882 |      1006 | ten005      |
+-----------------------+-------------------------------+---------+---------+-------------+------------------+---------------------+---------------------+---------+-------+------------------+-----------+-------------+1 row in set (0.02 sec)

  • 主租户的信息。
MySQL [oceanbase]> select tenant_id, tenant_name, tenant_type, primary_zone,tenant_role, scn_to_timestamp(sync_scn) sync_ts, scn_to_timestamp(replayable_scn) replayable_ts, scn_to_timestamp(readable_scn) readable_ts, scn_to_timestamp(recovery_until_scn) recovery_until_ts, log_mode,max_ls_id from oceanbase.dba_ob_tenants where tenant_type='USER' and tenant_id in (1016);
+-----------+-------------+-------------+-------------------+-------------+----------------------------+----------------------------+----------------------------+----------------------------+------------+-----------+
| tenant_id | tenant_name | tenant_type | primary_zone      | tenant_role | sync_ts                    | replayable_ts              | readable_ts                | recovery_until_ts          | log_mode   | max_ls_id |
+-----------+-------------+-------------+-------------------+-------------+----------------------------+----------------------------+----------------------------+----------------------------+------------+-----------+
|      1016 | ten005      | USER        | zone2;zone1;zone3 | PRIMARY     | 2024-05-22 18:31:58.495893 | 2024-05-22 18:31:58.495893 | 2024-05-22 18:31:58.495893 | 2116-02-21 07:53:38.427387 | ARCHIVELOG |      1003 |
+-----------+-------------+-------------+-------------------+-------------+----------------------------+----------------------------+----------------------------+----------------------------+------------+-----------+
1 row in set (0.01 sec)

MySQL [oceanbase]> select tenant_id,ls_id,svr_ip,role,access_mode,in_sync, scn_to_timestamp(begin_scn) begin_timestamp,scn_to_timestamp(end_scn) end_timestamp,scn_to_timestamp(max_scn) max_timestamp from oceanbase.gv$ob_log_stat where role='LEADER' and tenant_id in (1016);
+-----------+-------+-------------+--------+-------------+---------+----------------------------+----------------------------+----------------------------+
| tenant_id | ls_id | svr_ip      | role   | access_mode | in_sync | begin_timestamp            | end_timestamp              | max_timestamp              |
+-----------+-------+-------------+--------+-------------+---------+----------------------------+----------------------------+----------------------------+
|      1016 |     1 | 10.0.0.37 | LEADER | APPEND      | YES     | 2024-05-15 19:05:47.830719 | 2024-05-22 18:27:58.041650 | 2024-05-22 18:27:58.041650 |
|      1016 |  1001 | 10.0.0.37 | LEADER | APPEND      | YES     | 2024-05-15 18:05:00.916525 | 2024-05-22 18:27:58.041650 | 2024-05-22 18:27:58.041650 |
+-----------+-------+-------------+--------+-------------+---------+----------------------------+----------------------------+----------------------------+
2 rows in set (0.00 sec)
  • 备租户信息。
MySQL [oceanbase]> select tenant_id, tenant_name, tenant_type, primary_zone,tenant_role, scn_to_timestamp(sync_scn) sync_ts, scn_to_timestamp(replayable_scn) replayable_ts, scn_to_timestamp(readable_scn) readable_ts, scn_to_timestamp(recovery_until_scn) recovery_until_ts, log_mode,max_ls_id from oceanbase.dba_ob_tenants where tenant_type='USER' and tenant_id in (1006);
+-----------+-------------+-------------+--------------+-------------+----------------------------+----------------------------+----------------------------+----------------------------+--------------+-----------+
| tenant_id | tenant_name | tenant_type | primary_zone | tenant_role | sync_ts                    | replayable_ts              | readable_ts                | recovery_until_ts          | log_mode     | max_ls_id |
+-----------+-------------+-------------+--------------+-------------+----------------------------+----------------------------+----------------------------+----------------------------+--------------+-----------+
|      1006 | ten005      | USER        | RANDOM       | STANDBY     | 2024-05-22 12:48:11.425191 | 2024-05-22 12:48:11.425191 | 2024-05-22 12:48:11.425191 | 2116-02-21 07:53:38.427387 | NOARCHIVELOG |      1001 |
+-----------+-------------+-------------+--------------+-------------+----------------------------+----------------------------+----------------------------+----------------------------+--------------+-----------+
1 row in set (0.02 sec)

MySQL [oceanbase]> select tenant_id,ls_id,svr_ip,role,access_mode,in_sync, scn_to_timestamp(begin_scn) begin_timestamp,scn_to_timestamp(end_scn) end_timestamp,scn_to_timestamp(max_scn) max_timestamp from oceanbase.gv$ob_log_stat where role='LEADER' and tenant_id in (1006);
+-----------+-------+-------------+--------+-------------+---------+----------------------------+----------------------------+----------------------------+
| tenant_id | ls_id | svr_ip      | role   | access_mode | in_sync | begin_timestamp            | end_timestamp              | max_timestamp              |
+-----------+-------+-------------+--------+-------------+---------+----------------------------+----------------------------+----------------------------+
|      1006 |     1 | 10.0.0.41 | LEADER | RAW_WRITE   | YES     | 2024-05-21 20:45:47.970381 | 2024-05-22 18:33:34.639532 | 2024-05-22 18:33:34.639532 |
|      1006 |  1001 | 10.0.0.41 | LEADER | RAW_WRITE   | YES     | 2024-05-22 03:33:03.317788 | 2024-05-22 12:48:11.425191 | 2024-05-22 12:48:11.425191 |
+-----------+-------+-------------+--------+-------------+---------+----------------------------+----------------------------+----------------------------+
2 rows in set (0.01 sec)

看起来租户 的业务数据同步时间停留在 2024-05-22 12:48:11.425191 。原因不明。

查看备租户的日志复制源信息。

MySQL [oceanbase]> select tenant_id, id, type, substr(value,1,30) value_, scn_to_timestamp(recovery_until_scn) from oceanbase.cdb_ob_log_restore_source where tenant_id=1006;
+-----------+----+---------+--------------------------------+--------------------------------------+
| tenant_id | id | type    | value_                         | scn_to_timestamp(recovery_until_scn) |
+-----------+----+---------+--------------------------------+--------------------------------------+
|      1006 |  1 | SERVICE | IP_LIST=10.0.0.36:2881;10.10 | 2116-02-21 07:53:38.427387           |
+-----------+----+---------+--------------------------------+--------------------------------------+
1 row in set (0.01 sec)

是选择的基于网络的日志复制。 上面截断了部分 IP 字符串信息。

尝试解决思路

在备集群对备租户暂停和开启日志同步。

MySQL [oceanbase]> SELECT TENANT_NAME, TENANT_ID, TENANT_ROLE, SCN_TO_TIMESTAMP(SYNC_SCN) 
    -> FROM oceanbase.DBA_OB_TENANTS WHERE TENANT_NAME ='ten005';
+-------------+-----------+-------------+----------------------------+
| TENANT_NAME | TENANT_ID | TENANT_ROLE | SCN_TO_TIMESTAMP(SYNC_SCN) |
+-------------+-----------+-------------+----------------------------+
| ten005      |      1006 | STANDBY     | 2024-05-22 12:48:11.425191 |
+-------------+-----------+-------------+----------------------------+
1 row in set (0.04 sec)

MySQL [oceanbase]> SELECT LS_ID, SCN_TO_TIMESTAMP(END_SCN) FROM oceanbase.GV$OB_LOG_STAT WHERE TENANT_ID =1006 and role='LEADER';
+-------+----------------------------+
| LS_ID | SCN_TO_TIMESTAMP(END_SCN)  |
+-------+----------------------------+
|     1 | 2024-05-22 17:35:32.897059 |
|  1001 | 2024-05-22 12:48:11.425191 |
+-------+----------------------------+
2 rows in set (0.01 sec)

MySQL [oceanbase]> alter system recover standby tenant=ten005 cancel;
Query OK, 0 rows affected (0.16 sec)

MySQL [oceanbase]> SELECT TENANT_NAME, TENANT_ID, TENANT_ROLE, SCN_TO_TIMESTAMP(SYNC_SCN)  FROM oceanbase.DBA_OB_TENANTS WHERE TENANT_NAME ='ten005';
+-------------+-----------+-------------+----------------------------+
| TENANT_NAME | TENANT_ID | TENANT_ROLE | SCN_TO_TIMESTAMP(SYNC_SCN) |
+-------------+-----------+-------------+----------------------------+
| ten005      |      1006 | STANDBY     | 2024-05-22 12:48:11.425191 |
+-------------+-----------+-------------+----------------------------+
1 row in set (0.04 sec)

MySQL [oceanbase]> SELECT LS_ID, SCN_TO_TIMESTAMP(END_SCN) FROM oceanbase.GV$OB_LOG_STAT WHERE TENANT_ID =1006 and role='LEADER';                                                                                                           
+-------+----------------------------+
| LS_ID | SCN_TO_TIMESTAMP(END_SCN)  |
+-------+----------------------------+
|     1 | 2024-05-22 17:39:33.258304 |
|  1001 | 2024-05-22 12:48:11.425191 |
+-------+----------------------------+
2 rows in set (0.00 sec)

MySQL [oceanbase]> alter system recover standby tenant=ten005 until unlimited;
Query OK, 0 rows affected (0.05 sec)

MySQL [oceanbase]> SELECT LS_ID, SCN_TO_TIMESTAMP(END_SCN) FROM oceanbase.GV$OB_LOG_STAT WHERE TENANT_ID =1006 and role='LEADER';
+-------+----------------------------+
| LS_ID | SCN_TO_TIMESTAMP(END_SCN)  |
+-------+----------------------------+
|     1 | 2024-05-22 17:39:33.258304 |
|  1001 | 2024-05-22 12:48:11.425191 |
+-------+----------------------------+
2 rows in set (0.00 sec)

end_scn 列对应的时间还是停留不变。

为了避免是源端 业务数据没有变化导致,到主租户的 test 数据库下新建了一个表和写入一笔数据。再次查看备租户的复制时间,还是不变。

1 个赞

你的问题我们已经收到,稍后回复

1 个赞

老师,这边看备租户在下午调整过 租户资源规格,基于网络的日志同步,备租户同步会受网络、CPU、Memory等资源的影响,现在备租户延时时间一直在增加吗,您可以参考这个文档,在备租户与恢复源之间带宽足够的前提下,尝试通过修改租户级配置项 log_restore_concurrency 来调整备租户恢复日志的并发度,进而提高备租户的日志同步性能。OceanBase分布式数据库-海量数据 笔笔算数

1 个赞

更新一下排查信息:

  • 备集群(租户)所在磁盘确认是 SATA 机械盘,此前信息有误。

备租户所在 OB 节点的日志有下面可疑信息:

2024-05-23 21:03:12.601009] WDIAG [STORAGE] ~ObOccamTimeGuard (ob_occam_time_guard.h:270) [14182][T1_MDS_TABLE_ME][T1][YB420A0A0A29-00061917A310BB9E-0-0] [lt=16][errcode=0] cost too much time:ob_tablet_slog_helper.cpp:static int oceanbase::storage::ObTabletSlogHelper::write_update_tablet_slog(const share::ObLSID &, const common::ObTabletID &, const oceanbase::storage::ObMetaDiskAddr &), (*this=|threshold=10.00ms|start at 21:03:12.533|44=0us|52=16.96ms|56=50.53ms|tota
l=67.59ms)

[2024-05-23 21:03:12.601152] INFO  [MDS.EVENT]try_advance_rec_scn (mds_table_base.cpp:234) [14175][T1_MDS_TABLE_ME][T1][YB420A0A0A29-00061917A310BB9C-0-0] [lt=34] ADVANCE_REC_SCN(key={tenant_id:1, ls_id:{id:1}, tablet_id:{id:382}}, even
t={alloc:null, timestamp:"2024-05-23 21:03:18.263276", event:"ADVANCE_REC_SCN", info_str:"{val:1716442247717646001, v:0} -> {val:4611686018427387903, v:0}", unit_id:255, key_str:"", writer_type:0, writer_id:0, seq_no:0, redo_scn:{val:18446744073709551615, v:3}, end_scn:{val:18446744073709551615, v:3}, trans_version:{val:18446744073709551615, v:3}, node_type:0, state:5})

[2024-05-23 21:03:12.601208] INFO  [MDS.EVENT]on_flush_ (mds_table_impl.ipp:1087) [14168][T1_MDS_TABLE_ME][T1][YB420A0A0A29-00061917A310BB9D-0-0] [lt=32] ON_FLUSH(key={tenant_id:1, ls_id:{id:1}, tablet_id:{id:60344}}, event={alloc:null, timestamp:"2024-05-23 21:03:10.757720", event:"ON_FLUSH", info_str:"flush_scn:{val:1716469381385051000, v:0}", unit_id:255, key_str:"", writer_type:0, writer_id:0, seq_no:0, redo_scn:{val:18446744073709551615, v:3}, end_scn:{val:184467
44073709551615, v:3}, trans_version:{val:18446744073709551615, v:3}, node_type:0, state:5})

[2024-05-23 21:03:12.601622] INFO  [MDS.EVENT]operator() (mds_row.ipp:510) [14197][T1_MDS_TABLE_ME][T1][YB420A0A0A29-00061917A310BBA1-0-0] [lt=36] DUMP_NODE_FOR_FLUSH(key={tenant_id:1, ls_id:{id:1}, tablet_id:{id:215}}, event={alloc:nul
l, timestamp:"2024-05-23 21:03:17.12203", event:"DUMP_NODE_FOR_FLUSH", info_str:"{compaction_type:"MAJOR_COMPACTION", medium_merge_reason:"TENANT_MAJOR", medium_snapshot:1716442214995732000, last_medium_snapshot:1716434472218307000, tenant_id:1, cluster_id:1711351486, medium_compat_version:4, data_version:17180000769, is_schema_changed:0, storage_schema:{this:0x7fd308cab5e0, storage_schema_version:2, version:0, is_use_bloomfilter:0, column_info_simplified:0, compat_mode:0, table_type:0, index_type:0, index_status:1, row_store_type:1, schema_version:1715736145754520, column_cnt:4, store_column_cnt:4, tablet_size:134217728, pctfree:10, block_size:16384, progressive_merge_round:1, master_key_id:1844674
4073709551615, compressor_type:1, encryption:"", encrypt_key:"", rowkey_cnt:1, rowkey_array:[{column_idx:18, meta_type:{type:"BIGINT", collation:"binary", coercibility:"NUMERIC"}, order:0}], column_array:[{meta_type:{type:"TIMESTAMP", c
ollation:"binary", coercibility:"NUMERIC"}, is_column_stored_in_sstable:1, is_rowkey_column:0, is_generated_column:0, orig_default_value:{"NULL":"NULL"}}, {meta", unit_id:3, key_str:"", writer_type:3, writer_id:10000, seq_no:0, redo_scn
:{val:1716442247979611002, v:0}, end_scn:{val:1716442247979611002, v:0}, trans_version:{val:1716442247979611002, v:0}, node_type:1, state:3})

[2024-05-23 21:21:26.076405] WDIAG [RECOVERY_LS] ~ObOccamFastTimeGuard (ob_occam_time_guard.h:403) [14532][T1006_RecLSSer][T1006][YB420A0A0A29-00061917A0042A72-0-0] [lt=87][errcode=0] cost too much time:ob_recovery_ls_service.cpp:int oceanbase::rootserver::ObRecoveryLSService::report_sys_ls_recovery_stat_in_trans_(const share::SCN &, const bool, common::ObMySQLTransaction &, const char *), (*this=|threshold=100.00ms|start at 21:21:25.898|1036=63us|1049=177.92ms|total=178.05ms)

查看 CLOG 所在磁盘 IO 性能发现 平均 IO 延时在 3~10ms 左右。

为了确认是不是 IO 导致的问题,我将备集群上其他备租户全部删除了,依然有这种告警日志。
同时新创建了一个备租户,新的备租户也创建成功。

MySQL [oceanbase]> select tenant_id,ls_id,svr_ip,role,in_sync,scn_to_timestamp(begin_scn) begin_scn_ts,scn_to_timestamp(end_scn) end_scn_ts,scn_to_timestamp(max_scn) max_scn_ts from oceanbase.gv$ob_log_stat where tenant_id in (1006,1010);
+-----------+-------+-------------+--------+---------+----------------------------+----------------------------+----------------------------+
| tenant_id | ls_id | svr_ip      | role   | in_sync | begin_scn_ts               | end_scn_ts                 | max_scn_ts                 |
+-----------+-------+-------------+--------+---------+----------------------------+----------------------------+----------------------------+
|      1006 |     1 | 10.0.0.41 | LEADER | YES     | 2024-05-21 20:45:47.970381 | 2024-05-22 19:58:40.317789 | 2024-05-22 19:58:40.317789 |
|      1006 |  1001 | 10.0.0.41 | LEADER | YES     | 2024-05-22 03:33:03.317788 | 2024-05-22 12:48:11.425191 | 2024-05-22 12:48:11.425191 |
|      1010 |     1 | 10.0.0.41 | LEADER | YES     | 2024-05-23 02:03:04.114085 | 2024-05-23 21:26:13.000313 | 2024-05-23 21:26:13.000313 |
|      1010 |  1001 | 10.0.0.41 | LEADER | YES     | 2024-05-22 16:11:38.894014 | 2024-05-23 21:26:13.000313 | 2024-05-23 21:26:13.000313 |
+-----------+-------+-------------+--------+---------+----------------------------+----------------------------+----------------------------+
4 rows in set (0.00 sec)

1010 是新的备租户,老的备租户 1006 的 end_scn_ts 依然静止在 2024-05-22 12:48:11.425191

可以排除网络吞吐瓶颈,因为主租户写入量就很低。磁盘性能差或许有一定影响,但不影响新的备租户的创建。

  • 确认是否主租户端事务日志在没有传递到备租户就覆盖 了。
    不确定用什么方法确认,下面这个 SQL 不知道行不行。
MySQL [oceanbase]> select tenant_id,ls_id,svr_ip,role,in_sync,scn_to_timestamp(begin_scn) begin_scn_ts,scn_to_timestamp(end_scn) end_scn_ts,scn_to_timestamp(max_scn) max_scn_ts from oceanbase.gv$ob_log_stat where tenant_id in (1016);
+-----------+-------+-------------+----------+---------+----------------------------+----------------------------+----------------------------+
| tenant_id | ls_id | svr_ip      | role     | in_sync | begin_scn_ts               | end_scn_ts                 | max_scn_ts                 |
+-----------+-------+-------------+----------+---------+----------------------------+----------------------------+----------------------------+
|      1016 |     1 | 10.0.0.36 | FOLLOWER | YES     | 2024-05-17 00:27:15.752917 | 2024-05-23 21:11:20.116482 | 2024-05-23 21:11:20.116482 |
|      1016 |  1001 | 10.0.0.36 | FOLLOWER | YES     | 2024-05-17 09:44:10.709447 | 2024-05-23 21:11:20.116482 | 2024-05-23 21:11:20.116482 |
|      1016 |     1 | 10.0.0.37 | LEADER   | YES     | 2024-05-17 00:27:15.752917 | 2024-05-23 21:11:20.116482 | 2024-05-23 21:11:20.116482 |
|      1016 |  1001 | 10.0.0.37 | LEADER   | YES     | 2024-05-17 09:44:10.709447 | 2024-05-23 21:11:20.116482 | 2024-05-23 21:11:20.116482 |
|      1016 |     1 | 10.0.0.38 | FOLLOWER | YES     | 2024-05-17 00:27:15.752917 | 2024-05-23 21:11:20.116482 | 2024-05-23 21:11:20.116482 |
|      1016 |  1001 | 10.0.0.38 | FOLLOWER | YES     | 2024-05-17 09:44:10.709447 | 2024-05-23 21:11:20.116482 | 2024-05-23 21:11:20.116482 |
+-----------+-------+-------------+----------+---------+----------------------------+----------------------------+----------------------------+
6 rows in set (0.01 sec)

也可以排除是同步性能问题。这个租户不是同步慢,是同步静止了。

也重启过备集群 observer 节点,没有变化。

主租户当初做过的事情如果硬要说有什么特别的,就是合并过。目前备租户的合并感觉也是卡住的状态。

MySQL [oceanbase]> select tenant_id, frozen_time, scn_to_timestamp(global_broadcast_scn) global_broadcast_ts, last_finish_time, start_time,status,is_error,is_suspended,info from oceanbase.cdb_ob_major_compaction where tenant_id=1006 ;
+-----------+----------------------------+----------------------------+----------------------------+----------------------------+--------+----------+--------------+------+
| tenant_id | frozen_time                | global_broadcast_ts        | last_finish_time           | start_time                 | status | is_error | is_suspended | info |
+-----------+----------------------------+----------------------------+----------------------------+----------------------------+--------+----------+--------------+------+
|      1006 | 2024-05-22 02:00:04.031659 | 2024-05-22 02:00:04.031659 | 1970-01-01 08:00:00.000000 | 1970-01-01 08:00:00.000000 | IDLE   | NO       | NO           |      |
+-----------+----------------------------+----------------------------+----------------------------+----------------------------+--------+----------+--------------+------+
1 row in set (0.01 sec)

这个 备租户的 last_finish_time 是不对的,说明合并没有成功。
目前也发现 4.2 的备租户并不能发起合并,是跟着主租户合并走的。 主租户已经发起过多次合并,备租户目前的合并都不推进。

目前不确定是 备租户的合并卡住了恢复,还是恢复卡住了合并,或者是其他原因卡住了合并或恢复。 备租户所在节点磁盘(sata)确实不好,不过其他写压力比这个租户大的备租户也创建成功了,延时在几秒以内。在这个问题租户后面的新建备租户也成功了。

所以怀疑可能是这个问题租户的主租户当初多个某些操作 加上备租户的磁盘性能不好同时发生作用 行成了这样一个局面。

日志如下。 为了减少 IO 影响,我设置了 syslog_level=ERROR, 不过 WDIAG 日志还是有输出。
observer.log.gz (8.3 MB)

1 个赞

针对这个主租户 重新在备集群(为指代方便)创建了一个新的备租户,创建成功了。

MySQL [oceanbase]> select tenant_id, tenant_name, tenant_type, primary_zone,tenant_role, scn_to_timestamp(sync_scn) sync_ts, scn_to_timestamp(replayable_scn) replayable_ts, scn_to_timestamp(readable_scn) readable_ts, scn_to_timestamp(recovery_until_scn) recovery_until_ts, log_mode,max_ls_id from oceanbase.dba_ob_tenants where tenant_type='USER' ;
+-----------+-------------+-------------+--------------+-------------+----------------------------+----------------------------+----------------------------+----------------------------+--------------+-----------+
| tenant_id | tenant_name | tenant_type | primary_zone | tenant_role | sync_ts                    | replayable_ts              | readable_ts                | recovery_until_ts          | log_mode     | max_ls_id |
+-----------+-------------+-------------+--------------+-------------+----------------------------+----------------------------+----------------------------+----------------------------+--------------+-----------+
|      1006 | ten005      | USER        | RANDOM       | STANDBY     | 2024-05-22 12:48:11.425191 | 2024-05-22 12:48:11.425191 | 2024-05-22 12:48:11.425191 | 2116-02-21 07:53:38.427387 | NOARCHIVELOG |      1001 |
|      1010 | ten002      | USER        | RANDOM       | STANDBY     | 2024-05-24 08:39:16.859044 | 2024-05-24 08:39:16.859044 | 2024-05-24 08:39:16.357898 | 2116-02-21 07:53:38.427387 | NOARCHIVELOG |      1001 |
|      1012 | ten005b     | USER        | RANDOM       | STANDBY     | 2024-05-24 08:37:47.065172 | 2024-05-24 08:37:47.065172 | 2024-05-24 08:37:47.065172 | 2116-02-21 07:53:38.427387 | NOARCHIVELOG |      1001 |
+-----------+-------------+-------------+--------------+-------------+----------------------------+----------------------------+----------------------------+----------------------------+--------------+-----------+
3 rows in set (0.09 sec)

上面 ten005 是老的问题备租户,ten005b 是新的备租户。所以进程外部环境应该不是问题备租户不同步的原因。
这个估计不好再查,可以不查。这里就确认一下以下 2 个方法问题:

    1. 如何确认主租户上的日志流 在传递到备租户的时候有出现中断(GAP)?GAP 多大?
    1. 如果复制中断原因是在备租户这端,如何查看备租户分区或 tablet 恢复细节?

备租户上下面两个视图为空。

MySQL [oceanbase]> select * from CDB_OB_RECOVER_TABLE_JOBS;
Empty set (0.01 sec)

MySQL [oceanbase]> select * from CDB_OB_RECOVER_TABLE_JOB_HISTORY ;
Empty set (0.01 sec)
1 个赞

可以参看下是不是 问题四 的现象。
https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000000818726#12-title-问题四:备租户同步位点推进卡住

2 个赞

前面的排查方法都没有看出什么原因。后经日志分析推测是 4.2.2.1 的 BUG 。


分析过程

经过反复十几次的备租户创建和还原发现,有时候备租户能成功并不会有延时增大,有时候备租户延时会一直增大。对于失败的备租户,日志里频繁的刷新一段告警日志。

[2024-06-18 12:18:59.113356] WDIAG [SHARE] parse_service_attr_from_str (ob_log_restore_struct.cpp:119) [10924][T1016_LogRessvr][T1016][YB420A0A0A29-00061AA923BA85DD-0-0] [lt=47][errcode=-4007] fail to parse service attr str(token="V9X")
[2024-06-18 12:18:59.113372] WDIAG [CLOG] add_service_source_ (ob_remote_location_adaptor.cpp:246) [10924][T1016_LogRessvr][T1016][YB420A0A0A29-00061AA923BA85DD-0-0] [lt=14][errcode=-4007] parse service attr failed(value=IP_LIST=10.0.0.36:2881;10.0.0.37:2881;10.0.0.38:2881,USER=STANDBYRO@ten005,PASSWORD=5t)Ufw,V9X,TENANT_ID=1016,CLUSTER_ID=1711351485,COMPATIBILITY_MODE=MYSQL,IS_ENCRYPTED=false)
[2024-06-18 12:18:59.113390] WDIAG [CLOG] update_upstream (ob_remote_location_adaptor.cpp:96) [10924][T1016_LogRessvr][T1016][YB420A0A0A29-00061AA923BA85DD-0-0] [lt=12][errcode=-4007] do update failed(source_exist=true, source={tenant_id:1016, id:1, until_scn:{val:4611686018427387903, v:0}, type:1, value:"IP_LIST=10.0.0.36:2881;10.0.0.37:2881;10.0.0.38:2881,USER=STANDBYRO@ten005,PASSWORD=5t)Ufw,V9X,TENANT_ID=1016,CLUSTER_ID=1711351485,COMPATIBILITY_MODE=MYSQL,IS_ENCRYPTED=false"})
[2024-06-18 12:18:59.113415] WDIAG [CLOG] do_thread_task_ (ob_log_restore_service.cpp:189) [10924][T1016_LogRessvr][T1016][YB420A0A0A29-00061AA923BA85DD-0-0] [lt=23][errcode=-4007] update_upstream_ failed

多次失败时候,日志都跟这个类似。诡异的就是这个报错 :token="V9X"
这个 V9X是配置的密码中的一部分 :5t)Ufw,V9X

推测是这个随机密码中的 逗号 导致代码里解析 log source失败了。

解决方法

修改主租户里 standbyro 用户的密码,不要搞那么多随机字符串(逗号等)。然后在备租户里重新设置 log source。

ALTER SYSTEM SET LOG_RESTORE_SOURCE = 'SERVICE=10.0.0.36:2881;10.0.0.37:2881;10.0.0.38:2881 USER=STANDBYRO@ten005 PASSWORD=stb@RO#123' TENANT = ten005b;

同步延时很快就消失了。

image

1 个赞