【 使用环境 】生产环境 or 测试环境
【 OB or 其他组件 】
OB在4.x版本中触发永久下线后,删除了该节点上的clog、slog和sstable文件保留了目录结构然后重新拉起了进程,这种情况下发现日志流的同步状态报错并且该副本上的分区数和其他副本也不一致,我记得之前用3.x的版本会自动补齐数据,4.x不具备这样的能力吗
【 使用版本 】
4.2.1
正常情况下,只要没超出永久下线时间,拉起后会重新检测clog同步。如果是换磁盘或者服务器节点场景,此时原始数据丢弃,推荐是走删除/添加server或者节点替换流程,加入节点去补齐数据。
删除文件这种场景加入补齐,不太符合官方常规运维操作流程。
我把永久下线的时间调低了,根据__all_virtual_event_history确定触发了永久下线。。主要想知道这种场景下清空文件拉起进程为什么分区数没有补全,这块的机制和3.x有什么不一样的地方吗。。
超出永久下线,节点会被踢出集群,即使拉起也不会加入集群,也不存在自动补齐数据的。
__all_server中没有删除这个节点信息拉起来之后还是加入了集群,但确实也没有补齐数据 。。
看描述是不太符合预期的,使用的是版本呢,我内部测验下看看。
4.2.1.6
好的,我测验看下。
感谢老师
你好 老师 上面的结论有些不正确( ),按以下测试结论为准:
1)超出永久下线时间,集群不会删除节点。所做的操作是删除节点副本,并切换leader。
obclient [oceanbase]> select * from DBA_OB_ROOTSERVICE_EVENT_HISTORY order by TIMESTAMP desc limit 100;
+----------------------------+-------------------+--------------------------------+------------+-----------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+-----------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+--------------------------------------------------------------------------------------------------------------------+----------------------+-----------------------------------+---------------------------------------------------------+----------------+-------------+
| TIMESTAMP | MODULE | EVENT | NAME1 | VALUE1 | NAME2 | VALUE2 | NAME3 | VALUE3 | NAME4 | VALUE4 | NAME5 | VALUE5 | NAME6 | VALUE6 | EXTRA_INFO | RS_SVR_IP | RS_SVR_PORT |
+----------------------------+-------------------+--------------------------------+------------+-----------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+-----------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+--------------------------------------------------------------------------------------------------------------------+----------------------+-----------------------------------+---------------------------------------------------------+----------------+-------------+
| 2024-05-24 11:35:26.598202 | disaster_recovery | finish_remove_ls_paxos_replica | tenant_id | 1002 | ls_id | 1003 | task_id | YB420BA1CC62-0006192AA91D54CF-0-0 | leader | "xx.xx.xx.99:2882" | remove_server | {server:"xx.xx.xx.105:2882", timestamp:1, flag:0, replica_type:0, region:"default_region", memstore_percent:100} | execute_result | ret:0, OB_SUCCESS; elapsed:54311; | remove permanent offline replica | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:35:26.589069 | disaster_recovery | finish_remove_ls_paxos_replica | tenant_id | 1002 | ls_id | 1002 | task_id | YB420BA1CC62-0006192AA91D54CE-0-0 | leader | "xx.xx.xx.98:2882" | remove_server | {server:"xx.xx.xx.105:2882", timestamp:1, flag:0, replica_type:0, region:"default_region", memstore_percent:100} | execute_result | ret:0, OB_SUCCESS; elapsed:54182; | remove permanent offline replica | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:35:26.581312 | disaster_recovery | finish_remove_ls_paxos_replica | tenant_id | 1002 | ls_id | 1001 | task_id | YB420BA1CC62-0006192AA91D54CD-0-0 | leader | "xx.xx.xx.98:2882" | remove_server | {server:"xx.xx.xx.105:2882", timestamp:1, flag:0, replica_type:0, region:"default_region", memstore_percent:100} | execute_result | ret:0, OB_SUCCESS; elapsed:55138; | remove permanent offline replica | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:35:26.573853 | disaster_recovery | finish_remove_ls_paxos_replica | tenant_id | 1002 | ls_id | 1 | task_id | YB420BA1CC62-0006192AA91D54CC-0-0 | leader | "xx.xx.xx.98:2882" | remove_server | {server:"xx.xx.xx.105:2882", timestamp:1, flag:0, replica_type:0, region:"default_region", memstore_percent:100} | execute_result | ret:0, OB_SUCCESS; elapsed:57380; | remove permanent offline replica | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:35:26.563089 | disaster_recovery | finish_remove_ls_paxos_replica | tenant_id | 1001 | ls_id | 1 | task_id | YB420BA1CC62-0006192AA91D54CB-0-0 | leader | "xx.xx.xx.98:2882" | remove_server | {server:"xx.xx.xx.105:2882", timestamp:1, flag:0, replica_type:0, region:"default_region", memstore_percent:100} | execute_result | ret:0, OB_SUCCESS; elapsed:57380; | remove permanent offline replica | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:35:26.552679 | disaster_recovery | finish_remove_ls_paxos_replica | tenant_id | 1 | ls_id | 1 | task_id | YB420BA1CC62-0006192AA91D54CA-0-0 | leader | "xx.xx.xx.98:2882" | remove_server | {server:"xx.xx.xx.105:2882", timestamp:1, flag:0, replica_type:0, region:"default_region", memstore_percent:100} | execute_result | ret:0, OB_SUCCESS; elapsed:57884; | remove permanent offline replica | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:35:26.546558 | disaster_recovery | disaster_recovery_start | start_time | 1716521726546557 | | | | | | | | | | | | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:35:26.543917 | disaster_recovery | start_remove_ls_paxos_replica | tenant_id | 1002 | ls_id | 1003 | task_id | YB420BA1CC62-0006192AA91D54CF-0-0 | leader | "xx.xx.xx.99:2882" | remove_server | {server:"xx.xx.xx.105:2882", timestamp:1, flag:0, replica_type:0, region:"default_region", memstore_percent:100} | comment | remove permanent offline replica | | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:35:26.534903 | disaster_recovery | start_remove_ls_paxos_replica | tenant_id | 1002 | ls_id | 1002 | task_id | YB420BA1CC62-0006192AA91D54CE-0-0 | leader | "xx.xx.xx.98:2882" | remove_server | {server:"xx.xx.xx.105:2882", timestamp:1, flag:0, replica_type:0, region:"default_region", memstore_percent:100} | comment | remove permanent offline replica | | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:35:26.526190 | disaster_recovery | start_remove_ls_paxos_replica | tenant_id | 1002 | ls_id | 1001 | task_id | YB420BA1CC62-0006192AA91D54CD-0-0 | leader | "xx.xx.xx.98:2882" | remove_server | {server:"xx.xx.xx.105:2882", timestamp:1, flag:0, replica_type:0, region:"default_region", memstore_percent:100} | comment | remove permanent offline replica | | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:35:26.516492 | disaster_recovery | start_remove_ls_paxos_replica | tenant_id | 1002 | ls_id | 1 | task_id | YB420BA1CC62-0006192AA91D54CC-0-0 | leader | "xx.xx.xx.98:2882" | remove_server | {server:"xx.xx.xx.105:2882", timestamp:1, flag:0, replica_type:0, region:"default_region", memstore_percent:100} | comment | remove permanent offline replica | | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:35:26.505729 | disaster_recovery | start_remove_ls_paxos_replica | tenant_id | 1001 | ls_id | 1 | task_id | YB420BA1CC62-0006192AA91D54CB-0-0 | leader | "xx.xx.xx.98:2882" | remove_server | {server:"xx.xx.xx.105:2882", timestamp:1, flag:0, replica_type:0, region:"default_region", memstore_percent:100} | comment | remove permanent offline replica | | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:35:26.494841 | disaster_recovery | start_remove_ls_paxos_replica | tenant_id | 1 | ls_id | 1 | task_id | YB420BA1CC62-0006192AA91D54CA-0-0 | leader | "xx.xx.xx.98:2882" | remove_server | {server:"xx.xx.xx.105:2882", timestamp:1, flag:0, replica_type:0, region:"default_region", memstore_percent:100} | comment | remove permanent offline replica | | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:35:17.595611 | server | permanent_offline | server | "xx.xx.xx.105:2882" | | | | | | | | | | | | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:34:27.538115 | root_service | admin_set_config | ret | 0 | arg | {items:[{name:"rootservice_list", value:"xx.xx.xx.98:2882:2881;xx.xx.xx.99:2882:2881", comment:"", zone:"", server:"0.0.0.0:0", tenant_name:"", exec_tenant_id:1, tenant_ids:[], want_to_set_tenant_config:false}], is_inner:false} | | | | | | | | | | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:34:27.528222 | server | last_offline_time set | server | "xx.xx.xx.105:2882" | | | | | | | | | | | | xx.xx.xx.98 | 2882 |
| 2024-05-24 11:34:27.521227 | server | lease_expire | server | "xx.xx.xx.105:2882" | | | | | | | | | | | | xx.xx.xx.98 | 2882 |
+----------------------------+-------------------+--------------------------------+------------+-----------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+-----------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+--------------------------------------------------------------------------------------------------------------------+----------------------+-----------------------------------+---------------------------------------------------------+----------------+-------------+
100 rows in set (0.001 sec)
2)重新拉起节点,会自动补齐副本,测试删除数据文件 和 clog目录,也会重新补齐。
4.x 也会自动补齐?我这边实测发现长时间没有补齐。。还有个行为我忘记说了,我是在集群正常的情况下先删除了一台机器上的clog下的所有目录,leader立马切走然后才让这台机器触发了永久下线,后面重建了clog下的目录拉起进程才发现没有补齐
目前值得怀疑的地方可能是我创建的目录少了,我只创建了该zone中缺少的日志流id目录,我在去复现下试试。。感谢秃蛙老师
你这最终补齐数据了吗
能补齐
我有一个测试集群,其中一个节点网线被弄掉了。已经过了一天多时间了,看了日志已经有租约过期,和永久下线的信息。为啥插上网线后,过了会他就自动加入了呢。版本4.3.2.0的
估计是服务器的信息没有在集群中删除吧,如果__all_server中还有该机器的信息那么机器起来之后重新加入集群是正常的
我还想要找资料把他看怎么恢复呢,自己就恢复了。就觉得奇怪,都报永久下线了,还能回来
触发永久下线只是会让该机器上的副本不可用然后迁移,不代表机器不能重新加入集群