ob-operator在扩容ob过程中重启,没有继续完成扩容

【 使用环境 】生产环境 or 测试环境
【 OB or 其他组件 】ob、ob-operator
【 使用版本 】ob 4.4.2、ob-operator 2.3.4
【问题描述】ob-operator水平扩容zone中的副本数过程中,出现了重启,重启后没有继续完成扩容。
【复现路径】

  1. 参考改文档,部署ob集群:
    https://www.oceanbase.com/docs/community-ob-operator-doc-1000000005169787
    ob的zone和observer数参考如下:
    topology:
    • zone: zone1
      replica: 1
    • zone: zone2
      replica: 1
    • zone: zone3
      replica: 1

查看observer的cr。
kubectl get observers.oceanbase.oceanbase.com -n test-oceanbase
NAME PODIP STATUS AGE CLUSTERNAME ZONENAME
obcluster-1-zone1-6p7k78 10.244.151.187 running 2d21h obcluster zone1
obcluster-1-zone2-m7fw9d 10.244.157.38 running 2d19h obcluster zone2
obcluster-1-zone3-rq8sb8 10.244.160.31 running 2d19h obcluster zone3

  1. 完成部署后,修改ob的每个zone中的observer数量,触发扩容。
    topology:
    • zone: zone1
      replica: 2
    • zone: zone2
      replica: 2
    • zone: zone3
      replica: 2

查看扩容过程中cr。
kubectl get observers.oceanbase.oceanbase.com -n test-oceanbase
NAME PODIP STATUS AGE CLUSTERNAME ZONENAME
obcluster-1-zone1-5p84x4 new 38s obcluster zone1
obcluster-1-zone1-6p7k78 10.244.151.187 running 2d21h obcluster zone1
obcluster-1-zone2-m7fw9d 10.244.157.38 running 2d19h obcluster zone2
obcluster-1-zone2-vvqjp5 new 38s obcluster zone2
obcluster-1-zone3-k7vp48 new 38s obcluster zone3
obcluster-1-zone3-rq8sb8 10.244.160.31 running 2d19h obcluster zone3

  1. 扩容过程中,重启ob-operator
    kubectl delete pod -n oceanbase-system oceanbase-controller-manager-7676569996-2ff8c

  2. 查看扩容结果,其中一个出现了failed(实际可能会出现新的observer都是failed)
    kubectl get observers.oceanbase.oceanbase.com -n test-oceanbase
    NAME PODIP STATUS AGE CLUSTERNAME ZONENAME
    obcluster-1-zone1-5p84x4 Failed 2m47s obcluster zone1
    obcluster-1-zone1-6p7k78 10.244.151.187 running 2d21h obcluster zone1
    obcluster-1-zone2-m7fw9d 10.244.157.38 running 2d19h obcluster zone2
    obcluster-1-zone2-vvqjp5 10.244.157.61 running 2m47s obcluster zone2
    obcluster-1-zone3-k7vp48 10.244.160.42 running 2m47s obcluster zone3
    obcluster-1-zone3-rq8sb8 10.244.160.31 running 2d19h obcluster zone3

【附件及日志】

operator重启前日志:
ob-operator重启前.log (26.0 KB)
operator重启后日志:
ob-operator重启后.log (39.1 KB)

1

1 个赞

这个现象是符合代码预期的,ob-operator 重启后,在内存中找不到任务后,会在 observer 的 crd 中将任务标记成 failed。
有两种解决方法:

  1. https://oceanbase.github.io/ob-operator/docs/manual/appendix/FAQ#retry 参考文章中的内容,retry。
  2. 在保证无影响的情况下,删除创建失败的 observer,后续相应的 pod 会重建出来。
1 个赞