docker容器内ip变更导致oceanbase无法连接

【 使用环境 】测试环境
【 OB or 其他组件 】oceanbase/oceanbase-ce
【 使用版本 】4.3.5.0-100000202024123117
【问题描述】
因服务器可用资源紧张,故需要采取docker compose部署一套oceanbase供开发测试。

然而当docker容器内ip出现变更时(如容器重启、变更所属network时),oceanbase会无法连接。

根据以下帖子提示,oceanbase不支持ip变更,否则会触发故障:

在observer.conf.bin中确实发现all_server_list等参数没有随着docker容器ip变更,导致这个故障。

我的问题如下:

1、出现该故障,是否只能强制让容器使用首次初始化时候的容器ip来恢复服务?

2、由于容器内ip可高度变化,是否可以认为即使是测试环境,如果使用docker,也会有丢失数据的风险?

3、后续是否会考虑支持单机部署oceanbase时变更ip的需求?

4、看到官方k8s示例中用的是oceanbase-cloud-native镜像,是否可以认为k8s不存在容器ip变化导致类似故障?(但考虑到k8s运维复杂度较高原因,暂时不会上k8s)

【复现路径】

1、分别创建两个docker network:

docker network create -d bridge database
docker network create -d bridge database-test1

2、按照文档“部署 OceanBase 数据库容器环境”(https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000002013494 )自行编写docker-compose.yml文件(见附件及日志),此时networks配置指向database。

3、启动容器:

docker compose up -d

4、启动成功后,关掉容器:

docker compose down

5、人为修改docker-compose.yml中的networks配置,指向database-test1。
该方法是为了加快复现容器ip意外故障、docker重新设置等各种情况,然后重启容器后ip出现变更的情形。

    networks:
      - database-test1
networks:
  database-test1:
    external: true

6、重启容器:

docker compose up -d

预期正常情形:

在第5步和第6步后,oceanbase均能成功连接。

故障情形:

在第5步和第6步后,oceanbase无法连接,dbeaver数据库管理软件提示“Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.”。容器内使用obclient连接失败:

term@dev:~/docker-compose/oceanbase$ docker compose exec oceanbase /bin/bash
[root@7e83c781626d ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if52: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 【随着docker容器ip变化而变化】/16 brd xxx.xxx.255.255 scope global eth0
       valid_lft forever preferred_lft forever
[root@7e83c781626d ~]# obclient -h127.0.0.1 -uroot@sys -A -Doceanbase -P2881 -p
Enter password: 
ERROR 2002 (HY000): Can't connect to OceanBase server on '127.0.0.1' (115)
[root@7e83c781626d ~]# 

【附件及日志】
docker-compose.yml(故障前):

services:
  oceanbase:
    image: oceanbase/oceanbase-ce:4.3.5.0-100000202024123117
    env_file: ./oceanbase.env
    volumes:
      - ./root/ob:/root/ob:rw
      - ./root/.obd/cluster:/root/.obd/cluster:rw
    ports:
      - 0.0.0.0:2881:2881
    networks:
      - database
networks:
  database:
    external: true

oceanbase.env:

MODE=MINI
EXIT_WHILE_ERROR=true
OB_CLUSTER_NAME=obcluster01
OB_TENANT_NAME=mysql_tena
OB_SYS_PASSWORD=112233445566
OB_TENANT_PASSWORD=112233445566
OB_TENANT_MIN_CPU=6

docker-compose.yml所在目录/root/ob/etc/observer.conf.bin:

observer_id=1
local_ip=【随着docker容器ip变化而变化】
_bloom_filter_ratio=3
all_server_list=【第一次启动docker容器ip,后续该ip失效】:2882
__min_full_resource_pool_memory=2147483648
log_disk_size=5G
min_observer_version=4.3.5.0
enable_syslog_recycle=True
enable_syslog_wf=False
max_syslog_file_count=4
syslog_level=WDIAG
cluster_id=1
cluster=obcluster01
rootservice_list=【第一次启动docker容器ip,后续该ip失效】:2882:2881

docker-compose.yml所在目录/root/ob/log/observer.log:

[2025-02-25 16:00:54.049204] WDIAG [RPC] post (ob_poc_rpc_proxy.h:235) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=4][errcode=-4122] check_blacklist failed(addr="【第一次启动docker容器ip,后续该ip失效】:2882")
[2025-02-25 16:00:54.049212] WDIAG [RPC] call_rpc (ob_async_rpc_proxy.h:358) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=4][errcode=-4122] call rpc func failed(server="【第一次启动docker容器ip,后续该ip失效】:2882", timeout=2000000, arg={addr:"【第一次启动docker容器ip,后续该ip失效】:2882", cluster_id:1}, cluster_id=1, tenant_id=1, group_id=9, ret=-4122, ret="OB_RPC_POST_ERROR")
[2025-02-25 16:00:54.049217] WDIAG [RPC] call (ob_async_rpc_proxy.h:292) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3][errcode=-4122] call rpc func failed(server="【第一次启动docker容器ip,后续该ip失效】:2882", timeout=2000000, cluster_id=1, tenant_id=1, arg={addr:"【第一次启动docker容器ip,后续该ip失效】:2882", cluster_id:1}, group_id=9, ret=-4122, ret="OB_RPC_POST_ERROR")
[2025-02-25 16:00:54.049228] WDIAG [SHARE.PT] find_leader (ob_ls_info.cpp:848) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=0][errcode=-4018] fail to get leader replica(ret=-4018, ret="OB_ENTRY_NOT_EXIST", *this={tenant_id:1, ls_id:{id:1}, replicas:[]}, replica count=0)
[2025-02-25 16:00:54.049234] WDIAG [SHARE.PT] find_leader (ob_ls_info.cpp:848) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3][errcode=-4018] fail to get leader replica(ret=-4018, ret="OB_ENTRY_NOT_EXIST", *this={tenant_id:1, ls_id:{id:1}, replicas:[]}, replica count=0)
[2025-02-25 16:00:54.049240] INFO  [SHARE.PT] get_ls_info_ (ob_rpc_ls_table.cpp:140) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=4] leader doesn't exist, try use all_server_list(tmp_ret=-4018, tmp_ret="OB_ENTRY_NOT_EXIST", ls_info={tenant_id:1, ls_id:{id:1}, replicas:[]})
[2025-02-25 16:00:54.049249] INFO  [SHARE.PT] get_ls_info_ (ob_rpc_ls_table.cpp:151) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=5] server_list is empty, do nothing(ret=0, ret="OB_SUCCESS", server_list=[])
[2025-02-25 16:00:54.049256] INFO  [SHARE.LOCATION] batch_update_caches_ (ob_ls_location_service.cpp:962) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3] [LS_LOCATION]ls location cache has changed(ret=0, ret="OB_SUCCESS", old_location={cache_key:{tenant_id:0, ls_id:{id:-1}, cluster_id:-1}, renew_time:0, replica_locations:[]}, new_location={cache_key:{tenant_id:1, ls_id:{id:1}, cluster_id:1}, renew_time:1740499254049255, replica_locations:[]})
[2025-02-25 16:00:54.049263] WDIAG [SHARE.LOCATION] renew_location_ (ob_ls_location_service.cpp:1026) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=4][errcode=-4721] get empty location from meta table(ret=-4721, ret="OB_LS_LOCATION_NOT_EXIST", location={cache_key:{tenant_id:0, ls_id:{id:-1}, cluster_id:-1}, renew_time:0, replica_locations:[]})
[2025-02-25 16:00:54.049269] WDIAG [SHARE.LOCATION] get (ob_ls_location_service.cpp:291) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3][errcode=-4721] renew location failed(ret=-4721, ret="OB_LS_LOCATION_NOT_EXIST", cluster_id=1, tenant_id=1, ls_id={id:1})
[2025-02-25 16:00:54.049275] WDIAG [SHARE.LOCATION] get (ob_location_service.cpp:58) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3][errcode=-4721] fail to get log stream location(ret=-4721, ret="OB_LS_LOCATION_NOT_EXIST", cluster_id=1, tenant_id=1, ls_id={id:1}, expire_renew_time=9223372036854775807, is_cache_hit=false, location={cache_key:{tenant_id:0, ls_id:{id:-1}, cluster_id:-1}, renew_time:0, replica_locations:[]})
[2025-02-25 16:00:54.049281] WDIAG [SQL.DAS] block_renew_tablet_location (ob_das_location_router.cpp:1299) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=4][errcode=-4721] failed to get location(ls_id={id:1}, ret=-4721)
[2025-02-25 16:00:54.049284] WDIAG [SQL.DAS] nonblock_get (ob_das_location_router.cpp:927) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=2][errcode=-4721] block renew tablet location failed(tmp_ret=-4721, tmp_ret="OB_LS_LOCATION_NOT_EXIST", tablet_id={id:1})
[2025-02-25 16:00:54.049288] WDIAG [SQL.DAS] nonblock_get_candi_tablet_locations (ob_das_location_router.cpp:960) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=2][errcode=-4721] Get partition error, the location cache will be renewed later(ret=-4721, tablet_id={id:1}, candi_tablet_loc={opt_tablet_loc:{partition_id:-1, tablet_id:{id:0}, ls_id:{id:-1}, replica_locations:[]}, selected_replica_idx:-1, priority_replica_idxs:[]})
[2025-02-25 16:00:54.049296] WDIAG [SQL.OPT] calculate_candi_tablet_locations (ob_table_location.cpp:1531) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=5][errcode=-4721] Failed to set partition locations(ret=-4721, partition_ids=[1], tablet_ids=[{id:1}])
[2025-02-25 16:00:54.049302] WDIAG [SQL.OPT] calculate_phy_table_location_info (ob_table_partition_info.cpp:96) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=4][errcode=-4721] Failed to calculate table location(ret=-4721)
[2025-02-25 16:00:54.049307] WDIAG [SQL.JO] compute_table_location (ob_join_order.cpp:254) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3][errcode=-4721] failed to calculate table location(ret=-4721)
[2025-02-25 16:00:54.049312] WDIAG [SQL.JO] compute_base_table_property (ob_join_order.cpp:9496) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3][errcode=-4721] failed to calc table location(ret=-4721)
[2025-02-25 16:00:54.049315] WDIAG [SQL.JO] generate_base_table_paths (ob_join_order.cpp:9437) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=2][errcode=-4721] failed to compute base path property(ret=-4721)
[2025-02-25 16:00:54.049319] WDIAG [SQL.JO] generate_normal_base_table_paths (ob_join_order.cpp:9423) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3][errcode=-4721] failed to generate access paths(ret=-4721)
[2025-02-25 16:00:54.049324] WDIAG [SQL.OPT] generate_plan_tree (ob_log_plan.cpp:6603) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3][errcode=-4721] failed to generate the access path for the single-table query(ret=-4721, get_optimizer_context().get_query_ctx()->get_sql_stmt()=SELECT column_value FROM __all_core_table WHERE TABLE_NAME = '__all_global_stat' AND COLUMN_NAME = 'snapshot_gc_scn')
[2025-02-25 16:00:54.049330] WDIAG [SQL.OPT] generate_raw_plan_for_plain_select (ob_select_log_plan.cpp:5250) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=4][errcode=-4721] failed to generate plan tree for plain select(ret=-4721)
[2025-02-25 16:00:54.049335] WDIAG [SQL.OPT] generate_raw_plan (ob_log_plan.cpp:10963) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3][errcode=-4721] fail to generate normal raw plan(ret=-4721)
[2025-02-25 16:00:54.049339] WDIAG [SQL.OPT] generate_plan (ob_log_plan.cpp:10920) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=2][errcode=-4721] fail to generate raw plan(ret=-4721)
[2025-02-25 16:00:54.049344] WDIAG [SQL.OPT] optimize (ob_optimizer.cpp:65) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=2][errcode=-4721] failed to perform optimization(ret=-4721)
[2025-02-25 16:00:54.049347] WDIAG [SQL] optimize_stmt (ob_sql.cpp:3846) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=1][errcode=-4721] Failed to optimize logical plan(ret=-4721)
[2025-02-25 16:00:54.049351] WDIAG [SQL] generate_plan (ob_sql.cpp:3481) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3][errcode=-4721] Failed to optimizer stmt(ret=-4721)
[2025-02-25 16:00:54.049356] WDIAG [SQL] generate_physical_plan (ob_sql.cpp:3268) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=2][errcode=-4721] failed to generate plan(ret=-4721)
[2025-02-25 16:00:54.049361] WDIAG [SQL] handle_physical_plan (ob_sql.cpp:5162) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3][errcode=-4721] Failed to generate plan(ret=-4721, result.get_exec_context().need_disconnect()=false)
[2025-02-25 16:00:54.049365] WDIAG [SQL] handle_text_query (ob_sql.cpp:2823) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3][errcode=-4721] fail to handle physical plan(ret=-4721)
[2025-02-25 16:00:54.049370] WDIAG [SQL] stmt_query (ob_sql.cpp:232) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=3][errcode=-4721] fail to handle text query(stmt=SELECT column_value FROM __all_core_table WHERE TABLE_NAME = '__all_global_stat' AND COLUMN_NAME = 'snapshot_gc_scn' , ret=-4721)
[2025-02-25 16:00:54.049378] WDIAG [SERVER] do_query (ob_inner_sql_connection.cpp:785) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=5][errcode=-4721] executor execute failed(ret=-4721)
[2025-02-25 16:00:54.049382] WDIAG [SERVER] query (ob_inner_sql_connection.cpp:944) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=2][errcode=-4721] execute failed(ret=-4721, tenant_id=1, executor={ObIExecutor:, sql:"SELECT column_value FROM __all_core_table WHERE TABLE_NAME = '__all_global_stat' AND COLUMN_NAME = 'snapshot_gc_scn' "}, retry_cnt=65, local_sys_schema_version=1, local_tenant_schema_version=1)
[2025-02-25 16:00:54.049393] INFO  [SERVER] sleep_before_local_retry (ob_query_retry_ctrl.cpp:92) [487][T1_TimerWK0][T1][YB42AC130002-00062EF96197BCF4-0-0] [lt=4] will sleep(sleep_us=65000, remain_us=27869139, base_sleep_us=1000, retry_sleep_type=1, v.stmt_retry_times_=65, v.err_=-4721, timeout_timestamp=1740499281918532)
[2025-02-25 16:00:54.051770] INFO  [SQL.QRR] runTimerTask (ob_udr_mgr.cpp:93) [489][T1_TimerWK2][T1][Y0-0000000000000000-0-0] [lt=0] run rewrite rule refresh task(rule_mgr_->tenant_id_=1)
[2025-02-25 16:00:54.052459] INFO  [SQL.RESV] check_table_exist_or_not (ob_dml_resolver.cpp:9990) [489][T1_TimerWK2][T1][YB42AC130002-00062EF961E7BCB7-0-0] [lt=8] table not exist(tenant_id=1, database_id=201001, table_name=__all_sys_stat, table_name.ptr()="data_size:14, data:5F5F616C6C5F7379735F73746174", ret=-5019)

docker-compose.yml所在目录/root/ob/log/election.log:

[2025-02-25 16:03:58.004677] INFO  [ELECT] prepare (election_proposer.cpp:356) [735][T1002_Occam][T1002][Y0-0000000000000000-0-0] [lt=5] [phase]self is not in memberlist, give up do prepare(ret=0, ret="OB_SUCCESS", role=2, *this={ls_id:{id:1001}, addr:"【当前容器ip】:2882", role:Follower, ballot_number:-1, lease_interval:-0.00s, memberlist_with_states:{member_list:{addr_list:["【第一次启动docker容器ip,后续该ip失效】:2882"], membership_version:{proposal_id:2, config_seq:3}, replica_num:1}, prepare_ok:False, accept_ok_promised_ts:invalid, follower_promise_membership_version:{proposal_id:9223372036854775807, config_seq:-1}}, priority_seed:0x1000, restart_counter:1, last_do_prepare_ts:1970-01-01 00:00:00.-1, self_priority:{priority:{is_valid:false, is_observer_stopped:false, is_server_stopped:false, is_zone_stopped:false, fatal_failures:[], is_primary_region:false, serious_failures:[], is_in_blacklist:false, in_blacklist_reason:, scn:{val:0, v:0}, is_manual_leader:false, zone_priority:9223372036854775807}}, p_election:0x7f48cb1a7830})

1.docker应该不能修改IP吧?
2.docker容器部署,测试的话直接使用127.0.0.1 ip部署,不要使用local ip。
3.当前还没有支持变更ip的需求,这边先去咨询下该需求可行度
4.可以使用ob-operator进行k8s部署ob,可以简化 OceanBase 的运维
https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000002013291

“2.docker容器部署,测试的话直接使用127.0.0.1 ip部署,不要使用local ip。”——这个ip是docker容器内获取的ip,一般不会手动干预,而且容器内应用一般不会获得127.0.0.1,否则会无法服务。

docker部署这边是不保证稳定和性能的
建议使用operator的部署方案。