根据您提供的信息,OCP 在执行从 OB 4.2.2 升级到 4.3.5.2 的任务时卡住,并且在日志中出现了 Lock wait timeout exceeded; try restarting transaction 的错误。以下是问题分析和解决建议:
问题分析
-
错误码解释
根据文档库中的信息,错误Lock wait timeout exceeded; try restarting transaction对应的 OceanBase 错误码为6003或6006(兼容 MySQL 错误码1205)。该错误表示事务等待锁超时,可能的原因包括:- 数据库中存在长时间运行的事务。
- 并发操作导致锁冲突。
-
日志关键点
- 日志显示任务
Execute upgrade health checker script先标记为SUCCESSFUL,随后又被标记为FAILED,这表明任务状态更新过程中出现了异常。 - OCP 日志中明确提到
PessimisticLockingFailureException和HibernateJpaDialect.convertHibernateAccessException,说明问题可能与数据库锁或并发访问有关。
- 日志显示任务
-
版本差异
您提到备集群升级成功,而主集群卡住。这可能与主集群的负载、事务状态或配置差异有关。
解决步骤
1. 检查并清理长时间运行的事务
- 使用以下 SQL 查询当前正在运行的事务:
SELECT * FROM oceanbase.__all_virtual_transaction_info; - 如果发现有长时间未提交的事务,可以尝试手动终止这些事务:
KILL TRANSACTION <transaction_id>;
2. 调整锁等待超时时间
- 默认情况下,OceanBase 的锁等待超时时间为 1 秒。可以通过设置参数
lock_wait_timeout来调整超时时间:ALTER SYSTEM SET lock_wait_timeout = 10; - 注意:此参数需要重启 OB 集群才能生效。
3. 检查 OCP 数据库连接池配置
- 确保 OCP 的数据库连接池配置合理,避免因连接池耗尽导致的锁等待问题。检查以下参数:
maxPoolSizeidleTimeoutconnectionTimeout
4. 重试升级任务
- 在清理潜在问题后,重新触发 OCP 的升级任务。如果仍然失败,可以尝试跳过
Execute upgrade health checker script步骤(如果支持)。
5. 检查主集群与备集群的差异
- 比较主集群和备集群的配置文件(如
config.ini),确保两者一致。 - 检查主集群的负载情况,避免高负载导致的锁等待问题。
注意事项
- 如果上述方法无法解决问题,建议收集以下信息并联系 OceanBase 技术支持:
- 主集群的完整日志文件。
- 当前的系统参数配置(可通过
SHOW PARAMETERS;获取)。 - 数据库的性能指标(如 CPU、内存、磁盘 I/O)。
总结
根据文档库中的信息,Lock wait timeout exceeded 通常由锁冲突或长时间运行的事务引起。建议按照上述步骤逐一排查问题。如果问题仍未解决,请提供更多上下文信息以便进一步分析。
具体信息可参考以下文档:
从 ocp-server.log 里提取到任务重试失败附近的 日志如下:
2025-04-01 08:13:32.794 WARN 5479 --- [manual-subtask-executor4,a0327b7a330a68f8,42fea0700a7acce3] com.zaxxer.hikari.pool.ProxyConnection : metadb-connect-pool - Connection com.oceanbase.jdbc.OceanBaseConnection@405b9708 marked as broken because of SQLSTATE(08000), ErrorCode(0)
java.sql.SQLNonTransientConnectionException: (conn=655429) Connection timed out at com.oceanbase.jdbc.internal.util.exceptions.ExceptionFactory.createException(ExceptionFactory.java:122) at com.oceanbase.jdbc.internal.util.exceptions.ExceptionFactory.create(ExceptionFactory.java:202) at com.oceanbase.jdbc.OceanBaseStatement.executeExceptionEpilogue(OceanBaseStatement.java:320)
at com.oceanbase.jdbc.JDBC4PreparedStatement.executeInternal(JDBC4PreparedStatement.java:257) at com.oceanbase.jdbc.JDBC4PreparedStatement.execute(JDBC4PreparedStatement.java:171)
这个日志里有个 cs_id=655429 。这是一个 OCP 到 metadb 的链接,不是到业务OB 的链接。
然后到 OCP 服务器的 obproxy.log 里查找这个链接相关日志。
[admin@RS-OBOPC-P1 log]$ grep 655429 obproxy.log
[2025-04-01 07:02:26.655429] INFO [PROXY.TXN] ob_mysql_transact.cpp:5761 [4835][Y0-00002B9910CA2BA0] [lt=4] [dc=0] [ObMysqlTransact::handle_server_connection_break](client_ip={10.10.10.37:15248}, server_ip={10.10.10.39:2881}, cs_id=917549, proxy_sessid=723401852955330437, ss_id=0, server_sessid=0, sm_id=4868, proxy_user_name=root@ocp_monitor#ocpcluster, database_name=monitor_database, server_state="INTERNAL_ERROR", request_cmd="Quit", sql_cmd="Quit", sql=)[2025-04-01 07:48:17.228104] INFO [PROXY.CS] ob_mysql_client_session.cpp:370 [4833][Y0-00002B9910CA2D20] [lt=7] [dc=0] client session born(cs_id=655429, proxy_sessid=0, is_local_connection=false, client_vc=0x2b9910c94580, client_fd=188, client_addr="10.10.10.39:48578")[2025-04-01 07:48:17.228124] INFO [PROXY.CS] ob_mysql_client_session.cpp:241 [4833][Y0-00002B9910CA2D20] [lt=9] [dc=0] Starting new transaction using sm(cs_id=655429, get_transact_count()=0, sm_id=5632)[2025-04-01 07:48:17.228659] INFO [PROXY.SS] ob_mysql_server_session.cpp:106 [4833][Y0-00002B9910CA2D20] [lt=6] [dc=0] server session born(ss_id=5640, server_ip={10.10.10.39:2881}, cs_id=655429, proxy_sessid=0, server_type=1)[2025-04-01 07:48:17.228720] INFO [PROXY.TXN] ob_mysql_transact.cpp:4726 [4833][Y0-00002B9910CA2D20] [lt=8] [dc=0] succ to set proxy_sessid(cs_id=655429, proxy_sessid=723401852955330709, server_ip={10.10.10.39:2881}, ss_id=5640, server_sessid=3221745283, is_proxy_mysql_client=false, ss_fd=195, client_addr="10.10.10.39:48578")[2025-04-01 08:13:32.793781] WARN [PROXY.SM] set_client_abort (ob_mysql_sm.cpp:8124) [4833][Y0-00002B9910CA2D20] [lt=45] [dc=0] client will abort soon(sm_id=5632, cs_id=655429, proxy_sessid=723401852955330709, ss_id=5640, server_sessid=3221745283, client_ip={10.10.10.39:48578}, server_ip={10.10.10.39:2881}, cluster_name=ocpcluster, tenant_name=ocp_meta, user_name=root, db=meta_database, event="VC_EVENT_EOS", request_cmd="Query", sql_cmd="Query", sql=update `subtask_instance` set `state`='PAUSED' where `id` in (9171685))[2025-04-01 08:13:32.793910] WARN [PROXY.SM] update_cmd_stats (ob_mysql_sm.cpp:10239) [4833][Y0-00002B9910CA2D20] [lt=5] [dc=0] Slow Query: ((client_ip={10.10.10.39:48578}, server_ip={10.10.10.39:2881}, obproxy_client_port={10.10.10.39:62722}, server_trace_id=, route_type=ROUTE_TYPE_LEADER, user_name=root, tenant_name=ocp_meta, cluster_name=ocpcluster, logic_database_name=, logic_tenant_name=, ob_proxy_protocol=2, cs_id=655429, proxy_sessid=723401852955330709, ss_id=5640, server_sessid=3221745283, sm_id=5632, cmd_size_stats={client_request_bytes:75, server_request_bytes:191, server_response_bytes:0, client_response_bytes:0}, cmd_time_stats={client_transaction_idle_time_us=0, client_request_read_time_us=7, client_request_analyze_time_us=6, cluster_resource_create_time_us=0, pl_lookup_time_us=3, pl_process_time_us=2, bl_lookup_time_us=0, bl_process_time_us=0, congestion_control_time_us=13, congestion_process_time_us=0, do_observer_open_time_us=1, server_connect_time_us=0, server_sync_session_variable_time_us=0, server_send_saved_login_time_us=0, server_send_use_database_time_us=0, server_send_session_variable_time_us=0, server_send_session_user_variable_time_us=0, server_send_all_session_variable_time_us=0, server_send_start_trans_time_us=0, server_send_xa_start_time_us=0, server_send_init_sql_time_us=0, build_server_request_time_us=4, plugin_compress_request_time_us=0, prepare_send_request_to_server_time_us=37, server_request_write_time_us=14, server_process_request_time_us=0, server_response_read_time_us=0, plugin_decompress_response_time_us=0, server_response_analyze_time_us=0, ok_packet_trim_time_us=0, client_response_write_time_us=0, request_total_time_us=30029398}, sql=update `subtask_instance` set `state`='PAUSED' where `id` in (9171685), trans_internal_routing_state=not in trans)[2025-04-01 08:13:32.794009] WARN [PROXY.SM] update_stats (ob_mysql_sm.cpp:10366) [4833][Y0-00002B9910CA2D20] [lt=16] [dc=0] Slow transaction: ((client_ip={10.10.10.39:48578}, server_ip={10.10.10.39:2881}, server_trace_id=, route_type=ROUTE_TYPE_LEADER, ob_proxy_protocol=2, cs_id=655429, proxy_sessid=723401852955330709, ss_id=5640, server_sessid=3221745283, sm_id=5632, state=1, trans_stats_={client_requests:1, server_responses:0, pl_lookup_retries:0, server_retries:0, client_request_bytes:75, server_request_bytes:191, server_response_bytes:0, client_response_bytes:0, client_transaction_idle_time_us=0, client_process_request_time_us=21, client_request_read_time_us=7, client_request_analyze_time_us=6, cluster_resource_create_time_us=0, pl_lookup_time_us=3, pl_process_time_us=2, congestion_control_time_us=13, congestion_process_time_us=0, do_observer_open_time_us=1, server_connect_time_us=0, sync_session_variable_time_us=0, send_saved_login_time_us=0, send_use_database_time_us=0, send_session_vars_time_us=0, send_session_user_vars_time_us=0, send_all_session_vars_time_us=0, send_start_trans_time_us=0, send_xa_start_time_us=0, build_server_request_time_us=4, plugin_compress_request_time_us=0, prepare_send_request_to_server_time_us=37, server_request_write_time_us=14, server_process_request_time_us=0, server_response_read_time_us=0, plugin_decompress_response_time_us=0, server_response_analyze_time_us=0, ok_packet_trim_time_us=0, client_response_write_time_us=0, trans_time_us=30029630}, last_sql=update `subtask_instance` set `state`='PAUSED' where `id` in (9171685))[2025-04-01 08:13:32.794063] INFO [PROXY.SS] ob_mysql_server_session.cpp:153 [4833][Y0-00002B9910CA2D20] [lt=21] [dc=0] server session do_io_close((*this={ss_id:5640, server_sessid:3221745283, server_ip:{10.10.10.39:2881}, is_inited:true, magic:19136237, state:1, server_vc:0x2b991246bbd0, client_session:{this:0x2b9912089120, is_proxy_mysql_client:false, is_waiting_trans_first_request:false, need_delete_cluster:false, is_first_dml_sql_got:false, vc_ready_killed:false, active:true, magic:19132429, conn_decrease:true, current_tid:4833, cs_id:655429, proxy_sessid:723401852955330709, session_info:{is_inited:true, priv_info:{has_all_privilege:false, cs_id:655429, user_priv_set:133009965054, cluster_name:"ocpcluster", tenant_name:"ocp_meta", user_name:"root"}, version:{common_hot_sys_var_version:0, common_sys_var_version:0, mysql_hot_sys_var_version:0, mysql_sys_var_version:0, hot_sys_var_version:6968, sys_var_version:6801, user_var_version:1, db_name_version:1, last_insert_id_version:6, sess_info_version:11099}, hash_version:{common_hot_sys_var_version:0, common_sys_var_version:0, mysql_hot_sys_var_version:0, mysql_sys_var_version:0, hot_sys_var_version:0, sys_var_version:0, user_var_version:0, db_name_version:0, last_insert_id_version:0, sess_info_version:0}, val_hash:{common_hot_sys_var_hash:0, common_cold_sys_var_hash:0, mysql_hot_sys_var_hash:0, mysql_cold_sys_var_hash:0, hot_sys_var_hash:0, cold_sys_var_hash:0, user_var_hash:0}, global_vars_version:0, is_global_vars_changed:false, is_trans_specified:false, is_user_idc_name_set:false, is_read_consistency_set:false, idc_name:"", cluster_id:0, real_meta_cluster_name:"", safe_read_snapshot:1, syncing_safe_read_snapshot:0, route_policy:1, proxy_route_policy:3, user_identity:0, global_vars_version:0, is_read_only_user:false, is_request_follower_user:false, obproxy_force_parallel_query_dop:1, ob20_request:{remain_payload_len:0, ob20_request_received_done:false, ob20_header:{ob 20 protocol header:{compressed_len:0, seq:0, non_compressed_len:0}, magic_num:0, header_checksum:0, connection_id:0, request_id:0, pkt_seq:0, payload_len:0, version:0, flag_.flags:0, reserved:0}}, client_cap:72, server_cap:916303, last_server_addr:{*Not IP address [0]*:0}, last_server_sess_id:0, init_sql:""}, dummy_ldc:{use_ldc:false, idc_name:"", item_count:1, site_start_index_array:[[0]0, [1]1, [2]1, [3]1], item_array:[[0]{idc_type:"SAME_IDC", zone_type:"ReadWrite", replica:{server:"10.10.10.39:2881", is_dup_replica:false, role:"FOLLOWER", type:"FULL"}, is_merging:false, is_partition_server:false, is_force_congested:false, is_used:false}], pl:null, ts:0x2b9903f00030, readonly_exist_status:"READONLY_ZONE_NOT_EXIST"}, dummy_entry:0x2b9903f9bbe0, server_state_version:3, cur_ss:0x2b9912204be0, bound_ss:null, lii_ss:0x2b9912204be0, cluster_resource:{this:0x2b9910023080, ref_count:182, is_inited:true, cluster_info_key:{cluster_name:{config_string:"ocpcluster"}, cluster_id:0}, cr_state:"CR_AVAIL", version:2, last_access_time_ns:1743466412669165417, deleting_completed_thread_num:0, fetch_rslist_task_count:0, fetch_idc_list_task_count:0, last_idc_list_refresh_time_ns:1743437552565904769, last_rslist_refresh_time_ns:0, server_state_version:3}, client_vc:0x2b9910c94580, using_ldg:false, trace_stats:NULL}, transact_count:26716, server_trans_stat:1, session_info:{cap:916303, ob_server:, version:{common_hot_sys_var_version:0, common_sys_var_version:0, mysql_hot_sys_var_version:0, mysql_sys_var_version:0, hot_sys_var_version:6968, sys_var_version:6801, user_var_version:1, db_name_version:1, last_insert_id_version:6, sess_info_version:11099}, val_hash:{common_hot_sys_var_hash:0, common_cold_sys_var_hash:0, mysql_hot_sys_var_hash:0, mysql_cold_sys_var_hash:0, hot_sys_var_hash:0, cold_sys_var_hash:0, user_var_hash:0}, checksum_switch:1, server_type:1}}, server_vc_=0x2b991246bbd0, this=0x2b9912204be0)
[2025-04-01 08:13:32.794176] INFO [PROXY.SS] ob_mysql_server_session.cpp:190 [4833][Y0-00002B9910CA2D20] [lt=84] [dc=0] server session is closing(ss_id=5640, server_sessid=3221745283, server_ip={10.10.10.39:2881}, cs_id=655429, proxy_sessid=723401852955330709)
[2025-04-01 08:13:32.794206] INFO [PROXY.SS] ob_mysql_client_session.cpp:652 [4833][Y0-00002B9910CA2D20] [lt=8] [dc=0] client session do_io_close((*this={this:0x2b9912089120, is_proxy_mysql_client:false, is_waiting_trans_first_request:false, need_delete_cluster:false, is_first_dml_sql_got:false, vc_ready_killed:false, active:true, magic:19132429, conn_decrease:true, current_tid:4833, cs_id:655429, proxy_sessid:723401852955330709, session_info:{is_inited:true, priv_info:{has_all_privilege:false, cs_id:655429, user_priv_set:133009965054, cluster_name:"ocpcluster", tenant_name:"ocp_meta", user_name:"root"}, version:{common_hot_sys_var_version:0, common_sys_var_version:0, mysql_hot_sys_var_version:0, mysql_sys_var_version:0, hot_sys_var_version:6968, sys_var_version:6801, user_var_version:1, db_name_version:1, last_insert_id_version:6, sess_info_version:11099}, hash_version:{common_hot_sys_var_version:0, common_sys_var_version:0, mysql_hot_sys_var_version:0, mysql_sys_var_version:0, hot_sys_var_version:0, sys_var_version:0, user_var_version:0, db_name_version:0, last_insert_id_version:0, sess_info_version:0}, val_hash:{common_hot_sys_var_hash:0, common_cold_sys_var_hash:0, mysql_hot_sys_var_hash:0, mysql_cold_sys_var_hash:0, hot_sys_var_hash:0, cold_sys_var_hash:0, user_var_hash:0}, global_vars_version:0, is_global_vars_changed:false, is_trans_specified:false, is_user_idc_name_set:false, is_read_consistency_set:false, idc_name:"", cluster_id:0, real_meta_cluster_name:"", safe_read_snapshot:1, syncing_safe_read_snapshot:0, route_policy:1, proxy_route_policy:3, user_identity:0, global_vars_version:0, is_read_only_user:false, is_request_follower_user:false, obproxy_force_parallel_query_dop:1, ob20_request:{remain_payload_len:0, ob20_request_received_done:false, ob20_header:{ob 20 protocol header:{compressed_len:0, seq:0, non_compressed_len:0}, magic_num:0, header_checksum:0, connection_id:0, request_id:0, pkt_seq:0, payload_len:0, version:0, flag_.flags:0, reserved:0}}, client_cap:72, server_cap:916303, last_server_addr:{*Not IP address [0]*:0}, last_server_sess_id:0, init_sql:""}, dummy_ldc:{use_ldc:false, idc_name:"", item_count:1, site_start_index_array:[[0]0, [1]1, [2]1, [3]1], item_array:[[0]{idc_type:"SAME_IDC", zone_type:"ReadWrite", replica:{server:"10.10.10.39:2881", is_dup_replica:false, role:"FOLLOWER", type:"FULL"}, is_merging:false, is_partition_server:false, is_force_congested:false, is_used:false}], pl:null, ts:0x2b9903f00030, readonly_exist_status:"READONLY_ZONE_NOT_EXIST"}, dummy_entry:0x2b9903f9bbe0, server_state_version:3, cur_ss:null, bound_ss:null, lii_ss:null, cluster_resource:{this:0x2b9910023080, ref_count:182, is_inited:true, cluster_info_key:{cluster_name:{config_string:"ocpcluster"}, cluster_id:0}, cr_state:"CR_AVAIL", version:2, last_access_time_ns:1743466412669165417, deleting_completed_thread_num:0, fetch_rslist_task_count:0, fetch_idc_list_task_count:0, last_idc_list_refresh_time_ns:1743437552565904769, last_rslist_refresh_time_ns:0, server_state_version:3}, client_vc:0x2b9910c94580, using_ldg:false, trace_stats:NULL}, client_vc_=0x2b9910c94580, this=0x2b9912089120)
[2025-04-01 08:13:32.794257] INFO [PROXY.CS] ob_mysql_client_session.cpp:94 [4833][Y0-00002B9910CA2D20] [lt=44] [dc=0] client session destroy(cs_id=655429, proxy_sessid=723401852955330709, client_vc=NULL)
从 ocp的 obproxy.log 里可以看出,这个链接在任务重试之前就已经建好(2025-04-01 07:48:17.228104),任务重试在 2025-04-01 08:12:47.581,链接异常断开在 2025-04-01 08:12:47.581,断开时正在执行的SQL是 :
sql_cmd="Query", sql=update `subtask_instance` set `state`='PAUSED' where `id` in (9171685))
并且紧接着这句 SQL 还被标识为 slow query,执行时间约 request_total_time_us=30029398 。
metadb 的 sql 超时时间是 100s
obclient(root@ocp_meta)[meta_database]> show global variables like '%timeout%';
+---------------------+------------------+
| Variable_name | Value |
+---------------------+------------------+
| connect_timeout | 10 |
| interactive_timeout | 28800 |
| lock_wait_timeout | 31536000 |
| net_read_timeout | 30 |
| net_write_timeout | 60 |
| ob_pl_block_timeout | 3216672000000000 |
| ob_query_timeout | 100000000 |
| ob_trx_idle_timeout | 86400000000 |
| ob_trx_lock_timeout | -1 |
| ob_trx_timeout | 86400000000 |
| wait_timeout | 28800 |
+---------------------+------------------+
11 rows in set (0.003 sec)
metadb 压力并不大,这个 SQL 也不可能是慢 SQL。30s 恰好是 net_read_timeout 。ocp跟metadb 是同一台服务器。 很可能是 ocp 自身哪里问题。 @论坛小助手
根据您提供的详细信息和日志,以下是对问题的分析和可能的解决方案:
问题分析
-
任务卡住的原因:
- 主集群升级任务在
Execute upgrade health checker script步骤失败,且无法跳过。 - 日志显示任务状态先标记为
SUCCESSFUL,随后又被标记为FAILED。
- 主集群升级任务在
-
锁等待超时:
- OCP 日志中出现了
Lock wait timeout exceeded; try restarting transaction的错误,表明存在事务锁冲突或长时间未提交的事务。 - 锁超时的具体原因可能是某些 SQL 执行时间过长,导致其他事务无法获取锁。
- OCP 日志中出现了
-
连接异常断开:
- OBProxy 日志显示链接
cs_id=655429在执行update subtask_instance set state='PAUSED' where id in (9171685)时被标识为慢查询,并最终断开连接。 - 断开连接的时间恰好接近
net_read_timeout(30 秒),这表明可能是网络层或客户端配置的问题。
- OBProxy 日志显示链接
-
SQL 执行时间:
- 虽然您提到
metadb压力不大,但该 SQL 的执行时间达到了约 30 秒(request_total_time_us=30029398),远超正常范围。 - 这种情况可能与锁冲突、索引缺失或表扫描有关。
- 虽然您提到
解决方案
1. 检查锁冲突
根据文档片段 6 和片段 10 的描述,OceanBase 数据库可能存在锁冲突问题。以下是排查步骤:
- 查询当前事务状态和锁信息:
SELECT * FROM oceanbase.__all_virtual_transaction_info; - 如果发现有长时间未提交的事务,尝试手动回滚这些事务:
ROLLBACK;
2. 调整超时参数
- 当前
net_read_timeout设置为 30 秒,可能导致连接在长时间操作时被中断。建议将该值调大:SET GLOBAL net_read_timeout = 120; - 同时检查
ob_query_timeout是否合理,默认为 10 秒,可能需要根据实际业务需求调整:SET GLOBAL ob_query_timeout = 300000000; -- 单位为微秒,此处设置为 300 秒
3. 优化 SQL 性能
- 检查
subtask_instance表的索引情况,确保id列上有合适的索引:SHOW INDEX FROM subtask_instance; - 如果索引缺失,可以添加索引以加速更新操作:
ALTER TABLE subtask_instance ADD INDEX idx_id (id);
4. 排查 OCP 内部问题
- 根据日志,OCP 自身可能存在性能瓶颈或资源不足的情况。建议检查 OCP 的系统资源使用情况(CPU、内存、磁盘 I/O 等)。
- 查看 OCP 的任务日志,定位是否存在其他并发任务干扰升级流程:
tail -f /path/to/ocp/tmp/upgrade_*_log
5. 重试升级任务
- 在完成上述调整后,重新发起升级任务。如果仍然失败,可以尝试清理失败的任务记录后再重试:
obd cluster restart <cluster_name>
其他建议
- 版本兼容性:确保 OCP 和 OB 的版本兼容性。虽然备集群升级成功,但主集群可能存在特定场景下的兼容性问题。
-
联系技术支持:如果问题仍未解决,建议提供完整的日志文件(包括
ocp-server.log和obproxy.log)以及任务截图,联系 OceanBase 技术支持团队进一步分析。
希望以上建议能帮助您解决问题!如果有更多细节或新的日志信息,请随时补充。
更多信息请查看:
每次重试都是一笔成功记录紧着着十几秒后跟着一笔失败记录。通过metadb 的sql 审计也看到这样的 记录。
| 2025-04-01 08:50:08.571955 | root | update `subtask_instance` set `class_name`='com.oceanbase.ocp.service.task.business.cluster.ExecUpgradeHealthCheckerScriptTask', `context`='eyJjb25jdXJyZW5jeSI6LTEsImxpc3RNYXAiOnsic3RhbmRieV9
jbHVzdGVyX2lkcyI6W10sIjE3MTEzNTE0ODUub2Jycy56b25lMy5zZXJ2ZXJfaWRzIjpbIjQiXSwiMTcxMTM1MTQ4NS5vYnJzLnpvbmUxLnNlcnZlcl9pZHMiOlsiMiJdLCJjbHVzdGVyX2lkcyI6WyIyIl0sInByaW1hcnlfZmlyc3RfY2x1c3Rlcl9pZHMiOlsiMiJdLCIxNzExMzUxNDg1Lm9icnMuem9uZTMuaG9zdF9pZHMiOlsiNCJdLCJzdGFuZGJ5X2ZpcnN0X2NsdXN0ZXJfaWRzIjpbIjIiXSwiMTcxMTM1MTQ4NS5vYnJzLnpvbmUyLmhvc3RfaWRzIjpbIjMiXSwic3RhZ2VkX3ZlcnNpb25zIjpbIjQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwIl0sIjE3MTEzNTE0ODUub2Jycy56b25lMi5zZXJ2ZXJfaWRzIjpbIjMiXSwiZGlyZWN0X3VwZ3JhZGVfdmVyc2lvbnMiOltdLCIxNzExMzUxNDg1Lm9icnMuem9uZV9uYW1lcyI6WyJ6b25lMSIsInpvbmUyIiwiem9uZTMiXSwiMTcxMTM1MTQ4NS5vYnJzLnpvbmUxLmhvc3RfaWRzIjpbIjIiXSwiem9uZV9uYW1lcyI6WyJ6b25lMSIsInpvbmUyIiwiem9uZTMiXSwic2VydmVyX2lkcyI6WyIyIiwiMyIsIjQiXSwiaG9zdF9pZHMiOlsiMiIsIjMiLCI0Il19LCJwYXJhbGxlbCI6ZmFsc2UsInBhcmFsbGVsSWR4IjotMSwic3RyaW5nTWFwIjp7InN1YnRhc2tfc3BsaXR0ZXIiOiJwcmltYXJ5X2ZpcnN0X2NsdXN0ZXJfaWRzIiwiY2x1c3Rlcl92ZXJzaW9uIjoiNC4yLjIuMSIsImNsdXN0ZXJfbmFtZSI6Im9icnMiLCJjbHVzdGVyX2lkIjoiMiIsIm9jcGFnZW50X3NlcnZpY2VfbmFtZSI6ImFnZW50IiwicHJpbWFyeV9yb290X3N5c19wYXNzd29yZCI6IlJTcnNAIzIwMjQiLCJvcGVyYXRpbmdfc3lzdGVtIjoiZWw3IiwidGFyZ2V0X2NsdXN0ZXJfc3RhdHVzIjoiUlVOTklORyIsIndhdGNoZWRfc2NlbmFya
W8iOiJVUEdSQURFIiwicHJpbWFyeV9jbHVzdGVyX2lkIjoiMiIsInRhcmdldF92ZXJzaW9uIjoiNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAiLCJ0YXNrX2luc3RhbmNlX2lkIjoiNzk0MTYyNyIsImxhdGVzdF9leGVjdXRpb25fc3RhcnRfdGltZSI6IjIwMjUtMDQtMDFUMDg6NDk6NTMuMzgxKzA4OjAwIiwib2Jycy4xNzExMzUxNDg1LndhdGNoZWRfcGFyYW1ldGVycyI6IntcInNlcnZlcl9wZXJtYW5lbnRfb2ZmbGluZV90aW1lXCI6XCIzNjAwc1wiLFwiZW5hYmxlX3JlYmFsYW5jZVwiOlwie1xcXCJ0ZW4wMDFcXFwiOlxcXCJUcnVlXFxcIixcXFwidGVuMDAyXFxcIjpcXFwiVHJ1ZVxcXCIsXFxcInRlbjAwM1xcXCI6XFxcIlRydWVcXFwiLFxcXCJ0ZW4wMDRcXFwiOlxcXCJUcnVlXFxcIixcXFwidGVuMDEwXFxcIjpcXFwiVHJ1ZVxcXCIsXFxcInRlbjAxMVxcXCI6XFxcIlRydWVcXFwiLFxcXCJvZGNfbWV0YVxcXCI6XFxcIlRydWVcXFwiLFxcXCJvbXNfbWV0YVxcXCI6XFxcIlRydWVcXFwiLFxcXCJzeXNcXFwiOlxcXCJUcnVlXFxcIixcXFwidGVuMDA1XFxcIjpcXFwiVHJ1ZVxcXCIsXFxcInRlbjAwNlxcXCI6XFxcIlRydWVcXFwiLFxcXCJ0ZW4wMDdcXFwiOlxcXCJUcnVlXFxcIn1cIixcImVuYWJsZV9yZXJlcGxpY2F0aW9uXCI6XCJUcnVlXCJ9Iiwib2JfY2x1c3Rlcl9pZCI6IjE3MTEzNTE0ODUiLCJtaXhfcnBtX25hbWVzIjoie1wiem9uZTNcIjpcIm9jZWFuYmFzZS1jZS00LjIuNS4yLTEwMjAxMDAxMjAyNTAyMjYxMC5lbDcueDg2XzY0LnJwbVwiLFwiem9uZTJcIjpcIm9jZWFuYmFzZS1jZS00LjIuNS4yLTEwMjAxMDAxMjAyNTAyMjYxMC5lbDcueDg2XzY0LnJwbVwiLFwiem9uZTFcIjpcIm9jZWFuYmFzZS1jZS00LjIuNS4yLTEwMjAxMDAxMjAyNTAyMjYxMC5lbDcueDg2XzY0LnJwbVwifSIsImRlcF9ycG1fbmFtZXMiOiJ7XCJ6b25lM1wiOltcIm9jZWFuYmFzZS1jZS1saWJzLTQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwLmVsNy54ODZfNjQucnBtXCIsXCJvY2VhbmJhc2UtY2UtdXRpbHMtNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAuZWw3Lng4Nl82NC5ycG1cIl0sXCJ6b25lMlwiOltcIm9jZWFuYmFzZS1jZS1saWJzLTQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwLmVsNy54ODZfNjQucnBtXCIsXCJvY2VhbmJhc2UtY2UtdXRpbHMtNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAuZWw3Lng4Nl82NC5ycG1cIl0sXCJ6b25lMVwiOltcIm9jZWFuYmFzZS1jZS1saWJzLTQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwLmVsNy54ODZfNjQucnBtXCIsXCJvY2VhbmJhc2UtY2UtdXRpbHMtNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAuZWw3Lng4Nl82NC5ycG1cIl19Iiwib2JfaW5zdGFsbF9wYXRoIjoiL2hvbWUvYWRtaW4vb2NlYW5iYXNlIiwib2JfcnVuX3BhdGgiOiIvaG9tZS9hZG1pbi9vY2VhbmJhc2UiLCJvYl9ydW5fdXNlciI6ImFkbWluIiwib2JfZGF0YV9kaXNrX3BhdGgiOiIvZGF0YS8xIiwib2JfbG9nX2Rpc2tfcGF0aCI6Ii9kYXRhL2xvZzEiLCJvYl9kaXNrX3BhdGhfc3R5bGUiOiJERUZBVUxUIiwib2Jfc3FsX3BvcnQiOiIyODgxIiwib2Jfc3Z
yX3BvcnQiOiIyODgyIn19', `end_time`='2025-04-01 08:50:08.57', `executor`='10.10.10.39:8080', `name`='Execute upgrade health checker script', `node_type`='JAVA_TASK', `operation`='RETRY', `parallel_idx`=-1, `run_time`=23, `series_id`=30, `start_time`='2025-03-31 22:33:41.724', `state`='SUCCESSFUL', `task_id`=7941627, `timeout`=1800, `version`=NULL where `id`=9171685 | 0 || 2025-04-01 08:50:08.573149 | root | INSERT INTO `subtask_log`(`subtask_id`, `log_content`, `run_time`) VALUES (9171685, _binary 'Set state for subtask: 9171685, operation:RETRY, state: SUCCESSFUL', 23)
......
| 2025-04-01 08:50:38.612806 | root | update `subtask_instance` set `class_name`='com.oceanbase.ocp.service.task.business.cluster.ExecUpgradeHealthCheckerScriptTask', `context`='eyJjb25jdXJyZW5jeSI6LTEsImxpc3RNYXAiOnsic3RhbmRieV9
jbHVzdGVyX2lkcyI6W10sIjE3MTEzNTE0ODUub2Jycy56b25lMy5zZXJ2ZXJfaWRzIjpbIjQiXSwiMTcxMTM1MTQ4NS5vYnJzLnpvbmUxLnNlcnZlcl9pZHMiOlsiMiJdLCJjbHVzdGVyX2lkcyI6WyIyIl0sInByaW1hcnlfZmlyc3RfY2x1c3Rlcl9pZHMiOlsiMiJdLCIxNzExMzUxNDg1Lm9icnMuem9uZTMua
G9zdF9pZHMiOlsiNCJdLCJzdGFuZGJ5X2ZpcnN0X2NsdXN0ZXJfaWRzIjpbIjIiXSwiMTcxMTM1MTQ4NS5vYnJzLnpvbmUyLmhvc3RfaWRzIjpbIjMiXSwic3RhZ2VkX3ZlcnNpb25zIjpbIjQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwIl0sIjE3MTEzNTE0ODUub2Jycy56b25lMi5zZXJ2ZXJfaWRzIjpbIjM
iXSwiZGlyZWN0X3VwZ3JhZGVfdmVyc2lvbnMiOltdLCIxNzExMzUxNDg1Lm9icnMuem9uZV9uYW1lcyI6WyJ6b25lMSIsInpvbmUyIiwiem9uZTMiXSwiMTcxMTM1MTQ4NS5vYnJzLnpvbmUxLmhvc3RfaWRzIjpbIjIiXSwiem9uZV9uYW1lcyI6WyJ6b25lMSIsInpvbmUyIiwiem9uZTMiXSwic2VydmVyX2lkc
yI6WyIyIiwiMyIsIjQiXSwiaG9zdF9pZHMiOlsiMiIsIjMiLCI0Il19LCJwYXJhbGxlbCI6ZmFsc2UsInBhcmFsbGVsSWR4IjotMSwic3RyaW5nTWFwIjp7InN1YnRhc2tfc3BsaXR0ZXIiOiJwcmltYXJ5X2ZpcnN0X2NsdXN0ZXJfaWRzIiwiY2x1c3Rlcl92ZXJzaW9uIjoiNC4yLjIuMSIsImNsdXN0ZXJfbmF
tZSI6Im9icnMiLCJjbHVzdGVyX2lkIjoiMiIsIm9jcGFnZW50X3NlcnZpY2VfbmFtZSI6ImFnZW50IiwicHJpbWFyeV9yb290X3N5c19wYXNzd29yZCI6IlJTcnNAIzIwMjQiLCJvcGVyYXRpbmdfc3lzdGVtIjoiZWw3IiwidGFyZ2V0X2NsdXN0ZXJfc3RhdHVzIjoiUlVOTklORyIsIndhdGNoZWRfc2NlbmFya
W8iOiJVUEdSQURFIiwicHJpbWFyeV9jbHVzdGVyX2lkIjoiMiIsInRhcmdldF92ZXJzaW9uIjoiNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAiLCJ0YXNrX2luc3RhbmNlX2lkIjoiNzk0MTYyNyIsImxhdGVzdF9leGVjdXRpb25fc3RhcnRfdGltZSI6IjIwMjUtMDQtMDFUMDg6NDk6NTMuMzgxKzA4OjAwIiw
ib2Jycy4xNzExMzUxNDg1LndhdGNoZWRfcGFyYW1ldGVycyI6IntcInNlcnZlcl9wZXJtYW5lbnRfb2ZmbGluZV90aW1lXCI6XCIzNjAwc1wiLFwiZW5hYmxlX3JlYmFsYW5jZVwiOlwie1xcXCJ0ZW4wMDFcXFwiOlxcXCJUcnVlXFxcIixcXFwidGVuMDAyXFxcIjpcXFwiVHJ1ZVxcXCIsXFxcInRlbjAwM1xcX
CI6XFxcIlRydWVcXFwiLFxcXCJ0ZW4wMDRcXFwiOlxcXCJUcnVlXFxcIixcXFwidGVuMDEwXFxcIjpcXFwiVHJ1ZVxcXCIsXFxcInRlbjAxMVxcXCI6XFxcIlRydWVcXFwiLFxcXCJvZGNfbWV0YVxcXCI6XFxcIlRydWVcXFwiLFxcXCJvbXNfbWV0YVxcXCI6XFxcIlRydWVcXFwiLFxcXCJzeXNcXFwiOlxcXCJ
UcnVlXFxcIixcXFwidGVuMDA1XFxcIjpcXFwiVHJ1ZVxcXCIsXFxcInRlbjAwNlxcXCI6XFxcIlRydWVcXFwiLFxcXCJ0ZW4wMDdcXFwiOlxcXCJUcnVlXFxcIn1cIixcImVuYWJsZV9yZXJlcGxpY2F0aW9uXCI6XCJUcnVlXCJ9Iiwib2JfY2x1c3Rlcl9pZCI6IjE3MTEzNTE0ODUiLCJtaXhfcnBtX25hbWVzI
joie1wiem9uZTNcIjpcIm9jZWFuYmFzZS1jZS00LjIuNS4yLTEwMjAxMDAxMjAyNTAyMjYxMC5lbDcueDg2XzY0LnJwbVwiLFwiem9uZTJcIjpcIm9jZWFuYmFzZS1jZS00LjIuNS4yLTEwMjAxMDAxMjAyNTAyMjYxMC5lbDcueDg2XzY0LnJwbVwiLFwiem9uZTFcIjpcIm9jZWFuYmFzZS1jZS00LjIuNS4yLTE
wMjAxMDAxMjAyNTAyMjYxMC5lbDcueDg2XzY0LnJwbVwifSIsImRlcF9ycG1fbmFtZXMiOiJ7XCJ6b25lM1wiOltcIm9jZWFuYmFzZS1jZS1saWJzLTQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwLmVsNy54ODZfNjQucnBtXCIsXCJvY2VhbmJhc2UtY2UtdXRpbHMtNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2M
TAuZWw3Lng4Nl82NC5ycG1cIl0sXCJ6b25lMlwiOltcIm9jZWFuYmFzZS1jZS1saWJzLTQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwLmVsNy54ODZfNjQucnBtXCIsXCJvY2VhbmJhc2UtY2UtdXRpbHMtNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAuZWw3Lng4Nl82NC5ycG1cIl0sXCJ6b25lMVwiOltcIm9
jZWFuYmFzZS1jZS1saWJzLTQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwLmVsNy54ODZfNjQucnBtXCIsXCJvY2VhbmJhc2UtY2UtdXRpbHMtNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAuZWw3Lng4Nl82NC5ycG1cIl19Iiwib2JfaW5zdGFsbF9wYXRoIjoiL2hvbWUvYWRtaW4vb2NlYW5iYXNlIiwib2Jfc
nVuX3BhdGgiOiIvaG9tZS9hZG1pbi9vY2VhbmJhc2UiLCJvYl9ydW5fdXNlciI6ImFkbWluIiwib2JfZGF0YV9kaXNrX3BhdGgiOiIvZGF0YS8xIiwib2JfbG9nX2Rpc2tfcGF0aCI6Ii9kYXRhL2xvZzEiLCJvYl9kaXNrX3BhdGhfc3R5bGUiOiJERUZBVUxUIiwib2Jfc3FsX3BvcnQiOiIyODgxIiwib2Jfc3Z
yX3BvcnQiOiIyODgyIn19', `end_time`='2025-04-01 08:50:38.611', `executor`='10.10.10.39:8080', `name`='Execute upgrade health checker script', `node_type`='JAVA_TASK', `operation`='RETRY', `parallel_idx`=-1, `run_time`=23, `series_id`=3
0, `start_time`='2025-03-31 22:33:41.724', `state`='FAILED', `task_id`=7941627, `timeout`=1800, `version`=NULL where `id`=9171685 | 0 |
| 2025-04-01 08:50:38.613989 | root | INSERT INTO `subtask_log`(`subtask_id`, `log_content`, `run_time`) VALUES (9171685, _binary 'Set state for subtask: 9171685, operation:RETRY, state: FAILED', 23)
由此可见 ocp 同样的检查可能做了两次。
OceanBase 社区已接收您的帖子,正在跟进中。
在 OB社区专家的支持下,问题解决了,非常感谢!
总结一下。
- 问题现象是一个子任务的日志提示先成功后失败。
Set state for subtask: 9171685, operation:RETRY, state: SUCCESSFUL
Set state for subtask: 9171685, operation:RETRY, state: FAILED
事后多个子任务出现类似现象,任务并不固定。
- 首先排查子任务 upgrade_health_checker 是否真的失败了。
根据日志信息,登录 OB 节点做这个检查
python upgrade_health_checker.py -h 127.0.0.1 -P 2881 -u root@sys -pxxxxxxx
从结果看,并无报错。
所以确认这一步问题跟业务OB无关。
那么怀疑方向就是 OCP 自身。
-
分析 OCP 的日志
ocp-server.log能看到一些异常信息。
这里面日志很多,有点乱。最终对我没用(对产品研发有用),就忽略了。 -
分析 OCP 元数据租户
ocp_meta的元数据库meta_database。抓取 sql 审计视图,发现跟这个任务有关的两笔 update 记录。
可以确认 OCP 是先认为任务成功了 后20秒后又认为失败了。并且根据 session_id 字段判断是两个不同的客户端连接更新导致。
多次重试都是如此,由此推断是 OCP 新版本的任务框架出了问题。可能用了某种并发技术(协程?),没有处理好导致随机出现这种报错。
后面其他子任务也随机出现了这个现象。
这也解释了为什么 OCP 的日志里有时候会出现 lock wait timeout 等记录。这种并行的更新,可能自己阻塞了自己(OCP 里日志里还能找到第三句 update ,是将 state 更新为 pause 记录的。但是 SQL audit 里看不到,推测是这个更新被阻塞了,然后超时回滚了。
- 解决办法
订正任务父子表记录。
begin; update task_instance set state='RUNNING' where id=7941627;update subtask_instance set state='SUCCESSFUL' where id=9171678;
commit;
- OCP 页面刷新任务,点击【继续】。
感谢反馈这个问题,后续我们确定根因了再同步给您。
问题原因:对于升级场景下 挂起|继续 逻辑处理存在异常,导致子任务执行成功跟标记子任务挂起状态事务冲突,进而因为事务超时而将子任务标记成异常
临时处理:手动修改正 db 中子任务状态接口
bug修复方案:修正子任务状态处理逻辑,避免事务冲突
bug修复版本:OCP 4.3.6
实践出真知,感谢分享实战经验
宝贵的经验分享,谢谢!
内容很好
感谢作者的整理和分享
感谢作者的整理和分享
期待更多分享
写得很详细




