OCP (4.3.5-20250303105901) 升级 OB 4.2.2.1 到 4.2.5.2 任务异常

  1. 版本信息:
    ocp: 4.3.5-20250303105901
    ob: 4.2.2.1-101000012024030709 目标版本:oceanbase-ce-4.2.5.2-102010012025022610.el7.x86_64

  2. 备集群 升级成功。

  3. 主集群 3节点升级任务卡住了。

卡在任务:Execute upgrade health checker script
多次重试,都失败。也无法跳过。

日志:
subtask_9171685.log (76.1 KB)
日志最后两句最为奇怪:


2025-03-31 23:33:09.920  INFO 25034 --- [manual-subtask-executor12,a0327b7a330a68f8,7969b8d17f14f688] c.o.o.s.task.util.AgentAsyncTaskHelper   : try to request task result(EXECUTE), result:false,null,,<null>

2025-03-31 23:33:09.922  INFO 25034 --- [manual-subtask-executor12,a0327b7a330a68f8,7969b8d17f14f688] c.o.ocp.common.lang.pattern.Retry        : wait for 15 seconds

2025-03-31 23:33:24.924  INFO 25034 --- [manual-subtask-executor12,a0327b7a330a68f8,7969b8d17f14f688] c.o.o.e.internal.template.HttpTemplate   : POST request to agent, url:http://10.10.10.38:62888/api/v1/task/status, request body:GetTaskStatusRequest(taskToken=08470b2b-c8b0-40dc-9b82-9a4dc4181152), params:null

2025-03-31 23:33:24.930  INFO 25034 --- [manual-subtask-executor12,a0327b7a330a68f8,7969b8d17f14f688] c.o.o.s.task.util.AgentAsyncTaskHelper   : try to request task result(EXECUTE), result:true,null,,<null>

2025-03-31 23:33:24.933  INFO 25034 --- [manual-subtask-executor12,a0327b7a330a68f8,7969b8d17f14f688] c.o.o.s.t.b.c.helper.UpgradeTaskHelper   : execute script upgrade_health_checker.py finished

Set state for subtask: 9171685, operation:RETRY, state: SUCCESSFUL
Set state for subtask: 9171685, operation:RETRY, state: FAILED
 

先是说 SUCCESSFUL, 后说 FAILED 。怀疑是这个版本 OCP 问题。

查看 ocp.log
发现下面日志:

2025-03-31 23:51:15.370  WARN 2696 --- [manual-subtask-executor1,a0327b7a330a68f8,9444d552e6180363] o.h.engine.jdbc.spi.SqlExceptionHelper   : SQL Error: 1205, SQLState: HY000
2025-03-31 23:51:15.370 ERROR 2696 --- [manual-subtask-executor1,a0327b7a330a68f8,9444d552e6180363] o.h.engine.jdbc.spi.SqlExceptionHelper   : (conn=1051017) Lock wait timeout exceeded; try restarting transaction
2025-03-31 23:51:15.383  WARN 2696 --- [manual-subtask-executor1,a0327b7a330a68f8,9444d552e6180363] c.o.o.c.t.e.c.w.subtask.SubtaskExecutor  : Subtask execute failed.

org.springframework.dao.PessimisticLockingFailureException: could not execute statement; SQL [n/a]; nested exception is org.hibernate.PessimisticLockException: could not execute statement
        at org.springframework.orm.jpa.vendor.HibernateJpaDialect.convertHibernateAccessException(HibernateJpaDialect.java:271)
        at org.springframework.orm.jpa.vendor.HibernateJpaDialect.translateExceptionIfPossible(HibernateJpaDialect.java:233)
        at org.springframework.orm.jpa.AbstractEntityManagerFactoryBean.translateExceptionIfPossible(AbstractEntityManagerFactoryBean.java:551)
        at org.springframework.dao.support.ChainedPersistenceExceptionTranslator.translateExceptionIfPossible(ChainedPersistenceExceptionTranslator.java:61)
        at org.springframework.dao.support.DataAccessUtils.translateIfNecessary(DataAccessUtils.java:243)
        at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:152)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.data.jpa.repository.support.CrudMethodMetadataPostProcessor$CrudMethodMetadataPopulatingMethodInterceptor.invoke(CrudMethodMetadataPostProcessor.java:145)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:241)
        at com.sun.proxy.$Proxy426.updateStateByIdIn(Unknown Source)
        at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskFinishedProcessor.lambda$handleExecuteTask$3(SubtaskFinishedProcessor.java:84)
        at org.springframework.transaction.support.TransactionOperations.lambda$executeWithoutResult$0(TransactionOperations.java:68)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
        at org.springframework.transaction.support.TransactionOperations.executeWithoutResult(TransactionOperations.java:67)
        at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskFinishedProcessor.transactionWithoutResult(SubtaskFinishedProcessor.java:139)
        at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskFinishedProcessor.handleExecuteTask(SubtaskFinishedProcessor.java:84)
        at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskFinishedProcessor.markRelatedSubtask(SubtaskFinishedProcessor.java:40)
        at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskAccessor.setExecuteEnded(SubtaskAccessor.java:134)
        at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskAccessor$$FastClassBySpringCGLIB$$5fd21405.invoke(<generated>)
        at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:792)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:762)
        at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123)
        at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:388)
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:762)
        at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:707)
        at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskAccessor$$EnhancerBySpringCGLIB$$5facfbef.setExecuteEnded(<generated>)
        at com.oceanbase.ocp.core.task.engine.coordinator.worker.subtask.SubtaskExecutor.lambda$submit$2(SubtaskExecutor.java:143)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: org.hibernate.PessimisticLockException: could not execute statement
        at org.hibernate.dialect.MySQLDialect$3.convert(MySQLDialect.java:547)
        at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:37)
        at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:113)
        at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:99)
        at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.executeUpdate(ResultSetReturnImpl.java:200)
        at org.hibernate.hql.internal.ast.exec.BasicExecutor.doExecute(BasicExecutor.java:80)
        at org.hibernate.hql.internal.ast.exec.BasicExecutor.execute(BasicExecutor.java:50)
        at org.hibernate.hql.internal.ast.QueryTranslatorImpl.executeUpdate(QueryTranslatorImpl.java:458)
        at org.hibernate.engine.query.spi.HQLQueryPlan.performExecuteUpdate(HQLQueryPlan.java:376)
        at org.hibernate.internal.SessionImpl.executeUpdate(SessionImpl.java:1483)

2 个赞

@论坛小助手

1 个赞

根据您提供的信息,OCP 在执行从 OB 4.2.2 升级到 4.3.5.2 的任务时卡住,并且在日志中出现了 Lock wait timeout exceeded; try restarting transaction 的错误。以下是问题分析和解决建议:


问题分析

  1. 错误码解释
    根据文档库中的信息,错误 Lock wait timeout exceeded; try restarting transaction 对应的 OceanBase 错误码为 60036006(兼容 MySQL 错误码 1205)。该错误表示事务等待锁超时,可能的原因包括:

    • 数据库中存在长时间运行的事务。
    • 并发操作导致锁冲突。
  2. 日志关键点

    • 日志显示任务 Execute upgrade health checker script 先标记为 SUCCESSFUL,随后又被标记为 FAILED,这表明任务状态更新过程中出现了异常。
    • OCP 日志中明确提到 PessimisticLockingFailureExceptionHibernateJpaDialect.convertHibernateAccessException,说明问题可能与数据库锁或并发访问有关。
  3. 版本差异
    您提到备集群升级成功,而主集群卡住。这可能与主集群的负载、事务状态或配置差异有关。


解决步骤

1. 检查并清理长时间运行的事务

  • 使用以下 SQL 查询当前正在运行的事务:
    SELECT * FROM oceanbase.__all_virtual_transaction_info;
    
  • 如果发现有长时间未提交的事务,可以尝试手动终止这些事务:
    KILL TRANSACTION <transaction_id>;
    

2. 调整锁等待超时时间

  • 默认情况下,OceanBase 的锁等待超时时间为 1 秒。可以通过设置参数 lock_wait_timeout 来调整超时时间:
    ALTER SYSTEM SET lock_wait_timeout = 10;
    
  • 注意:此参数需要重启 OB 集群才能生效。

3. 检查 OCP 数据库连接池配置

  • 确保 OCP 的数据库连接池配置合理,避免因连接池耗尽导致的锁等待问题。检查以下参数:
    • maxPoolSize
    • idleTimeout
    • connectionTimeout

4. 重试升级任务

  • 在清理潜在问题后,重新触发 OCP 的升级任务。如果仍然失败,可以尝试跳过 Execute upgrade health checker script 步骤(如果支持)。

5. 检查主集群与备集群的差异

  • 比较主集群和备集群的配置文件(如 config.ini),确保两者一致。
  • 检查主集群的负载情况,避免高负载导致的锁等待问题。

注意事项

  • 如果上述方法无法解决问题,建议收集以下信息并联系 OceanBase 技术支持:
    • 主集群的完整日志文件。
    • 当前的系统参数配置(可通过 SHOW PARAMETERS; 获取)。
    • 数据库的性能指标(如 CPU、内存、磁盘 I/O)。

总结

根据文档库中的信息,Lock wait timeout exceeded 通常由锁冲突或长时间运行的事务引起。建议按照上述步骤逐一排查问题。如果问题仍未解决,请提供更多上下文信息以便进一步分析。

具体信息可参考以下文档:

  1. 0001 ~ 3999
  2. 4500 ~ 4999
  3. 10000 ~ 12000
  4. 7000 ~ 7999
  5. 4000 ~ 4499
  6. 5000 ~ 5999
2 个赞

ocp-server.log 里提取到任务重试失败附近的 日志如下:

2025-04-01 08:13:32.794  WARN 5479 --- [manual-subtask-executor4,a0327b7a330a68f8,42fea0700a7acce3] com.zaxxer.hikari.pool.ProxyConnection   : metadb-connect-pool - Connection com.oceanbase.jdbc.OceanBaseConnection@405b9708 marked as broken because of SQLSTATE(08000), ErrorCode(0)
java.sql.SQLNonTransientConnectionException: (conn=655429) Connection timed out        at com.oceanbase.jdbc.internal.util.exceptions.ExceptionFactory.createException(ExceptionFactory.java:122)        at com.oceanbase.jdbc.internal.util.exceptions.ExceptionFactory.create(ExceptionFactory.java:202)        at com.oceanbase.jdbc.OceanBaseStatement.executeExceptionEpilogue(OceanBaseStatement.java:320)
        at com.oceanbase.jdbc.JDBC4PreparedStatement.executeInternal(JDBC4PreparedStatement.java:257)        at com.oceanbase.jdbc.JDBC4PreparedStatement.execute(JDBC4PreparedStatement.java:171)

这个日志里有个 cs_id=655429 。这是一个 OCP 到 metadb 的链接,不是到业务OB 的链接。
然后到 OCP 服务器的 obproxy.log 里查找这个链接相关日志。

[admin@RS-OBOPC-P1 log]$ grep 655429 obproxy.log
[2025-04-01 07:02:26.655429] INFO  [PROXY.TXN] ob_mysql_transact.cpp:5761 [4835][Y0-00002B9910CA2BA0] [lt=4] [dc=0] [ObMysqlTransact::handle_server_connection_break](client_ip={10.10.10.37:15248}, server_ip={10.10.10.39:2881}, cs_id=917549, proxy_sessid=723401852955330437, ss_id=0, server_sessid=0, sm_id=4868, proxy_user_name=root@ocp_monitor#ocpcluster, database_name=monitor_database, server_state="INTERNAL_ERROR", request_cmd="Quit", sql_cmd="Quit", sql=)[2025-04-01 07:48:17.228104] INFO  [PROXY.CS] ob_mysql_client_session.cpp:370 [4833][Y0-00002B9910CA2D20] [lt=7] [dc=0] client session born(cs_id=655429, proxy_sessid=0, is_local_connection=false, client_vc=0x2b9910c94580, client_fd=188, client_addr="10.10.10.39:48578")[2025-04-01 07:48:17.228124] INFO  [PROXY.CS] ob_mysql_client_session.cpp:241 [4833][Y0-00002B9910CA2D20] [lt=9] [dc=0] Starting new transaction using sm(cs_id=655429, get_transact_count()=0, sm_id=5632)[2025-04-01 07:48:17.228659] INFO  [PROXY.SS] ob_mysql_server_session.cpp:106 [4833][Y0-00002B9910CA2D20] [lt=6] [dc=0] server session born(ss_id=5640, server_ip={10.10.10.39:2881}, cs_id=655429, proxy_sessid=0, server_type=1)[2025-04-01 07:48:17.228720] INFO  [PROXY.TXN] ob_mysql_transact.cpp:4726 [4833][Y0-00002B9910CA2D20] [lt=8] [dc=0] succ to set proxy_sessid(cs_id=655429, proxy_sessid=723401852955330709, server_ip={10.10.10.39:2881}, ss_id=5640, server_sessid=3221745283, is_proxy_mysql_client=false, ss_fd=195, client_addr="10.10.10.39:48578")[2025-04-01 08:13:32.793781] WARN  [PROXY.SM] set_client_abort (ob_mysql_sm.cpp:8124) [4833][Y0-00002B9910CA2D20] [lt=45] [dc=0] client will abort soon(sm_id=5632, cs_id=655429, proxy_sessid=723401852955330709, ss_id=5640, server_sessid=3221745283, client_ip={10.10.10.39:48578}, server_ip={10.10.10.39:2881}, cluster_name=ocpcluster, tenant_name=ocp_meta, user_name=root, db=meta_database, event="VC_EVENT_EOS", request_cmd="Query", sql_cmd="Query", sql=update `subtask_instance` set `state`='PAUSED' where `id` in (9171685))[2025-04-01 08:13:32.793910] WARN  [PROXY.SM] update_cmd_stats (ob_mysql_sm.cpp:10239) [4833][Y0-00002B9910CA2D20] [lt=5] [dc=0] Slow Query: ((client_ip={10.10.10.39:48578}, server_ip={10.10.10.39:2881}, obproxy_client_port={10.10.10.39:62722}, server_trace_id=, route_type=ROUTE_TYPE_LEADER, user_name=root, tenant_name=ocp_meta, cluster_name=ocpcluster, logic_database_name=, logic_tenant_name=, ob_proxy_protocol=2, cs_id=655429, proxy_sessid=723401852955330709, ss_id=5640, server_sessid=3221745283, sm_id=5632, cmd_size_stats={client_request_bytes:75, server_request_bytes:191, server_response_bytes:0, client_response_bytes:0}, cmd_time_stats={client_transaction_idle_time_us=0, client_request_read_time_us=7, client_request_analyze_time_us=6, cluster_resource_create_time_us=0, pl_lookup_time_us=3, pl_process_time_us=2, bl_lookup_time_us=0, bl_process_time_us=0, congestion_control_time_us=13, congestion_process_time_us=0, do_observer_open_time_us=1, server_connect_time_us=0, server_sync_session_variable_time_us=0, server_send_saved_login_time_us=0, server_send_use_database_time_us=0, server_send_session_variable_time_us=0, server_send_session_user_variable_time_us=0, server_send_all_session_variable_time_us=0, server_send_start_trans_time_us=0, server_send_xa_start_time_us=0, server_send_init_sql_time_us=0, build_server_request_time_us=4, plugin_compress_request_time_us=0, prepare_send_request_to_server_time_us=37, server_request_write_time_us=14, server_process_request_time_us=0, server_response_read_time_us=0, plugin_decompress_response_time_us=0, server_response_analyze_time_us=0, ok_packet_trim_time_us=0, client_response_write_time_us=0, request_total_time_us=30029398}, sql=update `subtask_instance` set `state`='PAUSED' where `id` in (9171685), trans_internal_routing_state=not in trans)[2025-04-01 08:13:32.794009] WARN  [PROXY.SM] update_stats (ob_mysql_sm.cpp:10366) [4833][Y0-00002B9910CA2D20] [lt=16] [dc=0] Slow transaction: ((client_ip={10.10.10.39:48578}, server_ip={10.10.10.39:2881}, server_trace_id=, route_type=ROUTE_TYPE_LEADER, ob_proxy_protocol=2, cs_id=655429, proxy_sessid=723401852955330709, ss_id=5640, server_sessid=3221745283, sm_id=5632, state=1, trans_stats_={client_requests:1, server_responses:0, pl_lookup_retries:0, server_retries:0, client_request_bytes:75, server_request_bytes:191, server_response_bytes:0, client_response_bytes:0, client_transaction_idle_time_us=0, client_process_request_time_us=21, client_request_read_time_us=7, client_request_analyze_time_us=6, cluster_resource_create_time_us=0, pl_lookup_time_us=3, pl_process_time_us=2, congestion_control_time_us=13, congestion_process_time_us=0, do_observer_open_time_us=1, server_connect_time_us=0, sync_session_variable_time_us=0, send_saved_login_time_us=0, send_use_database_time_us=0, send_session_vars_time_us=0, send_session_user_vars_time_us=0, send_all_session_vars_time_us=0, send_start_trans_time_us=0, send_xa_start_time_us=0, build_server_request_time_us=4, plugin_compress_request_time_us=0, prepare_send_request_to_server_time_us=37, server_request_write_time_us=14, server_process_request_time_us=0, server_response_read_time_us=0, plugin_decompress_response_time_us=0, server_response_analyze_time_us=0, ok_packet_trim_time_us=0, client_response_write_time_us=0, trans_time_us=30029630}, last_sql=update `subtask_instance` set `state`='PAUSED' where `id` in (9171685))[2025-04-01 08:13:32.794063] INFO  [PROXY.SS] ob_mysql_server_session.cpp:153 [4833][Y0-00002B9910CA2D20] [lt=21] [dc=0] server session do_io_close((*this={ss_id:5640, server_sessid:3221745283, server_ip:{10.10.10.39:2881}, is_inited:true, magic:19136237, state:1, server_vc:0x2b991246bbd0, client_session:{this:0x2b9912089120, is_proxy_mysql_client:false, is_waiting_trans_first_request:false, need_delete_cluster:false, is_first_dml_sql_got:false, vc_ready_killed:false, active:true, magic:19132429, conn_decrease:true, current_tid:4833, cs_id:655429, proxy_sessid:723401852955330709, session_info:{is_inited:true, priv_info:{has_all_privilege:false, cs_id:655429, user_priv_set:133009965054, cluster_name:"ocpcluster", tenant_name:"ocp_meta", user_name:"root"}, version:{common_hot_sys_var_version:0, common_sys_var_version:0, mysql_hot_sys_var_version:0, mysql_sys_var_version:0, hot_sys_var_version:6968, sys_var_version:6801, user_var_version:1, db_name_version:1, last_insert_id_version:6, sess_info_version:11099}, hash_version:{common_hot_sys_var_version:0, common_sys_var_version:0, mysql_hot_sys_var_version:0, mysql_sys_var_version:0, hot_sys_var_version:0, sys_var_version:0, user_var_version:0, db_name_version:0, last_insert_id_version:0, sess_info_version:0}, val_hash:{common_hot_sys_var_hash:0, common_cold_sys_var_hash:0, mysql_hot_sys_var_hash:0, mysql_cold_sys_var_hash:0, hot_sys_var_hash:0, cold_sys_var_hash:0, user_var_hash:0}, global_vars_version:0, is_global_vars_changed:false, is_trans_specified:false, is_user_idc_name_set:false, is_read_consistency_set:false, idc_name:"", cluster_id:0, real_meta_cluster_name:"", safe_read_snapshot:1, syncing_safe_read_snapshot:0, route_policy:1, proxy_route_policy:3, user_identity:0, global_vars_version:0, is_read_only_user:false, is_request_follower_user:false, obproxy_force_parallel_query_dop:1, ob20_request:{remain_payload_len:0, ob20_request_received_done:false, ob20_header:{ob 20 protocol header:{compressed_len:0, seq:0, non_compressed_len:0}, magic_num:0, header_checksum:0, connection_id:0, request_id:0, pkt_seq:0, payload_len:0, version:0, flag_.flags:0, reserved:0}}, client_cap:72, server_cap:916303, last_server_addr:{*Not IP address [0]*:0}, last_server_sess_id:0, init_sql:""}, dummy_ldc:{use_ldc:false, idc_name:"", item_count:1, site_start_index_array:[[0]0, [1]1, [2]1, [3]1], item_array:[[0]{idc_type:"SAME_IDC", zone_type:"ReadWrite", replica:{server:"10.10.10.39:2881", is_dup_replica:false, role:"FOLLOWER", type:"FULL"}, is_merging:false, is_partition_server:false, is_force_congested:false, is_used:false}], pl:null, ts:0x2b9903f00030, readonly_exist_status:"READONLY_ZONE_NOT_EXIST"}, dummy_entry:0x2b9903f9bbe0, server_state_version:3, cur_ss:0x2b9912204be0, bound_ss:null, lii_ss:0x2b9912204be0, cluster_resource:{this:0x2b9910023080, ref_count:182, is_inited:true, cluster_info_key:{cluster_name:{config_string:"ocpcluster"}, cluster_id:0}, cr_state:"CR_AVAIL", version:2, last_access_time_ns:1743466412669165417, deleting_completed_thread_num:0, fetch_rslist_task_count:0, fetch_idc_list_task_count:0, last_idc_list_refresh_time_ns:1743437552565904769, last_rslist_refresh_time_ns:0, server_state_version:3}, client_vc:0x2b9910c94580, using_ldg:false, trace_stats:NULL}, transact_count:26716, server_trans_stat:1, session_info:{cap:916303, ob_server:, version:{common_hot_sys_var_version:0, common_sys_var_version:0, mysql_hot_sys_var_version:0, mysql_sys_var_version:0, hot_sys_var_version:6968, sys_var_version:6801, user_var_version:1, db_name_version:1, last_insert_id_version:6, sess_info_version:11099}, val_hash:{common_hot_sys_var_hash:0, common_cold_sys_var_hash:0, mysql_hot_sys_var_hash:0, mysql_cold_sys_var_hash:0, hot_sys_var_hash:0, cold_sys_var_hash:0, user_var_hash:0}, checksum_switch:1, server_type:1}}, server_vc_=0x2b991246bbd0, this=0x2b9912204be0)
[2025-04-01 08:13:32.794176] INFO  [PROXY.SS] ob_mysql_server_session.cpp:190 [4833][Y0-00002B9910CA2D20] [lt=84] [dc=0] server session is closing(ss_id=5640, server_sessid=3221745283, server_ip={10.10.10.39:2881}, cs_id=655429, proxy_sessid=723401852955330709)
[2025-04-01 08:13:32.794206] INFO  [PROXY.SS] ob_mysql_client_session.cpp:652 [4833][Y0-00002B9910CA2D20] [lt=8] [dc=0] client session do_io_close((*this={this:0x2b9912089120, is_proxy_mysql_client:false, is_waiting_trans_first_request:false, need_delete_cluster:false, is_first_dml_sql_got:false, vc_ready_killed:false, active:true, magic:19132429, conn_decrease:true, current_tid:4833, cs_id:655429, proxy_sessid:723401852955330709, session_info:{is_inited:true, priv_info:{has_all_privilege:false, cs_id:655429, user_priv_set:133009965054, cluster_name:"ocpcluster", tenant_name:"ocp_meta", user_name:"root"}, version:{common_hot_sys_var_version:0, common_sys_var_version:0, mysql_hot_sys_var_version:0, mysql_sys_var_version:0, hot_sys_var_version:6968, sys_var_version:6801, user_var_version:1, db_name_version:1, last_insert_id_version:6, sess_info_version:11099}, hash_version:{common_hot_sys_var_version:0, common_sys_var_version:0, mysql_hot_sys_var_version:0, mysql_sys_var_version:0, hot_sys_var_version:0, sys_var_version:0, user_var_version:0, db_name_version:0, last_insert_id_version:0, sess_info_version:0}, val_hash:{common_hot_sys_var_hash:0, common_cold_sys_var_hash:0, mysql_hot_sys_var_hash:0, mysql_cold_sys_var_hash:0, hot_sys_var_hash:0, cold_sys_var_hash:0, user_var_hash:0}, global_vars_version:0, is_global_vars_changed:false, is_trans_specified:false, is_user_idc_name_set:false, is_read_consistency_set:false, idc_name:"", cluster_id:0, real_meta_cluster_name:"", safe_read_snapshot:1, syncing_safe_read_snapshot:0, route_policy:1, proxy_route_policy:3, user_identity:0, global_vars_version:0, is_read_only_user:false, is_request_follower_user:false, obproxy_force_parallel_query_dop:1, ob20_request:{remain_payload_len:0, ob20_request_received_done:false, ob20_header:{ob 20 protocol header:{compressed_len:0, seq:0, non_compressed_len:0}, magic_num:0, header_checksum:0, connection_id:0, request_id:0, pkt_seq:0, payload_len:0, version:0, flag_.flags:0, reserved:0}}, client_cap:72, server_cap:916303, last_server_addr:{*Not IP address [0]*:0}, last_server_sess_id:0, init_sql:""}, dummy_ldc:{use_ldc:false, idc_name:"", item_count:1, site_start_index_array:[[0]0, [1]1, [2]1, [3]1], item_array:[[0]{idc_type:"SAME_IDC", zone_type:"ReadWrite", replica:{server:"10.10.10.39:2881", is_dup_replica:false, role:"FOLLOWER", type:"FULL"}, is_merging:false, is_partition_server:false, is_force_congested:false, is_used:false}], pl:null, ts:0x2b9903f00030, readonly_exist_status:"READONLY_ZONE_NOT_EXIST"}, dummy_entry:0x2b9903f9bbe0, server_state_version:3, cur_ss:null, bound_ss:null, lii_ss:null, cluster_resource:{this:0x2b9910023080, ref_count:182, is_inited:true, cluster_info_key:{cluster_name:{config_string:"ocpcluster"}, cluster_id:0}, cr_state:"CR_AVAIL", version:2, last_access_time_ns:1743466412669165417, deleting_completed_thread_num:0, fetch_rslist_task_count:0, fetch_idc_list_task_count:0, last_idc_list_refresh_time_ns:1743437552565904769, last_rslist_refresh_time_ns:0, server_state_version:3}, client_vc:0x2b9910c94580, using_ldg:false, trace_stats:NULL}, client_vc_=0x2b9910c94580, this=0x2b9912089120)
[2025-04-01 08:13:32.794257] INFO  [PROXY.CS] ob_mysql_client_session.cpp:94 [4833][Y0-00002B9910CA2D20] [lt=44] [dc=0] client session destroy(cs_id=655429, proxy_sessid=723401852955330709, client_vc=NULL)

从 ocp的 obproxy.log 里可以看出,这个链接在任务重试之前就已经建好(2025-04-01 07:48:17.228104),任务重试在 2025-04-01 08:12:47.581,链接异常断开在 2025-04-01 08:12:47.581,断开时正在执行的SQL是 :

sql_cmd="Query", sql=update `subtask_instance` set `state`='PAUSED' where `id` in (9171685))

并且紧接着这句 SQL 还被标识为 slow query,执行时间约 request_total_time_us=30029398
metadb 的 sql 超时时间是 100s

obclient(root@ocp_meta)[meta_database]> show global variables like '%timeout%';
+---------------------+------------------+
| Variable_name       | Value            |
+---------------------+------------------+
| connect_timeout     | 10               |
| interactive_timeout | 28800            |
| lock_wait_timeout   | 31536000         |
| net_read_timeout    | 30               |
| net_write_timeout   | 60               |
| ob_pl_block_timeout | 3216672000000000 |
| ob_query_timeout    | 100000000        |
| ob_trx_idle_timeout | 86400000000      |
| ob_trx_lock_timeout | -1               |
| ob_trx_timeout      | 86400000000      |
| wait_timeout        | 28800            |
+---------------------+------------------+
11 rows in set (0.003 sec)

metadb 压力并不大,这个 SQL 也不可能是慢 SQL。30s 恰好是 net_read_timeout 。ocp跟metadb 是同一台服务器。 很可能是 ocp 自身哪里问题。 @论坛小助手

1 个赞

根据您提供的详细信息和日志,以下是对问题的分析和可能的解决方案:

问题分析

  1. 任务卡住的原因

    • 主集群升级任务在 Execute upgrade health checker script 步骤失败,且无法跳过。
    • 日志显示任务状态先标记为 SUCCESSFUL,随后又被标记为 FAILED
  2. 锁等待超时

    • OCP 日志中出现了 Lock wait timeout exceeded; try restarting transaction 的错误,表明存在事务锁冲突或长时间未提交的事务。
    • 锁超时的具体原因可能是某些 SQL 执行时间过长,导致其他事务无法获取锁。
  3. 连接异常断开

    • OBProxy 日志显示链接 cs_id=655429 在执行 update subtask_instance set state='PAUSED' where id in (9171685) 时被标识为慢查询,并最终断开连接。
    • 断开连接的时间恰好接近 net_read_timeout(30 秒),这表明可能是网络层或客户端配置的问题。
  4. SQL 执行时间

    • 虽然您提到 metadb 压力不大,但该 SQL 的执行时间达到了约 30 秒(request_total_time_us=30029398),远超正常范围。
    • 这种情况可能与锁冲突、索引缺失或表扫描有关。

解决方案

1. 检查锁冲突

根据文档片段 6 和片段 10 的描述,OceanBase 数据库可能存在锁冲突问题。以下是排查步骤:

  • 查询当前事务状态和锁信息:
    SELECT * FROM oceanbase.__all_virtual_transaction_info;
    
  • 如果发现有长时间未提交的事务,尝试手动回滚这些事务:
    ROLLBACK;
    

2. 调整超时参数

  • 当前 net_read_timeout 设置为 30 秒,可能导致连接在长时间操作时被中断。建议将该值调大:
    SET GLOBAL net_read_timeout = 120;
    
  • 同时检查 ob_query_timeout 是否合理,默认为 10 秒,可能需要根据实际业务需求调整:
    SET GLOBAL ob_query_timeout = 300000000; -- 单位为微秒,此处设置为 300 秒
    

3. 优化 SQL 性能

  • 检查 subtask_instance 表的索引情况,确保 id 列上有合适的索引:
    SHOW INDEX FROM subtask_instance;
    
  • 如果索引缺失,可以添加索引以加速更新操作:
    ALTER TABLE subtask_instance ADD INDEX idx_id (id);
    

4. 排查 OCP 内部问题

  • 根据日志,OCP 自身可能存在性能瓶颈或资源不足的情况。建议检查 OCP 的系统资源使用情况(CPU、内存、磁盘 I/O 等)。
  • 查看 OCP 的任务日志,定位是否存在其他并发任务干扰升级流程:
    tail -f /path/to/ocp/tmp/upgrade_*_log
    

5. 重试升级任务

  • 在完成上述调整后,重新发起升级任务。如果仍然失败,可以尝试清理失败的任务记录后再重试:
    obd cluster restart <cluster_name>
    

其他建议

  • 版本兼容性:确保 OCP 和 OB 的版本兼容性。虽然备集群升级成功,但主集群可能存在特定场景下的兼容性问题。
  • 联系技术支持:如果问题仍未解决,建议提供完整的日志文件(包括 ocp-server.logobproxy.log)以及任务截图,联系 OceanBase 技术支持团队进一步分析。

希望以上建议能帮助您解决问题!如果有更多细节或新的日志信息,请随时补充。

更多信息请查看:

  1. 10000 ~ 12000
  2. 连接诊断
  3. 并发控制概述
  4. 4500 ~ 4999
  5. 锁机制
  6. 遇到问题如何向技术支持同学提问
  7. 分析诊断&决策流程
1 个赞

每次重试都是一笔成功记录紧着着十几秒后跟着一笔失败记录。通过metadb 的sql 审计也看到这样的 记录。

| 2025-04-01 08:50:08.571955 | root      | update `subtask_instance` set `class_name`='com.oceanbase.ocp.service.task.business.cluster.ExecUpgradeHealthCheckerScriptTask', `context`='eyJjb25jdXJyZW5jeSI6LTEsImxpc3RNYXAiOnsic3RhbmRieV9
jbHVzdGVyX2lkcyI6W10sIjE3MTEzNTE0ODUub2Jycy56b25lMy5zZXJ2ZXJfaWRzIjpbIjQiXSwiMTcxMTM1MTQ4NS5vYnJzLnpvbmUxLnNlcnZlcl9pZHMiOlsiMiJdLCJjbHVzdGVyX2lkcyI6WyIyIl0sInByaW1hcnlfZmlyc3RfY2x1c3Rlcl9pZHMiOlsiMiJdLCIxNzExMzUxNDg1Lm9icnMuem9uZTMuaG9zdF9pZHMiOlsiNCJdLCJzdGFuZGJ5X2ZpcnN0X2NsdXN0ZXJfaWRzIjpbIjIiXSwiMTcxMTM1MTQ4NS5vYnJzLnpvbmUyLmhvc3RfaWRzIjpbIjMiXSwic3RhZ2VkX3ZlcnNpb25zIjpbIjQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwIl0sIjE3MTEzNTE0ODUub2Jycy56b25lMi5zZXJ2ZXJfaWRzIjpbIjMiXSwiZGlyZWN0X3VwZ3JhZGVfdmVyc2lvbnMiOltdLCIxNzExMzUxNDg1Lm9icnMuem9uZV9uYW1lcyI6WyJ6b25lMSIsInpvbmUyIiwiem9uZTMiXSwiMTcxMTM1MTQ4NS5vYnJzLnpvbmUxLmhvc3RfaWRzIjpbIjIiXSwiem9uZV9uYW1lcyI6WyJ6b25lMSIsInpvbmUyIiwiem9uZTMiXSwic2VydmVyX2lkcyI6WyIyIiwiMyIsIjQiXSwiaG9zdF9pZHMiOlsiMiIsIjMiLCI0Il19LCJwYXJhbGxlbCI6ZmFsc2UsInBhcmFsbGVsSWR4IjotMSwic3RyaW5nTWFwIjp7InN1YnRhc2tfc3BsaXR0ZXIiOiJwcmltYXJ5X2ZpcnN0X2NsdXN0ZXJfaWRzIiwiY2x1c3Rlcl92ZXJzaW9uIjoiNC4yLjIuMSIsImNsdXN0ZXJfbmFtZSI6Im9icnMiLCJjbHVzdGVyX2lkIjoiMiIsIm9jcGFnZW50X3NlcnZpY2VfbmFtZSI6ImFnZW50IiwicHJpbWFyeV9yb290X3N5c19wYXNzd29yZCI6IlJTcnNAIzIwMjQiLCJvcGVyYXRpbmdfc3lzdGVtIjoiZWw3IiwidGFyZ2V0X2NsdXN0ZXJfc3RhdHVzIjoiUlVOTklORyIsIndhdGNoZWRfc2NlbmFya
W8iOiJVUEdSQURFIiwicHJpbWFyeV9jbHVzdGVyX2lkIjoiMiIsInRhcmdldF92ZXJzaW9uIjoiNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAiLCJ0YXNrX2luc3RhbmNlX2lkIjoiNzk0MTYyNyIsImxhdGVzdF9leGVjdXRpb25fc3RhcnRfdGltZSI6IjIwMjUtMDQtMDFUMDg6NDk6NTMuMzgxKzA4OjAwIiwib2Jycy4xNzExMzUxNDg1LndhdGNoZWRfcGFyYW1ldGVycyI6IntcInNlcnZlcl9wZXJtYW5lbnRfb2ZmbGluZV90aW1lXCI6XCIzNjAwc1wiLFwiZW5hYmxlX3JlYmFsYW5jZVwiOlwie1xcXCJ0ZW4wMDFcXFwiOlxcXCJUcnVlXFxcIixcXFwidGVuMDAyXFxcIjpcXFwiVHJ1ZVxcXCIsXFxcInRlbjAwM1xcXCI6XFxcIlRydWVcXFwiLFxcXCJ0ZW4wMDRcXFwiOlxcXCJUcnVlXFxcIixcXFwidGVuMDEwXFxcIjpcXFwiVHJ1ZVxcXCIsXFxcInRlbjAxMVxcXCI6XFxcIlRydWVcXFwiLFxcXCJvZGNfbWV0YVxcXCI6XFxcIlRydWVcXFwiLFxcXCJvbXNfbWV0YVxcXCI6XFxcIlRydWVcXFwiLFxcXCJzeXNcXFwiOlxcXCJUcnVlXFxcIixcXFwidGVuMDA1XFxcIjpcXFwiVHJ1ZVxcXCIsXFxcInRlbjAwNlxcXCI6XFxcIlRydWVcXFwiLFxcXCJ0ZW4wMDdcXFwiOlxcXCJUcnVlXFxcIn1cIixcImVuYWJsZV9yZXJlcGxpY2F0aW9uXCI6XCJUcnVlXCJ9Iiwib2JfY2x1c3Rlcl9pZCI6IjE3MTEzNTE0ODUiLCJtaXhfcnBtX25hbWVzIjoie1wiem9uZTNcIjpcIm9jZWFuYmFzZS1jZS00LjIuNS4yLTEwMjAxMDAxMjAyNTAyMjYxMC5lbDcueDg2XzY0LnJwbVwiLFwiem9uZTJcIjpcIm9jZWFuYmFzZS1jZS00LjIuNS4yLTEwMjAxMDAxMjAyNTAyMjYxMC5lbDcueDg2XzY0LnJwbVwiLFwiem9uZTFcIjpcIm9jZWFuYmFzZS1jZS00LjIuNS4yLTEwMjAxMDAxMjAyNTAyMjYxMC5lbDcueDg2XzY0LnJwbVwifSIsImRlcF9ycG1fbmFtZXMiOiJ7XCJ6b25lM1wiOltcIm9jZWFuYmFzZS1jZS1saWJzLTQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwLmVsNy54ODZfNjQucnBtXCIsXCJvY2VhbmJhc2UtY2UtdXRpbHMtNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAuZWw3Lng4Nl82NC5ycG1cIl0sXCJ6b25lMlwiOltcIm9jZWFuYmFzZS1jZS1saWJzLTQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwLmVsNy54ODZfNjQucnBtXCIsXCJvY2VhbmJhc2UtY2UtdXRpbHMtNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAuZWw3Lng4Nl82NC5ycG1cIl0sXCJ6b25lMVwiOltcIm9jZWFuYmFzZS1jZS1saWJzLTQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwLmVsNy54ODZfNjQucnBtXCIsXCJvY2VhbmJhc2UtY2UtdXRpbHMtNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAuZWw3Lng4Nl82NC5ycG1cIl19Iiwib2JfaW5zdGFsbF9wYXRoIjoiL2hvbWUvYWRtaW4vb2NlYW5iYXNlIiwib2JfcnVuX3BhdGgiOiIvaG9tZS9hZG1pbi9vY2VhbmJhc2UiLCJvYl9ydW5fdXNlciI6ImFkbWluIiwib2JfZGF0YV9kaXNrX3BhdGgiOiIvZGF0YS8xIiwib2JfbG9nX2Rpc2tfcGF0aCI6Ii9kYXRhL2xvZzEiLCJvYl9kaXNrX3BhdGhfc3R5bGUiOiJERUZBVUxUIiwib2Jfc3FsX3BvcnQiOiIyODgxIiwib2Jfc3Z
yX3BvcnQiOiIyODgyIn19', `end_time`='2025-04-01 08:50:08.57', `executor`='10.10.10.39:8080', `name`='Execute upgrade health checker script', `node_type`='JAVA_TASK', `operation`='RETRY', `parallel_idx`=-1, `run_time`=23, `series_id`=30, `start_time`='2025-03-31 22:33:41.724', `state`='SUCCESSFUL', `task_id`=7941627, `timeout`=1800, `version`=NULL where `id`=9171685 |        0 || 2025-04-01 08:50:08.573149 | root      | INSERT INTO `subtask_log`(`subtask_id`, `log_content`, `run_time`) VALUES (9171685, _binary 'Set state for subtask: 9171685, operation:RETRY, state: SUCCESSFUL', 23)                                                                                                                                                                                                                                            

......

| 2025-04-01 08:50:38.612806 | root      | update `subtask_instance` set `class_name`='com.oceanbase.ocp.service.task.business.cluster.ExecUpgradeHealthCheckerScriptTask', `context`='eyJjb25jdXJyZW5jeSI6LTEsImxpc3RNYXAiOnsic3RhbmRieV9
jbHVzdGVyX2lkcyI6W10sIjE3MTEzNTE0ODUub2Jycy56b25lMy5zZXJ2ZXJfaWRzIjpbIjQiXSwiMTcxMTM1MTQ4NS5vYnJzLnpvbmUxLnNlcnZlcl9pZHMiOlsiMiJdLCJjbHVzdGVyX2lkcyI6WyIyIl0sInByaW1hcnlfZmlyc3RfY2x1c3Rlcl9pZHMiOlsiMiJdLCIxNzExMzUxNDg1Lm9icnMuem9uZTMua
G9zdF9pZHMiOlsiNCJdLCJzdGFuZGJ5X2ZpcnN0X2NsdXN0ZXJfaWRzIjpbIjIiXSwiMTcxMTM1MTQ4NS5vYnJzLnpvbmUyLmhvc3RfaWRzIjpbIjMiXSwic3RhZ2VkX3ZlcnNpb25zIjpbIjQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwIl0sIjE3MTEzNTE0ODUub2Jycy56b25lMi5zZXJ2ZXJfaWRzIjpbIjM
iXSwiZGlyZWN0X3VwZ3JhZGVfdmVyc2lvbnMiOltdLCIxNzExMzUxNDg1Lm9icnMuem9uZV9uYW1lcyI6WyJ6b25lMSIsInpvbmUyIiwiem9uZTMiXSwiMTcxMTM1MTQ4NS5vYnJzLnpvbmUxLmhvc3RfaWRzIjpbIjIiXSwiem9uZV9uYW1lcyI6WyJ6b25lMSIsInpvbmUyIiwiem9uZTMiXSwic2VydmVyX2lkc
yI6WyIyIiwiMyIsIjQiXSwiaG9zdF9pZHMiOlsiMiIsIjMiLCI0Il19LCJwYXJhbGxlbCI6ZmFsc2UsInBhcmFsbGVsSWR4IjotMSwic3RyaW5nTWFwIjp7InN1YnRhc2tfc3BsaXR0ZXIiOiJwcmltYXJ5X2ZpcnN0X2NsdXN0ZXJfaWRzIiwiY2x1c3Rlcl92ZXJzaW9uIjoiNC4yLjIuMSIsImNsdXN0ZXJfbmF
tZSI6Im9icnMiLCJjbHVzdGVyX2lkIjoiMiIsIm9jcGFnZW50X3NlcnZpY2VfbmFtZSI6ImFnZW50IiwicHJpbWFyeV9yb290X3N5c19wYXNzd29yZCI6IlJTcnNAIzIwMjQiLCJvcGVyYXRpbmdfc3lzdGVtIjoiZWw3IiwidGFyZ2V0X2NsdXN0ZXJfc3RhdHVzIjoiUlVOTklORyIsIndhdGNoZWRfc2NlbmFya
W8iOiJVUEdSQURFIiwicHJpbWFyeV9jbHVzdGVyX2lkIjoiMiIsInRhcmdldF92ZXJzaW9uIjoiNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAiLCJ0YXNrX2luc3RhbmNlX2lkIjoiNzk0MTYyNyIsImxhdGVzdF9leGVjdXRpb25fc3RhcnRfdGltZSI6IjIwMjUtMDQtMDFUMDg6NDk6NTMuMzgxKzA4OjAwIiw
ib2Jycy4xNzExMzUxNDg1LndhdGNoZWRfcGFyYW1ldGVycyI6IntcInNlcnZlcl9wZXJtYW5lbnRfb2ZmbGluZV90aW1lXCI6XCIzNjAwc1wiLFwiZW5hYmxlX3JlYmFsYW5jZVwiOlwie1xcXCJ0ZW4wMDFcXFwiOlxcXCJUcnVlXFxcIixcXFwidGVuMDAyXFxcIjpcXFwiVHJ1ZVxcXCIsXFxcInRlbjAwM1xcX
CI6XFxcIlRydWVcXFwiLFxcXCJ0ZW4wMDRcXFwiOlxcXCJUcnVlXFxcIixcXFwidGVuMDEwXFxcIjpcXFwiVHJ1ZVxcXCIsXFxcInRlbjAxMVxcXCI6XFxcIlRydWVcXFwiLFxcXCJvZGNfbWV0YVxcXCI6XFxcIlRydWVcXFwiLFxcXCJvbXNfbWV0YVxcXCI6XFxcIlRydWVcXFwiLFxcXCJzeXNcXFwiOlxcXCJ
UcnVlXFxcIixcXFwidGVuMDA1XFxcIjpcXFwiVHJ1ZVxcXCIsXFxcInRlbjAwNlxcXCI6XFxcIlRydWVcXFwiLFxcXCJ0ZW4wMDdcXFwiOlxcXCJUcnVlXFxcIn1cIixcImVuYWJsZV9yZXJlcGxpY2F0aW9uXCI6XCJUcnVlXCJ9Iiwib2JfY2x1c3Rlcl9pZCI6IjE3MTEzNTE0ODUiLCJtaXhfcnBtX25hbWVzI
joie1wiem9uZTNcIjpcIm9jZWFuYmFzZS1jZS00LjIuNS4yLTEwMjAxMDAxMjAyNTAyMjYxMC5lbDcueDg2XzY0LnJwbVwiLFwiem9uZTJcIjpcIm9jZWFuYmFzZS1jZS00LjIuNS4yLTEwMjAxMDAxMjAyNTAyMjYxMC5lbDcueDg2XzY0LnJwbVwiLFwiem9uZTFcIjpcIm9jZWFuYmFzZS1jZS00LjIuNS4yLTE
wMjAxMDAxMjAyNTAyMjYxMC5lbDcueDg2XzY0LnJwbVwifSIsImRlcF9ycG1fbmFtZXMiOiJ7XCJ6b25lM1wiOltcIm9jZWFuYmFzZS1jZS1saWJzLTQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwLmVsNy54ODZfNjQucnBtXCIsXCJvY2VhbmJhc2UtY2UtdXRpbHMtNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2M
TAuZWw3Lng4Nl82NC5ycG1cIl0sXCJ6b25lMlwiOltcIm9jZWFuYmFzZS1jZS1saWJzLTQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwLmVsNy54ODZfNjQucnBtXCIsXCJvY2VhbmJhc2UtY2UtdXRpbHMtNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAuZWw3Lng4Nl82NC5ycG1cIl0sXCJ6b25lMVwiOltcIm9
jZWFuYmFzZS1jZS1saWJzLTQuMi41LjItMTAyMDEwMDEyMDI1MDIyNjEwLmVsNy54ODZfNjQucnBtXCIsXCJvY2VhbmJhc2UtY2UtdXRpbHMtNC4yLjUuMi0xMDIwMTAwMTIwMjUwMjI2MTAuZWw3Lng4Nl82NC5ycG1cIl19Iiwib2JfaW5zdGFsbF9wYXRoIjoiL2hvbWUvYWRtaW4vb2NlYW5iYXNlIiwib2Jfc
nVuX3BhdGgiOiIvaG9tZS9hZG1pbi9vY2VhbmJhc2UiLCJvYl9ydW5fdXNlciI6ImFkbWluIiwib2JfZGF0YV9kaXNrX3BhdGgiOiIvZGF0YS8xIiwib2JfbG9nX2Rpc2tfcGF0aCI6Ii9kYXRhL2xvZzEiLCJvYl9kaXNrX3BhdGhfc3R5bGUiOiJERUZBVUxUIiwib2Jfc3FsX3BvcnQiOiIyODgxIiwib2Jfc3Z
yX3BvcnQiOiIyODgyIn19', `end_time`='2025-04-01 08:50:38.611', `executor`='10.10.10.39:8080', `name`='Execute upgrade health checker script', `node_type`='JAVA_TASK', `operation`='RETRY', `parallel_idx`=-1, `run_time`=23, `series_id`=3
0, `start_time`='2025-03-31 22:33:41.724', `state`='FAILED', `task_id`=7941627, `timeout`=1800, `version`=NULL where `id`=9171685 |        0 |
| 2025-04-01 08:50:38.613989 | root      | INSERT INTO `subtask_log`(`subtask_id`, `log_content`, `run_time`) VALUES (9171685, _binary 'Set state for subtask: 9171685, operation:RETRY, state: FAILED', 23) 

由此可见 ocp 同样的检查可能做了两次。

这个您了解了没 ???
版本升级,具有强绑定的性质。 包含升级的方法。 OCP 升级版本不是万能的。

https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000001502708

版本的文档连接, 您看下。

看这个 :OceanBase 数据库CE版本发布说明 - OceanBase

OceanBase 社区已接收您的帖子,正在跟进中。

单独去跑 OB 的检查脚本 ,是 OK 的。 upgrade_health_checker 这一步完全能过去。

就是 OCP 的任务 无法跳过去

在 OB社区专家的支持下,问题解决了,非常感谢!

总结一下。

  1. 问题现象是一个子任务的日志提示先成功后失败。
Set state for subtask: 9171685, operation:RETRY, state: SUCCESSFUL
Set state for subtask: 9171685, operation:RETRY, state: FAILED

事后多个子任务出现类似现象,任务并不固定。

  1. 首先排查子任务 upgrade_health_checker 是否真的失败了。
    根据日志信息,登录 OB 节点做这个检查
python upgrade_health_checker.py -h 127.0.0.1 -P 2881 -u root@sys -pxxxxxxx

从结果看,并无报错。
所以确认这一步问题跟业务OB无关。
那么怀疑方向就是 OCP 自身。

  1. 分析 OCP 的日志 ocp-server.log 能看到一些异常信息。
    这里面日志很多,有点乱。最终对我没用(对产品研发有用),就忽略了。

  2. 分析 OCP 元数据租户 ocp_meta的元数据库 meta_database。抓取 sql 审计视图,发现跟这个任务有关的两笔 update 记录。

可以确认 OCP 是先认为任务成功了 后20秒后又认为失败了。并且根据 session_id 字段判断是两个不同的客户端连接更新导致。

多次重试都是如此,由此推断是 OCP 新版本的任务框架出了问题。可能用了某种并发技术(协程?),没有处理好导致随机出现这种报错。
后面其他子任务也随机出现了这个现象。

这也解释了为什么 OCP 的日志里有时候会出现 lock wait timeout 等记录。这种并行的更新,可能自己阻塞了自己(OCP 里日志里还能找到第三句 update ,是将 state 更新为 pause 记录的。但是 SQL audit 里看不到,推测是这个更新被阻塞了,然后超时回滚了。

  1. 解决办法
    订正任务父子表记录。
begin; update task_instance set state='RUNNING' where id=7941627;update subtask_instance set state='SUCCESSFUL' where id=9171678; 
commit;
  1. OCP 页面刷新任务,点击【继续】。
2 个赞

感谢反馈这个问题,后续我们确定根因了再同步给您。

1 个赞