OceanBase 执行 SQL 莫名报错 4016 - Internal Error

【 使用环境 】测试环境
【 OB or 其他组件 】 OceanBase_CE
【 使用版本 】OceanBase_CE_V4.2.1.7
【问题描述】

INSERT INTO gs_merchant_msg ( msg_id, merchant_id, emp_id, add_time, read_time, state )
SELECT m.id, 2, 4, m.send_time, NOW( 3 ), 3 
FROM gs_msg m
  INNER JOIN gs_merchant m2 ON m2.id = 2
  LEFT JOIN gs_merchant_msg t ON t.merchant_id = 2
  AND t.msg_id = m.id 
  AND t.state = 3 
WHERE
  m.merchant_id = 0 
  AND m.scope = 100 
  AND t.id IS NULL 
  AND m.end_time >= m2.add_time

应用程序执行上述 SQL,莫名出现意外报错 [ 4016 - Internal Error ],补充说明以下几点:

  1. 我们有多个环境,OB版本是完全一致的,表结构也是一致的(数据有区别)。
  2. 在其中一个环境的 OB 执行上述SQL会稳定出现该报错。在其他环境执行同样的SQL,并不会报错。
  3. 通过应用程序执行,或者通过 Navicat 连接执行,只要是该环境,都会出现该报错。
  4. 忽略 INSERT INTO,只执行后面的 SELECT 语句,实际返回的是 空(没有数据)。

【附件及日志】
以下是从日志文件中找到的可能相关的错误日志:

observer.log.20240819134900148:[2024-08-19 13:47:53.125276] WDIAG update_qc_error_code (ob_px_util.h:591) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=9][errcode=-4016] 
observer.log.20240819134900148:[2024-08-19 13:47:53.125284] WDIAG [SQL.ENG] process_sqc_finish_msg_once (ob_px_scheduler.cpp:410) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=7][errcode=-4016] sqc fail, abort qc(pkt={dfo_id:1, sqc_id:0, rc:-4016, das_retry_rc:0, sqc_affected_rows:0}, ret=-4016, sqc_addr="172.31.44.197:2882")
observer.log.20240819134900148:[2024-08-19 13:47:53.125295] WDIAG [SQL.ENG] on_sqc_finish_msg (ob_px_scheduler.cpp:296) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=10][errcode=-4016] failed to process_sqc_finish_msg_once
observer.log.20240819134900148:[2024-08-19 13:47:53.125300] WDIAG [SQL.DTL] process (ob_dtl_channel_loop.cpp:209) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=4][errcode=-4016] process message in channel fail(header={hlen:16, type:12, nbody:63, checksum:0}, ret=-4016)
observer.log.20240819134900148:[2024-08-19 13:47:53.125316] WDIAG [SQL.DTL] process_base (ob_dtl_channel_loop.cpp:272) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=15][errcode=-4016] fail process channel(ret=-4016)
observer.log.20240819134900148:[2024-08-19 13:47:53.125319] WDIAG [SQL.ENG] fetch_rows (ob_px_fifo_coord_op.cpp:212) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=3][errcode=-4016] fail process message(ret=-4016)
observer.log.20240819134900148:[2024-08-19 13:47:53.125328] WDIAG [SQL.ENG] fetch_rows (ob_px_fifo_coord_op.cpp:244) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=7][errcode=-4016] QC get error code(ret=-4016, ret_terminate=0)
observer.log.20240819134900148:[2024-08-19 13:47:53.125333] WDIAG [SQL.ENG] get_next_batch (ob_operator.cpp:1194) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=4][errcode=-4016] get next batch failed(ret=-4016, eval_ctx={batch_idx:0, batch_size:16, max_batch_size:16, frames_:0x7f224f0166a0}, id=0, op_name="PHY_PX_FIFO_COORD")
observer.log.20240819134900148:[2024-08-19 13:47:53.125341] WDIAG [SQL.ENG] get_next_row (ob_operator.cpp:1499) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=7][errcode=-4016] get next batch failed(ret=-4016)
observer.log.20240819134900148:[2024-08-19 13:47:53.125346] WDIAG [SQL] inner_get_next_row (ob_result_set.cpp:411) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=4][errcode=-4016] get next row from exec result failed(ret=-4016)
observer.log.20240819134900148:[2024-08-19 13:47:53.125350] WDIAG [SQL] drive_dml_query (ob_result_set.cpp:1448) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=4][errcode=-4016] do dml query failed(ret=-4016)
observer.log.20240819134900148:[2024-08-19 13:47:53.125354] WDIAG [SQL] open_result (ob_result_set.cpp:214) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=4][errcode=-4016] fail to drive dml query(ret=-4016)
observer.log.20240819134900148:[2024-08-19 13:47:53.125358] WDIAG [SQL] open (ob_result_set.cpp:160) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=3][errcode=-4016] open result set failed(ret=-4016)
observer.log.20240819134900148:[2024-08-19 13:47:53.125364] WDIAG [SERVER] after_func (ob_query_retry_ctrl.cpp:947) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=3][errcode=-4016] [RETRY] check if need retry(v={force_local_retry:false, stmt_retry_times:0, local_retry_times:0, err_:-4016, err_:"OB_ERR_UNEXPECTED", retry_type:0, client_ret:-4016}, need_retry=false, THIS_WORKER.can_retry()=true, v.ctx_.multi_stmt_item_={is_part_of_multi_stmt:false, seq_num:0, sql:"", batched_queries:NULL, is_ps_mode:false, ab_cnt:0})
observer.log.20240819134900148:[2024-08-19 13:47:53.125374] WDIAG [SERVER] response_result (ob_sync_plan_driver.cpp:82) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=10][errcode=-4016] result set open failed, check if need retry(ret=-4016, cli_ret=-4016, retry_ctrl_.need_retry()=0)
observer.log.20240819134900148:[2024-08-19 13:47:53.125405] INFO  [DETECT] maintain_deadlock_info_when_end_stmt (ob_trans_deadlock_adapter.cpp:646) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=5] try unregister deadlock detecotr cause conflict array is empty(step=9, ret=0, ret="OB_SUCCESS", exec_ctx.get_errcode()=-4016, exec_ctx.get_errcode()="OB_ERR_UNEXPECTED", session={this:0x7f25fd5d01f8, id:3224512736, deser:false, tenant:"myapp", tenant_id:1002, effective_tenant:"myapp", effective_tenant_id:1002, database:"my_db", user:"myapp@%", consistency_level:3, session_state:2, autocommit:false, tx:0x7f2acb2b3ce0}, desc={this:0x7f2acb2b3ce0, tx_id:{txid:1358185025}, state:1, addr:"172.31.32.214:2882", tenant_id:1002, session_id:3224512736, assoc_session_id:3224512736, xid:NULL, xa_mode:"", xa_start_addr:"0.0.0.0:0", access_mode:0, tx_consistency_type:0, isolation:1, snapshot_version:{val:18446744073709551615, v:3}, snapshot_scn:0, active_scn:0, op_sn:2, alloc_ts:1724046472114271, active_ts:-1, commit_ts:-1, finish_ts:-1, timeout_us:86400000000, lock_timeout_us:-1, expire_ts:9223372036854775807, coord_id:{id:-1}, parts:[], exec_info_reap_ts:0, commit_version:{val:18446744073709551615, v:3}, commit_times:0, commit_cb:null, cluster_id:1719287282, cluster_version:17180000519, flags_.SHADOW:false, flags_.INTERRUPTED:false, flags_.BLOCK:false, flags_.REPLICA:false, can_elr:true, cflict_txs:[], abort_cause:0, commit_expire_ts:-1, commit_task_.is_registered():false, ref:2}, is_rollback=true, conflict_txs=[])
observer.log.20240819134900148:[2024-08-19 13:47:53.125435] INFO  [DETECT] maintain_deadlock_info_when_end_stmt (ob_trans_deadlock_adapter.cpp:670) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=30] maintain deadlock info(step=9, ret=0, ret="OB_SUCCESS", exec_ctx.get_errcode()=-4016, exec_ctx.get_errcode()="OB_ERR_UNEXPECTED", session={this:0x7f25fd5d01f8, id:3224512736, deser:false, tenant:"myapp", tenant_id:1002, effective_tenant:"myapp", effective_tenant_id:1002, database:"my_db", user:"myapp@%", consistency_level:3, session_state:2, autocommit:false, tx:0x7f2acb2b3ce0}, desc={this:0x7f2acb2b3ce0, tx_id:{txid:1358185025}, state:1, addr:"172.31.32.214:2882", tenant_id:1002, session_id:3224512736, assoc_session_id:3224512736, xid:NULL, xa_mode:"", xa_start_addr:"0.0.0.0:0", access_mode:0, tx_consistency_type:0, isolation:1, snapshot_version:{val:18446744073709551615, v:3}, snapshot_scn:0, active_scn:0, op_sn:2, alloc_ts:1724046472114271, active_ts:-1, commit_ts:-1, finish_ts:-1, timeout_us:86400000000, lock_timeout_us:-1, expire_ts:9223372036854775807, coord_id:{id:-1}, parts:[], exec_info_reap_ts:0, commit_version:{val:18446744073709551615, v:3}, commit_times:0, commit_cb:null, cluster_id:1719287282, cluster_version:17180000519, flags_.SHADOW:false, flags_.INTERRUPTED:false, flags_.BLOCK:false, flags_.REPLICA:false, can_elr:true, cflict_txs:[], abort_cause:0, commit_expire_ts:-1, commit_task_.is_registered():false, ref:2}, is_rollback=true, conflict_txs=[])
observer.log.20240819134900148:[2024-08-19 13:47:53.125450] INFO  [SQL.EXE] end_stmt (ob_sql_trans_control.cpp:1129) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=12] end stmt(ret=0, plain_select=false, stmt_type=2, savepoint=1724046473122673, tx_desc={this:0x7f2acb2b3ce0, tx_id:{txid:1358185025}, state:1, addr:"172.31.32.214:2882", tenant_id:1002, session_id:3224512736, assoc_session_id:3224512736, xid:NULL, xa_mode:"", xa_start_addr:"0.0.0.0:0", access_mode:0, tx_consistency_type:0, isolation:1, snapshot_version:{val:18446744073709551615, v:3}, snapshot_scn:0, active_scn:0, op_sn:3, alloc_ts:1724046472114271, active_ts:-1, commit_ts:-1, finish_ts:-1, timeout_us:86400000000, lock_timeout_us:-1, expire_ts:9223372036854775807, coord_id:{id:-1}, parts:[], exec_info_reap_ts:0, commit_version:{val:18446744073709551615, v:3}, commit_times:0, commit_cb:null, cluster_id:1719287282, cluster_version:17180000519, flags_.SHADOW:false, flags_.INTERRUPTED:false, flags_.BLOCK:false, flags_.REPLICA:false, can_elr:true, cflict_txs:[], abort_cause:0, commit_expire_ts:-1, commit_task_.is_registered():false, ref:2}, trans_result={incomplete:false, parts:[], touched_ls_list:[], cflict_txs:[]}, rollback=true, session={this:0x7f25fd5d01f8, id:3224512736, deser:false, tenant:"myapp", tenant_id:1002, effective_tenant:"myapp", effective_tenant_id:1002, database:"my_db", user:"myapp@%", consistency_level:3, session_state:2, autocommit:false, tx:0x7f2acb2b3ce0}, exec_ctx.get_errcode()=-4016)
observer.log.20240819134900148:[2024-08-19 13:47:53.125473] WDIAG [SERVER] response_result (ob_sync_plan_driver.cpp:92) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=22][errcode=-4016] close result set fail(cret=-4016)
observer.log.20240819134900148:[2024-08-19 13:47:53.125482] INFO  [SERVER] send_error_packet (obmp_packet_sender.cpp:319) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=4] sending error packet(ob_error=-4016, client error=-4016, extra_err_info=NULL, lbt()="0x1203501c 0x9db8660 0x9d65e29 0x4c573b5 0x9ddb062 0x9ddc2f2 0x9de342f 0x9de69cc 0x4c300c7 0x4c296a0 0x9a8d0d4 0x123077c3 0x7f2b65f6844b 0x7f2b65ca352f")
observer.log.20240819134900148:[2024-08-19 13:47:53.125504] INFO  [SERVER] send_error_packet (obmp_packet_sender.cpp:502) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=4] dump txn free route audit_record(value=1048581, session->get_sessid()=3224512736, session->get_proxy_sessid()=12402670931653377457)
observer.log.20240819134900148:[2024-08-19 13:47:53.125508] WDIAG [SERVER] execute_response (obmp_stmt_execute.cpp:1175) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=3][errcode=-4016] execute query fail(ret=-4016, timeout_timestamp=1724046533123356)
observer.log.20240819134900148:[2024-08-19 13:47:53.125537] WDIAG [SERVER] process_execute_stmt (obmp_stmt_execute.cpp:1741) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=5][errcode=-4016] fail to do process(ret=-4016, ctx_.cur_sql_=INSERT INTO gs_merchant_msg ( msg_id, merchant_id, emp_id, add_time, read_time, state)
observer.log.20240819134900148:[2024-08-19 13:47:53.125545] WDIAG [SERVER] process (obmp_stmt_execute.cpp:1898) [10304][T1002_L0_G0][T1002][YB42AC1F20D6-00061D78E29DC6D6-0-0] [lt=6][errcode=-4016] fail execute sql(sql_id="1D338D993E93BBB49A64B2DBEA4AEAAB", stmt_id=15, ret=-4016)

对比下正常的和报错的环境,执行计划一样吗


我对比了一下,似乎并不完全一样(左侧是出问题的那个环境)

把parallel关了试下呢

执行语句报错 类似这样 中间执行的语句 替你执行的语句 抓取一下日志 上传一下 看看什么问题
alter system set enable_rich_error_msg=true;
obclient [test]> select count(*) from t2;
ERROR 1146 (42S02): Table ‘test.t2’ doesn’t exist
[xx.xx.xx.1:2882] [2024-04-13 20:10:20.292087] [YB420BA1CC68-000615A0A8EA5E38-0-0]
[root@x.x.x.1 ~]$ grep “YB420BA1CC68-000615A0A8EA5E38-0-0” rootservice.log
[root@x.x.x.1 ~]$ grep “YB420BA1CC68-000615A0A8EA5E38-0-0” observer.log
alter system set enable_rich_error_msg=false;

这是同一个租户下的两个相同结构的数据库,一个可以,一个不可以额。

rootservice.log 没有日志
observer.log 日志已上传, 请帮忙看下
observer.log (14.3 KB)

只是排除下,是否是并行相关的问题

好的,仍然非常感谢,感谢大家的热心帮助与回复~

在 OceanBase 官方开发人员的提示下,我想到了一个报错环境与其他环境的一个细微差别:

就是在报错的这个环境中,gs_merchantgs_merchant_msg 这两个表的主副本都位于 Zone3,没有加入任何表组,gs_msg 表的主副本位于 Zone2,且加入了表组 tg2 (这里是表组名称)。

而在不报错的环境里面,这3个表都没有加入任何表组(不过也并不在同一个Zone)。

我把 前面2个表也加入到表组 tg2 后,等它们迁移完毕后(主副本都迁移到了表组 tg2 所在的 Zone2 ),我再次执行,就没有报错了。当然,我会再继续观察一段时间。

不过,这种情况也希望能够反馈给 OB 的开发人员,让他们能彻底修复这个问题。

非常感谢你的反馈 我们收集一下信息测试测试 你把你的表结构信息 能发一下么?还有就是语句是直连执行的么?你们部署架构是?麻烦你尽量能把你的测试的步骤列一下么?

CREATE TABLE `gs_merchant` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(20) NOT NULL,
  `site_name` varchar(20) NOT NULL DEFAULT '',
  `currency` char CHARACTER SET latin1 NOT NULL DEFAULT '',
  `country_codes` varchar(2) CHARACTER SET latin1 NOT NULL,
  `languages` varchar NOT NULL,
  `level` tinyint unsigned NOT NULL DEFAULT '0',
  `dealer_mode` tinyint unsigned DEFAULT NULL,
  `vendor_num` smallint unsigned NOT NULL,
  `agent_mode` tinyint unsigned NOT NULL DEFAULT '1',
  `biz_flag` smallint unsigned NOT NULL DEFAULT '0',
  `risk_status` tinyint NOT NULL DEFAULT '1',
  `status` tinyint NOT NULL,
  `state` tinyint NOT NULL,
  `add_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `update_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `group_id` int unsigned DEFAULT NULL,
  `remark` varchar(500) DEFAULT NULL,
  `org_id` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_name_groupId` (`name`, `group_id`) BLOCK_SIZE 16384 LOCAL,
  KEY `idx_addTime` (`add_time`) BLOCK_SIZE 16384 LOCAL,
  KEY `idx_currency` (`currency`) BLOCK_SIZE 16384 LOCAL
) AUTO_INCREMENT = 490 AUTO_INCREMENT_MODE = 'ORDER' DEFAULT CHARSET = utf8mb4 ROW_FORMAT = DYNAMIC COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 3 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0;

CREATE TABLE `gs_merchant_msg` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `msg_id` int unsigned NOT NULL,
  `merchant_id` int unsigned NOT NULL,
  `emp_id` int unsigned DEFAULT NULL,
  `read_time` datetime DEFAULT NULL,
  `state` tinyint unsigned NOT NULL DEFAULT '3',
  `add_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_merchantId_msgId` (`merchant_id`, `msg_id`) BLOCK_SIZE 16384 LOCAL
) AUTO_INCREMENT = 6176 AUTO_INCREMENT_MODE = 'ORDER' DEFAULT CHARSET = utf8mb4 ROW_FORMAT = DYNAMIC COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 3 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0;


CREATE TABLE `gs_msg` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `type` tinyint(4) NOT NULL,
  `merchant_id` int(10) unsigned NOT NULL,
  `emp_id` int(10) unsigned NOT NULL,
  `title` varchar(100) NOT NULL DEFAULT '',
  `content` text DEFAULT NULL,
  `scope` tinyint(3) unsigned NOT NULL,
  `ranges` varchar(2560) CHARACTER SET latin1 DEFAULT NULL,
  `send_time` datetime(3) NOT NULL,
  `end_time` datetime(3) DEFAULT NULL,
  `pop_up` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `status` tinyint(4) NOT NULL DEFAULT '0',
  `state` tinyint(3) unsigned NOT NULL DEFAULT '3',
  `add_time` datetime(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
  `update_time` datetime(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
  PRIMARY KEY (`id`),
  KEY `idx_merchantId_type` (`merchant_id`, `type`) BLOCK_SIZE 16384 LOCAL
) AUTO_INCREMENT = 186566 AUTO_INCREMENT_MODE = 'ORDER' DEFAULT CHARSET = utf8mb4 ROW_FORMAT = DYNAMIC COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 3 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 TABLEGROUP = 'tg2';

SQL 通过应用程序连接执行,或者通过 Navicat 连接执行(都是连接到 OBProxy ),之前都会出现该报错。

集群部署架构就是最常规的 1-1-1 全功能副本集群模式(版本是 OceanBase_CE_V4.2.1.7)。

你能把哪个时间段三个节点的observer日志压缩发一下么?非常感谢

具体报错日志你可以看一下前面这里,之前已经上传过的。
(这里是截图,往楼上找一找就有错误日志)

image

主要我是想要个比较全的日志 是三个obsever节点上 在你操作的sql语句的那段时间的observer的日志

:rofl:没测出来

sqc fail, abort qc(pkt={dfo_id:1, sqc_id:0, rc:-4016, das_retry_rc:0, sqc_affected_rows:0}, ret=-4016, sqc_addr=“172.31.44.197:2882”)
出现在这样的日志表明px线程发生错误,向主线程发送中断,此时需要到px线程所在节点,即172.31.44.197搜索这个trace_id的日志,查找最初报错的原因。

不好意思,刚刚让运维检索了一下,貌似找不到了。这个错误日志是昨天的,现在这个问题,我们已经通过将这3个表都加入同一个表组,解决了该问题,所以也没法复现了。

麻烦从这样归档日志observer.log.20220427154619 后面有日志 找到对应的时间能查找到

看了下服务器里面,貌似里面只有2个文件,都是今天的。