【BUG】疑似多列联合主键导致偶发数据库连接失败

AntTech_ZAGSOK · 2024 年7 月 31 日 17:17

【使用环境】测试环境 OceanBase_CE_V4.2.1.7

我们有2个汇总统计表，表结构大致如下：

CREATE TABLE `order_stat_hourly` (
    `merchant_id` int unsigned NOT NULL,
    `period` timestamp NOT NULL COMMENT '每个周期的开始时间点，也就是每个小时的整点',
    `user_id` int unsigned NOT NULL,
    `pay_amount_sum` int unsigned NOT NULL DEFAULT '0',    
    `vendor` varchar(4) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
    `product_type` tinyint unsigned NOT NULL,
    `product_id` smallint unsigned NOT NULL,
    `status` tinyint NOT NULL DEFAULT '1',
    UNIQUE KEY `uk_period_merchantId_userId_vendor_productId` (`period`,`merchant_id`,`user_id`,`vendor`,`product_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
    PARTITION BY RANGE (UNIX_TIMESTAMP( `period` ))
        (
        PARTITION P20240326 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-27'))
        , PARTITION P20240327 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-28'))
        , PARTITION P20240328 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-29'))
    );


CREATE TABLE `order_stat_daily` (
    `id` int unsigned NOT NULL AUTO_INCREMENT,
    `merchant_id` int unsigned NOT NULL,
    `period` timestamp NOT NULL COMMENT '每个周期的开始时间点，也就是每天的0点',
    `user_id` int unsigned NOT NULL,
    `pay_amount_sum` int unsigned NOT NULL DEFAULT '0',
    `status` tinyint NOT NULL DEFAULT '1',
    PRIMARY KEY (`id`),
    UNIQUE KEY `uk_period_merchantId_userId` (`period`,`merchant_id`,`user_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

order_stat_hourly 表用于按小时汇总订单数据，该表使用 period, merchant_id, user_id, vendor, product_id 作为联合主键，并按天分区。

SELECT SUM(pay_amount_sum) FROM order_stat_hourly WHERE period BETWEEN ? AND ? AND merchant_id = ? AND user_id = ?

当我们使用该表进行如上查询时，应用程序偶发会触发以下异常，偶发的频率大概是每分钟 1~2 次。
然而，在其他所有条件都不变的情况下，我们将上述SQL中的数据表改为 order_stat_daily 时，该异常就不再触发，也就是恢复正常了。

根据两个表的结构差异，我们怀疑大概率是因为 order_stat_hourly 表是用 5个字段 作为联合主键的缘故。

org.springframework.dao.RecoverableDataAccessException: 
### Error querying database.  Cause: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
### The error occurred while setting parameters
### SQL: SELECT SUM(pay_amount_sum) FROM order_stat_hourly WHERE period BETWEEN ? AND ? AND merchant_id = ? AND user_id = ?
### Cause: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
; Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.; nested exception is com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
	at org.springframework.jdbc.support.SQLExceptionSubclassTranslator.doTranslate(SQLExceptionSubclassTranslator.java:100)
	at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73)	
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
	at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:165)
	at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:55)
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
	at com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:58)
	at com.mysql.cj.protocol.a.SimplePacketReader.readHeaderLocal(SimplePacketReader.java:72)

【补充说明】
应用程序使用的是 HikariCP 的连接池，3个 OBServer 构成 1-1-1 的集群，3个OBProxy分别部署在与3个OBServer相同的机器上，并使用 ELB 实现负载均衡。

BTW，使用 order_stat_hourly 表进行上述查询，比使用 order_stat_daily 会消耗更多的数据库 CPU 资源（只查询 1-3天内的数据，但是并不算多的 order_stat_hourly 表查询就会让 CPU 满载）。

此外，我们还发现，在应用程序的 SQL 中添加 弱一致性读 的 SQL Hint，例如：SELECT /*+ READ_CONSISTENCY(WEAK) */ FROM table_name （任何表都可能会触发），也大概率会触发相同的错误。

order_stat_hourly 和 order_stat_daily 两表的数据量比例约为 4 : 1 。

30分钟内查询小时表 1000 多次，CPU占比就达到了 90+%
30分钟内查询天表 3000 多次，CPU占比才 3+%

【更新说明】

我们取消表分区后，这个偶发的连接失败报错就没有再出现了。
我们增加 id 作为自增主键后，响应时间从 300+ 降低到 80+ ms，优化索引后，响应时间进一步降低到 0.57ms（作为对比，使用 order_stat_daily 表的响应时间是 0.37ms）。

旭辉 · 2024 年7 月 31 日 17:55

感谢反馈，从您的描述看：

order_stat_hourly 表查询就会让 CPU 满载，order_stat_daily查询消耗cpu相对少，

我们取消表分区后，这个偶发的连接失败报错就没有再出现了。
我们增加 id 作为自增主键后，响应时间从 300+ 降低到 80+ ms，优化索引后，响应时间进一步降低到 0.57ms（作为对比，使用 order_stat_daily 表的响应时间是 0.37ms）。

以及结合应用报错日志 “Caused by: java.io.EOFException: Can not read response from server. ”分析

应该是服务器cpu负载高，导致数据库响应慢，从而出现这个错误

建议：
1.优化SQL，降低资源消耗
2.扩容cpu，内存资源
3.cpu高时建议运行obdiag 搜集下信息，便于根因分析
obdiag gather scene run --scene=observer.cpu_high
https://www.oceanbase.com/docs/common-obdiag-cn-1000000001102519

AntTech_ZAGSOK · 2024 年7 月 31 日 18:04

我感觉应该不是 SQL 的问题，而是因为我按照OB官方的建议，使用了 5个业务列 作为唯一键，加上采用表分区，导致的这个问题。

我只要把表分区去掉，SQL不作任何改动，这个偶发报错就消失了。
我推测，是不是 OB 对这种 使用5个列作联合主键+但只查询前3个列 的分区表查询支持有bug。

我使用 a、b、c、d、e 作为联合主键，查询 WHERE a=? AND b=? AND c=? 会很耗费性能。
但如果我使用 id 作为单列主键，再把 a、b、c、d、e 作为唯一索引，查询条件保持不变，CPU的负载就大幅下降了。

AntTech_ZAGSOK · 2024 年7 月 31 日 18:22

该数据库节点本来负载只有 10~20%，就是因为这个查询，负载才升上去的。
单次查询，响应时间也不是特别长。
您看我的截图就知道，单次响应才 0.3+秒，也不至于响应超时。但就是这种情况下的 多列联合主键+分区表 非常吃 CPU 资源。
我把这两个因素去掉，负载就大幅降低了。
使用 分区表 还有偶发报错的问题。

我们的其他表也有分区表，数据量更大（是该表数据量的50倍以上），响应时间更长，但都是单列主键，就没有类似的偶发报错问题。

我怀疑问题的关键可能在于：使用 5个列作联合主键，但只查询了前3个列，并没有包含完整的主键，虽然 5列是唯一索引，但OB只将其当主键使用，没有将其当索引使用（也没有再定义其他索引），导致 OB 的执行和分区扫描产生性能问题。

旭辉 · 2024 年8 月 1 日 10:10

“我怀疑问题的关键可能在于：使用 5个列作联合主键，但只查询了前3个列，并没有包含完整的主键，虽然 5列是唯一索引，但OB只将其当主键使用，没有将其当索引使用（也没有再定义其他索引），导致 OB 的执行和分区扫描产生性能问题。”

–你的怀疑有道理，需要执行计划来验证下

–麻烦取一下做原始执行计划及做改动之后的执行计划

1.通过 gv$ob_sql_audit 查询某次 SQL 执行对应的 svr_ip、svr_port、tenant_id、plan_id
SELECT svr_ip,svr_port,tenant_id,user_name, db_name, sql_id, plan_id, elapsed_time, execute_time, query_sql FROM gv$ob_sql_audit WHERE tenant_id=xxx and user_name=‘xxxx’ and query_sql like…

2.根据上述得到的svr_ip、svr_port、tenant_id、plan_id，查询 GV$OB_PLAN_CACHE_PLAN_EXPLAIN 视图，展示对应的执行计划

select * from GV$OB_PLAN_CACHE_PLAN_EXPLAIN where tenant_id=xxx and svr_ip=‘xx.xx.xx.xx’ and svr_port=xxx and plan_id=xxx;

AntTech_ZAGSOK · 2024 年8 月 1 日 10:18

稍等，我先去看看，还能不能找到原始的执行计划。
如果找不到了的话，我再想办法构造数据和查询，重新模拟出对应的场景。

感谢您的热心回复。

旭辉 · 2024 年8 月 1 日 11:10

好的，如果能复现出来，麻烦反馈下，也可以借助obdiag 这个工具来收集sql plan monitor报告，这份报告的信息很全面：OceanBase 社区
感谢