【 使用环境 】测试环境 OceanBase_CE_V4.2.1.7
我们有2个汇总统计表,表结构大致如下:
CREATE TABLE `order_stat_hourly` (
`merchant_id` int unsigned NOT NULL,
`period` timestamp NOT NULL COMMENT '每个周期的开始时间点,也就是每个小时的整点',
`user_id` int unsigned NOT NULL,
`pay_amount_sum` int unsigned NOT NULL DEFAULT '0',
`vendor` varchar(4) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
`product_type` tinyint unsigned NOT NULL,
`product_id` smallint unsigned NOT NULL,
`status` tinyint NOT NULL DEFAULT '1',
UNIQUE KEY `uk_period_merchantId_userId_vendor_productId` (`period`,`merchant_id`,`user_id`,`vendor`,`product_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
PARTITION BY RANGE (UNIX_TIMESTAMP( `period` ))
(
PARTITION P20240326 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-27'))
, PARTITION P20240327 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-28'))
, PARTITION P20240328 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-29'))
);
CREATE TABLE `order_stat_daily` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`merchant_id` int unsigned NOT NULL,
`period` timestamp NOT NULL COMMENT '每个周期的开始时间点,也就是每天的0点',
`user_id` int unsigned NOT NULL,
`pay_amount_sum` int unsigned NOT NULL DEFAULT '0',
`status` tinyint NOT NULL DEFAULT '1',
PRIMARY KEY (`id`),
UNIQUE KEY `uk_period_merchantId_userId` (`period`,`merchant_id`,`user_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
order_stat_hourly
表 用于按小时汇总订单数据,该表使用 period, merchant_id, user_id, vendor, product_id
作为联合主键,并按天分区。
SELECT SUM(pay_amount_sum) FROM order_stat_hourly WHERE period BETWEEN ? AND ? AND merchant_id = ? AND user_id = ?
当我们使用该表进行如上查询时,应用程序偶发会触发以下异常,偶发的频率大概是每分钟 1~2 次。
然而,在其他所有条件都不变的情况下,我们将上述SQL中的数据表改为 order_stat_daily
时,该异常就不再触发,也就是恢复正常了。
根据两个表的结构差异,我们怀疑大概率是因为 order_stat_hourly
表是用 5个字段 作为联合主键 的缘故。
org.springframework.dao.RecoverableDataAccessException:
### Error querying database. Cause: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
### The error occurred while setting parameters
### SQL: SELECT SUM(pay_amount_sum) FROM order_stat_hourly WHERE period BETWEEN ? AND ? AND merchant_id = ? AND user_id = ?
### Cause: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
; Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.; nested exception is com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
at org.springframework.jdbc.support.SQLExceptionSubclassTranslator.doTranslate(SQLExceptionSubclassTranslator.java:100)
at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73)
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:165)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:55)
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
The last packet successfully received from the server was 69 milliseconds ago. The last packet sent successfully to the server was 70 milliseconds ago.
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
at com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:58)
at com.mysql.cj.protocol.a.SimplePacketReader.readHeaderLocal(SimplePacketReader.java:72)
【补充说明】
应用程序使用的是 HikariCP 的连接池,3个 OBServer 构成 1-1-1
的集群,3个OBProxy分别部署在与3个OBServer相同的机器上,并使用 ELB 实现 负载均衡。
BTW, 使用 order_stat_hourly
表进行上述查询,比使用 order_stat_daily
会消耗更多的 数据库 CPU 资源( 只查询 1-3天内的数据, 但是 并不算多的 order_stat_hourly
表查询 就会让 CPU 满载 )。
此外,我们还发现,在应用程序的 SQL 中添加 弱一致性读 的 SQL Hint,例如:SELECT /*+ READ_CONSISTENCY(WEAK) */ FROM table_name
( 任何表都可能会触发 ),也大概率会触发相同的错误。
order_stat_hourly
和 order_stat_daily
两表的数据量比例约为 4 : 1 。
30分钟内 查询 小时表 1000 多次,CPU占比就达到了 90+%
30分钟内 查询 天表 3000 多次,CPU占比才 3+%
【更新说明】
- 我们取消 表分区 后,这个偶发的连接失败报错 就没有再出现了。
- 我们增加
id
作为 自增主键后,响应时间从 300+ 降低到 80+ ms,优化索引后,响应时间进一步降低到 0.57ms(作为对比,使用order_stat_daily
表的响应时间是 0.37ms)。