从数据库崩溃案例看懂 SQL 逻辑运算优先级的重要性

最近遇到一个SQL语句把数据库弄奔溃了,分享一下数据库的逻辑运算的优先级。

1. 存在问题的SQL

下面的SQL中每个表的总数据量是2797行,且数据一样,并且两个表上都没有索引(因为懒)。

explain select  count(*) from t1,t2 where t1.object_name = t2.object_name or t1.DATA_OBJECT_ID = t2.DATA_OBJECT_ID and t1.OBJECT_ID = 123 and rownum = 1\G
*************************** 1. row ***************************
Query Plan: ============================================================
|ID|OPERATOR                    |NAME    |EST. ROWS|COST   |
------------------------------------------------------------
|0 |SCALAR GROUP BY             |        |1        |3085179|
|1 | COUNT                      |        |76706    |3082251|
|2 |  NESTED-LOOP JOIN CARTESIAN|        |7820412  |2865184|
|3 |   TABLE SCAN               |T1      |2796     |1082   |
|4 |   MATERIAL                 |        |2797     |3218   |
|5 |    PX COORDINATOR          |        |2797     |2956   |
|6 |     EXCHANGE OUT DISTR     |:EX10000|2797     |1082   |
|7 |      TABLE SCAN            |T2      |2797     |1082   |
============================================================

Outputs & filters: 
-------------------------------------
  0 - output([T_FUN_COUNT(*)]), filter(nil), 
      group(nil), agg_func([T_FUN_COUNT(*)])
  1 - output([1]), filter([T1.OBJECT_NAME = T2.OBJECT_NAME OR (T_OP_AND, T1.DATA_OBJECT_ID = T2.DATA_OBJECT_ID, T1.OBJECT_ID = 123, rownum() = 1)])
  2 - output([T1.OBJECT_NAME], [T2.OBJECT_NAME], [T1.DATA_OBJECT_ID], [T2.DATA_OBJECT_ID], [T1.OBJECT_ID]), filter(nil), 
      conds(nil), nl_params_(nil)
  3 - output([T1.OBJECT_NAME], [T1.DATA_OBJECT_ID], [T1.OBJECT_ID]), filter(nil), 
      access([T1.OBJECT_NAME], [T1.DATA_OBJECT_ID], [T1.OBJECT_ID]), partitions(p0)
  4 - output([T2.OBJECT_NAME], [T2.DATA_OBJECT_ID]), filter(nil)
  5 - output([T2.OBJECT_NAME], [T2.DATA_OBJECT_ID]), filter(nil)
  6 - output([T2.OBJECT_NAME], [T2.DATA_OBJECT_ID]), filter(nil), is_single, dop=1
  7 - output([T2.OBJECT_NAME], [T2.DATA_OBJECT_ID]), filter(nil), 
      access([T2.OBJECT_NAME], [T2.DATA_OBJECT_ID]), partitions(p0)
      

看似正常的SQL语句,执行计划中ID等于2的步骤,没有过滤筛选,使用了嵌套循环连接,结果集的总数据量有7820412行。ID等于1的步骤中有过滤条件,过滤条件是 filter([T1.OBJECT_NAME = T2.OBJECT_NAME OR (T_OP_AND, T1.DATA_OBJECT_ID = T2.DATA_OBJECT_ID, T1.OBJECT_ID = 123, rownum() = 1)]) ,业务逻辑是先取两个表中object_name相同或DATA_OBJECT_ID 相同的数据,得到结果集后再根据T1.OBJECT_ID得到相关数据,最后保留一条。SQL的实际执行结果并没有按照预期进行。

结合执行计划中的过滤条件和官方的逻辑运算符的优先级说明看到,AND的优先级是比OR优先级要高,SQL等价于select count(*) from t1,t2 where t1.object_name = t2.object_name or (t1.DATA_OBJECT_ID = t2.DATA_OBJECT_ID and t1.OBJECT_ID = 123 and rownum = 1),因此导致了SQL的执行不符合预期且造成大量的资源消耗。

2. 问题解决

知道了原因,解决起来就很简单,把OR的条件用括号包含起来,让OR条件先计算,这让就符合业务的预期。

explain select  count(*) from t1 t1,t2 t2 where (t1.object_name = t2.object_name or t1.DATA_OBJECT_ID = t2.DATA_OBJECT_ID) and t1.OBJECT_ID = 123 and rownum = 1\G
*************************** 1. row ***************************
Query Plan: ====================================================
|ID|OPERATOR               |NAME    |EST. ROWS|COST|
----------------------------------------------------
|0 |SCALAR GROUP BY        |        |1        |1199|
|1 | SUBPLAN SCAN          |VIEW1   |1        |1199|
|2 |  LIMIT                |        |1        |1199|
|3 |   NESTED-LOOP JOIN    |        |1        |1199|
|4 |    PX COORDINATOR     |        |2        |47  |
|5 |     EXCHANGE OUT DISTR|:EX10000|2        |46  |
|6 |      TABLE SCAN       |T2      |2        |46  |
|7 |    MATERIAL           |        |28       |1144|
|8 |     TABLE SCAN        |T1      |28       |1141|
====================================================

Outputs & filters: 
-------------------------------------
  0 - output([T_FUN_COUNT(*)]), filter(nil), 
      group(nil), agg_func([T_FUN_COUNT(*)])
  1 - output([1]), filter(nil), 
      access(nil)
  2 - output([1]), filter(nil), limit(?), offset(nil)
  3 - output([1]), filter(nil), 
      conds([T1.OBJECT_NAME = T2.OBJECT_NAME OR T1.DATA_OBJECT_ID = T2.DATA_OBJECT_ID]), nl_params_(nil)
  4 - output([T2.OBJECT_NAME], [T2.DATA_OBJECT_ID]), filter(nil)
  5 - output([T2.OBJECT_NAME], [T2.DATA_OBJECT_ID]), filter(nil), is_single, dop=1
  6 - output([T2.OBJECT_NAME], [T2.DATA_OBJECT_ID]), filter(nil), 
      access([T2.OBJECT_NAME], [T2.DATA_OBJECT_ID]), partitions(p0)
  7 - output([T1.OBJECT_NAME], [T1.DATA_OBJECT_ID]), filter(nil)
  8 - output([T1.OBJECT_NAME], [T1.DATA_OBJECT_ID]), filter([T1.OBJECT_ID = 123]), 
      access([T1.OBJECT_NAME], [T1.DATA_OBJECT_ID], [T1.OBJECT_ID]), partitions(p0)

执行计划中ID等于8的T1表的过滤条件是OBJECT_ID = 123,两个表连接时,使用了连接条件T1.OBJECT_NAME = T2.OBJECT_NAME OR T1.DATA_OBJECT_ID = T2.DATA_OBJECT_ID,这样的SQL数据量非常小,执行特别快。

3. 总结

本次问题原因是因为一个逻辑运算的优先级判断错误。对于自己拿不准的SQL语句,查阅官方文档是最权威的参考依据。其次,在测试环境就算只有少量数据也可以看到执行计划是否合理。最后,上线时SQL语句经过把关,可以最大限度减少SQL语句引发的线上故障。