【有问必答】如何调整 OceanBase 的 DDL 超时时间？

兹拉坦 · 2024 年8 月 26 日 20:07

在 OceanBase 的社区论坛问答区里，总是能看到不断有同学发帖询问类似于 “执行 DDL 超时，为何调大超时时间不生效？” 之类的问题。

其中有很多帖子里的问题，因为各种原因没有完全解决，最终都不了了之了，很是可惜。OceanBase 里的超时时间确实比较多，上周看用户发的问答帖里，又有人在问一样的问题。

所以趁热打铁，学习了一下这个最让用户困扰的 DDL 超时时间，在这里和大家分享：《如何调整 OceanBase 的 DDL 超时时间？》

大家如果对 OceanBase 的各类超时时间，有任何问题，或者任何槽点，都欢迎在本帖中进行留言评论，我们会第一时间对大家的疑问或者建议进行回复。

兹拉坦 · 2024 年8 月 26 日 20:10

写博客的过程中想尝试总结一下如何排查超时时间的相关问题，但是发现 OceanBase 超时报错之后，通过报错信息和日志都很难排查具体是哪个超时时间导致的超时。

看用户发的问答帖，发现大家在遇到超时问题之后，基本都是尝试性地逐个调大自己知道的几个超时时间，但是 OceanBase 里的超时时间很多，下面列出冰山一角。

obclient [test]> show global variables like '%timeout%';
+---------------------+------------------+
| Variable_name       | Value            |
+---------------------+------------------+
| connect_timeout     | 10               |
| interactive_timeout | 28800            |
| lock_wait_timeout   | 31536000         |
| net_read_timeout    | 30               |
| net_write_timeout   | 60               |
| ob_pl_block_timeout | 3216672000000000 |
| ob_query_timeout    | 10000000         |
| ob_trx_idle_timeout | 86400000000      |
| ob_trx_lock_timeout | -1               |
| ob_trx_timeout      | 86400000000      |
| wait_timeout        | 28800            |
+---------------------+------------------+
11 rows in set (0.064 sec)

obclient [test]> show parameters like '%timeout%';
+-------+----------+--------------+----------+-----------------------------------------------+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+---------+---------+-------------------+---------------+-----------+
| zone  | svr_type | svr_ip       | svr_port | name                                          | data_type | value | info                                                                                                                                                                                                                                                             | section        | scope   | source  | edit_level        | default_value | isdefault |
+-------+----------+--------------+----------+-----------------------------------------------+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+---------+---------+-------------------+---------------+-----------+
| zone1 | observer | 11.158.31.20 |    22602 | sys_bkgd_migration_change_member_list_timeout | NULL      | 20s   | the timeout for migration change member list retry. The default value is 20s. Range: [0s,24h]                                                                                                                                                                    | OBSERVER       | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 20s           |         1 |
| zone1 | observer | 11.158.31.20 |    22602 | location_cache_refresh_sql_timeout            | NULL      | 1s    | The timeout used for refreshing location cache by SQL. Range: [1ms, +∞)                                                                                                                                                                                          | LOCATION_CACHE | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 1s            |         1 |
| zone1 | observer | 11.158.31.20 |    22602 | location_cache_refresh_rpc_timeout            | NULL      | 500ms | The timeout used for refreshing location cache by RPC. Range: [1ms, +∞)                                                                                                                                                                                          | LOCATION_CACHE | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 500ms         |         1 |
| zone1 | observer | 11.158.31.20 |    22602 | rpc_timeout                                   | NULL      | 2s    | the time during which a RPC request is permitted to execute before it is terminated                                                                                                                                                                              | RPC            | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 2s            |         1 |
| zone1 | observer | 11.158.31.20 |    22602 | balancer_task_timeout                         | NULL      | 20m   | the time to execute the load-balancing task before it is terminated. Range: [1s, +∞)                                                                                                                                                                             | LOAD_BALANCE   | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 20m           |         1 |
| zone1 | observer | 11.158.31.20 |    22602 | dead_socket_detection_timeout                 | NULL      | 3s    | specify a tcp_user_timeout for RFC5482. A zero value makes the option disabled, Range: [0, 2h]                                                                                                                                                                   | OBSERVER       | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 3s            |         1 |
| zone1 | observer | 11.158.31.20 |    22602 | debug_sync_timeout                            | NULL      | 0     | Enable the debug sync facility and optionally specify a default wait timeout in micro seconds. A zero value keeps the facility disabled, Range: [0, +∞]                                                                                                          | OBSERVER       | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 0             |         1 |
| zone1 | observer | 11.158.31.20 |    22602 | internal_sql_execute_timeout                  | NULL      | 30s   | the number of microseconds an internal DML request is permitted to execute before it is terminated. Range: [1000us, 1h]                                                                                                                                          | OBSERVER       | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 30s           |         1 |
| zone1 | observer | 11.158.31.20 |    22602 | arbitration_timeout                           | NULL      | 5s    | The timeout before automatically degrading when arbitration member exists. Range: [3s,+∞]                                                                                                                                                                        | TRANS          | TENANT  | DEFAULT | DYNAMIC_EFFECTIVE | 5s            |         1 |
| zone1 | observer | 11.158.31.20 |    22602 | ob_query_switch_leader_retry_timeout          | NULL      | 0ms   | max time spend on retry caused by leader swith or network disconnectionRange: [0ms, +∞)                                                                                                                                                                          | OBSERVER       | TENANT  | DEFAULT | DYNAMIC_EFFECTIVE | 0ms           |         1 |
| zone1 | observer | 11.158.31.20 |    22602 | standby_db_fetch_log_rpc_timeout              | NULL      | 15s   | The threshold for detecting the RPC timeout for the standby tenant to fetch log from the log restore source tenant. When the rpc timeout, the log transport service switches to another server of the log restore source tenant to fetch logs. Range: [2s, +∞)   | LOGSERVICE     | TENANT  | DEFAULT | DYNAMIC_EFFECTIVE | 15s           |         1 |
+-------+----------+--------------+----------+-----------------------------------------------+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+---------+---------+-------------------+---------------+-----------+
11 rows in set (0.104 sec)

如果遇到了 SQL 超时，想知道这条 SQL 到底受到了哪个超时时间的限制，看报错日志好像也看不出来，因为无论由于什么超时报错，错误码好像都是 4012。

我后面会联系 OceanBase 的产品同学，提出相关需求，希望未来 OceanBase 能够把由于不同超时时间导致的报错，匹配上不同的错误码，提升一些日志的易用性。例如 ob_query_timeout 超时报错 40120，ob_trx_timeout 超时报错 40121，_ob_ddl_timeout 超时报错 40122。这样大家看到错误码后，马上就可以知道要调整哪个超时时间了。