今天写了一篇博客:OceanBase 社区
写博客的过程中想尝试总结一下如何排查超时时间的相关问题,但是发现 OceanBase 超时报错之后,通过报错信息和日志都很难排查具体是哪个超时时间导致的超时,希望能对错误码做些调整。
看用户发的问答帖,发现大家在遇到超时问题之后,基本都是尝试性地逐个调大自己知道的几个超时时间,但是 OceanBase 里的超时时间很多,下面列出冰山一角。
obclient [test]> show global variables like '%timeout%';
+---------------------+------------------+
| Variable_name | Value |
+---------------------+------------------+
| connect_timeout | 10 |
| interactive_timeout | 28800 |
| lock_wait_timeout | 31536000 |
| net_read_timeout | 30 |
| net_write_timeout | 60 |
| ob_pl_block_timeout | 3216672000000000 |
| ob_query_timeout | 10000000 |
| ob_trx_idle_timeout | 86400000000 |
| ob_trx_lock_timeout | -1 |
| ob_trx_timeout | 86400000000 |
| wait_timeout | 28800 |
+---------------------+------------------+
11 rows in set (0.064 sec)
obclient [test]> show parameters like '%timeout%';
+-------+----------+--------------+----------+-----------------------------------------------+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+---------+---------+-------------------+---------------+-----------+
| zone | svr_type | svr_ip | svr_port | name | data_type | value | info | section | scope | source | edit_level | default_value | isdefault |
+-------+----------+--------------+----------+-----------------------------------------------+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+---------+---------+-------------------+---------------+-----------+
| zone1 | observer | 11.158.31.20 | 22602 | sys_bkgd_migration_change_member_list_timeout | NULL | 20s | the timeout for migration change member list retry. The default value is 20s. Range: [0s,24h] | OBSERVER | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 20s | 1 |
| zone1 | observer | 11.158.31.20 | 22602 | location_cache_refresh_sql_timeout | NULL | 1s | The timeout used for refreshing location cache by SQL. Range: [1ms, +∞) | LOCATION_CACHE | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 1s | 1 |
| zone1 | observer | 11.158.31.20 | 22602 | location_cache_refresh_rpc_timeout | NULL | 500ms | The timeout used for refreshing location cache by RPC. Range: [1ms, +∞) | LOCATION_CACHE | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 500ms | 1 |
| zone1 | observer | 11.158.31.20 | 22602 | rpc_timeout | NULL | 2s | the time during which a RPC request is permitted to execute before it is terminated | RPC | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 2s | 1 |
| zone1 | observer | 11.158.31.20 | 22602 | balancer_task_timeout | NULL | 20m | the time to execute the load-balancing task before it is terminated. Range: [1s, +∞) | LOAD_BALANCE | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 20m | 1 |
| zone1 | observer | 11.158.31.20 | 22602 | dead_socket_detection_timeout | NULL | 3s | specify a tcp_user_timeout for RFC5482. A zero value makes the option disabled, Range: [0, 2h] | OBSERVER | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 3s | 1 |
| zone1 | observer | 11.158.31.20 | 22602 | debug_sync_timeout | NULL | 0 | Enable the debug sync facility and optionally specify a default wait timeout in micro seconds. A zero value keeps the facility disabled, Range: [0, +∞] | OBSERVER | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 0 | 1 |
| zone1 | observer | 11.158.31.20 | 22602 | internal_sql_execute_timeout | NULL | 30s | the number of microseconds an internal DML request is permitted to execute before it is terminated. Range: [1000us, 1h] | OBSERVER | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE | 30s | 1 |
| zone1 | observer | 11.158.31.20 | 22602 | arbitration_timeout | NULL | 5s | The timeout before automatically degrading when arbitration member exists. Range: [3s,+∞] | TRANS | TENANT | DEFAULT | DYNAMIC_EFFECTIVE | 5s | 1 |
| zone1 | observer | 11.158.31.20 | 22602 | ob_query_switch_leader_retry_timeout | NULL | 0ms | max time spend on retry caused by leader swith or network disconnectionRange: [0ms, +∞) | OBSERVER | TENANT | DEFAULT | DYNAMIC_EFFECTIVE | 0ms | 1 |
| zone1 | observer | 11.158.31.20 | 22602 | standby_db_fetch_log_rpc_timeout | NULL | 15s | The threshold for detecting the RPC timeout for the standby tenant to fetch log from the log restore source tenant. When the rpc timeout, the log transport service switches to another server of the log restore source tenant to fetch logs. Range: [2s, +∞) | LOGSERVICE | TENANT | DEFAULT | DYNAMIC_EFFECTIVE | 15s | 1 |
+-------+----------+--------------+----------+-----------------------------------------------+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+---------+---------+-------------------+---------------+-----------+
11 rows in set (0.104 sec)
如果遇到了 SQL 超时,想知道这条 SQL 到底受到了哪个超时时间的限制,看报错日志好像也看不出来,因为无论由于什么超时报错,错误码好像都是 4012。
希望未来 OceanBase 能够把由于不同超时时间导致的报错,匹配上不同的错误码,提升一些日志的易用性。例如 ob_query_timeout 超时报错 40120,ob_trx_timeout 超时报错 40121,_ob_ddl_timeout 超时报错 40122。这样大家看到错误码后,马上就可以知道要调整哪个超时时间了。