4.0bp2版本sysbench压测write场景异常退出

【 使用环境 】 测试环境
配置:40C128G3.5T *3台
sysbench版本:1.0.17
【 OB or 其他组件 】 OB集群
【 使用版本 】4.0.0-bp2
【问题描述】
sysbench压测write场景正常压测试一段时间后,异常退出:
命令:
sysbench --db-driver=mysql --threads=50 --time=3000 --mysql-host=10.99.1.62 --mysql-port=2883 --mysql-user=‘root@t1’ --mysql-password=’’ --report-interval=2 --tables=100 /usr/share/sysbench/oltp_write_only.lua --db-ps-mode=disable run
报错:
FATAL: mysql_drv_query() returned error 6002 (Transaction rollbacked) for query ‘BEGIN’
FATAL: `thread_run’ function failed: /usr/share/sysbench/oltp_common.lua:405: SQL error, errno = 6002, state = ‘40000’: Transaction rollbacked
Error in my_thread_global_end(): 1 threads didn’t exit
日志:

 observer.log中日志太多,大概看了下,没有看到和sysbench相关的报错

但是压测insert场景是正常的,且性能非常不错

【复现路径】问题出现前后相关操作
完全重启后集群状态正常的情况下,压测依然会复现

obd cluster display obtest

Get local repositories and plugins ok
Open ssh connection ok
Cluster status check ok
Connect to observer ok
Wait for observer init ok
±---------------------------------------------+
| observer |
±-----------±--------±-----±------±-------+
| ip | version | port | zone | status |
±-----------±--------±-----±------±-------+
| 10.99.1.61 | 4.0.0.0 | 2881 | zone1 | ACTIVE |
| 10.99.1.62 | 4.0.0.0 | 2881 | zone2 | ACTIVE |
| 10.99.1.63 | 4.0.0.0 | 2881 | zone3 | ACTIVE |
±-----------±--------±-----±------±-------+
obclient -h10.99.1.61 -P2881 -uroot -Doceanbase -A

Connect to obproxy ok
±---------------------------------------------+
| obproxy |
±-----------±-----±----------------±-------+
| ip | port | prometheus_port | status |
±-----------±-----±----------------±-------+
| 10.99.1.62 | 2883 | 2884 | active |
±-----------±-----±----------------±-------+
obclient -h10.99.1.62 -P2883 -uroot -Doceanbase -A
±-----------------------------------------------+
| obagent |
±-----------±------------±-----------±-------+
| ip | server_port | pprof_port | status |
±-----------±------------±-----------±-------+
| 10.99.1.61 | 8088 | 8089 | active |
| 10.99.1.62 | 8088 | 8089 | active |
| 10.99.1.63 | 8088 | 8089 | active |
±-----------±------------±-----------±-------+
Connect to Prometheus ok
±--------------------------------------------------+
| prometheus |
±-----------------------±-----±---------±-------+
| url | user | password | status |
±-----------------------±-----±---------±-------+
| http://10.99.1.63:9090 | | | active |
±-----------------------±-----±---------±-------+
Connect to grafana ok
±----------------------------------------------------------------+
| grafana |
±-----------------------------------±------±----------±-------+
| url | user | password | status |
±-----------------------------------±------±----------±-------+
| http://10.99.1.63:3000/d/oceanbase | admin | oceanbase | active |
±-----------------------------------±------±----------±-------+

【问题现象及影响】

异常退出,由于是测试无影响

【附件】

sysbench --db-driver=mysql --threads=50 --time=3000 --mysql-host=10.99.1.62 --mysql-port=2883 --mysql-user=‘root@t1’ --mysql-password=’’ --report-interval=2 --tables=100 /usr/share/sysbench/oltp_write_only.lua --db-ps-mode=disable run

sysbench 1.0.17 (using system LuaJIT 2.0.4)

Running the test with following options:
Number of threads: 50
Report intermediate results every 2 second(s)
Initializing random number generator from current time

Initializing worker threads…

Threads started!

[ 2s ] thds: 50 tps: 8343.21 qps: 50134.69 (r/w/o: 0.00/33423.79/16710.90) lat (ms,95%): 8.74 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 50 tps: 8500.09 qps: 51019.53 (r/w/o: 0.00/34018.85/17000.68) lat (ms,95%): 8.58 err/s: 0.00 reconn/s: 0.00
[ 6s ] thds: 50 tps: 8613.75 qps: 51677.02 (r/w/o: 0.00/34449.51/17227.51) lat (ms,95%): 8.28 err/s: 0.00 reconn/s: 0.00
[ 8s ] thds: 50 tps: 7941.72 qps: 47640.79 (r/w/o: 0.00/31757.86/15882.93) lat (ms,95%): 9.56 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 50 tps: 7864.73 qps: 47189.40 (r/w/o: 0.00/31459.93/15729.47) lat (ms,95%): 9.56 err/s: 0.00 reconn/s: 0.00
[ 12s ] thds: 50 tps: 8264.43 qps: 49591.61 (r/w/o: 0.00/33062.24/16529.37) lat (ms,95%): 8.74 err/s: 0.00 reconn/s: 0.00
[ 14s ] thds: 50 tps: 8221.60 qps: 49323.60 (r/w/o: 0.00/32880.90/16442.70) lat (ms,95%): 8.74 err/s: 0.00 reconn/s: 0.00
[ 16s ] thds: 50 tps: 7793.93 qps: 46766.06 (r/w/o: 0.00/31178.70/15587.35) lat (ms,95%): 9.73 err/s: 0.00 reconn/s: 0.00
[ 18s ] thds: 50 tps: 7865.17 qps: 47169.01 (r/w/o: 0.00/31439.67/15729.34) lat (ms,95%): 9.56 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 50 tps: 7612.33 qps: 45702.46 (r/w/o: 0.00/30475.81/15226.65) lat (ms,95%): 10.09 err/s: 0.00 reconn/s: 0.00
[ 22s ] thds: 50 tps: 7491.11 qps: 44954.17 (r/w/o: 0.00/29973.44/14980.72) lat (ms,95%): 10.65 err/s: 0.00 reconn/s: 0.00
[ 24s ] thds: 50 tps: 7803.23 qps: 46806.90 (r/w/o: 0.00/31198.44/15608.47) lat (ms,95%): 9.91 err/s: 0.00 reconn/s: 0.00
[ 26s ] thds: 50 tps: 8253.37 qps: 49528.72 (r/w/o: 0.00/33022.48/16506.24) lat (ms,95%): 8.90 err/s: 0.00 reconn/s: 0.00
[ 28s ] thds: 50 tps: 8308.94 qps: 49832.13 (r/w/o: 0.00/33214.25/16617.88) lat (ms,95%): 8.90 err/s: 0.00 reconn/s: 0.00
FATAL: mysql_drv_query() returned error 6002 (Transaction rollbacked) for query ‘BEGIN’
FATAL: `thread_run’ function failed: /usr/share/sysbench/oltp_common.lua:405: SQL error, errno = 6002, state = ‘40000’: Transaction rollbacked
Error in my_thread_global_end(): 1 threads didn’t exit

insert场景跑的挺好:

sysbench --db-driver=mysql --threads=600 --time=3000 --mysql-host=10.99.1.62 --mysql-port=2883 --mysql-user=‘root@t1’ --mysql-password=’’ --report-interval=10 --tables=
100 /usr/share/sysbench/oltp_insert.lua --db-ps-mode=disable run

SQL statistics:
queries performed:
read: 0
write: 149664933
other: 0
total: 149664933
transactions: 149664933 (49886.71 per sec.)
queries: 149664933 (49886.71 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)

General statistics:
total time: 3000.0947s
total number of events: 149664933

Latency (ms):
min: 0.63
avg: 12.02
max: 560.57
95th percentile: 37.56
sum: 1799589775.70

Threads fairness:
events (avg/stddev): 249441.5550/1058.10
execution time (avg/stddev): 2999.3163/0.02

补充集群config.yml

Only need to configure when remote login is required

user:
username: root
password: root
oceanbase-ce:
servers:

  • name: server1
    ip: 10.99.1.61
  • name: server2
    ip: 10.99.1.62
  • name: server3
    ip: 10.99.1.63

global:
devname: bond0
cluster_id: 1
memory_limit: 64G # The maximum running memory for an observer
system_memory: 45G
datafile_size: 3072G # Size of the data file.
log_disk_size: 192G # The size of disk space used by the clog files.
syslog_level: INFO # System log level. The default value is INFO.
enable_syslog_wf: false # Print system logs whose levels are higher than WARNING to a separate log file. The default value is true.
enable_syslog_recycle: true # Enable auto system log recycling or not. The default value is false.
max_syslog_file_count: 4 # The maximum number of reserved log files before enabling auto recycling. The default value is 0.
skip_proxy_sys_private_check: true
enable_strict_kernel_release: false

In this example , support multiple ob process in single node, so different process use different ports.

If deploy ob cluster in multiple nodes, the port and path setting can be same.

appname: obcluster

server1:
mysql_port: 2881 # External port for OceanBase Database. The default value is 2881. DO NOT change this value after the cluster is started.
rpc_port: 2882 # Internal port for OceanBase Database. The default value is 2882. DO NOT change this value after the cluster is started.
home_path: /tidata01/ob/observer
zone: zone1
system_memory: 10G
cpu_count: 38
server2:
mysql_port: 2881 # External port for OceanBase Database. The default value is 2881. DO NOT change this value after the cluster is started.
rpc_port: 2882 # Internal port for OceanBase Database. The default value is 2882. DO NOT change this value after the cluster is started.
home_path: /tidata01/ob/observer
zone: zone2
system_memory: 10G
cpu_count: 38
server3:
mysql_port: 2881 # External port for OceanBase Database. The default value is 2881. DO NOT change this value after the cluster is started.
rpc_port: 2882 # Internal port for OceanBase Database. The default value is 2882. DO NOT change this value after the cluster is started.
home_path: /tidata01/ob/observer
zone: zone3
system_memory: 10G
cpu_count: 38
obproxy-ce:

Set dependent components for the component.

When the associated configurations are not done, OBD will automatically get the these configurations from the dependent components.

depends:

  • oceanbase-ce
    servers:
  • 10.99.1.62
    global:
    listen_port: 2883 # External port. The default value is 2883.
    prometheus_listen_port: 2884 # The Prometheus port. The default value is 2884.
    home_path: /tidata01/ob/obproxy
    enable_cluster_checkout: false
    skip_proxy_sys_private_check: true
    enable_strict_kernel_release: false
    10.99.1.62:
    enable_cluster_checkout: false
    obagent:
    depends:
  • oceanbase-ce
    servers:
  • name: server1
    ip: 10.99.1.61
  • name: server2
    ip: 10.99.1.62
  • name: server3
    ip: 10.99.1.63
    global:
    home_path: /tidata01/ob/obagent
    ob_monitor_status: active
    prometheus:
    depends:
  • obagent
    servers:
  • 10.99.1.63
    global:
    home_path: /tidata01/ob/prometheus
    grafana:
    depends:
  • prometheus
    servers:
  • 10.99.1.63
    global:
    home_path: /tidata01/ob/grafana
    login_password: oceanbase

补充日志

##obproxy_error.log

2023-01-12 09:22:20.640739,undefined,obcluster:t1:sbtest,OB_MYSQL,sbtest65,sbtest65,OB_MYSQL_COM_QUERY,UPDATE,failed,1317,UPDATE sbtest65 SET c=‘41445603386-23814668590-50752016844-47645735530-26052180208-91425821062-58428280024-53836823266-24077507483-10094755523’ WHERE id=5040,11219us,12us,0us,0us,Y0-00007FC7195639B0,0,10.99.1.63:2881,Query execution was interrupted,
2023-01-12 09:22:20.640799,undefined,obcluster:t1:sbtest,OB_MYSQL,sbtest78,sbtest78,OB_MYSQL_COM_QUERY,UPDATE,failed,1317,UPDATE sbtest78 SET c=‘88628438601-69803453343-45019764880-53517412986-77963722934-97584611429-74690784558-57633731737-13122800479-46637828063’ WHERE id=5049,11137us,10us,0us,0us,Y0-00007FC71937DBA0,0,10.99.1.61:2881,Query execution was interrupted,
2023-01-12 09:22:29.886166,undefined,obcluster:t1:sbtest,OB_MYSQL,sbtest94,sbtest94,OB_MYSQL_COM_QUERY,DELETE,failed,1213,DELETE FROM sbtest94 WHERE id=5017,551489us,9us,0us,551412us,Y0-00007FC708E9DC30,YB420A63013E-0005F1FC0F48ABF4-0-0,0,10.99.1.62:2881,Deadlock,YB420A63013E-0005F1FC0F48ABF4-0-0
2023-01-12 09:22:29.887720,undefined,obcluster:t1:sbtest,OB_MYSQL,OB_MYSQL_COM_QUERY,BEGIN,failed,6002,BEGIN,1037us,17us,0us,918us,Y0-00007FC708E9DC30,YB420A63013E-0005F1FC13685F47-0-0,0,10.99.1.62:2881,Transaction rollbacked,YB420A63013E-0005F1FC13685F47-0-0

##observer.log

#YB420A63013E-0005F1FC0F48ABF4 检索结果如下:
[2023-01-12 09:22:29.339279] WARN setup_next_scanner (ob_direct_receive_op.cpp:202) [79271][T1004_TNT_L0_G0][T1004][YB420A63013E-0005F1FC0F48ABF4-0-0] [lt=0]
[2023-01-12 09:22:29.339303] WARN [SQL.EXE] setup_next_scanner (ob_direct_receive_op.cpp:205) [79271][T1004_TNT_L0_G0][T1004][YB420A63013E-0005F1FC0F48ABF4-0-0] [lt=21] while fetching first scanner, the remote rcode is not OB_SUCCESS(ret=-6005, err_msg="", dst_addr="“10.99.1.63:2882"”)
#YB420A63013E-0005F1FC13685F47 这个搜不到

#补充信息
sysbench 的thread从10到500 都会报同样的问题,越大出现的更快。 10以下未测

根据YB420A63013E-0005F1FC13685F47搜索不到信息,这个有几种原因:
1.搜索的机器不对,这里显示执行的是10.99.1.62这台机器,所以要到这台机器上去搜索。
2.ob的日志会按时间段进行拆分,需要找到对应时间段的日志文件进行搜索。可以尝试搜索下对应时间段上下几个日志文件的内容
3.可以尝试下按照https://www.oceanbase.com/docs/community-observer-cn-10000000000901379进行参数调优,调大下事务的超时间。如果已经调整了参数,可以将syslog_level改回’info’, 这样打印的日志会更全一些

1.机器是对的,就是在62上查的
2.是刚测就搜的, YB420A63013E-0005F1FC0F48ABF4也是同一测试、同一时间发生的问题,如果有应该是一起。 全量搜索也查了,没搜索到;
3. 不知道问题的情况下改参数这个方案我不认可。 sysbench是个很基础的测试,默认参数也应该能正常测试,除非说明哪个参数有问题,为什么有问题?为什么有问题的参数不默认就调整好? 是否有存在其他风险? 测一个这么基础的场景就要调整参数,那是不是说明:要是能达到生产运行,或者测一些其他场景,是不是还要调整其他参数?

像事务超时时间这种参数,跟机器性能、网络、io都有关系,不可能说默认参数解决所有的场景。这里由于暂时没有有效日志可以分析,所以只是一个猜测可能是超时了导致的,如果没有调整日志级别的话,可以把全量日志打个包上传下,我们分析下看

以下参数修改后,仍然报错:
mysql> show variables like ‘%ob_trx_timeout%’;
±---------------±------------+
| Variable_name | Value |
±---------------±------------+
| ob_trx_timeout | 86400000000 |
±---------------±------------+
1 row in set (0.00 sec)

mysql> set global ob_trx_timeout=86400000000000;
Query OK, 0 rows affected (0.02 sec)

日志级,就是INFO,没改过
mysql> show parameters like ‘%syslog_level%’ \G;
*************************** 1. row ***************************
zone: zone2
svr_type: observer
svr_ip: 10.99.1.62
svr_port: 2882
name: syslog_level
data_type: NULL
value: INFO
info: specifies the current level of logging. There are DEBUG, TRACE, INFO, WARN, USER_ERR, ERROR, six different log levels.
section: OBSERVER
scope: CLUSTER
source: DEFAULT
edit_level: DYNAMIC_EFFECTIVE
*************************** 2. row ***************************
zone: zone1
svr_type: observer
svr_ip: 10.99.1.61
svr_port: 2882
name: syslog_level
data_type: NULL
value: INFO
info: specifies the current level of logging. There are DEBUG, TRACE, INFO, WARN, USER_ERR, ERROR, six different log levels.
section: OBSERVER
scope: CLUSTER
source: DEFAULT
edit_level: DYNAMIC_EFFECTIVE
*************************** 3. row ***************************
zone: zone3
svr_type: observer
svr_ip: 10.99.1.63
svr_port: 2882
name: syslog_level
data_type: NULL
value: INFO
info: specifies the current level of logging. There are DEBUG, TRACE, INFO, WARN, USER_ERR, ERROR, six different log levels.
section: OBSERVER
scope: CLUSTER
source: DEFAULT
edit_level: DYNAMIC_EFFECTIVE
3 rows in set (0.01 sec)

ERROR:
No query specified

我刚又测了一次,最后一行的”0005F1FC10B63ABF“日志还是找不到。

obproxy_error.log

2023-01-12 13:58:51.595151,undefined,obcluster:t1:sbtest,OB_MYSQL,sbtest35,sbtest35,OB_MYSQL_COM_QUERY,UPDATE,failed,1213,UPDATE sbtest35 SET c=‘79923627564-25260140511-98172154472-33559289457-30404402788-17942984824-00270691994-64741400460-61211964308-72714772725’ WHERE id=4995,493152us,15us,0us,493097us,Y0-00007FC708E9E030,YB420A63013F-0005F1FC0E05FD48-0-0,0,10.99.1.63:2881,Deadlock,YB420A63013F-0005F1FC0E05FD48-0-0
2023-01-12 13:58:51.595736,undefined,obcluster:t1:sbtest,OB_MYSQL,OB_MYSQL_COM_QUERY,BEGIN,failed,6002,BEGIN,391us,13us,0us,344us,Y0-00007FC708E9E030,YB420A63013F-0005F1FC10B63ABF-0-0,0,10.99.1.63:2881,Transaction rollbacked,YB420A63013F-0005F1FC10B63ABF-0-0

observer.log

[root@scq03-802A12U0809-performance_testing-app-1-63-msxf log]# cat observer.log.* |grep 0005F1FC10B63ABF
[root@scq03-802A12U0809-performance_testing-app-1-63-msxf log]# cat observer.log.* |grep 0005F1FC0E05FD48
[2023-01-12 13:58:51.112321] INFO [STORAGE.TRANS] sync_rollback_savepoint__ (ob_tx_api.cpp:1513) [100343][T1004_TNT_L0_G0][T1004][YB420A63013F-0005F1FC0E05FD48-0-0] [lt=0] all savepoint rollback succeed(tx.tx_id={txid:746435122}, remain_cnt=1, waittime=1000000, retries=0)
[2023-01-12 13:58:51.175251] INFO [STORAGE.TRANS] sync_rollback_savepoint__ (ob_tx_api.cpp:1513) [100395][T1004_TNT_L0_G0][T1004][YB420A63013F-0005F1FC0E05FD48-0-0] [lt=0] all savepoint rollback succeed(tx.tx_id={txid:746435122}, remain_cnt=1, waittime=1000000, retries=0)
[2023-01-12 13:58:51.250901] WARN setup_next_scanner (ob_direct_receive_op.cpp:202) [100401][T1004_TNT_L0_G0][T1004][YB420A63013F-0005F1FC0E05FD48-0-0] [lt=1]
[2023-01-12 13:58:51.252330] INFO [STORAGE.TRANS] sync_rollback_savepoint__ (ob_tx_api.cpp:1513) [100401][T1004_TNT_L0_G0][T1004][YB420A63013F-0005F1FC0E05FD48-0-0] [lt=1] all savepoint rollback succeed(tx.tx_id={txid:746435122}, remain_cnt=1, waittime=1000000, retries=0)
[2023-01-12 13:58:51.257245] INFO [STORAGE.TRANS] sync_rollback_savepoint__ (ob_tx_api.cpp:1513) [100401][T1004_TNT_L0_G0][T1004][YB420A63013F-0005F1FC0E05FD48-0-0] [lt=0] all savepoint rollback succeed(tx.tx_id={txid:746435122}, remain_cnt=1, waittime=1000000, retries=0)
[2023-01-12 13:58:51.371121] WARN setup_next_scanner (ob_direct_receive_op.cpp:202) [100380][T1004_TNT_L0_G0][T1004][YB420A63013F-0005F1FC0E05FD48-0-0] [lt=0]
[2023-01-12 13:58:51.374360] WARN setup_next_scanner (ob_direct_receive_op.cpp:202) [100380][T1004_TNT_L0_G0][T1004][YB420A63013F-0005F1FC0E05FD48-0-0] [lt=0]
[2023-01-12 13:58:51.484838] WARN setup_next_scanner (ob_direct_receive_op.cpp:202) [100322][T1004_TNT_L0_G0][T1004][YB420A63013F-0005F1FC0E05FD48-0-0] [lt=61]
[root@scq03-802A12U0809-performance_testing-app-1-63-msxf log]#

搜不到日志还有点奇怪,理论上不会没有日志。从当前的信息判断,大概率是锁冲突导致的,你的压测命令里应该没有指定table_size,这个默认20000,有点小。另外压测write only和read write需要在压测命令里加上–rand-type=uniform ,客户端请求会均衡一些,不然会有大量的锁冲突问题

我的测试目的就是为了找问题,不是为了一个报告,有问题我们解决问题就行,我们OB是一个通用数据库,能处理各种复杂的场景,而且这只是一个最基础的测试,我也相信OB是欢迎大家找出问题的。所以压测方式我暂时不做调整; 另外同样的压测,找一个配置非常低的mysql也不会是这样的表现,所以我坚持认为这是OB应该解决的一个问题,而不是调整压测方式。

目前提供的日志信息里,没有有效的信息,所以只能从经验上去给出建议。可以把报错期间的所有日志都以附件的形式贴上来,我们这边分析下看。另外可以查下GV$OB_SQL_AUDIT表,看下ret_code为6002对应的TRACE_ID以及TX_ID再去搜下日志

1.全量日志,由于环境原因发不出来;
2.ret_code=6002 没有结果