ERROR 4184 (53100): Server out of disk space

bull020 · 2025 年1 月 2 日 16:25

ob：4.3.0
修改大表的默认值，报错：ERROR 4184 (53100): Server out of disk space

select /*+ query_timeout(30000000) */ a.TENANT_ID, a.DATABASE_NAME, a.TABLE_NAME, a.TABLE_ID,

sum(

case

when b.nested_offset = 0 then IFNULL(b.data_block_count+b.index_block_count+b.linked_block_count, 0) * 2 * 1024 * 1024

else IFNULL(b.size, 0)

end

) /1024/1024/1024 as data_size_in_GB

from CDB_OB_TABLE_LOCATIONS a left join __all_virtual_table_mgr b

on a.svr_ip = b.svr_ip and a.svr_port=b.svr_port and a.tenant_id = b.tenant_id and a.LS_ID = b.LS_ID and a.TABLET_ID = b.TABLET_ID

where a.role = 'LEADER' and a.tenant_id = 'xxxx' and a.DATABASE_NAME= 'xxxx' and a.TABLE_NAME = 'xxxx'

and b.table_type >= 10 and b.size > 0 group by a.TABLE_ID;

+----------+-----------------+
| TABLE_ID | data_size_in_GB |
+----------+-----------------+
|   501370 | 28.355468750000 |
+----------+-----------------+

只有28G左右，但是空间显示还有很多。

*************************** 1. row ***************************
                 SVR_IP: 10.xx.xx.xx
               SVR_PORT: 2882
                   ZONE: zone3
               SQL_PORT: 2881
           CPU_CAPACITY: 32
       CPU_CAPACITY_MAX: 32
           CPU_ASSIGNED: 25.5
       CPU_ASSIGNED_MAX: 25.5
           MEM_CAPACITY: 64424509440
           MEM_ASSIGNED: 60129542144
      LOG_DISK_CAPACITY: 206158430208
      LOG_DISK_ASSIGNED: 162403450880
        LOG_DISK_IN_USE: 129989869568
     DATA_DISK_CAPACITY: 322122547200
       DATA_DISK_IN_USE: 133297078272
DATA_DISK_HEALTH_STATUS: NORMAL
           MEMORY_LIMIT: 96636764160
    DATA_DISK_ALLOCATED: 322122547200
DATA_DISK_ABNORMAL_TIME: NULL
  SSL_CERT_EXPIRED_TIME: NULL

空间显示还有很多，但是报错。

淇铭 · 2025 年1 月 2 日 17:18

参数查询

show parameters where name in (‘memory_limit’,‘memory_limit_percentage’,‘system_memory’,‘log_disk_size’,‘log_disk_percentage’,‘datafile_size’,‘datafile_disk_percentage’);

集群 server 级资源分配情况

select zone,concat(SVR_IP,’:’,SVR_PORT) observer,
cpu_capacity_max cpu_total,cpu_assigned_max cpu_assigned,
cpu_capacity-cpu_assigned_max as cpu_free,
round(memory_limit/1024/1024/1024,2) as memory_total,
round((memory_limit-mem_capacity)/1024/1024/1024,2) as system_memory,
round(mem_assigned/1024/1024/1024,2) as mem_assigned,
round((mem_capacity-mem_assigned)/1024/1024/1024,2) as memory_free,
round(log_disk_capacity/1024/1024/1024,2) as log_disk_capacity,
round(log_disk_assigned/1024/1024/1024,2) as log_disk_assigned,
round((log_disk_capacity-log_disk_assigned)/1024/1024/1024,2) as log_disk_free,
round((data_disk_capacity/1024/1024/1024),2) as data_disk,
round((data_disk_in_use/1024/1024/1024),2) as data_disk_used,
round((data_disk_capacity-data_disk_in_use)/1024/1024/1024,2) as data_disk_free
from oceanbase.gv$ob_servers;

集群租户级资源分配和磁盘使用情况

select a.zone,a.svr_ip,b.tenant_name,b.tenant_type, a.max_cpu, a.min_cpu,
round(a.memory_size/1024/1024/1024,2) memory_size_gb,
round(a.log_disk_size/1024/1024/1024,2) log_disk_size,
round(a.log_disk_in_use/1024/1024/1024,2) log_disk_in_use,
round(a.data_disk_in_use/1024/1024/1024,2) data_disk_in_use
from oceanbase.gv$ob_units a join oceanbase.dba_ob_tenants b on a.tenant_id=b.tenant_id order by b.tenant_name;

查看某表单副本占用磁盘大小

SELECT sum(size)/1024/1024/1024 FROM (SELECT DATABASE_NAME,TABLE_NAME,TABLE_ID,PARTITION_NAME,TABLET_ID,ROLE
FROM oceanbase.DBA_OB_TABLE_LOCATIONS ) AA full join
(SELECT distinct(TABLET_ID) ,size
FROM oceanbase.GV$OB_SSTABLES ) BB on AA.TABLET_ID=BB.TABLET_ID
WHERE AA.role=‘leader’ and AA.table_name=’${表名}’;

统计租户的大小

SELECT t.tenant_name,
round(sum(t2.data_size)/1024/1024/1024,2) as data_size_gb,
round(sum(t2.required_size)/1024/1024/1024,2) as required_size_gb
FROM oceanbase.dba_ob_tenants t,oceanbase.cdb_ob_table_locations t1,cdb_ob_tablet_replicas t2
WHERE t.tenant_id=t1.tenant_id and t1.svr_ip=t2.svr_ip and t1.tenant_id=t2.tenant_id and t1.ls_id=t2.ls_id and t1.tablet_id=t2.tablet_id
group by t.tenant_name
order by 3 desc;

上面的信息在查询一下

bull020 · 2025 年1 月 2 日 17:32

对应的查询如下：

*************************** 15. row ***************************
      zone: zone3
  svr_type: observer
    svr_ip: 10. xx.xx.xx
  svr_port: 2882
      name: log_disk_percentage
 data_type: INT
     value: 0
      info: the percentage of disk space used by the log files. Range: [0,99] in integer;only effective when parameter log_disk_size is 0;when log_disk_percentage is 0: a) if the data and the log are on the same disk, means log_disk_percentage = 30 b) if the data and the log are on the different disks, means log_disk_perecentage = 90
   section: LOGSERVICE
     scope: CLUSTER
    source: DEFAULT
edit_level: DYNAMIC_EFFECTIVE
*************************** 16. row ***************************
      zone: zone3
  svr_type: observer
    svr_ip: 10. xx.xx.xx
  svr_port: 2882
      name: log_disk_size
 data_type: CAPACITY
     value: 160GB
      info: the size of disk space used by the log files. Range: [0, +∞)
   section: LOGSERVICE
     scope: CLUSTER
    source: DEFAULT
edit_level: DYNAMIC_EFFECTIVE
*************************** 17. row ***************************
      zone: zone3
  svr_type: observer
    svr_ip: 10. xx.xx.xx
  svr_port: 2882
      name: memory_limit_percentage
 data_type: INT
     value: 80
      info: the size of the memory reserved for internal use(for testing purpose). Range: [10, 95]
   section: OBSERVER
     scope: CLUSTER
    source: DEFAULT
edit_level: DYNAMIC_EFFECTIVE
*************************** 18. row ***************************
      zone: zone3
  svr_type: observer
    svr_ip: 10. xx.xx.xx
  svr_port: 2882
      name: system_memory
 data_type: CAPACITY
     value: 30G
      info: the memory reserved for internal use which cannot be allocated to any outer-tenant, and should be determined to guarantee every server functions normally. Range: [0M,)
   section: OBSERVER
     scope: CLUSTER
    source: DEFAULT
edit_level: DYNAMIC_EFFECTIVE
*************************** 19. row ***************************
      zone: zone3
  svr_type: observer
    svr_ip: 10. xx.xx.xx
  svr_port: 2882
      name: memory_limit
 data_type: CAPACITY
     value: 90G
      info: the size of the memory reserved for internal use(for testing purpose), 0 means follow memory_limit_percentage. Range: 0, [1G,).
   section: OBSERVER
     scope: CLUSTER
    source: DEFAULT
edit_level: DYNAMIC_EFFECTIVE
*************************** 20. row ***************************
      zone: zone3
  svr_type: observer
    svr_ip: 10. xx.xx.xx
  svr_port: 2882
      name: datafile_disk_percentage
 data_type: INT
     value: 0
      info: the percentage of disk space used by the data files. Range: [0,99] in integer
   section: SSTABLE
     scope: CLUSTER
    source: DEFAULT
edit_level: DYNAMIC_EFFECTIVE
*************************** 21. row ***************************
      zone: zone3
  svr_type: observer
    svr_ip: 10. xx.xx.xx
  svr_port: 2882
      name: datafile_size
 data_type: CAPACITY
     value: 300G
      info: size of the data file. Range: [0, +∞)
   section: SSTABLE
     scope: CLUSTER
    source: DEFAULT
edit_level: DYNAMIC_EFFECTIVE
21 rows in set (0.01 sec)

@淇铭老师

淇铭 · 2025 年1 月 2 日 18:41

通过下面的步骤抓取一下日志信息

如果 SQL 执行立刻报错的，推荐使用系统租户获取 trace_id。
a. 登录系统租户，打开enable_rich_error_msg 参数
alter system set enable_rich_error_msg=true;
b.登录业务租户，执行报错 SQL 语句，会直接返回执行节点 IP 和 trace_id 信息。
obclient [test]> select count(*) from t2;
ERROR 1146 (42S02): Table ‘test.t2’ doesn’t exist
[xx.xx.xx.1:2882] [2024-04-13 20:10:20.292087] [YB420BA1CC68-000615A0A8EA5E38-0-0]
c.去 xx.xx.xx.1 节点过滤日志，如果最新日志无法过滤到，可以正则匹配多个日志进行过滤
[root@x.x.x.1 ~]$ grep “YB420BA1CC68-000615A0A8EA5E38-0-0” rootservice.log
[root@x.x.x.1 ~]$ grep “YB420BA1CC68-000615A0A8EA5E38-0-0” observer.log
d.获取完日志信息后，关闭enable_rich_error_msg参数
alter system set enable_rich_error_msg=false;

根据上面步骤来找一下你执行的命令的日志信息再把完整的observer.log和rootservice.log发一下

bull020 · 2025 年1 月 2 日 18:58

[2025-01-02 18:32:57.120221] INFO  [RS] process_ (ob_rs_rpc_processor.h:232) [41185][DDLQueueTh0][T0][YB420A0C4B42-00061AE1F38ECB8F-0-0] [lt=23] [DDL] execute ddl like stmt(ret=0, cost=233833, ddl_arg=ddl_
stmt_str:"alter /*+ parallel(4) */  table  xx  modify `yy` varchar(14)   NOT NULL default ''", exec_tenant_id:1002, ddl_id_str:"", sync_from_primary:false, based_schema_object_infos:[{
schema_id:501370, schema_type:5, schema_version:1735812219219024, schema_tenant_id:0}], parallelism:4, task_id:350233035, consumer_group_id:0)
[2025-01-02 18:32:57.121286] WDIAG [RS] report_error_code (ob_ddl_task.cpp:1199) [44536][DDLTaskExecutor][T0][YB420A0C4B42-00061AE1F38ECB8F-0-0] [lt=14][errcode=-4008] load ddl user error failed(ret=-4008,
 dst_tenant_id_=1002, task_id_=350233035, object_id_=501370)
[2025-01-02 18:32:57.121286] WDIAG [RS] report_error_code (ob_ddl_task.cpp:1199) [44536][DDLTaskExecutor][T0][YB420A0C4B42-00061AE1F38ECB8F-0-0] [lt=14][errcode=-4008] load ddl user error failed(ret=-4008,
 dst_tenant_id_=1002, task_id_=350233035, object_id_=501370)

淇铭 · 2025 年1 月 2 日 19:03

尽量把整个observer.log日志都提供一下日志信息太少了按照上面的提供trace_id的日志信息也提供一下

淇铭 · 2025 年1 月 6 日 15:46

排查的结果：offline ddl过程中会有排序，会有临时空间的膨胀导致的磁盘空间不足用户修改datafile_size大小，ddl操作成功了。

obpilot · 2025 年1 月 6 日 17:27

最好的做法是 OB 部署的时候数据文件设置一个合理的初始值（datafile_size），再设置一个自增值datafile_nextsize(1GB左右），再设置一个文件系统能容纳的最大值。这样这类问题数据文件就自动增长了，不至于报错。

淇铭 · 2025 年1 月 6 日 17:30

确实这也是个解决方案

bull020 · 2025 年1 月 6 日 17:58

有点狠，表大小不到30G，50个字段，平均row大小为3K，ddl修改一个字段的默认值，就需要300G的临时空间。

淇铭 · 2025 年1 月 6 日 18:08

目前ob也是在优化中后面的版本会越来越好

obpilot · 2025 年1 月 7 日 09:04

4.3.0 刚出来的版本应该是还没来得及合并优化。你用 4.3.3 最新版本再看看，好多了。我测试过。