怎么处理和预防OceanBase的"内存使用率持续增长,最终触发OOM(Out of Memory)"的问题?

怎么处理和预防OceanBase的"内存使用率持续增长,最终触发OOM(Out of Memory)"的问题?
最好有实践方案。谢谢。

一般出现OOM时一般为现象有这些

1、memstore_used_percent指标持续上升超过85%;

2、客户反应查询响应时间明显变慢,出现大量等待事件;

3、触发OOM导致节点重启或服务不可用;

4、后台日志中出现"memory not enough"或"allocate memory failed"警告。

查询方法基本通用,解决优化办法仅供参考!
查询方法:
– 1.1 查看全局内存状态(系统租户执行)
SELECT
svr_ip,
zone,
hold/1024/1024/1024 as hold_gb,
free/1024/1024/1024 as free_gb,
total/1024/1024/1024 as total_gb,
(hold/total)*100 as used_percent
FROM __all_virtual_memory_info
WHERE ctx_name = ‘OB_SERVER’
ORDER BY used_percent DESC;

– 1.2 查看MemStore使用情况
SELECT
t.tenant_name,
s.svr_ip,
s.memstore_used/1024/1024/1024 as used_gb,
s.memstore_limit/1024/1024/1024 as limit_gb,
(s.memstore_used/s.memstore_limit)*100 as used_percent,
s.freeze_cnt,
s.active_memstore_used/1024/1024 as active_mb
FROM oceanbase.__all_virtual_tenant_memstore_info s
JOIN oceanbase.__all_tenant t ON s.tenant_id = t.tenant_id
WHERE s.memstore_used_percent > 70
ORDER BY used_percent DESC;

– 1.3 查看各模块内存使用明细
SELECT
ctx_name,
mod_name,
hold/1024/1024 as hold_mb,
used/1024/1024 as used_mb,
count,
alloc_count,
free_count
FROM __all_virtual_memory_info
WHERE hold > 10010241024 – 大于100MB的模块
ORDER BY hold DESC
LIMIT 20;
– 1.4 查看触发冻结的情况
SELECT
tenant_id,
svr_ip,
freeze_cnt,
last_freeze_time,
memstore_freeze_trigger_percentage,
active_memstore_used/1024/1024 as active_mb
FROM __all_virtual_tenant_memstore_info
WHERE freeze_cnt > 0
ORDER BY freeze_cnt DESC;

– 1.5 查看内存泄漏嫌疑对象
SELECT
tenant_id,
sql_id,
plan_id,
exec_count,
elapsed_time,
cpu_time,
get_plan,
hit_plan,
disk_reads,
retry_cnt,
row_cache_hit,
bloom_filter_cache_hit
FROM oceanbase.gv$sql_audit
WHERE tenant_id = 1001 – 指定租户
AND elapsed_time > 1000000 – 执行时间>1秒
ORDER BY elapsed_time DESC
LIMIT 50;

– 1.6 查看工作线程内存使用
SELECT
svr_ip,
thread_id,
req_id,
session_id,
trace_id,
retry_cnt,
total_time,
queue_time,
exec_time
FROM __all_virtual_thread
WHERE status = ‘RUNNING’
AND total_time > 10000000 – 运行超过10秒
ORDER BY total_time DESC;

处理办法:
– 1.1 手动触发Major Freeze(全局合并)
– 在系统租户执行
ALTER SYSTEM MAJOR FREEZE;

– 或指定租户触发
ALTER SYSTEM MAJOR FREEZE TENANT = ‘your_tenant’;

– 1.2 强制冻结MemStore
– 降低冻结触发阈值(临时措施)
ALTER SYSTEM SET memstore_freeze_trigger_percentage = 70;
ALTER SYSTEM SET freeze_trigger_percentage = 70;

– 1.3 终止高内存消耗会话
– 查找高内存消耗会话
SELECT
session_id,
tenant_id,
user_id,
sql_id,
status,
total_memory_used/1024/1024 as memory_mb,
active_time
FROM __all_virtual_processlist
WHERE total_memory_used > 10010241024 – 内存使用>100MB
ORDER BY total_memory_used DESC;

– 1.4 终止指定会话
KILL ‘session_id’;
– 或通过系统视图
ALTER SYSTEM KILL SESSION ‘server_ip:session_id’;
– 1.5 动态调整Unit内存配置(需要4.0+版本)
– 查看当前Unit配置
SELECT * FROM __all_unit_config
WHERE unit_config_id = <your_unit_id>;

– 1.6 临时增加内存(如果物理内存充足)
ALTER RESOURCE UNIT your_unit
MIN_MEMORY = ‘32G’,
MAX_MEMORY = ‘64G’,
MEMORY_SIZE = ‘48G’;

– 1.7 调整租户内存比例
– 查看租户内存配置
SELECT
tenant_id,
tenant_name,
unit_config_id,
unit_count,
max_cpu,
min_memory/1024/1024/1024 as min_memory_gb,
max_memory/1024/1024/1024 as max_memory_gb
FROM __all_tenant
WHERE tenant_id > 1000;

– 1.8 修改租户内存配额(系统租户执行)
ALTER TENANT your_tenant
SET RESOURCE_MEMORY_SIZE = ‘100G’;

预防优化方案
– 2.1 调整MemStore相关参数
– 设置合理的冻结阈值
ALTER SYSTEM SET memstore_freeze_trigger_percentage = 80; – 默认85
ALTER SYSTEM SET freeze_trigger_percentage = 70; – 默认70

– 调整冻结后保留比例
ALTER SYSTEM SET memstore_limit_percentage = 50; – 默认50%
ALTER SYSTEM SET writing_limit_threshold = 85; – 默认85%

– 2.2 优化工作线程内存
– 调整并行查询内存
ALTER SYSTEM SET parallel_servers_target = 32; – 默认CPU核数
ALTER SYSTEM SET parallel_max_servers = 64; – 默认CPU核数*2
ALTER SYSTEM SET _parallel_threads_per_cpu = 2; – 默认2

– 设置PGA内存限制
ALTER SYSTEM SET pga_aggregate_limit = ‘10G’; – 默认无限制
ALTER SYSTEM SET pga_aggregate_target = ‘5G’; – 默认1G

– 2.3 缓存大小优化
– 调整行缓存
ALTER SYSTEM SET row_cache_capacity = ‘2G’; – 默认100M

– 调整Bloom Filter缓存
ALTER SYSTEM SET bloom_filter_cache_size = ‘1G’; – 默认100M

– 调整块索引缓存
ALTER SYSTEM SET block_index_cache_size = ‘2G’; – 默认100M

2 个赞

如果在故意刷积分 会给你们禁言

1 个赞