NUMA 内存策略 preferred 的效果

2020-02-20

和聂工一起在如何在硬件资源有限的情况下尽量多的运行虚拟机的泥潭中苦苦挣扎。

想利用numa 的优势，但又因为内存资源总是不够，使用strict 的策略可能会导致oom，只好允许进程跨numa node 使用内存。
当在一台启用了numa 的物理机上，通过libvirt 的xml 给虚拟机添加内存策略preferred 后，期待的是它会先使用preferred 的node 上的内存，只有这个node 上的内存不足时，才考虑使用其他node.
但实际部署下来，经常发现很多进程在两个node 上都有内存，但通过numastat -m 来看其实preferred 的node 上往往还有不少可释放的buffer/cache 等。

后来查询得知这个行为和zone_reclaim_mode 有关。当zone_reclaim_mode 为0 时，如果当前zone 的内存低于low 水线，会倾向于从其他zone 申请内存，而所有的zone 按照numa 访问代价排成一个list，按顺序依次检查这些zone 的free pages 和low 水线的关系，并尝试分配内存。
当zone_reclaim_mode 为1 时，会倾向于先对当前zone 进行内存回收，回收一部分buffer/cache 后如果满足需求则从当前zone 进行分配。

从代码上来看，通过alloc_pages_nodemask 分配内存时，会调用get_page_from_freelist 函数，而当该函数分配不出page 时，则进入alloc_pages_slowpath.

// mm/page_alloc.c
/*
 * This is the 'heart' of the zoned buddy allocator.
 */
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
							nodemask_t *nodemask)
{
    ...
    unsigned int alloc_flags = ALLOC_WMARK_LOW;
    ...
    if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))         // 这里根据preferred 的node id 构建zone list，存在ac 中。
		return NULL;
    ...
    page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
    if (likely(page))
		goto out;
    ...
	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
out:
    ...
}

只观察get_page_from_freelist

 // mm/page_alloc.c
 /*
 * get_page_from_freelist goes through the zonelist trying to allocate
 * a page.
 */
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
						const struct alloc_context *ac)
{
    ...
    	for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,// 遍历已经通过访问代价排序的zone list
								ac->nodemask) {
            ...
    		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);           // 这里的alloc_flags 是从__alloc_pages_nodemask 传进来的参数，是ALLOC_WMARK_LOW 和其他flags 或起来的结果。
		    if (!zone_watermark_fast(zone, order, mark,                         // 这里会检查free pages 是否低过low 水线。不低于low 水线可以直接尝试从该zone 分配内存。
				       ac_classzone_idx(ac), alloc_flags)) {            
                ...
                if (node_reclaim_mode == 0 ||                                   // 低于low 水线的情况下，检查node_reclaim_mode，如果为0 或zone 不允许reclaim 则跳过该zone.
                    !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))    
                    continue;

                ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);          // 其他情况下开始对该zone 进行reclaim.
                switch (ret) {
                case NODE_RECLAIM_NOSCAN:
                    /* did not scan */
                    continue;
                case NODE_RECLAIM_FULL:
                    /* scanned but unreclaimable */
                    continue;
                default:
                    /* did we reclaim enough */
                    if (zone_watermark_ok(zone, order, mark,                    // reclaim 到足够的内存时，尝试从该zone 分配内存。
                            ac_classzone_idx(ac), alloc_flags))
                        goto try_this_zone;

                    continue;
                }
            }
try_this_zone:
		    page = rmqueue(ac->preferred_zoneref->zone, zone, order,            // rmqueue 已经是尝试从当前zone 分配内存了。
			    gfp_mask, alloc_flags, ac->migratetype);
            ...
		}

因此看来使用preferred 的模式时，由于buffer/cache 的存在free pages 会变少，而free 在low 以下时如果zone_reclaim_mode 为0 就会从其他zone 分配内存，所以看起来是多数进程在每个node 上都有点内存。而当zone_reclaim_mode 为1 时，倾向于先回收本zone 的buffer/cache，这个过程会导致sar 中的pgscand 上涨。

参考：https://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases#reproduce