Buddy system

While the page allocator serves as an interface for memory allocations (in multiples of page size), the buddy system operates at the back-end to administer physical page management. This algorithm manages all physical pages for each zone. It is optimized to accomplish allocations of large physically contiguous blocks (pages), by minimizing external fragmentation. Let's explore its operational details.

The zone descriptor structure contains an array of struct free_area, and the size of the array is defined through a kernel macro MAX_ORDER whose default value is 11:

  struct zone {
          ...
          ...
          struct free_area[MAX_ORDER];
          ...
          ...
     };

Each offset contains an instance of free_area structure. All free pages are split into 11 (MAX_ORDER) lists, each containing a list of blocks of 2^order pages, with order values in the range of 0 to 11 (that is, a list of of 2² would contain 16 KB sized blocks, and 2³ to be 32 KB sized blocks, and so on). This strategy ensures each block to be naturally aligned. Blocks in each list are exactly double in size to that of blocks in lower lists, resulting in faster allocation and deallocation operations. It also provides the allocator with the capability to handle contiguous allocations, of upto 8 MB block size (2¹¹ list):

When an allocation request is made for a particular size, the buddy system looks into the appropriate list for a free block, and returns its address, if available. However, if it cannot find a free block, it moves to check in the next high-order list for a larger block, which if available it splits the higher-order block into equal parts called buddies, returns one for the allocator, and queues the second into a lower-order list. When both buddy blocks become free at some future time, they are coalesced to create a larger block. Algorithm can identify buddy blocks through their aligned address, which makes it possible to coalesce them.

Let's consider an example to comprehend this better, assuming there were a request to allocate an 8k block (through page allocator routines). Buddy system looks for free blocks in an 8k list of the free_pages array(first offset containing 2¹ sized blocks), and returns the start linear address of the block if available; however, if there are no free blocks in the 8k list, it moves on to the next higher-order list, which is of 16k blocks (second offset of the free_pages array) to find a free block. Let's further assume that there were no free block in this list as well. It then moves ahead into the next high-order list of size 32k(third offset in the free_pages array) to find a free block; if available, it splits the 32k block into two equal halves of 16k each (buddies). The first 16k chunk is further split into two halves of 8k (buddies) of which one is allocated for the caller and other is put into the 8k list. The second chunk of 16k is put into the 16k free list, when lower order (8k) buddies become free at some future time, they are coalesced to form a higher-order 16k block. When both 16k buddies become free, they are again coalesced to arrive at a 32k block which is put back into the free list.

When a request for allocation from a desired zone cannot be processed, the buddy system uses a fallback mechanism to look for other zones and nodes:

The buddy system has a long history with extensive implementations across various *nix operating systems with appropriate optimizations. As discussed earlier, it helps faster memory allocation and deallocations, and it also minimizes external fragmentation to some degree. With the advent of huge pages, which provide much-needed performance benefits, it has become all the more important to further efforts toward anti-fragmentation. To accomplish this, the Linux kernel's implementation of the buddy system is equipped with anti-fragmentation capability through page migration.

Page migration is a process of moving data of a virtual page from one physical memory region to another. This mechanism helps create larger blocks with contiguous pages. To realize this, pages are categorized into the following types:

1. Unmovable pages: Physical pages which are pinned and reserved for a specific allocation are considered unmovable. Pages pinned for the core kernel fall into this category. These pages are non reclaimable.

2. Reclaimable pages: Physical pages mapped to a dynamic allocation that can be evicted to a backstore, and those which can be regenerated are considered reclaimable. Pages held for file caching, anonymous page mappings, and those held by the kernel's slab caches fall into this category. Reclaim operations are carried out in two modes: periodic and direct reclaim, the former is achieved through a kthread called kswapd. When system runs exceedingly short of memory, kernel enters into direct reclaim.

3. Movable pages: Physical pages that can be moved to different regions through page migration mechanism. Pages mapped to virtual address space of user-mode process are considered movable, since all the VM subsystem needs to do is copy data and change relevant page table entries. This works, considering all access operations from the user mode process are put through page table translations.

The buddy system groups pages on the basis of movability into independent lists, and uses them for appropriate allocations. This is achieved by organizing each 2ⁿ list in struct free_area as a group of autonomous lists based on mobility of pages. Each free_area instance holds an array of lists of size MIGRATE_TYPES. Each offset holds list_head of a respective group of pages:

 struct free_area {
          struct list_head free_list[MIGRATE_TYPES];
          unsigned long nr_free;
    };

nr_free is a counter that holds the total number of free pages for this free_area (all migration lists put together). The following diagram depicts free lists for each migration type:

The following enum defines page migration types:

enum {
 MIGRATE_UNMOVABLE,
 MIGRATE_MOVABLE,
 MIGRATE_RECLAIMABLE,
 MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
 MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
 MIGRATE_CMA,
#endif
#ifdef CONFIG_MEMORY_ISOLATION
 MIGRATE_ISOLATE, /* can't allocate from here */
#endif
 MIGRATE_TYPES
};

We have discussed key migration types MIGRATE_MOVABLE, MIGRATE_UNMOVABLE, and MIGRATE_RECLAIMABLE types. MIGRATE_PCPTYPES is a special type introduced to improve systems performance; each zone maintains a list of cache-hot pages in a per-CPU page cache. These pages are used to serve allocation requests raised by the local CPU. The zone descriptor structures pageset element points to pages in the per-CPU cache:

/* include/linux/mmzone.h */

struct per_cpu_pages {
 int count; /* number of pages in the list */
 int high; /* high watermark, emptying needed */
 int batch; /* chunk size for buddy add/remove */

 /* Lists of pages, one per migrate type stored on the pcp-lists */
 struct list_head lists[MIGRATE_PCPTYPES];
};

struct per_cpu_pageset {
 struct per_cpu_pages pcp;
#ifdef CONFIG_NUMA
 s8 expire;
#endif
#ifdef CONFIG_SMP
 s8 stat_threshold;
 s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
#endif
};

struct zone {
 ...
 ...
 struct per_cpu_pageset __percpu *pageset;
 ...
 ...
};

struct per_cpu_pageset is an abstraction that represents unmovable, reclaimable, and movable page lists. MIGRATE_PCPTYPES is a count of per-CPU page lists sorted as per page mobility. MIGRATE_CMA is list of pages for the contiguous memory allocator, which we shall discuss in further sections:

The buddy system is implemented to fall back on the alternate list, to process an allocation request when pages of desired mobility are not available. The following array defines the fallback order for various migration types; we will not go into further elaboration as it is self explanatory:

static int fallbacks[MIGRATE_TYPES][4] = {
 [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
 [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
 [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
#ifdef CONFIG_CMA
 [MIGRATE_CMA] = { MIGRATE_TYPES }, /* Never used */
#endif
#ifdef CONFIG_MEMORY_ISOLATION
 [MIGRATE_ISOLATE] = { MIGRATE_TYPES }, /* Never used */
#endif
};

Table of Contents for Buddy system

Create new playlist

Sign In

Sign Up

Table of Contents for
Buddy system