The XNU kernel provides a rich set of tools for allocating memory. Kernel memory allocation is not as trivial and straightforward as the malloc()
/ free()
interface found in user space libraries. Kernel memory allocation facilities range from high-level mechanisms analogous to the user space malloc()
interface to direct allocation of raw pages. There are dozens of various functions for obtaining memory. Which one to use depends on the subsystem you are working within—for example, Mach, BSD, or the I/O Kit—as well as the requirements for the memory, such as size or alignment. Memory is arguably one of the most limited resources on a computer system, especially for the iOS platform, which has limited amounts of physical memory compared to most Mac OS X-based computers.
At the fundamental level, the kernel keeps track of physical memory using the structure vm_page
. A vm_page
structure exists for every physical page of memory. Available pages are part of one of the following page lists:
Getting a free page from the free list is done with the vm_page_grab()
function or its higher-level interface vm_page_alloc()
, which unlike the former, places the page in a vm_object
as opposed to merely removing it from the free list. The kernel will signal the pageout daemon if it detects that the level of free pages falls behind a threshold. In this case, the pager will evict pages from the inactive list in a least recently used (LRU) fashion. Pages, which are mapped from an on-disk file, are prime candidates and can simply be discarded. The VM page cache and file system cache are combined on Mac OS X and iOS, which avoids duplication, and are collectively referred to as the Universal Buffer Cache (UBC). Pages originating from the file system are managed by the vnode pager, while pages in the VM cache are managed by the default pager.
The following sections will provide an overview of the various mechanisms for memory allocation available to kernel developers, as well as of their use and restrictions.
The kernel has several families of memory allocation routines. Each major subsystem, such as Mach, BSD, or I/O Kit, has their own families of functions. The VM subsystem lives in the Mach portion of the kernel, which implements the fundamental interfaces for allocating memory. These interfaces are in turn used to form higher-level memory allocation mechanisms for use in other subsystems such as BSD and I/O Kit.
For working in the Mach sections of the kernel, the kmem_alloc*()
family of functions is used. These functions are fairly low-level and are only a few levels away from the raw vm_page_alloc()
function. The following functions are available:
kern_return_t kmem_alloc(vm_map_t map, vm_offset_t* addrp, vm_size_t size);
kern_return_t mem_alloc_aligned(vm_map_t map, vm_offset_t* addrp, vm_size_t size);
kern_return_t kmem_alloc_wired(vm_map_t map, vm_offset_t* addrp, vm_size_t size);
kern_return_t kmem_alloc_pageable(vm_map_t map, vm_offset_t* addrp, vm_size_t size);
kern_return_t kmem_alloc_contig(vm_map_t map, vm_offset_t* addrp, vm_size_t size,
vm_offset_t mask, int flags);
void kmem_free(vm_map_t map, vm_offset_t addr, vm_size_t size);
All the functions require you to specify a VM Map belonging to either a user space task or kernel_map
. All the above functions allocate wired memory, which cannot be paged out, with the exception of kmem_alloc_pageable().
The Mach zone allocator is an allocation mechanism that can allocate fixed-size blocks of memory called zones. A zone usually represents a commonly used kernel data structure, such as a file descriptor or a task descriptor, but can also point to blocks of memory for more general use. Examples of data structures allocated by the zone allocator include:
struct task
)As a kernel programmer, you can create your own zones with the zinit()
function if you have a need for frequent and fast allocation and de-allocation of data objects of the same type. To create a new zone, you need to tell the allocator the size of the object, the maximum size of the queue, and the allocation size, which specifies how much memory will be added when the zone is exhausted.
The kalloc
family provides a slightly higher-level interface for fast memory allocation. The API would be familiar to those who have used the malloc()
interface in user space. In fact, the kernel also has a malloc()
function defined by the libkern kernel library, which again uses memory sourced by kalloc()
.
void* kalloc(vm_size_t size);
void* kalloc_noblock(vm_size_t size);
void* kalloc_canblock(vm_size_t size, boolean_t canblock);
void* krealloc(void** addrp, vm_size_t old_size, vm_size_t new_size);
void kfree(void *data, vm_size_t size);
Memory for the kalloc
family of functions is obtained via the Mach zone allocator discussed in the previous section. Larger memory allocations are handled by kmem_alloc()
function. Because memory can come from two sources, the kfree()
function needs to know the size of the original allocation to determine its origin and to free the memory in the appropriate place. The kalloc
family provides the API upon which fundamental memory functions in I/O Kit and the BSD layer are built. It is also the function used to provide memory for the C++ new
and new[]
operators for memory allocation.
The kalloc
functions and variants, except kalloc_noblock()
, may block (sleep) to obtain memory. The same is true for the kfree()
function. Therefore, you must use kalloc_noblock()
if you need memory in an interrupt context or while holding a simple lock.
The available zones can be queried; following is the trimmed output of the zprint
command showing the zones used by the kalloc
functions.
elem cur max cur max cur alloc alloc
zone name size size size #elts #elts inuse size count
-----------------------------------------------------------------------------------
kalloc.16 16 660K 922K 42240 59049 30284 4K 256 C
kalloc.32 32 3356K 4920K 107392 157464 73407 4K 128 C
kalloc.64 64 4792K 6561K 76672 104976 75837 4K 64 C
kalloc.128 128 2732K 3888K 21856 31104 20571 4K 32 C
kalloc.256 256 4248K 5184K 16992 20736 15950 4K 16 C
kalloc.512 512 968K 1152K 1936 2304 1870 4K 8 C
kalloc.1024 1024 784K 1024K 784 1024 735 4K 4 C
kalloc.2048 2048 3396K 4608K 1698 2304 1586 4K 2 C
kalloc.4096 4096 2204K 4096K 551 1024 508 4K 1 C
kalloc.8192 8192 3160K 32768K 395 4096 383 8K 1 C
kalloc.large 41375 5697K 6743K 141 166 141 40K 1 C
There is one zone for each size up to 8 KB. Allocations smaller than 8 KB return an element from the smallest matching zone. It is not possible to partially allocate an element, so, for example, if you need 5000 bytes of memory, you will actually be allocated 8192 bytes (3192 bytes wasted per allocation!). Allocations greater than 8 KB are handled by the appropriate kmem_alloc()
function instead of the zone allocator, but are nevertheless recorded in the virtual zone kalloc.large.
Memory allocation in the BSD subsystem is implemented by the following functions and macros:
#define MALLOC(space, cast, size, type, flags) (space) = (cast)_MALLOC(size, type, flags)
#define FREE(addr, type)_ FREE((void *)addr, type)
#define MALLOC_ZONE(space, cast, size, type, flags)
(space) = (cast)_MALLOC_ZONE(size, type, flags)
#define FREE_ZONE(addr, size, type) _FREE_ZONE((void *)addr, size, type)
void* _MALLOC(size_t size, int type, int flags);
void _FREE(void *addr, int type);
void* _MALLOC_ZONE(size_t size, int type, int flags);
void _FREE_ZONE(void *elem, size_t size, int type);
Under the hood, the _MALLOC()
function allocates memory using some variant of kalloc(), depending on the flags that are passed; for example, if non-blocking allocation is required, (M_NOWAIT
) kalloc_noblock()
is called. The _MALLOC_ZONE()
function invokes the zone allocator directly instead of indirectly through kalloc()
. Instead of using the general purpose kalloc.X zones, it allows you to access zones of commonly used object types, such as file descriptors, network sockets, or mbuf
descriptors, used by the networking subsystem. The type
argument is used to determine which zone to access. Although _MALLOC()
also takes a type
argument, it is ignored, except to check that the value is less than the maximum allowed. There are over a hundred different types defined. The flags parameter can be one of the following:
#define M_WAITOK 0x0000
#define M_NOWAIT 0x0001
#define M_ZERO 0x0004 /* bzero the allocation */
Tip MALLOC family of functions, along with zone types, are defined in sys/malloc.h
.
The M_ZERO
flag, if specified, will use the bzero()
function to overwrite the memory with zeros before the memory is returned to the caller. If not, the memory will still have the contents written there by the last user or will contain random garbage if never used.
The I/O Kit provides a full set of functions for memory allocation. All the following functions return kernel virtual addresses, which can be accessed directly:
void* IOMalloc(vm_size_t size);
void* IOMallocAligned(vm_size_t size, vm_size_t alignment);
void* IOMallocPageable(vm_size_t size, vm_size_t alignment);
The corresponding functions for freeing memory are as follows.
void IOFree(void* address, vm_size_t size);
void IOFreeAligned(vm_size_t size);
void IOFreePageable(void* address, vm_size_t size);
The first function, IOMalloc()
, is a wrapper for kalloc()
and is subject to the same restrictions. Specifically, it cannot be used in an atomic context, such as a primary interrupt handler, as it may block (sleep) to obtain memory. Nor can IOMalloc()
be used if aligned memory is required, as no guarantees are made. IOFree()
is a wrapper for the kfree()
function and may also block (sleep). It is also possible to deadlock the system if you call either IOMalloc()
or IOFree()
while holding a simple lock, such as OSSpinLock
, as the thread may be preempted if either function sleeps. It could cause a deadlock if an interrupt handler attempted to claim the same lock. Furthermore, memory from IOMalloc()
is intended for small and fast allocations and is not suitable for mapping into user space. Because the memory reserved for IOMalloc()
comes from a small fixed-size pool, excessive use of IOMalloc()
can drain this pool and panic the kernel if the pool is exhausted.
Caution It is a bug to free memory allocated by, for example, IOMallocAligned()
with IOFree()
. Always use the free function corresponding to the original allocation function. Even if it works now (by accident), the mechanism could change in a future update and cause a crash.
IOMallocAligned()
is subject to the same restrictions as IOMalloc()
, but unlike IOMalloc()
, it will return memory addresses aligned to a specific value. For example, if you need page-aligned memory you can pass in 4096 to get an address aligned to the beginning of a page. Following are some reasons for requesting aligned memory.
IOMallocPageable()
allocates memory that can be paged, unlike the other variants, which always create memory that is wired and cannot be paged out. The restrictions that apply to IOMalloc()
and IOMallocAligned()
are also valid for IOMallocPageable()
. Memory obtained by it cannot be used for device I/O such as DMA or in a code path that is not able to block/sleep without it being wired down first.
There is also a last variant, IOMallocContiguous()
, that allocates memory that is physically contiguous. Its use is now deprecated. Apple recommends using IOBufferMemoryDescriptor
instead.
Each of the memory allocation functions has a corresponding function to free the memory. It is important to call the right free function that matches the function you used for allocating the memory. Each of the variants source memory from different low-level mechanisms, hence they are not interchangeable. In fact, IOMalloc()
may source its memory from more than one source. Larger allocations (>8 KB) may be allocated with kmem_alloc()
; however, smaller allocations come from the zone allocator.
This happens to be the reason why you must pass in the size of the original allocation to the IOFree*()
functions, as it is used to determine where the memory came from.
The libkern library implements a basic C++ runtime, upon which I/O Kit is built. Memory allocation in C++ is typically done with the new
and new[]
operators for single objects and arrays, respectively. In libkern, the new
operator is implemented internally by calling kalloc()
to obtain memory. Because kfree()
requires the size of the original allocation, libkern modifies the size passed to the new operator to include space for a small structure that can hold the size of the allocation, so that when the delete
operator calls kfree()
, it can retrieve the size in the four bytes preceding the address returned by new
.
Memory allocated by new
or new[]
is always zeroed out, unlike most implementations of these operators in user space.
Tip The implementation of the new
, new[]
, delete
, delete[]
operators can be found in the XNU source distribution under libkern/c++/OSRuntime.cpp
.
18.223.210.71