The XNU Kernel

The XNU kernel is large and complex, and a full architectural description is beyond the scope of this book (there are other books that fill this need), but we will, in the following sections, outline some of the major components that make up XNU and offer a brief description of their responsibilities and mode of operation. In most cases when programming for the kernel you will be writing extensions rather than modifying the core kernel itself (unless you happen to be an Apple Engineer or contributor to Darwin), but it is useful to have a basic understanding of the kernel as a whole, as it will give a better understanding of how a kernel extension fit within the bigger picture. Subsequent chapters will focus on some of the more important programming frameworks that the kernel provides such as I/O Kit.

The XNU kernel is the core of Mac OS X and iOS. XNU has a layered architecture consisting of three major components. The inner ring of the kernel is referred to as the Mach layer, derived from the Mach 3.0 kernel developed at Carnegie Mellon University. References to Mach throughout the book will refer to Mach as it is implemented in OS X and iOS and not the original project. Mach was developed as a microkernel, a thin layer providing only fundamental services, such as processor management and scheduling, as well as IPC (inter-process communication), which is a core concept of the Mach kernel. Because of the layered architecture, there are minimal differences between the iOS and Mac OS X versions of XNU.

While the Mach layer in XNU has the same responsibilities as in the original project, other operating system services, such as file systems and networking, run in the same memory space as Mach. Apple cites performance as the key reason for doing this, as switching between address spaces (context switching) is an expensive operation.

Because the Mach layer is still, to some degree, an isolated component, many refer to XNU as a hybrid kernel, as opposed to a microkernel or a monolithic kernel, where all OS services run in the same context. Figure 2-3 shows a simplified view of XNU's architecture.

images

Figure 2-3. The XNU kernel architecture

The second major component of XNU is the BSD layer, which can be thought of as an outer ring around the Mach layer. BSD again provides a programming interface to end-user applications. Responsibilities include process management, file systems, and networking.

The last major component is the I/O Kit, which provides an object-oriented framework for device drivers.

While it would be nice if each layer had clear responsibilities, reality is somewhat more complicated and the lines between each layer are blurred, as many OS services and tasks span the borders of multiple components.

images Tip You can download the full source code for XNU at Apple's open source website: http://www.opensource.apple.com.

Kernel Extensions (KEXTs)

The XNU kernel, like most, if not all, modern operating systems, supports dynamically loading code into the kernel's address space at runtime. This allows extra functionality, such as drivers, to be loaded and unloaded while the kernel is running. A main focus of this book will be the development of such kernel extensions, with a particular focus on drivers, as this is the most common reason to implement a kernel extension. There are two principal classes of kernel extensions. The first class is for I/O Kit-based kernel extensions, which are used for hardware drivers. These extensions are written in C++. The second class is for generic kernel extensions, which are typically written in C (though C++ is possible here, too). These extensions can implement anything from new network protocols to file systems. Generic kernel extensions usually interface with the BSD or Mach layers.

Mach

The Mach layer can be seen as the core of the kernel, a provider of lower-level services to higher-level components like the BSD layer and I/O Kit. It is responsible for hardware abstraction, hiding the differences between the PowerPC architecture and the Intel x86 and x86-64 architectures. This includes details for handling traps and interrupts, as well as managing memory, including virtual memory and paging. This design allows the kernel to be easily adapted to new hardware architectures, as proven with Apple's move to Intel x86, and later to ARM for iOS. In addition to hardware abstraction, Mach is responsible for the scheduling of threads. It supports symmetric multiprocessing (SMP), which refers to the ability to schedule processes between multiple CPUs or CPU cores. In fact, the difficulty of implementing proper SMP support in the existing BSD Unix kernel was instrumental in the development of Mach.

Interprocess communication (IPC) is the core tenet of Mach's design. IPC in Mach is implemented as a client/server system. A task (the client) is able to request services from another task (the server). The endpoints in this system are known as ports. A port has associated rights, which determine if a client has access to a particular service. This IPC mechanism is used internally throughout the XNU kernel. The following sections will outline the key abstractions and services provided by the Mach layer.

images Tip Mach API documentation can be found in the osfmk/man directory of the XNU source package.

Tasks and Threads

A task is a group consisting of zero or more executable threads that share resources and memory address space. A task needs at least one thread to be executed. A Mach task maps one to one to a Unix (BSD layer) process. The XNU kernel is also a task (known as the kernel_task) consisting of multiple threads. Task resources are private and cannot normally be accessed by the threads of another task.

Unlike a task, a thread is an executable entity that can be scheduled and run by the CPU. A thread shares resources, such as open files or network sockets, with other threads in the same task. Threads of the same task can execute on different CPUs concurrently. A thread has its own state, which includes a copy of the processor state (registers and instruction counter) and its own stack. The state of a thread is restored when it is scheduled to run on a CPU. Mach supports preemptive multitasking, which means that a thread's execution can be interrupted before its allocated time slice (10ms in XNU) is up. Preemption happens under a variety of circumstances, such as when a high priority OS event occurs, when a higher priority thread needs to run, or when waiting for long I/O operations to complete. A thread can also voluntarily preempt itself by going to sleep. A Mach thread is scheduled independently from other threads, regardless of the task to which it belongs. The scheduler is also unaware of process parent-child relationships traditional in Unix systems (the BSD layer, however, is aware).

Scheduling

The scheduler is responsible for coordinating the access of threads to the CPU. Most modern kernels, including XNU, use a timesharing scheduler, where each thread is allocated a finite (10ms in XNU, as we've seen) time quantum in which the thread is allowed to execute. Upon expiration of the thread's quantum, it is put to sleep so that other threads can run. While it may seem reasonable and fair that each thread gets to run for an equal amount of time, this is impractical, as some threads have a greater need for low latencies, for example to perform audio and video playback. The XNU scheduler employs a priority-based algorithm to schedule threads. Table 2-3 shows the priority levels used by the scheduler.

images

The kernel organizes threads in doubly-linked lists. This collection of lists is known as the run queue. There is one list per priority level (currently 0–127). Each processor (core) in the system maintains its own run queue structure (osfmk/kern/sched.h):

  struct run_queue {
        int                         highq;                  /* highest runnable queue */
        int                         bitmap[NRQBM];          /* run queue bitmap array */
        int                         count;                  /* # of threads total */
        int                         urgency;                /* level of preemption urgency */
        queue_head_t                queues[NRQS];           /* one for each priority */
  };

A regular application thread starts with a priority of 31. Its priority may decrease over time, as a side effect of the scheduling algorithm. This will happen, for example, if a thread is highly compute intensive. By lowering the priority of such threads, it will improve the scheduling latency of I/O bound threads, which spend most of their time sleeping in-between issuing I/O requests, thus usually going back to sleep before their quantum expires, and thus allowing compute intensive threads access to the CPU again. The end result is improved system responsiveness.

To avoid getting into a situation where the thread's priority will be too low for it to run, the Mach scheduler will decay a thread's processor usage accounting over time, eventually resetting it, and thus a thread's priority will fluctuate over time.

The Mach scheduler provides support for real-time threads, although it does not provide guaranteed latency; however, every effort is made to ensure it will run for the required amount of clock cycles. A real-time thread may be downgraded to normal priority if it does not block/sleep frequently enough, for example if it is highly compute bound.

Mach IPC: Ports and Messages

A port is a unidirectional communications endpoint, which represents a resource referred to as an object. If you are familiar with TCP/IP networking, many parallels can be drawn between Mach's IPC and the UDP protocol, though unlike the UDP protocol, Mach IPC is used for more than just data transfers. It can be used to provide synchronization, or to send notifications between tasks. An IPC client can send messages to a port. The owner of the port receives the messages. For bidirectional communication, two ports are needed. A port is implemented as a message queue (though other mechanisms exist). Messages for the port are queued until a thread is available to service them. A port can receive messages from multiple senders, but there can be only one receiver per port.

Ports have protection mechanisms known as port rights. A task must have the proper permissions in order to interact with a port. Port rights are associated with a task; therefore, all threads in a task share the same privileges to a port. The following are examples of port rights: send, send once, and receive. The rights can be copied or moved between tasks. Unlike Unix permissions, port rights are not inherited from parent to child processes (Mach tasks do not have this concept). Table 2-4 shows the available port right types.

images

A group of ports are collectively known as a port set. The message queue is shared between all ports in a set. A 32-bit integer number addresses ports in the system. There is no global register or namespace for ports.

The Mach IPC system is also available in user space programs and can be used to pass messages between tasks or from a task to the kernel. It offers an alternative to system calls, though the mechanism uses system calls under the hood.

Mach Exceptions

Exceptions are interrupts sent by a CPU when certain (exceptional) events or conditions occur during the execution of a thread. An exception will result in the interruption of a thread's execution, while the OS (Mach) processes the exception. The task may resume afterwards, depending on the type of exception that occurred. Common causes for exceptions include access to invalid or non-existing memory, execution of an invalid processor instruction, passing invalid arguments, or division by zero. These exceptions usually result in the termination of the offending task, but there are also a number of non-erroneous exceptions that can occur.

A system call is one such exception. A user space application may issue a system call exception when it needs to perform a low-level operation involving the kernel, such as writing from a file, or receiving data on a network socket. When the OS handles the system call, it inspects a register for the system call number, which is then used to look up the handler for that call, for example read() or recv(). A task may also generate an exception if attempting to access paged out memory. In this case, a page fault exception is generated, which will be handled by retrieving the missing page from the backing store, or result in an invalid memory access. A task may also issue deliberate exceptions with the EXC_BREAKPOINT exception, which are typically used in debugging or tracing applications, such as Xcode, to temporarily halt the execution of a thread.

It is possible, of course, for the kernel itself to misbehave and cause exceptions. In this case, the OS will be halted and the grey screen of death will be shown (unless the kernel debugger is activated), informing the user to reboot the computer. Table 2-5 shows a subset of defined Mach exceptions.

images

When an exception occurs, the kernel will suspend the thread which caused the exception, and send an IPC message to the thread's exception port. If the thread does not handle the exception, it's forwarded to the containing task's exception port, and finally to the system's (host) exception port. The following structure encapsulates a thread, task, or processor's (host) exception ports:

struct exception_action {
        struct ipc_port*               port;               /* exception port */
        thread_state_flavor_t          flavor;             /* state flavor to send */
        exception_behavior_t           behavior;           /* exception type to raise */
        boolean_t                      privileged;         /* survives ipc_task_reset */
};

Each thread, task, and host has an array of the structure exception_action, which specifies exception behavior, one structure is defiend for each exception type (as defined in Table 2-5). The flavor and behavior fields specify the type of information that should be sent with the exception message, such as the state of general purpose, or other specialized CPU registers, and the handler, which should be executed. The handler will be either catch_mach_exception_raise(), catch_mach_exception_raise_state() or catch_mach_exception_raise_state_identity(). When an exception has been dispatched, the kernel waits for a reply in order to determine the course of action. A return of KERN_SUCCESS means the exception was handled, and the thread will be allowed to resume.

A thread's exception port defaults to PORT_NULL, unless a port is explicitly allocated, exceptions will be handled by task's exception port instead. When a process issues the fork() system call to spawn a child process, the child will inherit exception ports from the parent task. The Unix signaling mechanism is implemented on top of the Mach's exception system.

Time Management

Proper timekeeping is a vital responsibility of any OS, not only to serve user applications, but also to serve other important kernel functions such as scheduling processes. In Mach, the abstraction for time management is known as a clock. A clock object in Mach represents time in nanoseconds as a monotonically increasing value. There are three main clocks defined: the real-time clock, the calendar clock, and the high-resolution clock. The real-time clock keeps the time since the last boot, while the calendar clock is typically battery backed, so its value is persistent across system reboots, or in periods when the computer is powered off. It has a resolution of seconds and as the name implies, it is used to keep track of the current time. The Mach time KPI consists of three functions:

void clock_get_uptime(uint64_t* result);
void clock_get_system_nanotime(uint32_t* secs, uint32_t* nanosecs);
void clock_get_calendar_nanotime(uint32_t* secs, uint32_t* nanosecs);

The calendar clock is typically only used by applications, as the kernel itself rarely needs to concern itself with the current time or date, and doing so, in fact, is considered poor design. The kernel uses the relative time provided by the real-time clock. The time from the real-time clock typically comes from a circuit on the computer's motherboard that contains an oscillating crystal. The real-time clock circuit (RTC) is programmable, and wired to the CPUs' (every CPU/core) interrupt pins. The RTC gets programmed in XNU with a deadline of 100 Hz (using clock_set_timer_deadline()).

Memory Management

The Mach layer is responsible for coordinating the use of physical memory in a machine independent manner, providing a consistent interface to higher-level components. The virtual memory subsystem of Mach, the Mach VM, provides protected memory and facilities to applications, and the kernel itself, for allocating, sharing, and mapping memory. A solid understanding of memory management is essential to a successful kernel programmer.

Task Address Space

Each Mach task has its own virtual address (VM) space. For a 32-bit task, the address space is 4 GB, while for a 64-bit task it is substantially larger, with 51-bits (approximately 2 petabytes) of usable address space. Specialized applications, such as video editing or effects software, often exceed the 32-bit address space. Support for 64-bit virtual address space became available in OS X 10.4.

images Note While 32-bit applications are limited to a 4 GB address space, this does not correlate with the amount of physical memory that can be used in a system. Technologies such as Physical Address Extensions (PAE) are supported by OS X and allow 32-bit x86 processors (or 64-bit processors running in 32-bit mode) to address up to 36-bits (64 GB) of physical memory; however, a task's address space remains limited to 4 GB.

A task's address space is fundamental to the concept of protected memory. A task is not allowed to access the address space, and thus the underlying physical memory containing the data of another task, unless explicitly allowed to do so, through the use of shared memory or other mechanisms.

KERNEL ADDRESS SPACE MANAGEMENT

VM Maps and Entries

The virtual memory (VM) map is the actual representation of a task's address space. Each task has its own VM map. The map is represented by the structure vm_map. There is no map associated with a thread as they share the VM map of the task that owns them.

A VM map represents a doubly-linked list of memory regions that is mapped into the process address space. Each region is a virtually contiguous range of memory addresses (not necessarily backed by contiguous physical memory) described by a start and end address, as well as other meta-data, such as protection flags, which can be any combination of read, write, and execute. The regions are represented by the vm_map_entry structure. A VM map entry may be merged with another adjacent entry when more memory is allocated before or after an existing entry or split into smaller regions. Splitting will occur if the protection flags are modified for a range of addresses described by an entry, as protection flags can only be set on VM map entries. Figure 2-4 shows a VM map with two VM map entries.

images

Figure 2-4. Relationship between VM subsystem structures

images Tip The relevant structures pertaining to task address spaces are defined in mach/vm_map.h and mach/vm_region.h in the XNU source package.

The Physical Map

Each VM map has an associated physical map, or pmap structure. This structure helps hold information on virtual to physical memory mappings being used by the task. The portion of the Mach VM that deals with physical mappings is machine dependent, as it interacts with the memory management unit (MMU), a specialized hardware component of the system that takes care of address translation.

VM Objects

A VM map entry can point to either a VM object or a VM submap. A submap is a container for other (VM map) mappings. A submap is used to share memory between addresses spaces. The VM object is a representation of the location, or rather how the described memory is accessed. Memory pages underlying the object may not be present in physical memory, but could be located on an external backing store (a hard drive on OS X). In this case, the VM object will have information on how to page in the external pages. Transfer to or from a backing store is handled by the pager discussed next.

A VM object describes memory in units of pages. A page in XNU is currently 4096 bytes. A virtual page is described by the vm_page structure. A VM object may contain many pages, but a page is only ever associated with one VM object.

PAGES

When memory needs to be shared between tasks, a VM map entry will point into the foreign address space via a submap, as opposed to a VM object. This commonly happens when a shared library is used. The shared library gets mapped into the task's address space.

Let's consider another example. When a Unix process issues the fork() system call to create a child process, a new process will be created as a copy of the parent. To avoid having to copy the memory from the parent to the child, an optimization known as copy-on-write (COW) is employed. Read access to a child's memory will simply reference the same pages as the parent. If the child process modifies its memory, the page describing that memory will be copied, and a shadow VM object will be created. On the next read to that memory region, a check is performed to see if the shadow object has a copy of the page, and if not the original shared page is referenced. The previously described behavior is only true when the inheritance property of the original VM map entry from the parent is set to copy. Other possible values are shared, in which case the child will continue both the read and write operation to the original memory location. If the setting is none, the memory pages referenced by the map entry will not be mapped into the child's address space. The fourth possible value is copy and delete, where the memory will be copied to the child and deleted from the parent.

images Note Copy-on-write is also used by Mach IPC to optimize the transfer of data between tasks.

Examining a Task's Address Space

The vmmap command line utility allows you to inspect a process virtual memory map and its VM map entries. It clearly illustrates how memory regions are mapped into a task's VM address space. The vmmap command takes a process identifier (PID) as an argument. The following shows the output of vmmap executed with the PID of a simple Hello World C application (a.out), which prints a message and then goes to sleep:


==== Non-writable regions for process 46874
__PAGEZERO           00000000-00001000   [     4K] ---/--- SM=NUL  /Users/ole/a.out
__TEXT               00001000-00002000   [     4K] r-x/rwx SM=COW  /Users/ole/a.out
__LINKEDIT           00003000-00004000   [     4K] r--/rwx SM=COW  /Users/ole/a.out
MALLOC guard page    00004000-00005000   [     4K] ---/rwx SM=NUL

MALLOC metadata     00021000-00022000    [     4K] r--/rwx SM=PRV
__TEXT               8fe00000-8fe42000   [   264K] r-x/rwx SM=COW  /usr/lib/dyld
__LINKEDIT          8fe70000-8fe84000    [    80K] r--/rwx SM=COW  /usr/lib/dyld
__TEXT              9703b000-971e3000    [  1696K] r-x/r-x SM=COW  /usr/lib/libSystem.B.dylib
STACK GUARD         bc000000-bf800000    [  56.0M] ---/rwx SM=NUL  stack guard for thread 0
==== Writable regions for process 46874
__DATA              00002000-00003000    [     4K] rw-/rwx SM=PRV  /Users/ole/a.out
MALLOC metadata     00015000-00020000    [    44K] rw-/rwx SM=PRV
   MALLOC_TINY         00100000-00200000 [  1024K] rw-/rwx SM=PRV  DefaultMallocZone_0x5000
MALLOC_SMALL        00800000-01000000    [  8192K] rw-/rwx SM=PRV  DefaultMallocZone_0x5000
__DATA              8fe42000-8fe6f000    [   180K] rw-/rwx SM=PRV  /usr/lib/dyld
__IMPORT            8fe6f000-8fe70000    [     4K] rwx/rwx SM=COW  /usr/lib/dyld
shared pmap         a0800000-a093a000    [  1256K] rw-/rwx SM=COW
__DATA              a093a000-a0952000    [    96K] rw-/rwx SM=COW  /usr/lib/libSystem.B.dylib
shared pmap         a0952000-a0a00000    [   696K] rw-/rwx SM=COW
Stack               bf800000-bffff000    [  8188K] rw-/rwx SM=ZER  thread 0
Stack               bffff000-c0000000    [     4K] rw-/rwx SM=COW  thread 0

The result has been trimmed for readability. The output is divided between non-writable regions and writable regions. The former, as you can see, includes the page zero mapping, which is read-only and will generate an exception if an application tries to write to memory addresses 0-4096 (4096 decimal = 0x1000 hex). This is why your application will crash if you try to dereference a null-pointer. The next map entry is the text segment of the application, which contains the executable code of the application. You will see that the text segment is marked as having a share mode (SM) of COW, which means that if this process spawns a child, it will inherit this mapping from the parent, thus avoiding a copy until pages in that segment are modified.

In addition to the text segment for the a.out program itself, you will also see a mapping for libSystem.B.dylib. On Mac OS X and iOS, libSystem implements the standard C Library and the POSIX thread API, as well as other system APIs. The a.out process inherited the mapping for libSystem from its parent process /sbin/launchd, the parent of all user space processes. This ensures the library is only loaded once, saving memory and improving the launch speed of applications, as fetching a library from secondary storage, such as a hard drive, is usually slow.

In the writable regions you can see the data segment of a.out and libSystem. These segments contain variables defined by the program/library. Obviously, these can be modified, so each process needs a copy of the data segment for a shared library, however it is COW, so no overhead is necessary until a process makes modifications to the mapping.

images Tip If you want to inspect the virtual memory map of a system process, such as launchd, you need to run vmmap with sudo, as by default your user will only be able to inspect its own processes.

Pagers

Virtual memory allows a process to have a virtual address space larger than the available physical memory, and it is possible for tasks running on the system to be combined, consuming more than the available amount of memory. The mechanism that makes this possible is known as a pager. The pager controls the transfer of memory pages to and from the system memory (RAM), to a secondary backing store, usually a hard drive. When a task that has high memory requirements needs to run, the pager can temporarily transfer (page out) memory pages belonging to inactive tasks to the backing store, thereby freeing up enough memory to allow the demanding task to execute. Similarly, if a process is found to be largely idle, the system can opt to page out the task's memory to free memory for current or future tasks. When an application runs, and it tries to access memory that has been paged out, an exception known as a page fault will occur, which is also the exception that occurs if a task tries to access an invalid memory address. When the page fault occurs, the kernel will attempt to transfer back (page in) the page corresponding to the memory address, and if the page cannot be transferred back, it will be treated as an invalid memory access, and the task will be aborted. The XNU kernel supports three different pagers:

  • Default Pager: Performs traditional paging and transfers between the main memory and a swap file on the system hard drive (/var/vm/swapfile*).
  • Vnode Pager: Ties in with the Unified Buffer Cache (UBC) used by file systems and is used to cache files in memory.
  • Device Pager: Used for managing memory mappings of hardware devices, such as PCI devices that map registers into memory. Mapped memory is commonly used by I/O Kit drivers, and I/O Kit provides abstractions for working with such memory.

Which pager is in use is more or less transparent to higher-level parts, such as the VM object. Each VM object has an associated memory object, which provides (via ports) an interface to the current pager.

Memory Allocation in Mach

Some fundamental routines for memory allocation in Mach are:

kern_return_t kmem_alloc(vm_map_t map, vm_offset_t *addrp, vm_size_t  size);
kern_return_t kmem_alloc_contig(vm_map_t map, vm_offset_t *addrp,
                                vm_size_t size, vm_offset_t mask, int flags);
void kmem_free(vm_map_t map, vm_offset_t addr, vm_size_t size);

kmem_alloc() provides the main interface to obtaining memory in Mach. In order to allocate memory, you must provide a VM map. For most work within the kernel, kernel_map is defined and points to the VM map of kernel_task. The second variant, kmem_alloc_contig(), attempts to allocate memory that is physically contiguous, as opposed to the former, which allocates virtually contiguous memory. Apple recommends against making this type of allocation, as there is a significant penalty incurred in searching for free contiguous blocks. Mach also provides kmem_alloc_aligned() function, which allocates memory aligned to a power of two, as well as a few other variants that are less commonly used. The kmem_free() function is provided to free allocated memory. You have to take care to pass the same VM map as you used when you allocated, as well as the size of the original allocation.

The BSD Layer

Unlike Mach, which only provides a few fundamental services, the BSD layer sits between Mach and the user applications and implements many core OS functions, building on the services provided by Mach. In OS X and iOS, the BSD layer is running with the processor in privileged mode and not as a user task, as originally intended by the Mach project. The layer therefore does not have memory protection, and runs in the same address space as Mach and I/O Kit. The BSD layer refers to a portion of the kernel derived from the FreeBSD 5 operating system, and it is not a complete system in itself, but rather a portion of code originating from it.

The BSD layer provides services such as process management, system calls, file systems, and networking. Table 2-6 shows a brief overview of the services provided by the BSD layer.

images

The BSD layer provides abstractions on top of the services provided by Mach. For example, its process management and memory management is implemented on top of Mach services.

System Calls

When an application needs services from the file system, or wishes to access the network, it needs to issue a system call to the kernel. The BSD layer implements all system calls. When a system call handler executes, the kernel context switches from user mode to kernel mode to service a request by the application, such as to read a file. This API is referred to as the syscall API, and it is the traditional Unix API for calling functions in the kernel from user space. There are hundreds of system calls available, ranging from calls related to process control, such as fork() and execve(), or file management calls, such as open(), close(), read(), and write().

The BSD layer also provides ioctl() function (itself a system call), which is short for I/O control, and this is typically used to send commands to device drivers. The sysctl() function is provided to set or get a variety of kernel parameters, including but not limited to the scheduler, memory, and networking subsystems.

images Tip Available system calls are defined in /usr/include/sys/syscall.h.

Mach traps are mechanisms similar to system calls, used for crossing the kernel/user space boundary. Unlike system calls that provide direct services to an application, the Mach traps are used to carry IPC messages from a user space client to a kernel server.

Networking

Networking is a major subsystem of the BSD portion of XNU. BSD handles most aspects of networking, such as the details of socket communication and the implementation of protocols like TCP/IP, except for low-level communication with actual hardware devices, which is typically handled by an I/O Kit driver. The I/O Kit network driver will interface with the network stack that is responsible for handling received buffers from the networking device, inspect them, and ensure they make their way down to the initiator, for example your web browser. Similarly, the BSD networking stack will accept outgoing data from an application, format the data in a packet, then route or dispatch it to the appropriate network interface. BSD also implements the IPFW firewall, which will filter packets to/from the computer according to policy set by the system administrator.

The BSD networking layer supports a wide range of network and transport layer protocols, including IPv4 and IPv6, TCP, and UDP. At the higher level we find support for BOOTP, DHCP, and ICMP, among others. Other networking-related functions include routing, bridging, and Network Address Translation (NAT), as well as device level packet filtering with Berkeley Packet Filter (BPF).

NETWORK KERNEL EXTENSIONS (NKE)

File Systems

The kernel has inbuilt support for a range of different file systems, as shown in Table 2-7. The primary file system used by Mac OS X and iOS is HFS+. It was developed as a replacement for the Mac OS file system HFS.

images

HFS+ gained support for journaling in Mac OS X 10.2.2. Journaling improves the reliability of a file system by recording transactions in a journal prior to carrying them out. This makes the file system resilient to events such as a power failure or a crash of the kernel, as the data can be replayed after reboot in order to bring the file system to a consistent state.

HFS+ supports very large files, up to 8 EiB in size (1 Exbibyte = 260 bytes), which is also the maximum possible volume size. The file system has full support for Unicode characters in file names and is case insensitive by default. Support for both Unix style file permissions and access control lists (ACLs) exists.

The Virtual File System

The virtual file system, or VFS, provides an abstraction over specific file systems, such as HFS+ and AFP, and makes it possible for applications to access them using a single consistent interface. The VFS allows support for new file systems to be easily added as kernel extensions through the VFS Kernel Programming Interface (KPI), without the OS as a whole knowing anything about its implementation. The fundamental data structure of the VFS is the vnode. The vnode is how both a file and a directory are represented in the kernel. A vnode structure exists for every file active in the kernel.

Unified Buffer Cache

The Unified Buffer Cache (UBC) is a cache for files. When a file is written to, or read from, it will be loaded into physical memory from a backing store, such as a hard drive. The UBC is intimately linked with the VM subsystem and the UBC also caches VM objects. The structure used to cache a vnode is shown in Listing 2-1.

Listing 2-1. The ubc_info Structure

struct ubc_info {
      memory_object_t             ui_pager;       /* pager */
      memory_object_control_t     ui_control;     /* VM control for the pager */
      uint32_t                    ui_flags;       /* flags */
      vnode_t                     ui_vnode;       /* vnode for this ubc_info */
      kauth_cred_t                ui_ucred;       /* holds credentials for NFS paging */
      off_t                       ui_size;        /* file size for the vnode */
   
      struct  cl_readahead*       cl_rahead;      /* cluster read ahead context */
      struct  cl_writebehind*     cl_wbehind;     /* cluster write behind context */
   
      struct  cs_blob*            cs_blobs;       /* for CODE SIGNING */
};

Prior to the introduction of the UBC, the system had two caches, a page cache and a buffer cache. The buffer cache was indexed by a device and block number that addressed a chunk of data on the physical device, whereas the page cache performed caching of memory mappings.

The size of the UBC shrinks and grows dynamically depending on the needs of the system. If a file in the cache is modified, it is marked as dirty, to indicate that the cached copy differs from the original found on disk. Dirty entries are periodically flushed to disk. It is possible for a user space program to bypass UBC, and go directly to disk, by using the F_NOCACHE option of the fcntl system call, which may improve I/O performance for workloads that do not benefit from such caching, such as large sets of data that are unlikely to be reused.

The I/O Kit

The last major component that makes up XNU is the I/O Kit, which is an object-oriented framework for writing device drivers and other kernel extensions. It provides an abstraction of system hardware, with pre-defined base classes for many types of hardware, making it simple to implement a new driver, as it is able to inherit much of its functionality from a base class driver, achieving a high degree of code reuse. The I/O Kit framework consists of the kernel level framework, as well as a user space framework called IOKit.framework. The kernel framework is written in Embedded C++, a subset of C++, whereas the user space framework is C-based.

The I/O Kit maintains a database known as the I/O Catalog. The I/O Catalog is a registry of all available I/O Kit classes. Another database, the I/O Registry tracks object instances of classes in the I/O Catalog. Objects in the I/O Registry typically represent devices, drivers, or supporting classes, and are structured in a hierarchical manner, which mimics the way hardware devices are physically connected to each other. For example a USB device is a child of the USB controller it is connected to. The ioreg command line utility allows you to inspect the I/O Registry.

The I/O Kit is based around three major concepts:

  • Families
  • Drivers
  • Nubs

Families represent common abstractions for devices of a particular type. For example, an IOUSBFamily handles many of the technicalities of implementing support for USB related devices. Drivers are responsible for managing a specific device or bus. A driver may have a relationship with more than one family. In the case of a USB-based storage device, it might depend on the IOUSBFamily, as well as the IOStorageFamily. Nubs are interfaces for a controllable entity, such as a PCI or USB device, which a higher-level driver may use to communicate with the device.

As a kernel programmer, you will probably spend most of your time working with the I/O Kit, and thus much of this book will be devoted to it, and a full description of I/O Kit is provided in Chapter 4.

The Libkern Library

The libkern library, unlike Mach and BSD, which provide APIs for interacting with the system, provides supporting routines and classes to the rest of the kernel, and in particular the I/O Kit. That is, building blocks and utilities useful to the kernel itself, as well as extensions. The limited C++ runtime is implemented in libkern, which provides implementation for services such as the new and delete operators.

In addition to standard C++ runtime, libkern also provides a number of useful classes, the most fundamental being OSObject, the superclass of every class in I/O Kit. It provides support for reference counting, which works conceptually the same as NSObject in Cocoa, or Cocoa Touch in user space. Other classes of interest include OSDictionary, OSArray, OSString, and OSInteger. These classes, and others, are also used to provide a dictionary of values from the kernel extension's Info.plist.

The libkern library is not all about core C++ classes and runtime, as it also provides the implementation of many functions normally found in the standard C library. Examples of this are the printf() and sccanf() functions, as well as others such as strtol() and strsep(). Other functions provided by libkern include cryptographic hash algorithms (MD5 and SHA-1), UUID generation, and the zlib compression library. The library is also home to kxld, the library used to manage dynamically loaded kernel extensions.

Last, but not least, we find functions, such as OSMalloc(), for allocating memory and for the implementation of locking mechanisms and synchronization primitives.

images Note The sources for libkern are found in the libkern/ and bsd/libkern/ directories in the XNU source distribution.

The Platform Expert

The platform expert contains an abstraction layer for the system. Parts of it are available as part of the public XNU source code distribution, but the remainder is implemented in the com.apple.driver.AppleACPIPlatform KEXT, for which no source code is available. The platform expert handles device enumeration and detection for the system bus. It can be seen as the driver for the motherboard. The platform expert is responsible for the initial construction of the I/O Kit device tree after the system boots (known as the I/O Registry). The platform expert itself will form the root node of the tree, IOPlatformExpertDevice.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.192.247