Creating Processes

Unix operating systems rely heavily on process creation to satisfy user requests. For example, the shell creates a new process that executes another copy of the shell whenever the user enters a command.

Traditional Unix systems treat all processes in the same way: resources owned by the parent process are duplicated in the child process. This approach makes process creation very slow and inefficient, since it requires copying the entire address space of the parent process. The child process rarely needs to read or modify all the resources inherited from the parent; in many cases, it issues an immediate execve( ) and wipes out the address space that was so carefully copied.

Modern Unix kernels solve this problem by introducing three different mechanisms:

  • The Copy On Write technique allows both the parent and the child to read the same physical pages. Whenever either one tries to write on a physical page, the kernel copies its contents into a new physical page that is assigned to the writing process. The implementation of this technique in Linux is fully explained in Chapter 8.

  • Lightweight processes allow both the parent and the child to share many per-process kernel data structures, such as the paging tables (and therefore the entire User Mode address space), the open file tables, and the signal dispositions.

  • The vfork( ) system call creates a process that shares the memory address space of its parent. To prevent the parent from overwriting data needed by the child, the parent’s execution is blocked until the child exits or executes a new program. We’ll learn more about the vfork( ) system call in the following section.

The clone( ), fork( ), and vfork( ) System Calls

Lightweight processes are created in Linux by using a function named clone( ), which uses four parameters:

fn

Specifies a function to be executed by the new process; when the function returns, the child terminates. The function returns an integer, which represents the exit code for the child process.

arg

Points to data passed to the fn( ) function.

flags

Miscellaneous information. The low byte specifies the signal number to be sent to the parent process when the child terminates; the SIGCHLD signal is generally selected. The remaining three bytes encode a group of clone flags, which specify the resources to be shared between the parent and the child process as follows:

CLONE_VM

Shares the memory descriptor and all Page Tables (see Chapter 8).

CLONE_FS

Shares the table that identifies the root directory and the current working directory, as well as the value of the bitmask used to mask the initial file permissions of a new file (the so-called file umask ).

CLONE_FILES

Shares the table that identifies the open files (see Chapter 12).

CLONE_PARENT

Sets the parent of the child (p_pptr and p_opptr fields in the process descriptor) to the parent of the calling process.

CLONE_PID

Shares the PID.[22]

CLONE_PTRACE

If a ptrace( ) system call is causing the parent process to be traced, the child will also be traced.

CLONE_SIGHAND

Shares the table that identifies the signal handlers (see Chapter 10).

CLONE_THREAD

Inserts the child into the same thread group of the parent, and the child’s tgid field is set accordingly. If this flag is true, it implicitly enforces CLONE_PARENT.

CLONE_SIGNAL

Equivalent to setting both CLONE_SIGHAND and CLONE_THREAD, so that it is possible to send a signal to all threads of a multithreaded application.

CLONE_VFORK

Used for the vfork( ) system call (see later in this section).

child_stack

Specifies the User Mode stack pointer to be assigned to the esp register of the child process. If it is equal to 0, the kernel assigns the current parent stack pointer to the child. Therefore, the parent and child temporarily share the same User Mode stack. But thanks to the Copy On Write mechanism, they usually get separate copies of the User Mode stack as soon as one tries to change the stack. However, this parameter must have a non-null value if the child process shares the same address space as the parent.

clone( ) is actually a wrapper function defined in the C library (see Section 9.1), which in turn uses a clone( ) system call hidden to the programmer. This system call receives only the flags and child_stack parameters; the new process always starts its execution from the instruction following the system call invocation. When the system call returns to the clone( ) function, it determines whether it is in the parent or the child and forces the child to execute the fn( ) function.

The traditional fork( ) system call is implemented by Linux as a clone( ) system call whose flags parameter specifies both a SIGCHLD signal and all the clone flags cleared, and whose child_stack parameter is 0.

The vfork( ) system call, described in the previous section, is implemented by Linux as a clone( ) system call whose first parameter specifies both a SIGCHLD signal and the flags CLONE_VM and CLONE_VFORK, and whose second parameter is equal to 0.

When either a clone( ), fork( ), or vfork( ) system call is issued, the kernel invokes the do_fork( ) function, which executes the following steps:

  1. If the CLONE_PID flag is specified, the do_fork( ) function checks whether the PID of the parent process is not 0; if so, it returns an error code. Only the swapper process is allowed to set CLONE_PID; this is required when initializing a multiprocessor system.

  2. The alloc_task_struct( ) function is invoked to get a new 8 KB union task_union memory area to store the process descriptor and the Kernel Mode stack of the new process.

  3. The function follows the current pointer to obtain the parent process descriptor and copies it into the new process descriptor in the memory area just allocated.

  4. A few checks occur to make sure the user has the resources necessary to start a new process. First, the function checks whether current->rlim[RLIMIT_NPROC].rlim_cur is smaller than or equal to the current number of processes owned by the user. If so, an error code is returned, unless the process has root privileges. The function gets the current number of processes owned by the user from a per-user data structure named user_struct. This data structure can be found through a pointer in the user field of the process descriptor.

  5. The function checks that the number of processes is smaller than the value of the max_threads variable. The initial value of this variable depends on the amount of RAM in the system. The general rule is that the space taken by all process descriptors and Kernel Mode stacks cannot exceed 1/8 of the physical memory. However, the system administrator may change this value by writing in the /proc/sys/kernel/threads-max file.

  6. If the parent process uses any kernel modules, the function increments the corresponding reference counters. As we shall see in Appendix B, each kernel module has its own reference counter, which ensures that the module will not be unloaded while it is being used.

  7. The function then updates some of the flags included in the flags field that have been copied from the parent process:

    1. It clears the PF_SUPERPRIV flag, which indicates whether the process has used any of its superuser privileges.

    2. It clears the PF_USEDFPU flag.

    3. It sets the PF_FORKNOEXEC flag, which indicates that the child process has not yet issued an execve( ) system call.

  8. Now the function has taken almost everything that it can use from the parent process; the rest of its activities focus on setting up new resources in the child and letting the kernel know that this new process has been born. First, the function invokes the get_pid( ) function to obtain a new PID, which will be assigned to the child process (unless the CLONE_PID flag is set).

  9. The function then updates all the process descriptor fields that cannot be inherited from the parent process, such as the fields that specify the process parenthood relationships.

  10. Unless specified differently by the flags parameter, it invokes copy_files( ), copy_fs( ), copy_sighand( ), and copy_mm( ) to create new data structures and copy into them the values of the corresponding parent process data structures.

  11. The do_fork( ) function invokes copy_thread( ) to initialize the Kernel Mode stack of the child process with the values contained in the CPU registers when the clone( ) call was issued (these values have been saved in the Kernel Mode stack of the parent, as described in Chapter 9). However, the function forces the value 0 into the field corresponding to the eax register. The thread.esp field in the descriptor of the child process is initialized with the base address of the child’s Kernel Mode stack, and the address of an assembly language function (ret_from_fork( )) is stored in the thread.eip field. The copy_thread( ) function also invokes unlazy_fpu( ) on the parent and duplicates the contents of the thread.i387 field.

  12. If either CLONE_THREAD or CLONE_PARENT is set, the function copies the value of the p_opptr and p_pptr fields of the parent into the corresponding fields of the child. The parent of the child thus appears as the parent of the current process. Otherwise, the function stores the process descriptor address of current into the p_opptr and p_pptr fields of the child.

  13. If the CLONE_PTRACE flag is not set, the function sets the ptrace field in the child process descriptor to 0. This field stores a few flags used when a process is being traced by another process. Even if the current process is being traced, the child will not.

  14. Conversely, if the CLONE_PTRACE flag is set, the function checks whether the parent process is being traced because in this case, the child should be traced too. Therefore, if PT_PTRACED is set in current->ptrace, the function copies the current->p_pptr field into the corresponding field of the child.

  15. The do_fork( ) function checks the value of CLONE_THREAD. If the flag is set, the function inserts the child in the thread group of the parent and copies in the tgid field the value of the parent’s tgid; otherwise, the function sets the tgid field to the value of the pid field.

  16. The function uses the SET_LINKS macro to insert the new process descriptor in the process list.

  17. The function invokes hash_pid( ) to insert the new process descriptor in the pidhash hash table.

  18. The function increments the values of nr_threads and current->user->processes.

  19. If the child is being traced, the function sends a SIGSTOP signal to it so that the debugger has a chance to look at it before it starts the execution.

  20. It invokes wake_up_process( ) to set the state field of the child process descriptor to TASK_RUNNING and to insert the child in the runqueue list.

  21. If the CLONE_VFORK flag is specified, the function inserts the parent process in a wait queue and suspends it until the child releases its memory address space (that is, until the child either terminates or executes a new program).

  22. The do_fork( ) function returns the PID of the child, which is eventually read by the parent process in User Mode.

Now we have a complete child process in the runnable state. But it isn’t actually running. It is up to the scheduler to decide when to give the CPU to this child. At some future process switch, the schedule bestows this favor on the child process by loading a few CPU registers with the values of the thread field of the child’s process descriptor. In particular, esp is loaded with thread.esp (that is, with the address of child’s Kernel Mode stack), and eip is loaded with the address of ret_from_fork( ). This assembly language function, in turn, invokes the ret_from_sys_call( ) function (see Chapter 9), which reloads all other registers with the values stored in the stack and forces the CPU back to User Mode. The new process then starts its execution right at the end of the fork( ), vfork( ), or clone( ) system call. The value returned by the system call is contained in eax: the value is 0 for the child and equal to the PID for the child’s parent.

The child process executes the same code as the parent, except that the fork returns a 0. The developer of the application can exploit this fact, in a manner familiar to Unix programmers, by inserting a conditional statement in the program based on the PID value that forces the child to behave differently from the parent process.

Kernel Threads

Traditional Unix systems delegate some critical tasks to intermittently running processes, including flushing disk caches, swapping out unused page frames, servicing network connections, and so on. Indeed, it is not efficient to perform these tasks in strict linear fashion; both their functions and the end user processes get better responses if they are scheduled in the background. Since some of the system processes run only in Kernel Mode, modern operating systems delegate their functions to kernel threads, which are not encumbered with the unnecessary User Mode context. In Linux, kernel threads differ from regular processes in the following ways:

  • Each kernel thread executes a single specific kernel C function, while regular processes execute kernel functions only through system calls.

  • Kernel threads run only in Kernel Mode, while regular processes run alternatively in Kernel Mode and in User Mode.

  • Since kernel threads run only in Kernel Mode, they use only linear addresses greater than PAGE_OFFSET. Regular processes, on the other hand, use all four gigabytes of linear addresses, in either User Mode or Kernel Mode.

Creating a kernel thread

The kernel_thread( ) function creates a new kernel thread and can be executed only by another kernel thread. The function contains mostly inline assembly language code, but it is roughly equivalent to the following:

int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags) 
{ 
    int p; 
    p = clone( 0, flags | CLONE_VM ); 
    if ( p )        /* parent */ 
        return p; 
    else {          /* child */ 
        fn(arg); 
        exit(  ); 
   } 
}

Process 0

The ancestor of all processes, called process 0 or, for historical reasons, the swapper process , is a kernel thread created from scratch during the initialization phase of Linux by the start_kernel( ) function (see Appendix A). This ancestor process uses the following data structures:

  • A process descriptor and a Kernel Mode stack stored in the init_task_union variable. The init_task and init_stack macros yield the addresses of the process descriptor and the stack, respectively.

  • The following tables, which the process descriptor points to:

    • init_mm

    • init_fs

    • init_files

    • init_signals

    The tables are initialized, respectively, by the following macros:

    • INIT_MM

    • INIT_FS

    • INIT_FILES

    • INIT_SIGNALS

  • The master kernel Page Global Directory stored in swapper_pg_dir (see Section 2.5.5).

The start_kernel( ) function initializes all the data structures needed by the kernel, enables interrupts, and creates another kernel thread, named process 1 (more commonly referred to as the init process ):

kernel_thread(init, NULL, CLONE_FS | CLONE_FILES | CLONE_SIGNAL);

The newly created kernel thread has PID 1 and shares all per-process kernel data structures with process 0. Moreover, when selected from the scheduler, the init process starts executing the init( ) function.

After having created the init process, process 0 executes the cpu_idle( ) function, which essentially consists of repeatedly executing the hlt assembly language instruction with the interrupts enabled (see Chapter 4). Process 0 is selected by the scheduler only when there are no other processes in the TASK_RUNNING state.

Process 1

The kernel thread created by process 0 executes the init( ) function, which in turn completes the initialization of the kernel. Then init( ) invokes the execve( ) system call to load the executable program init. As a result, the init kernel thread becomes a regular process having its own per-process kernel data structure (see Chapter 20). The init process stays alive until the system is shut down, since it creates and monitors the activity of all processes that implement the outer layers of the operating system.

Other kernel threads

Linux uses many other kernel threads. Some of them are created in the initialization phase and run until shutdown; others are created “on demand,” when the kernel must execute a task that is better performed in its own execution context.

The most important kernel threads (beside process 0 and process 1) are:

keventd

Executes the tasks in the qt_context task queue (see Section 4.7.3).

kapm

Handles the events related to the Advanced Power Management (APM).

kswapd

Performs memory reclaiming, as described in Section 16.7.7.

kflushd (also bdflush)

Flushes “dirty” buffers to disk to reclaim memory, as described in Section 14.2.4.

kupdated

Flushes old “dirty” buffers to disk to reduce risks of filesystem inconsistencies, as described in Section 14.2.4.

ksoftirqd

Runs the tasklets (see section Section 4.7); there is one kernel thread for each CPU in the system.



[22] As we shall see later, the CLONE_PID flag can be used only by a process having a PID of 0; in a uniprocessor system, no two lightweight processes have the same PID.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.247.188