Monday, June 15, 2009

linux: creation of the new process

System developers know that you need fork system call to start a new process and execve system call to start a new program. While fork creates a copy of the current process(actually not a complete copy - please refer to the man page of fork to get information what is being duplicated and currently COW is used to postpone the copying of the memory regions), execve replaces current running process with a new executable binary.

Here I'll try to describe how the new process is being created in the kernel.

First of all the kernel has to handle a system call and switch from the userspace into the kernelspace. On different architectures this is done in different ways. On x86 it's most likely 0x80 software interrupt or swi instruction on ARM. The system call interface, SCI, demultiplexes the system call into the call of a routine in the kernel kingdom: gets the pointer of the routine in the syscall table.
fork system call is multiplexed into sys_fork kernel function. Its prototype is:

int sys_fork(struct pt_regs *regs)
. This is a arch-dependent routine. struct pt_regs is a set of CPU registers which are saved in the process' memory region. When userspace makes a system call CPU registers of current process are taken. Arch-dependent sys_fork makes initial checks(for example it returns -EINVAL on ARM without MMU) and finally calls system-independent do_fork:
long do_fork(unsigned long clone_flags,
       unsigned long stack_start,
       struct pt_regs *regs,
       unsigned long stack_size,
       int __user *parent_tidptr,
       int __user *child_tidptr)
clone_flags represent policy of what should be copied and what should be shared between the process that called sys_fork(parent process) and newly created process(child process). stack_start and stack_size point to the start of the stack and its size respectively. These values are taken from the information obtained about the current process. regs is a pointer to CPU registers of the current process. This structure represents the state of the CPU registers. For x86 it's defined as
struct pt_regs {
 long ebx;
 long ecx;
 long edx;
 long esi;
 long edi;
 long ebp;
 long eax;
 int  xds;
 int  xes;
 int  xfs;
 long orig_eax;
 long eip;
 int  xcs;
 long eflags;
 long esp;
 int  xss;
};
and for ARM as
struct pt_regs {
 long uregs[18];
};

#define ARM_cpsr uregs[16]
#define ARM_pc  uregs[15]
#define ARM_lr  uregs[14]
#define ARM_sp  uregs[13]
#define ARM_ip  uregs[12]
#define ARM_fp  uregs[11]
#define ARM_r10  uregs[10]
#define ARM_r9  uregs[9]
#define ARM_r8  uregs[8]
#define ARM_r7  uregs[7]
#define ARM_r6  uregs[6]
#define ARM_r5  uregs[5]
#define ARM_r4  uregs[4]
#define ARM_r3  uregs[3]
#define ARM_r2  uregs[2]
#define ARM_r1  uregs[1]
#define ARM_r0  uregs[0]
#define ARM_ORIG_r0 uregs[17]
parent_tidptr and child_tidptr are pointers which help userspace libraries to handle threads(NPTL respectively).
do_fork again does some checks and calls copy_process which is responsible for make a copy of the process. copy_process again does some checks of supplied flags and does the copying of the process:
* duplicates task structure;
* allocates memory for kernel stack and puts instance of struct thread_info on the bottom of the kernel stack which holds arch-specific information about the process;
* copies process information: fs, opened files, IPC, signal handling, mm, etc.;
* generates new PID and other IDs in respect of namespace information;
* does some further actions according to passed flags.
Interesting how memory copying is managed by copy_process. mm structure described by struct mm_struct is being copied by copy_mm function. If CLONE_VM was supplied it doesn't copy the memory management information of the parent process but shares it with child and adjusts reference counter of the users of this mm. If no CLONE_VM was set dup_mm is called. This function makes a copy of the mm struct but doesn't copy the contents of the memory pages - copy-on-write is used to make this process run faster and do not waste system memory. When either parent or child process attempts to write to the memory page page fault occurs and kernel recognizes that the page should be copied for each process and write operation could be later satisfied: the result of this operation would be visible only for initiator.
Another interesting function called by copy_process is copy_thread. This routine is architecture dependent and in general among other management tasks it copies CPU registers of current process to the new process and adjusts some of their values(stack pointer, etc.). Also it sets pc(for ARM)/ip(for x86) to ret_from_fork. ret_from_fork will be called next time the newly created process will be scheduled to run. This function does some cleanups and returns control to the userspace. Linux saves CPU registers in the top of the kernel stack of the process. This information helps kernel to do a context switch.
When the process is ready to run do_fork calls wake_up_new_task(or sets TASK_STOPPED if process is being traced) which informs process scheduler that task is ready.
At this point kernel returns execution point into the userland.