Wednesday, July 22, 2009

linux: execution of the binary

In previous post I've touched internals of fork system call in Linux kernel.
Here I want to make the picture complete with describing execve system call.
execve is a system call that substitutes current process image with new one constructed from the executable binary. Usually execve supplements fork: with fork program creates new process and with execve it loads new image. This allows former continue to run in parallel with new program. In other case there was no opportunity to run different binaries at one time.
When userspace program calls one of the exec family defined by POSIX:

int execl(const char *path, const char *arg0, ... /*, (char *)0 */);
       int execv(const char *path, char *const argv[]);
       int execle(const char *path, const char *arg0, ... /*,
              (char *)0, char *const envp[]*/);
       int execve(const char *path, char *const argv[], char *const envp[]);
       int execlp(const char *file, const char *arg0, ... /*, (char *)0 */);
       int execvp(const char *file, char *const argv[]);
c library will finally execute execve system call:
int execve(const char *filename, char *const argv[], char *const envp[]);
execve system call will result into calling do_execve from fs/exec.c in the kernel mode.
At first do_execve checks if it's safe to execute this binary - checks process credentials. If it's safe to run the program do_execve opens the file and fills linux_binprm structure. This structure holds enough information needed to execute binary:
struct linux_binprm{
 ....
#ifdef CONFIG_MMU
 struct vm_area_struct *vma;
#else
# define MAX_ARG_PAGES 32
 struct page *page[MAX_ARG_PAGES];
#endif
 struct mm_struct *mm;
 unsigned long p;  /* current top of mem */
 ....
 struct file * file;
 struct cred *cred; /* new credentials */
 ....
 int argc, envc;
 char * filename; /* Name of binary as seen by procps */
 char * interp;  /* Name of the binary really executed. Most
       of the time same as filename, but could be
       different for binfmt_{misc,script} */
 ....
};
With int bprm_mm_init(struct linux_binprm *bprm) help new mm_struct is being initialized and, for example, for x86 arch new LTD(Local Descriptor Table) is being written. Architecture independent code allocates space for struct vm_area_struct and fills fields with appropriate values - area for the stack, etc. Worth to note that do_execve tries to migrate task to other processor on SMP system if balancing needed. do_execve is a good place to do load balancing task - the task has smallest memory and cache footprint.
When linux_binprm is filled with various values gathered from the current task and executable file metadata, do_execve tries to find proper handler with int search_binary_handler(struct linux_binprm *bprm, struct pt_regs *regs) which will finish setup of the new image. It iterates over all registered binary format handler until the suitable is found. In the loop load_binary callback from linux binary format handler is used to probe the image. If this callback finds the image suitable for this format handler the outer loop will stop the iteration.
load_binary is responsible not only for probing the image but also for loading it and setting up environment. It gets all needed information through the function arguments to do that.
load_binary in binary format handler is responsible for loading(mapping) image into memory and all stuff needed for the normal program execution(relocation, etc.), setting up stack, bss and environment: put argv and envp arrays onto userspace stack. Finally it calls start_thread which sets up process' registers with new values of stack pointer, program execution point, arguments for main function of the executable etc. Here how it looks for arm architecture:
#define start_thread(regs,pc,sp)                                \
 ({                                                             \
  unsigned long *stack = (unsigned long *)sp;                   \
  set_fs(USER_DS);                                              \
  memset(regs->uregs, 0, sizeof(regs->uregs));                  \
  if (current->personality & ADDR_LIMIT_32BIT)                  \
   regs->ARM_cpsr = USR_MODE;                                   \
   regs->ARM_cpsr = USR26_MODE;                                 \
  if (elf_hwcap & HWCAP_THUMB && pc & 1)                        \
   regs->ARM_cpsr |= PSR_T_BIT;                                 \
  regs->ARM_cpsr |= PSR_ENDSTATE;                               \
  regs->ARM_pc = pc & ~1;  /* pc */                             \
  regs->ARM_sp = sp;       /* sp */                             \
  regs->ARM_r2 = stack[2]; /* r2 (envp) */                      \
  regs->ARM_r1 = stack[1]; /* r1 (argv) */                      \
  regs->ARM_r0 = stack[0]; /* r0 (argc) */                      \
  nommu_start_thread(regs);                                     \
 })
In ARM architecture in assembly first four function parameters are being passed in registers r0-r3: argc, argv, envp pointers are written into r0, r1, r2 correspondingly.
Now current process is ready to run new program. Once the context switch is done for the process, usually by returning into the userland, new values of registers come into play and the code of new image is executing.