Linux: Anatomy of a System Call

In this post, we shall look at how a system call request is handled by the Linux kernel. We shall also see how the library functions liaise between the kernel and the user code to get some work done.

A System Call is a request that you make to the Linux kernel, to get some work done. It is a controlled entry point into the kernel to execute some instructions on the calling process’s behalf and is the fundamental interface between the kernel and the application. The kernel exposes various System Calls APIs that a process can make use of during its execution. The APIs can be used to work with processes/thread, files, inter-process communication, sockets etc.

The need for the kernel to execute instructions - on behalf of the application is as: The CPU executes in different modes. There are privilege levels associated with each instruction set that the CPU executes. This control access to memory regions, I/O ports etc. There are multiple levels, depending on the CPU, from the most privileged to the least privileged. The kernel executes in top-of-the pyramid privileged level and the application in the least. This is done to present multiple access levels and keep everyone in check. When a lesser privileged agent tries to access a higher privileged instruction, an interrupt is raised and the error handled. A system call is way for the application to perform privileged instruction - albeit through the kernel.

At a higher level, the following happens when a system call is made: The call changes the process state from user mode to the kernel mode. Privileged instructions can now be executed by the CPU. The system call that needs to be made is identified. Each system call is identified through a unique number and the kernel exposes a limited set of system calls. The args from the application space needs to be transferred to the kernel space and this is effectively done by moving them to the registers. The CPU then executes the instructions in privileged mode and then the kernel transfers the result to the application context and switches the mode back to user.

Generally, system calls are not invoked directly. The GNU C library - glibc (and similarly others) provide wrapper functions to make invocations to the various system calls. For example, the open() system call will in-turn call sys_open() provided by the Linux kernel. The list of various system calls provided by the Linux kernel can be viewed here.

Steps Involved in a System Call Invocation

Let us look at what happens when a system call is invoked through glibc wrapper. For example, let us see what happens when the below call is made:

int fd = open("some_file", O_RDONLY);
  • The args to the system call will now need to be passed to the kernel context. This essentially means that the args needs to be moved from the application’s stack to the CPU registers, where the kernel expects them
  • The wrapper then copies the system call number to the %eax register. Each system call is identified by a unique number
  • The wrapper then executes a CPU trap instruction (an interrupt identified by the number - 0X80). This will cause the CPU to switch from user mode to privileged mode. Newer wrappers use the sysenter instruction to enter into the kernel mode than using the slower interrupt process
  • The kernel invokes the system_call() method - found in arch/i386/kernel/entry.S or similar
  • The system_call routine first saves the register values into the kernel stack. It then checks if the %eax register holds a valid system call identifier
  • The system_call routine then calls the ‘actual’ system call which is of the form - sys_xxxxxx(). The mapping of the system call identifier to the actual system call is maintained in a kernel variable - sys_call_table
  • The result of sys_xxxxxx() is returned back to the system_call routine
  • The system_call routine then transfers the result of the ‘actual’ system call to the stack and returns to the glibc wrapper function

The wrapper function then has a last bit of work to do. It does the following: If the sys_xxxxxx() had returned an error, the system_call routine, sets a global variable - errno to that value and returns to wrapper function. In Linux, in the case of an error a negative number is returned. Hence, the errno is set to a negative number. When the control comes back to the glibc wrapper function, it negates the errno and returns -1 to indicate that the call failed. And this is why, it is advised that we need to correctly check if each call passed or failed by checking for a return value of -1.

The above can be visualized as:

Fig 1: Simplified flow of a Linux System Call

As a side note: you can get the list of error numbers, from the command line, using the command: $errno -l. You need the package moreutils in your system for the above. You can also make use of the strerror function to get a string describing the error by passing in the errno.

That’s it. This is a very small post that tries to explain what happens, behind the scenes, when a system call is invoked. As you can see above, system calls do incur some cost from switching between user mode and kernel mode and other various bookkeeping. An application should try to minimize the system call invocations where ever possible. For any discussion, tweet here.

[1] https://en.wikipedia.org/wiki/Protection_ring
[2] http://asm.sourceforge.net/syscall.html
[3] http://man7.org/tlpi/