Isolation and System Calls

Why This Matters

Every program you run shares the same physical hardware as the OS kernel and every other running process. Without strong isolation guarantees, a buggy or malicious user program could overwrite kernel memory, read another process's secrets, or seize control of hardware devices. Understanding how the kernel enforces these boundaries—and how programs legitimately cross them—is foundational to kernel programming, security auditing, and performance engineering.


1. The Unit of Isolation: The Process

The kernel isolates each process from:

Three mechanisms collaborate to enforce this: CPU privilege levels, virtual address spaces, and time-slicing. This lecture focuses on the first and how system calls bridge the resulting gap.


2. Hardware Privilege Rings on x86

x86 CPUs define four privilege rings (0–3), though Linux uses only two:

Ring Name Who runs here
0 Kernel mode Linux kernel
3 User mode All user processes

The Current Privilege Level (CPL) is stored in bits 0:1 of the %cs (code segment) register. Ring 0 code can execute privileged instructions that ring 3 code cannot:

Segmentation and Privilege Checks

In x86 protected/long mode, each segment descriptor carries a Descriptor Privilege Level (DPL). Segment registers (%cs, %ds, %ss, etc.) hold 16-bit selectors that index into a segment descriptor table and also carry a Requested Privilege Level (RPL).

Access to a segment is granted when:

max(CPL, RPL) ≤ DPL

This means a user-mode program (CPL=3) can never access a kernel segment (DPL=0) directly.


3. System Calls: The Controlled Bridge

Because user programs run at ring 3 and cannot execute privileged code, they need a controlled mechanism to ask the kernel to act on their behalf. That mechanism is the system call.

A system call is the only legitimate way for a user-space application to enter the kernel.

System calls provide:

Categories of System Calls

Category Examples
Process management fork, exit, execve, getpid
Memory management brk, mmap
File system open, read, write, lseek, stat
IPC pipe, shmget
Time gettimeofday, settimeofday
Identity getuid, setuid
Networking connect

4. The Linux Syscall Table

Every syscall has a unique integer syscall ID assigned sequentially. For x86_64, the table lives at:

arch/x86/entry/syscalls/syscall_64.tbl

At kernel build time, the script scripts/syscalltbl.sh transforms this table into a generated header asm/syscalls_64.h, which populates the sys_call_table array:

/* arch/x86/entry/syscall_64.c */
asmlinkage const sys_call_ptr_t sys_call_table[] = {
    [0 ... __NR_syscall_max] = &sys_ni_syscall,
    [0] = sys_read,
    [1] = sys_write,
    /* ... */
};

The default entry sys_ni_syscall returns -ENOSYS for unimplemented IDs.


5. Invoking a System Call

Via the C Library (Normal Path)

Most syscalls are wrapped by libc (POSIX API). You call write(), the library sets up registers, and issues the CPU instruction.

Directly via syscall()

You can bypass the wrapper using the syscall() C library function:

#include <unistd.h>
#include <sys/syscall.h>

int main(void) {
    char msg[] = "Hello, world!\n";
    ssize_t bytes_written;
    /* syscall(syscall_id, arg1, arg2, arg3) */
    bytes_written = syscall(1, 1, msg, 14);
    /*                       ^  ^         */
    /*                       |  fd=stdout */
    /*                       syscall ID 1 = write */
    return 0;
}

CPU Instructions for Syscalls

Instruction Usage
int 0x80 Legacy 32-bit software interrupt
sysenter Fast syscall on x86-32
syscall Fast syscall on x86-64 (current standard)

Register Conventions (x86-64)

Register Role
%rax Syscall ID
%rdi Argument 1
%rsi Argument 2
%rdx Argument 3
%r10 Argument 4
%r8 Argument 5
%r9 Argument 6

If more than six arguments are needed, the remainder go on the stack.


6. How the Kernel Handles a Syscall

  1. Entry point registration — at boot, syscall_init() (arch/x86/kernel/cpu/common.c) writes the address of entry_SYSCALL_64 into the IA32_LSTAR MSR. The CPU jumps to this address whenever syscall is executed.

  2. entry_SYSCALL_64 (arch/x86/entry/entry_64.S) — saves registers, switches to the kernel stack, then calls do_syscall_64.

  3. do_syscall_64 — looks up the handler:

    regs->ax = sys_call_table[nr](regs);
    
  4. Return — the kernel uses sysret (x86-64) to restore CPL=3 and jump back to user space. Legacy paths use sysexit (x86-32) or iret.


7. User-Space vs. Kernel-Space Memory

User space cannot access kernel memory (enforced by page-table permissions). The reverse is also dangerous: kernel code must never blindly dereference a user-space pointer, because:

The kernel provides two safe helpers:

/* copy n bytes from user-space pointer `from` to kernel buffer `to` */
static inline long copy_from_user(void *to, const void __user *from, unsigned long n);

/* copy n bytes from kernel buffer `from` to user-space pointer `to` */
static inline long copy_to_user(void __user *to, const void *from, unsigned long n);

These validate the user pointer, handle faults if the page must be swapped in, and return the number of bytes not copied (0 on success).


8. Adding a New System Call

The standard four-step process:

  1. Write the implementation — add a function (often with SYSCALL_DEFINE macros) in an existing or new .c file; update the kernel Makefile if adding a new file.
  2. Register it — add an entry to arch/x86/entry/syscalls/syscall_64.tbl and assign the next available ID.
  3. Declare the prototype — add it to include/linux/syscalls.h.
  4. Compile, reboot, test — touching the syscall table triggers a full kernel recompile.

Should you add a syscall? Probably not for new kernel features:

Pros Cons
Simple, fast Needs an official number
Interface is frozen once merged
Must be registered per architecture
Overkill for small data exchanges

Better alternatives: a character device with read()/write(), or ioctl().


9. Improving Syscall Performance

Context switches are expensive. Several techniques reduce the overhead:

Hardware

Replacing int 0x80 with the syscall instruction cut latency significantly on modern CPUs.

vDSO (Virtual Dynamically Linked Shared Object)

The kernel maps a small read-only page into every process's address space containing routines that don't need a privilege transition. gettimeofday() is the canonical example: the kernel keeps the current time in a page that user space can read directly, avoiding any ring switch.

FlexSC (Exception-less System Calls)

A research technique (OSDI 2010) that batches syscalls through shared memory rather than issuing them synchronously, eliminating context-switch overhead on multi-core systems. Benchmarks show up to 116% improvement for Apache, 40% for MySQL, 105% for BIND.


Key Takeaways

Practice

  1. On x86, what determines whether a running program is currently in kernel mode or user mode?
  2. Given DPL=0 (kernel segment), CPL=3 (user mode), RPL=0, what does x86 do when the program tries to load this segment selector?
  3. Which file in the Linux kernel source defines the syscall IDs for the x86_64 architecture?
  4. A user-space program on x86-64 wants to invoke the write system call (ID=1) directly using the syscall instruction to write 5 bytes from buffer buf to file descriptor 3. Which register setup is correct before executing syscall?
  5. What is the role of entry_SYSCALL_64 in the Linux kernel?
  6. Why must kernel code use copy_from_user() instead of directly dereferencing a pointer received from user space?
  7. List the four steps required to add a new system call to the Linux kernel on x86-64.
  8. What is vDSO and why does it improve gettimeofday() performance?
  9. A colleague suggests adding a new syscall to exchange a small 16-byte status struct between a kernel module and a user-space daemon. Give two reasons to prefer an alternative approach, and name one concrete alternative.
  10. On x86-64, when the kernel finishes handling a system call and needs to return to user space, which instruction does it use?