Isolation and System Calls
Why This Matters
Every program you run shares the same physical hardware as the OS kernel and every other running process. Without strong isolation guarantees, a buggy or malicious user program could overwrite kernel memory, read another process's secrets, or seize control of hardware devices. Understanding how the kernel enforces these boundaries—and how programs legitimately cross them—is foundational to kernel programming, security auditing, and performance engineering.
1. The Unit of Isolation: The Process
The kernel isolates each process from:
- Other processes — a process cannot read or write another process's memory, file descriptors, CPU registers, or address space.
- The kernel itself — a process cannot directly invoke kernel code or modify kernel data structures.
Three mechanisms collaborate to enforce this: CPU privilege levels, virtual address spaces, and time-slicing. This lecture focuses on the first and how system calls bridge the resulting gap.
2. Hardware Privilege Rings on x86
x86 CPUs define four privilege rings (0–3), though Linux uses only two:
| Ring | Name | Who runs here |
|---|---|---|
| 0 | Kernel mode | Linux kernel |
| 3 | User mode | All user processes |
The Current Privilege Level (CPL) is stored in bits 0:1 of the %cs (code segment) register. Ring 0 code can execute privileged instructions that ring 3 code cannot:
- Writing to
%cs(which would change the CPL itself) - Accessing I/O ports directly
- Modifying control registers (
eflags,%cr3for page tables, etc.)
Segmentation and Privilege Checks
In x86 protected/long mode, each segment descriptor carries a Descriptor Privilege Level (DPL). Segment registers (%cs, %ds, %ss, etc.) hold 16-bit selectors that index into a segment descriptor table and also carry a Requested Privilege Level (RPL).
Access to a segment is granted when:
max(CPL, RPL) ≤ DPL
This means a user-mode program (CPL=3) can never access a kernel segment (DPL=0) directly.
3. System Calls: The Controlled Bridge
Because user programs run at ring 3 and cannot execute privileged code, they need a controlled mechanism to ask the kernel to act on their behalf. That mechanism is the system call.
A system call is the only legitimate way for a user-space application to enter the kernel.
System calls provide:
- An abstract hardware interface (processes don't talk to devices directly)
- Security and stability guarantees (the kernel validates every request)
Categories of System Calls
| Category | Examples |
|---|---|
| Process management | fork, exit, execve, getpid |
| Memory management | brk, mmap |
| File system | open, read, write, lseek, stat |
| IPC | pipe, shmget |
| Time | gettimeofday, settimeofday |
| Identity | getuid, setuid |
| Networking | connect |
4. The Linux Syscall Table
Every syscall has a unique integer syscall ID assigned sequentially. For x86_64, the table lives at:
arch/x86/entry/syscalls/syscall_64.tbl
At kernel build time, the script scripts/syscalltbl.sh transforms this table into a generated header asm/syscalls_64.h, which populates the sys_call_table array:
/* arch/x86/entry/syscall_64.c */
asmlinkage const sys_call_ptr_t sys_call_table[] = {
[0 ... __NR_syscall_max] = &sys_ni_syscall,
[0] = sys_read,
[1] = sys_write,
/* ... */
};
The default entry sys_ni_syscall returns -ENOSYS for unimplemented IDs.
5. Invoking a System Call
Via the C Library (Normal Path)
Most syscalls are wrapped by libc (POSIX API). You call write(), the library sets up registers, and issues the CPU instruction.
Directly via syscall()
You can bypass the wrapper using the syscall() C library function:
#include <unistd.h>
#include <sys/syscall.h>
int main(void) {
char msg[] = "Hello, world!\n";
ssize_t bytes_written;
/* syscall(syscall_id, arg1, arg2, arg3) */
bytes_written = syscall(1, 1, msg, 14);
/* ^ ^ */
/* | fd=stdout */
/* syscall ID 1 = write */
return 0;
}
CPU Instructions for Syscalls
| Instruction | Usage |
|---|---|
int 0x80 |
Legacy 32-bit software interrupt |
sysenter |
Fast syscall on x86-32 |
syscall |
Fast syscall on x86-64 (current standard) |
Register Conventions (x86-64)
| Register | Role |
|---|---|
%rax |
Syscall ID |
%rdi |
Argument 1 |
%rsi |
Argument 2 |
%rdx |
Argument 3 |
%r10 |
Argument 4 |
%r8 |
Argument 5 |
%r9 |
Argument 6 |
If more than six arguments are needed, the remainder go on the stack.
6. How the Kernel Handles a Syscall
Entry point registration — at boot,
syscall_init()(arch/x86/kernel/cpu/common.c) writes the address ofentry_SYSCALL_64into theIA32_LSTARMSR. The CPU jumps to this address wheneversyscallis executed.entry_SYSCALL_64(arch/x86/entry/entry_64.S) — saves registers, switches to the kernel stack, then callsdo_syscall_64.do_syscall_64— looks up the handler:regs->ax = sys_call_table[nr](regs);Return — the kernel uses
sysret(x86-64) to restore CPL=3 and jump back to user space. Legacy paths usesysexit(x86-32) oriret.
7. User-Space vs. Kernel-Space Memory
User space cannot access kernel memory (enforced by page-table permissions). The reverse is also dangerous: kernel code must never blindly dereference a user-space pointer, because:
- The pointer could be invalid or point to a kernel address (privilege escalation).
- The page might be swapped out.
The kernel provides two safe helpers:
/* copy n bytes from user-space pointer `from` to kernel buffer `to` */
static inline long copy_from_user(void *to, const void __user *from, unsigned long n);
/* copy n bytes from kernel buffer `from` to user-space pointer `to` */
static inline long copy_to_user(void __user *to, const void *from, unsigned long n);
These validate the user pointer, handle faults if the page must be swapped in, and return the number of bytes not copied (0 on success).
8. Adding a New System Call
The standard four-step process:
- Write the implementation — add a function (often with
SYSCALL_DEFINEmacros) in an existing or new.cfile; update the kernelMakefileif adding a new file. - Register it — add an entry to
arch/x86/entry/syscalls/syscall_64.tbland assign the next available ID. - Declare the prototype — add it to
include/linux/syscalls.h. - Compile, reboot, test — touching the syscall table triggers a full kernel recompile.
Should you add a syscall? Probably not for new kernel features:
| Pros | Cons |
|---|---|
| Simple, fast | Needs an official number |
| Interface is frozen once merged | |
| Must be registered per architecture | |
| Overkill for small data exchanges |
Better alternatives: a character device with read()/write(), or ioctl().
9. Improving Syscall Performance
Context switches are expensive. Several techniques reduce the overhead:
Hardware
Replacing int 0x80 with the syscall instruction cut latency significantly on modern CPUs.
vDSO (Virtual Dynamically Linked Shared Object)
The kernel maps a small read-only page into every process's address space containing routines that don't need a privilege transition. gettimeofday() is the canonical example: the kernel keeps the current time in a page that user space can read directly, avoiding any ring switch.
FlexSC (Exception-less System Calls)
A research technique (OSDI 2010) that batches syscalls through shared memory rather than issuing them synchronously, eliminating context-switch overhead on multi-core systems. Benchmarks show up to 116% improvement for Apache, 40% for MySQL, 105% for BIND.
Key Takeaways
- CPU rings (0 = kernel, 3 = user) enforce hardware isolation; the CPL in
%csis the gatekeeper. max(CPL, RPL) ≤ DPLis the x86 access-control rule for segments.- System calls are the single controlled path from user space into the kernel, mediated by the
sys_call_tableindexed by syscall ID. - On x86-64, the
syscallinstruction (notint 0x80) is the fast path;%raxcarries the ID and%rdi–%r9carry up to six arguments. copy_from_user/copy_to_userare mandatory for kernel↔user data transfer; raw pointer dereferences are unsafe.- Adding a new syscall is straightforward but costly in maintenance; prefer device nodes or
ioctlfor new interfaces. - vDSO eliminates the ring transition entirely for suitable read-only operations like
gettimeofday.