Isolation and System Calls

Why This Matters

Every program you run shares the same physical hardware as the OS kernel and every other running process. Without strong isolation guarantees, a buggy or malicious user program could overwrite kernel memory, read another process's secrets, or seize control of hardware devices. Understanding how the kernel enforces these boundaries—and how programs legitimately cross them—is foundational to kernel programming, security auditing, and performance engineering.

1. The Unit of Isolation: The Process

The kernel isolates each process from:

Other processes — a process cannot read or write another process's memory, file descriptors, CPU registers, or address space.
The kernel itself — a process cannot directly invoke kernel code or modify kernel data structures.

Three mechanisms collaborate to enforce this: CPU privilege levels, virtual address spaces, and time-slicing. This lecture focuses on the first and how system calls bridge the resulting gap.

2. Hardware Privilege Rings on x86

x86 CPUs define four privilege rings (0–3), though Linux uses only two:

Ring	Name	Who runs here
0	Kernel mode	Linux kernel
3	User mode	All user processes

The Current Privilege Level (CPL) is stored in bits 0:1 of the %cs (code segment) register. Ring 0 code can execute privileged instructions that ring 3 code cannot:

Writing to %cs (which would change the CPL itself)
Accessing I/O ports directly
Modifying control registers (eflags, %cr3 for page tables, etc.)

Segmentation and Privilege Checks

In x86 protected/long mode, each segment descriptor carries a Descriptor Privilege Level (DPL). Segment registers (%cs, %ds, %ss, etc.) hold 16-bit selectors that index into a segment descriptor table and also carry a Requested Privilege Level (RPL).

Access to a segment is granted when:

max(CPL, RPL) ≤ DPL

This means a user-mode program (CPL=3) can never access a kernel segment (DPL=0) directly.

3. System Calls: The Controlled Bridge

Because user programs run at ring 3 and cannot execute privileged code, they need a controlled mechanism to ask the kernel to act on their behalf. That mechanism is the system call.

A system call is the only legitimate way for a user-space application to enter the kernel.

System calls provide:

An abstract hardware interface (processes don't talk to devices directly)
Security and stability guarantees (the kernel validates every request)

Categories of System Calls

Category	Examples
Process management	`fork`, `exit`, `execve`, `getpid`
Memory management	`brk`, `mmap`
File system	`open`, `read`, `write`, `lseek`, `stat`
IPC	`pipe`, `shmget`
Time	`gettimeofday`, `settimeofday`
Identity	`getuid`, `setuid`
Networking	`connect`

4. The Linux Syscall Table

Every syscall has a unique integer syscall ID assigned sequentially. For x86_64, the table lives at:

arch/x86/entry/syscalls/syscall_64.tbl

At kernel build time, the script scripts/syscalltbl.sh transforms this table into a generated header asm/syscalls_64.h, which populates the sys_call_table array:

/* arch/x86/entry/syscall_64.c */
asmlinkage const sys_call_ptr_t sys_call_table[] = {
    [0 ... __NR_syscall_max] = &sys_ni_syscall,
    [0] = sys_read,
    [1] = sys_write,
    /* ... */
};

The default entry sys_ni_syscall returns -ENOSYS for unimplemented IDs.

5. Invoking a System Call

Via the C Library (Normal Path)

Most syscalls are wrapped by libc (POSIX API). You call write(), the library sets up registers, and issues the CPU instruction.

Directly via `syscall()`

You can bypass the wrapper using the syscall() C library function:

#include <unistd.h>
#include <sys/syscall.h>

int main(void) {
    char msg[] = "Hello, world!\n";
    ssize_t bytes_written;
    /* syscall(syscall_id, arg1, arg2, arg3) */
    bytes_written = syscall(1, 1, msg, 14);
    /*                       ^  ^         */
    /*                       |  fd=stdout */
    /*                       syscall ID 1 = write */
    return 0;
}

CPU Instructions for Syscalls

Instruction	Usage
`int 0x80`	Legacy 32-bit software interrupt
`sysenter`	Fast syscall on x86-32
`syscall`	Fast syscall on x86-64 (current standard)

Register Conventions (x86-64)

Register	Role
`%rax`	Syscall ID
`%rdi`	Argument 1
`%rsi`	Argument 2
`%rdx`	Argument 3
`%r10`	Argument 4
`%r8`	Argument 5
`%r9`	Argument 6

If more than six arguments are needed, the remainder go on the stack.

6. How the Kernel Handles a Syscall

Entry point registration — at boot, syscall_init() (arch/x86/kernel/cpu/common.c) writes the address of entry_SYSCALL_64 into the IA32_LSTAR MSR. The CPU jumps to this address whenever syscall is executed.
entry_SYSCALL_64 (arch/x86/entry/entry_64.S) — saves registers, switches to the kernel stack, then calls do_syscall_64.
do_syscall_64 — looks up the handler:
```
regs->ax = sys_call_table[nr](regs);
```
Return — the kernel uses sysret (x86-64) to restore CPL=3 and jump back to user space. Legacy paths use sysexit (x86-32) or iret.

7. User-Space vs. Kernel-Space Memory

User space cannot access kernel memory (enforced by page-table permissions). The reverse is also dangerous: kernel code must never blindly dereference a user-space pointer, because:

The pointer could be invalid or point to a kernel address (privilege escalation).
The page might be swapped out.

The kernel provides two safe helpers:

/* copy n bytes from user-space pointer `from` to kernel buffer `to` */
static inline long copy_from_user(void *to, const void __user *from, unsigned long n);

/* copy n bytes from kernel buffer `from` to user-space pointer `to` */
static inline long copy_to_user(void __user *to, const void *from, unsigned long n);

These validate the user pointer, handle faults if the page must be swapped in, and return the number of bytes not copied (0 on success).

8. Adding a New System Call

The standard four-step process:

Write the implementation — add a function (often with SYSCALL_DEFINE macros) in an existing or new .c file; update the kernel Makefile if adding a new file.
Register it — add an entry to arch/x86/entry/syscalls/syscall_64.tbl and assign the next available ID.
Declare the prototype — add it to include/linux/syscalls.h.
Compile, reboot, test — touching the syscall table triggers a full kernel recompile.

Should you add a syscall? Probably not for new kernel features:

Pros	Cons
Simple, fast	Needs an official number
	Interface is frozen once merged
	Must be registered per architecture
	Overkill for small data exchanges

Better alternatives: a character device with read()/write(), or ioctl().

9. Improving Syscall Performance

Context switches are expensive. Several techniques reduce the overhead:

Hardware

Replacing int 0x80 with the syscall instruction cut latency significantly on modern CPUs.

vDSO (Virtual Dynamically Linked Shared Object)

The kernel maps a small read-only page into every process's address space containing routines that don't need a privilege transition. gettimeofday() is the canonical example: the kernel keeps the current time in a page that user space can read directly, avoiding any ring switch.

FlexSC (Exception-less System Calls)

A research technique (OSDI 2010) that batches syscalls through shared memory rather than issuing them synchronously, eliminating context-switch overhead on multi-core systems. Benchmarks show up to 116% improvement for Apache, 40% for MySQL, 105% for BIND.

Key Takeaways

CPU rings (0 = kernel, 3 = user) enforce hardware isolation; the CPL in %cs is the gatekeeper.
max(CPL, RPL) ≤ DPL is the x86 access-control rule for segments.
System calls are the single controlled path from user space into the kernel, mediated by the sys_call_table indexed by syscall ID.
On x86-64, the syscall instruction (not int 0x80) is the fast path; %rax carries the ID and %rdi–%r9 carry up to six arguments.
copy_from_user / copy_to_user are mandatory for kernel↔user data transfer; raw pointer dereferences are unsafe.
Adding a new syscall is straightforward but costly in maintenance; prefer device nodes or ioctl for new interfaces.
vDSO eliminates the ring transition entirely for suitable read-only operations like gettimeofday.

Isolation and System Calls

Why This Matters

1. The Unit of Isolation: The Process

2. Hardware Privilege Rings on x86

Segmentation and Privilege Checks

3. System Calls: The Controlled Bridge

Categories of System Calls

4. The Linux Syscall Table

5. Invoking a System Call

Via the C Library (Normal Path)

Directly via `syscall()`

CPU Instructions for Syscalls

Register Conventions (x86-64)

6. How the Kernel Handles a Syscall

7. User-Space vs. Kernel-Space Memory

8. Adding a New System Call

9. Improving Syscall Performance

Hardware

vDSO (Virtual Dynamically Linked Shared Object)

FlexSC (Exception-less System Calls)

Key Takeaways

Practice

Model answer

Model answer

Model answer

Results

Isolation and System Calls

Why This Matters

1. The Unit of Isolation: The Process

2. Hardware Privilege Rings on x86

Segmentation and Privilege Checks

3. System Calls: The Controlled Bridge

Categories of System Calls

4. The Linux Syscall Table

5. Invoking a System Call

Via the C Library (Normal Path)

Directly via syscall()

CPU Instructions for Syscalls

Register Conventions (x86-64)

6. How the Kernel Handles a Syscall

7. User-Space vs. Kernel-Space Memory

8. Adding a New System Call

9. Improving Syscall Performance

Hardware

vDSO (Virtual Dynamically Linked Shared Object)

FlexSC (Exception-less System Calls)

Key Takeaways

Practice

Model answer

Model answer

Model answer

Results

Directly via `syscall()`