xv6 Comprehensive Review
Why This Matters
xv6 is the teaching OS you have been living inside all semester. Every mechanism you touched — fork, exec, page faults, context switches, file writes — passes through code you can read in its entirety. This review ties those mechanisms into a single coherent picture so you can reason about how the pieces interact, not just recall isolated facts.
1. OS Architecture Recap
xv6 uses a monolithic kernel: all OS services (memory management, file system, device drivers) share one kernel address space. This contrasts with a microkernel, where each service lives in its own protected process and communicates via IPC.
Monolithic kernels are faster (no IPC overhead) but have a larger trusted computing base — a bug anywhere in the kernel can compromise the whole system.
Isolation mechanisms
- Each process has its own page table; the kernel maps itself into the upper portion of every process's address space (the kernel half).
- User code runs in ring 3 (unprivileged); kernel code runs in ring 0.
- Transitions between user and kernel happen only through controlled entry points (traps).
2. Processes and the Process Lifecycle
struct proc
Every process is described by struct proc in proc.c. Key fields:
| Field | Purpose |
|---|---|
pgdir |
Pointer to the process's page table |
kstack |
Per-process kernel stack (used during syscalls/traps) |
state |
UNUSED, EMBRYO, SLEEPING, RUNNABLE, RUNNING, ZOMBIE |
ofile[] |
Open file descriptors |
cwd |
Current working directory inode |
tf |
Saved trapframe (registers at last trap) |
context |
Saved context (registers for swtch) |
chan |
Sleep channel (what the process is waiting on) |
killed |
Set to 1 when a signal should terminate the process |
Why does
kill 2 2not immediately kill the xv6 shell?kill()only setsp->killed = 1. The shell only checks that flag when it returns from a trap (system call or interrupt). If the shell is blocked inread(), it won't checkkilleduntil the next trap fires.
Process creation: fork()
allocproc()— allocate astruct proc, set up the kernel stack with a fake return address pointing atforkret.- Copy the parent's page table and all mapped pages into the child.
- Copy the parent's trapframe so the child returns from
fork()at the same instruction — but with return value 0 (child) vs. child PID (parent). - Copy open file descriptors and
cwd. - Set state to
RUNNABLE; the scheduler will pick it up.
First process: userinit() and initcode
The very first user process cannot be created by fork() — there is no parent yet. Instead:
main()callsuserinit().userinit()callsallocproc(), then builds a minimal page table withsetupkvm()+inituvm().- The page table is loaded with
initcode— a tiny hand-assembled binary (compiled frominitcode.S) that is embedded directly in the kernel binary. initcode.Sexecutesexec("/init", argv)as its very first action, replacing itself with the realinitprogram.
# initcode.S (simplified)
start:
mov $init, %rdi # argv[0] = "/init"
mov $argv, %rsi
mov $SYS_exec, %rax
syscall
Zombie processes and resource reclaim
- When a process calls
exit(), it transitions toZOMBIE— it retains itsstruct procso the parent can read the exit status. - The parent calls
wait()to collect the status and callfreeproc(). - If a parent exits before its children,
init(PID 1) is reparented to be their new parent, preventing orphaned zombies.
3. Scheduling and Context Switching
Round-robin scheduler
xv6 runs scheduler() in an infinite loop on each CPU core (called from mpmain() at boot). The loop scans ptable for the next RUNNABLE process and switches to it:
scheduler() loop:
for each proc p in ptable:
if p->state == RUNNABLE:
switchuvm(p) // install p's page table + TSS
swtch(&cpu->scheduler, p->context) // jump to p
switchuvm(p) installs the process's page table into CR3 and updates the TSS so that hardware knows where the kernel stack is on the next trap.
swtch() — saving and restoring context
swtch is written in assembly (swtch.S). It saves the callee-saved registers (%rbx, %rbp, %r12–%r15, and %rip implicitly via push/ret) to the old context struct and restores them from the new context struct.
These registers are saved because the C calling convention only guarantees that callee-saved registers survive a function call — the scheduler must preserve whatever the kernel was doing before the switch.
swtch is called in three situations:
- Voluntary yield — a process calls
sleep()oryield(), which callssched()→swtchback to the scheduler. - Timer interrupt — the timer handler calls
yield()→sched()→swtch. - Exit —
exit()callssched()→swtchto switch away permanently.
CFS fairness (Linux contrast)
Linux's Completely Fair Scheduler tracks vruntime (virtual runtime weighted by priority) for each task. It picks the task with the smallest vruntime. CPU-bound tasks accumulate vruntime fast; I/O-bound tasks (mostly sleeping) accumulate it slowly and are prioritized when they wake up — achieving fairness without starvation.
4. Traps, Interrupts, and Device Drivers
A trap is any CPU-forced transfer of control into the kernel. xv6 handles three kinds:
Path 1 — System calls
User syscall instruction
→ trapasm.S: syscall_entry (saves trapframe)
→ proc.c / syscall.c: syscall()
→ individual handler: sys_read(), sys_write(), …
→ iret back to user
The syscall entry point address is written into the MSR_LSTAR model-specific register at boot.
Path 2 — Hardware interrupts (e.g., timer, keyboard)
Hardware asserts interrupt line
→ APIC routes to CPU
→ vector.S → trapasm.S: alltraps (saves trapframe, calls trap())
→ trap.c: trap(struct trapframe *tf)
• timer: ticks++; wakeup(&ticks); yield()
• keyboard: kbdintr()
→ iret back to interrupted code
Path 3 — Exceptions (e.g., page fault)
Handled similarly to hardware interrupts via alltraps → trap(). The trap number (T_PGFLT = 14) tells the handler what happened.
In hw3 (Copy-on-Write), you used page faults productively:
fork()marks shared pages read-only instead of copying them.- On the first write, the CPU raises
T_PGFLT. copyonwrite()allocates a new page, copies the content, and remaps it writable.
5. Kernel Synchronization
Why locks are needed
With multiple CPUs and preemptible kernel code, two threads of execution can interleave reads and writes to shared data structures (e.g., ptable, the file table). Locks enforce mutual exclusion.
xv6 lock types
| Lock | Use case | Behavior while waiting |
|---|---|---|
spinlock |
Short critical sections, interrupt handlers | Spins (busy-wait) |
sleeplock |
Long operations (disk I/O) | Calls sleep(), yields CPU |
When to use a lock — checklist
Ask: "Is this data structure accessed by more than one CPU or interrupt handler concurrently?"
- If yes: protect reads and writes.
- Hold the lock for the shortest possible time.
- Never call
sleep()while holding aspinlock(deadlock via interrupt re-entry).
Deadlock
Deadlock occurs when two threads each hold a lock the other needs. xv6 avoids it by enforcing a global lock ordering (e.g., always acquire ptable.lock before p->lock).
Advanced lock types (Linux)
rwlock— multiple concurrent readers; writers get exclusive access. Good when reads dominate.seqlock— writers never block; readers retry if they observe an odd sequence number (write in progress). Good for frequently-read, rarely-written data (e.g.,jiffies).
6. File System
Disk layout
| boot | superblock | log | inodes | free bitmap | data blocks |
- Superblock — stores filesystem metadata: total blocks, inode count, log start, etc.
dinode— on-disk inode: file type, size, link count, 12 direct block pointers + 1 indirect pointer.
Inodes (in-memory struct inode)
An inode is the kernel's handle to a file. iget(dev, inum) looks up or allocates a cached inode. Inodes are reference-counted; iput() drops a reference and, when the count reaches zero, frees the inode.
Directories and path resolution
A directory is just a file whose data blocks hold dirent records:
struct dirent {
ushort inum; // inode number (0 = free slot)
char name[14];
};
Path resolution (namei) walks the directory tree, calling dirlookup at each component.
Virtual File System (VFS)
VFS is an abstraction layer that makes multiple concrete filesystem types (ext4, FAT, tmpfs, …) look identical to the kernel. Operations like read, write, open dispatch through function pointers in the inode operations table. xv6's design is a simplified version of this concept.
Buffer cache (bio.c)
The buffer cache keeps recently-used disk blocks in memory to avoid slow disk I/O.
| Function | Role |
|---|---|
bget(dev, blockno) |
Find or allocate a cache slot (LRU replacement) |
bread(dev, blockno) |
bget + read from disk if not cached |
bwrite(buf) |
Write block to disk |
Eviction uses LRU (least recently used): the least-recently-accessed buffer is recycled when the cache is full.
Journaling and crash recovery
xv6 uses a write-ahead log (journal) to provide atomic multi-block updates:
begin_op()— start a transaction.log_write(buf)— mark a buffer as part of the current transaction (dirty, logged).end_op()— when all operations in the transaction complete, callcommit(): a. Write all dirty blocks to the log region on disk. b. Write the log header (commit record). c. Install — copy log blocks to their real disk locations. d. Clear the log header.
If the system crashes between steps (a) and (c), recovery replays the log on the next boot. If it crashes before the commit record is written, the partial transaction is simply ignored — the filesystem remains consistent.
Key Takeaways
struct procis the central data structure — everything about a process lives here.userinit+initcodebootstrap the first user process by embedding a tiny binary directly in the kernel.swtchperforms context switching by saving/restoring callee-saved registers; it is called fromsched()in yield, sleep, and exit paths.- Three trap paths: system call (via MSR entry), hardware interrupt (via APIC + vector table), and exception (e.g., page fault) — all funnel through
alltraps/trap(). - Spinlocks are for short, interrupt-safe critical sections; sleeplocks are for long I/O waits.
- The filesystem layers build on each other: buffer cache → inode → directory → VFS.
- Journaling (write-ahead log) guarantees crash consistency by committing all-or-nothing multi-block updates.