Virtual Memory III: Page Table Walk and VM Usage
Why This Matters
The last two modules introduced virtual memory theory and the x86-64 four-level paging structure. This module grounds all of that in real code: the xv6 kernel's vm.c. You'll trace exactly how a kernel page table is built from scratch, how a user process gets its first mapped page, and how the OS traverses four levels of page tables to install or look up a mapping. Understanding this code makes page faults, fork, exec, and memory-mapped I/O intuitive rather than mysteriousβand it's the foundation for the homework assignments on page deduplication and copy-on-write.
1. Building the Kernel Page Table: kvmalloc
When xv6 boots, main() calls kvmalloc() to create a single page table that maps the kernel's virtual address space. Here is a simplified version:
void kvmalloc(void) {
kpml4 = (pml4e_t*) kalloc();
kpdpt = (pde_t*) kalloc();
kpml4[PMX(KERNBASE)] = v2p(kpdpt) | PTE_P | PTE_W;
// Direct-map first GB of physical memory to KERNBASE
kpdpt[0] = 0 | PTE_PS | PTE_P | PTE_W;
kpdpt[3] = 0xC0000000 | PTE_PS | PTE_P | PTE_W | PTE_PWT | PTE_PCD;
lcr3(v2p(kpml4));
}
Key points
| Line | What it does |
|---|---|
kalloc() |
Allocates one 4 KB physical page for the PML4 table, then one for the PDPT |
kpml4[PMX(KERNBASE)] = β¦ |
Plants a PML4 entry pointing to the PDPT; PMX extracts bits [47:39] of the virtual address |
PTE_PS |
Page-Size flag β makes the PDPT entry point directly to a 1 GB page instead of a Page Directory; skips two levels of walking |
v2p(kpdpt) / V2P(mem) |
Converts a kernel virtual address to a physical address, because page table entries must contain physical addresses |
lcr3(v2p(kpml4)) |
Writes the physical address of PML4 into CR3, activating the new page table hierarchy and flushing the TLB |
Why PTE_PS matters: Without it, the hardware would try to interpret kpdpt[0] as a pointer to a Page Directory, not a 1 GB frame. The kernel would take a page fault on its very first memory access during early boot.
2. Setting Up the First User Process: userinit β inituvm
After the kernel is running, main() calls userinit() to launch the first user process, which eventually execs the shell. Part of that setup is mapping the initial user code into the new process's page table:
// proc.c / vm.c (simplified)
userinit(void) {
setupkvm(); // copy kernel mappings into new process page table
inituvm(p->pgdir, _binary_initcode_start,
(addr_t)_binary_initcode_size);
}
inituvm(...) {
mappages(pgdir, (void*)PGSIZE, PGSIZE, V2P(mem), PTE_W | PTE_U);
}
mappages iterates over a range of virtual addresses, one page at a time, calling walkpgdir at each step to find (or create) the leaf page-table entry, then writing the physical address + permissions into it.
mappages(pde_t *pgdir, void *va, addr_t size, addr_t pa, int perm) {
a = PGROUNDDOWN(va);
last = PGROUNDDOWN(va + size - 1);
for (;;) {
pte = walkpgdir(pgdir, a, 1); // 1 = allocate missing tables
*pte = pa | perm | PTE_P;
if (a == last) break;
a += PGSIZE;
pa += PGSIZE;
}
}
3. The Page Table Walk: walkpgdir
walkpgdir is the heart of xv6's VM code. It accepts a virtual address and traverses the four-level table hierarchy, allocating intermediate tables on the fly when alloc == 1.
pte_t* walkpgdir(pde_t *pml4, const void *va, int alloc) {
pml4e = &pml4[PMX(va)]; // index into PML4
if (*pml4e & PTE_P)
pdp = (pdpe_t*) P2V(PTE_ADDR(*pml4e)); // table exists
else {
if (!alloc || (pdp = kalloc()) == 0)
return 0; // fail or no alloc
memset(pdp, 0, PGSIZE);
*pml4e = V2P(pdp) | PTE_P | PTE_W | PTE_U;
}
// β¦ same pattern repeated for PDP β PD β PT β¦
return &pt[PTX(va)]; // pointer to the leaf PTE
}
Walk summary
Virtual address bits: [63:48 unused][47:39 PML4][38:30 PDP][29:21 PD][20:12 PT][11:0 offset]
CR3 β PML4[PMX(va)] β PDP[PDPX(va)] β PD[PDX(va)] β PT[PTX(va)] β physical frame
The function returns a pointer to the PTE (not the PTE value), so mappages can write the physical address directly into it with *pte = pa | perm | PTE_P.
The alloc flag controls whether missing intermediate tables are created. Passing alloc=0 (used in vmprint) makes the walk read-onlyβit returns NULL rather than allocate.
4. Process Address Space
Each process owns its own page table, giving it a private virtual address space from address 0 up to some size sz. Above KERNBASE, every process maps the same kernel pages (installed by setupkvm). This means:
- System calls and interrupts can access kernel data without switching CR3.
- The kernel is invisible to user code because those PTEs lack
PTE_U. - Physical kernel pages are sharedβno duplication cost.
High VA βββββββββββββββββββββββ
β Kernel (KERNBASE+) β β shared across all processes
βββββββββββββββββββββββ€
β User stack β
β Heap (sbrk growsβ) β
β Code / data β
Low VA βββββββββββββββββββββββ address 0
5. Growing a Process: sbrk
sbrk(n) increases (or decreases) a process's data segment by n bytes. Internally:
addr_t sys_sbrk(void) {
argaddr(0, &n);
if (growproc(n) < 0) return -1;
return old_sz;
}
int growproc(int64 n) {
sz = proc->sz;
if (n > 0)
sz = allocuvm(proc->pgdir, sz, sz + n); // add pages
else if (n < 0)
sz = deallocuvm(proc->pgdir, sz, sz + n); // remove pages
proc->sz = sz;
switchuvm(proc); // reload CR3 / flush TLB
return 0;
}
allocuvm calls mappages (which calls walkpgdir) for each new page, allocating physical memory and inserting PTEs. deallocuvm removes PTEs and frees the physical pages.
6. VM Use Case: Page Deduplication
Because page tables add a level of indirection between virtual and physical addresses, the OS can point multiple virtual pages at the same physical page. This is page deduplication:
Process A VA 0x1000 ββ
βββ Physical page 0x5000 (read-only copy of libc)
Process B VA 0x3000 ββ
Benefits:
- Reduces physical memory pressureβimportant in cloud VMs and container environments where many guests run the same OS image.
- Each guest's page table entry just points at the shared frame; no extra copying needed until a write occurs.
The prerequisite is that shared pages must be read-only (or that the OS detects writes and reacts). That reaction is copy-on-write.
Copy-on-Write (CoW)
When a process with a deduplicated (shared) page tries to write:
- The write triggers a page fault (the PTE has
PTE_Wcleared). - The OS allocates a fresh physical page, copies the content, and updates the faulting process's PTE to point to the new writable copy.
- The other process's mapping is unchanged.
This is the principle behind efficient fork(): instead of copying all pages immediately, the kernel marks them shared and read-only, deferring copies until absolutely necessary.
Key Takeaways
lcr3()loads CR3 with the physical address of PML4, activating a page table and flushing the TLBβit is how the CPU switches from one address space to another.walkpgdir(va, alloc)traverses all four levels of the x86-64 page table hierarchy, optionally allocating missing intermediate tables, and returns a pointer to the leaf PTE.mappagesloops over a VA range and callswalkpgdir+ storespa | perm | PTE_Pin each PTE.- The kernel page table is shared across all processes (mapped above KERNBASE, no
PTE_U), enabling cheap system calls without CR3 switches. sbrkgrows/shrinks the user heap by callingallocuvm/deallocuvm, which ultimately invokewalkpgdirto add or remove PTEs.- Page deduplication exploits the virtualβphysical indirection to map many VAs to one physical frame, saving memory in cloud and container workloads.
- Copy-on-write handles writes to deduplicated pages via page faults, making
fork()efficient.