Page Cache and Page Fault
Why This Matters
Disk I/O is orders of magnitude slower than memory access — a disk seek takes milliseconds while a RAM read takes nanoseconds. Without caching, every read() or write() would stall waiting for physical storage. The Linux page cache is the kernel's primary mechanism for bridging this gap, and understanding it is essential for diagnosing performance problems, reasoning about memory pressure, and understanding how page faults connect virtual memory to real storage.
The Page Cache
The page cache stores physical pages in RAM whose content comes from disk (the backing store). It works for:
- Regular files
- Memory-mapped files
- Block device files
The cache grows dynamically to consume memory that is otherwise idle, and shrinks when processes or the kernel need more RAM. This is why free often shows most RAM as "cached" on a busy Linux system — that is intentional, not waste.
Buffered I/O Flow
Every read() and write() that does not use O_DIRECT goes through the page cache:
| Situation | What Happens |
|---|---|
| Cache hit | Data is already in a page cache page; copy directly to/from user memory |
| Cache miss | VFS asks the concrete filesystem (e.g., ext4) to read the block from disk, populate the page cache, then serve the request |
Write Caching Policies
Three strategies exist for handling writes:
| Policy | Behavior | Trade-off |
|---|---|---|
| No-write | Writes bypass the cache entirely | Cache stays clean; poor write performance |
| Write-through | Writes update the cache and immediately write to disk | Cache always coherent; high write latency |
| Write-back | Writes update the cache; disk write is deferred | Best performance; risk of data loss on crash |
Linux uses write-back. When a page is written, it is marked dirty (using a tag in the radix tree / xarray). Dirty pages are eventually flushed to disk by the flusher daemon. This absorbs temporal locality in writes: if the same page is written many times in quick succession, only the final version hits the disk.
Cache Eviction
The page cache is smaller than the disk, so pages must eventually be evicted to make room.
Naive LRU
The simplest policy is Least Recently Used (LRU): track the last access time for every page and evict the oldest. LRU works well for repeated-access patterns, but fails for streaming workloads — files that are read once and never again flood the LRU list and push out genuinely hot pages.
The Two-List Strategy
Linux solves this with two lists:
| List | Description | Eviction Eligible? |
|---|---|---|
| Inactive list | Pages accessed recently but not confirmed hot | Yes |
| Active list | Pages confirmed hot (accessed more than once) | No |
How pages move:
- A page accessed for the first time goes onto the inactive list, and its page-table access bit is cleared so future accesses can be detected.
- If the page is accessed again while still on the inactive list, it is promoted to the active list.
- If the active list grows much larger than the inactive list, pages are demoted from the active list's head back to the inactive list.
- Pages are evicted from the tail of the inactive list.
This means one-time-access pages cycle through the inactive list and are evicted without ever polluting the active list.
Linux Page Cache Internals
The kernel represents the page cache of each file with an address_space structure:
file → inode → address_space → xarray of pages
↑
one or more vm_area_struct (VMAs)
Key relationships:
- One
address_spaceper file (per inode) - One file can be mapped into multiple VMAs (different processes, different offsets)
- Each VMA has its own page tables pointing into the same physical pages in the
address_space
The new kernel uses xarray (previously radix tree) to index pages within an address_space by file offset.
Page Fault Handling
When a process accesses a virtual address that has no valid PTE, the CPU raises a page fault. The kernel entry point is handle_pte_fault (mm/memory.c), which identifies the faulting VMA and dispatches to a fault handler:
| Handler | Trigger |
|---|---|
do_anonymous_page |
No PTE, no backing file (heap/stack) |
filemap_fault |
PTE absent but VMA is backed by a file |
do_wp_page |
PTE is read-only but VMA is writable → Copy-on-Write |
do_swap_page |
Page was swapped out |
File-Mapped Page Fault (filemap_fault)
Occurs when a PTE entry is missing (---) but the VMA is accessible and has an associated file (vm_file):
- Look up the file's
address_spacefor the faulting page offset. - Cache hit: map the existing page cache page into the PTE.
- Cache miss: call
mapping->a_ops->readpage(file, page)to load it from disk, then map it.
This is the mechanism that makes mmap() work: pages appear on demand as you touch them.
Copy-on-Write (do_wp_page)
Triggered when a PTE is marked read-only but the VMA allows writes. This mismatch means CoW is in effect (e.g., after fork()):
- Allocate a new physical page.
- Copy the content of the original page into the new page.
- Update the PTE to point to the new page.
- Flush the TLB entry for the address.
The original page is unaffected; the child (or parent) now has an independent copy.
Flusher Daemon
Write-back means RAM can diverge from disk. The flusher daemon (multiple threads) is responsible for syncing dirty pages back to disk.
Writeback is triggered by three conditions:
| Trigger | Details |
|---|---|
| Memory pressure | Free memory drops below a threshold → wakeup_flusher_threads() is called; threads write until memory recovers |
| Age threshold | Dirty data that has not been written after a configurable interval is flushed |
| Explicit sync | A process calls sync() or fsync() |
The threshold that triggers background writeback is tunable via:
/proc/sys/vm/dirty_background_ratio
This is the percentage of total memory that may be dirty before flusher threads wake up.
Key Takeaways
- The page cache is Linux's unified disk buffer, using write-back policy for best throughput.
- Two-list (active/inactive) eviction solves LRU's weakness against one-time access patterns by only promoting pages that are accessed more than once.
- Every file's page cache is represented by an
address_space, which can be shared across multiple VMAs. - Page faults are the mechanism by which virtual addresses gain physical backing — different handlers cover anonymous, file-backed, CoW, and swap scenarios.
- Flusher threads periodically drain dirty pages to disk based on memory pressure, age, or explicit sync calls.