Page Cache and Page Fault

Why This Matters

Disk I/O is orders of magnitude slower than memory access — a disk seek takes milliseconds while a RAM read takes nanoseconds. Without caching, every read() or write() would stall waiting for physical storage. The Linux page cache is the kernel's primary mechanism for bridging this gap, and understanding it is essential for diagnosing performance problems, reasoning about memory pressure, and understanding how page faults connect virtual memory to real storage.


The Page Cache

The page cache stores physical pages in RAM whose content comes from disk (the backing store). It works for:

The cache grows dynamically to consume memory that is otherwise idle, and shrinks when processes or the kernel need more RAM. This is why free often shows most RAM as "cached" on a busy Linux system — that is intentional, not waste.

Buffered I/O Flow

Every read() and write() that does not use O_DIRECT goes through the page cache:

Situation What Happens
Cache hit Data is already in a page cache page; copy directly to/from user memory
Cache miss VFS asks the concrete filesystem (e.g., ext4) to read the block from disk, populate the page cache, then serve the request

Write Caching Policies

Three strategies exist for handling writes:

Policy Behavior Trade-off
No-write Writes bypass the cache entirely Cache stays clean; poor write performance
Write-through Writes update the cache and immediately write to disk Cache always coherent; high write latency
Write-back Writes update the cache; disk write is deferred Best performance; risk of data loss on crash

Linux uses write-back. When a page is written, it is marked dirty (using a tag in the radix tree / xarray). Dirty pages are eventually flushed to disk by the flusher daemon. This absorbs temporal locality in writes: if the same page is written many times in quick succession, only the final version hits the disk.


Cache Eviction

The page cache is smaller than the disk, so pages must eventually be evicted to make room.

Naive LRU

The simplest policy is Least Recently Used (LRU): track the last access time for every page and evict the oldest. LRU works well for repeated-access patterns, but fails for streaming workloads — files that are read once and never again flood the LRU list and push out genuinely hot pages.

The Two-List Strategy

Linux solves this with two lists:

List Description Eviction Eligible?
Inactive list Pages accessed recently but not confirmed hot Yes
Active list Pages confirmed hot (accessed more than once) No

How pages move:

  1. A page accessed for the first time goes onto the inactive list, and its page-table access bit is cleared so future accesses can be detected.
  2. If the page is accessed again while still on the inactive list, it is promoted to the active list.
  3. If the active list grows much larger than the inactive list, pages are demoted from the active list's head back to the inactive list.
  4. Pages are evicted from the tail of the inactive list.

This means one-time-access pages cycle through the inactive list and are evicted without ever polluting the active list.


Linux Page Cache Internals

The kernel represents the page cache of each file with an address_space structure:

file → inode → address_space → xarray of pages
                      ↑
              one or more vm_area_struct (VMAs)

Key relationships:

The new kernel uses xarray (previously radix tree) to index pages within an address_space by file offset.


Page Fault Handling

When a process accesses a virtual address that has no valid PTE, the CPU raises a page fault. The kernel entry point is handle_pte_fault (mm/memory.c), which identifies the faulting VMA and dispatches to a fault handler:

Handler Trigger
do_anonymous_page No PTE, no backing file (heap/stack)
filemap_fault PTE absent but VMA is backed by a file
do_wp_page PTE is read-only but VMA is writable → Copy-on-Write
do_swap_page Page was swapped out

File-Mapped Page Fault (filemap_fault)

Occurs when a PTE entry is missing (---) but the VMA is accessible and has an associated file (vm_file):

  1. Look up the file's address_space for the faulting page offset.
  2. Cache hit: map the existing page cache page into the PTE.
  3. Cache miss: call mapping->a_ops->readpage(file, page) to load it from disk, then map it.

This is the mechanism that makes mmap() work: pages appear on demand as you touch them.

Copy-on-Write (do_wp_page)

Triggered when a PTE is marked read-only but the VMA allows writes. This mismatch means CoW is in effect (e.g., after fork()):

  1. Allocate a new physical page.
  2. Copy the content of the original page into the new page.
  3. Update the PTE to point to the new page.
  4. Flush the TLB entry for the address.

The original page is unaffected; the child (or parent) now has an independent copy.


Flusher Daemon

Write-back means RAM can diverge from disk. The flusher daemon (multiple threads) is responsible for syncing dirty pages back to disk.

Writeback is triggered by three conditions:

Trigger Details
Memory pressure Free memory drops below a threshold → wakeup_flusher_threads() is called; threads write until memory recovers
Age threshold Dirty data that has not been written after a configurable interval is flushed
Explicit sync A process calls sync() or fsync()

The threshold that triggers background writeback is tunable via:

/proc/sys/vm/dirty_background_ratio

This is the percentage of total memory that may be dirty before flusher threads wake up.


Key Takeaways

Practice

  1. Which property of real-world workloads most directly motivates the existence of the Linux page cache?
  2. Which write caching policy does the Linux page cache use?
  3. A process opens a large log file, reads it sequentially once, and never touches it again. Why does pure LRU eviction perform poorly for this workload, and how does the two-list strategy improve things?
  4. In the two-list strategy, when a page on the inactive list is accessed a second time, what happens?
  5. A process calls mmap() on a file and then reads byte 4096 (the first byte of the second page). The PTE for that page is absent. Which page fault handler is invoked, and what does it do on a cache miss?
  6. Explain what the address_space structure represents in the Linux page cache and describe its relationship to inode, vm_area_struct, and physical pages.
  7. After a fork(), both parent and child share physical pages with read-only PTEs, even though the VMA is marked writable. The child then writes to one of these shared pages. Which handler fires, and what are the exact steps it takes?
  8. Describe the three conditions that trigger dirty page writeback in Linux and explain what /proc/sys/vm/dirty_background_ratio controls.
  9. A process opens a file with O_DIRECT. What is different about how reads are handled compared to normal buffered I/O?