Buffer Cache and Write-Ahead Logging

Why This Matters

A single user-visible operation like echo hello > test.txt translates into several separate disk writes inside the kernel: allocate an inode, update it, add a directory entry, write data blocks. Each write touches a different disk location. If power fails between any two of them, the filesystem ends up in a partially-updated state — an orphaned inode, a directory entry with no inode, or corrupted metadata. This module explains how xv6 prevents that using a Write-Ahead Log (WAL).


The Problem: Partial Writes

When create() runs inside sys_open(), it executes roughly this sequence:

ialloc()    — allocate a free inode
iupdate()   — write the inode to disk        ← power failure here?
dirlink()   — add a directory entry

If the machine crashes after iupdate() but before dirlink(), disk holds a live inode that no directory entry points to. The filesystem is now inconsistent. The fix: make the whole sequence atomic — either all writes commit, or none do.


Write-Ahead Logging (WAL)

WAL groups all the disk writes belonging to one logical operation into a transaction. The core guarantee is:

Write a description of what you are about to do before you do it. On recovery, replay any complete description; discard any incomplete one.

In xv6, every filesystem system call wraps its disk work in begin_op() / end_op():

begin_op();
ilock(f->ip);
writei(f->ip, addr + i, f->off, n1);
iunlock(f->ip);
end_op();

Key Data Structures

buf (buf.h)

An in-memory copy of one 512-byte disk block. Fields include the block number, a reference count, flags (including B_DIRTY), and the data array.

bcache (bio.c)

A doubly-linked list of buf structs that acts as an LRU disk-block cache. Before issuing a disk read the kernel always checks bcache first; a hit avoids the I/O entirely.

logheader (log.c)

Tracks which blocks are part of the current uncommitted transaction:

struct logheader {
  int n;               // number of logged blocks
  int block[LOGSIZE];  // disk block numbers of those blocks
};

logheader exists both in memory (fast access) and as a dedicated header block on disk (survives crashes).


How a Write Works: log_write()

Instead of writing a dirty buffer directly to its home disk block, xv6 calls log_write(), which:

  1. Sets the B_DIRTY flag on the buffer — this pins it in the cache (it cannot be evicted).
  2. Searches the log header; if the same block is already logged, reuses that slot (absorption).
  3. Increments n in the log header (in memory only at this point).
  4. Leaves the data sitting in the buffer cache — no disk I/O yet.

This defers all expensive disk writes until commit time.


Transaction Lifecycle

begin_op()

acquire(&log.lock);
while (1) {
    if (another thread is committing) → sleep
    if (not enough log slots)        → sleep
    log.outstanding++;
    release(&log.lock);
    break;
}

log.outstanding counts how many system calls are currently inside a transaction.

end_op()

log.outstanding--;
if (log.outstanding == 0)
    commit();

The last caller to finish triggers the commit.

commit() — step by step

Step What happens Why
1. write_log() Copy each dirty buffer from the cache to the log area on disk Persist the data safely before touching home blocks
2. write_head() Write the log header (with n and block numbers) to disk The commit point — once this is on disk the transaction is permanent
3. Install writes Copy each block from the log area to its home disk block Apply the changes to their actual locations
4. Clear log Set n = 0, write header to disk Mark the log as empty; recovery will ignore it

The commit point is step 2. If power fails before write_head(), recovery sees n = 0 (or a stale header with wrong data) and discards the log — as if the operation never happened. If power fails after write_head(), recovery replays all the logged blocks — the operation fully completes.


Group Commit

Because commit() only fires when log.outstanding reaches zero, multiple concurrent system calls can accumulate dirty buffers under a single commit. This is called group commit:

Transaction X:  begin_op → dirty blocks 2, 3 → end_op  (outstanding: 1→0 → commit!)
Transaction Y:  begin_op → dirty blocks 2, 3 → end_op  (outstanding: 2→1 → waits)

When both X and Y overlap, Y's end_op decrements outstanding to 1, not 0, so no commit fires yet. X's end_op gets it to 0 and commits all pending dirty blocks in one pass. This amortizes the disk-write overhead of a commit across many system calls.


Recovery

On boot, the xv6 log layer reads the on-disk log header. If n > 0, a commit was in progress or completed before the power failed. Recovery replays the log: it re-copies all n logged blocks from the log area to their home locations, then clears the log. This is safe because the writes are idempotent — replaying a block that was already installed just overwrites it with the same data.


Key Takeaways

Practice

  1. When the create() function is called in the xv6 kernel, what is its first primary task regarding path resolution?
  2. What is the purpose of the dirlink() call inside create()?
  3. During the xv6 commit process, what is the 'critical moment' that permanently records a transaction?
  4. Why does the xv6 logging layer call log_write() instead of writing dirty data directly to the home disk block?
  5. What does xv6 use the B_DIRTY flag on a buffer cache entry for in the context of logging?
  6. Describe, in order, the four phases of xv6's commit() function and explain why the phases must happen in exactly that order.
  7. Power fails immediately after write_log() completes but before write_head() runs. What happens when the system reboots and the log recovery code runs? Is the filesystem consistent?
  8. Explain what 'group commit' means in xv6 and why it improves performance.
  9. Which xv6 data structure exists both in memory and as a dedicated block on disk, and is checked by recovery code on every boot?
  10. A student proposes skipping the log and writing dirty blocks directly to their home locations, but doing so in a fixed order (inode before directory entry). Explain why this does NOT solve the crash-consistency problem.