Buffer Cache and Write-Ahead Logging
Why This Matters
A single user-visible operation like echo hello > test.txt translates into several separate disk writes inside the kernel: allocate an inode, update it, add a directory entry, write data blocks. Each write touches a different disk location. If power fails between any two of them, the filesystem ends up in a partially-updated state — an orphaned inode, a directory entry with no inode, or corrupted metadata. This module explains how xv6 prevents that using a Write-Ahead Log (WAL).
The Problem: Partial Writes
When create() runs inside sys_open(), it executes roughly this sequence:
ialloc() — allocate a free inode
iupdate() — write the inode to disk ← power failure here?
dirlink() — add a directory entry
If the machine crashes after iupdate() but before dirlink(), disk holds a live inode that no directory entry points to. The filesystem is now inconsistent. The fix: make the whole sequence atomic — either all writes commit, or none do.
Write-Ahead Logging (WAL)
WAL groups all the disk writes belonging to one logical operation into a transaction. The core guarantee is:
Write a description of what you are about to do before you do it. On recovery, replay any complete description; discard any incomplete one.
In xv6, every filesystem system call wraps its disk work in begin_op() / end_op():
begin_op();
ilock(f->ip);
writei(f->ip, addr + i, f->off, n1);
iunlock(f->ip);
end_op();
Key Data Structures
buf (buf.h)
An in-memory copy of one 512-byte disk block. Fields include the block number, a reference count, flags (including B_DIRTY), and the data array.
bcache (bio.c)
A doubly-linked list of buf structs that acts as an LRU disk-block cache. Before issuing a disk read the kernel always checks bcache first; a hit avoids the I/O entirely.
logheader (log.c)
Tracks which blocks are part of the current uncommitted transaction:
struct logheader {
int n; // number of logged blocks
int block[LOGSIZE]; // disk block numbers of those blocks
};
logheader exists both in memory (fast access) and as a dedicated header block on disk (survives crashes).
How a Write Works: log_write()
Instead of writing a dirty buffer directly to its home disk block, xv6 calls log_write(), which:
- Sets the
B_DIRTYflag on the buffer — this pins it in the cache (it cannot be evicted). - Searches the log header; if the same block is already logged, reuses that slot (absorption).
- Increments
nin the log header (in memory only at this point). - Leaves the data sitting in the buffer cache — no disk I/O yet.
This defers all expensive disk writes until commit time.
Transaction Lifecycle
begin_op()
acquire(&log.lock);
while (1) {
if (another thread is committing) → sleep
if (not enough log slots) → sleep
log.outstanding++;
release(&log.lock);
break;
}
log.outstanding counts how many system calls are currently inside a transaction.
end_op()
log.outstanding--;
if (log.outstanding == 0)
commit();
The last caller to finish triggers the commit.
commit() — step by step
| Step | What happens | Why |
|---|---|---|
1. write_log() |
Copy each dirty buffer from the cache to the log area on disk | Persist the data safely before touching home blocks |
2. write_head() |
Write the log header (with n and block numbers) to disk |
The commit point — once this is on disk the transaction is permanent |
| 3. Install writes | Copy each block from the log area to its home disk block | Apply the changes to their actual locations |
| 4. Clear log | Set n = 0, write header to disk |
Mark the log as empty; recovery will ignore it |
The commit point is step 2. If power fails before write_head(), recovery sees n = 0 (or a stale header with wrong data) and discards the log — as if the operation never happened. If power fails after write_head(), recovery replays all the logged blocks — the operation fully completes.
Group Commit
Because commit() only fires when log.outstanding reaches zero, multiple concurrent system calls can accumulate dirty buffers under a single commit. This is called group commit:
Transaction X: begin_op → dirty blocks 2, 3 → end_op (outstanding: 1→0 → commit!)
Transaction Y: begin_op → dirty blocks 2, 3 → end_op (outstanding: 2→1 → waits)
When both X and Y overlap, Y's end_op decrements outstanding to 1, not 0, so no commit fires yet. X's end_op gets it to 0 and commits all pending dirty blocks in one pass. This amortizes the disk-write overhead of a commit across many system calls.
Recovery
On boot, the xv6 log layer reads the on-disk log header. If n > 0, a commit was in progress or completed before the power failed. Recovery replays the log: it re-copies all n logged blocks from the log area to their home locations, then clears the log. This is safe because the writes are idempotent — replaying a block that was already installed just overwrites it with the same data.
Key Takeaways
- A single filesystem syscall spans multiple disk writes; WAL makes them atomic.
log_write()pins a buffer dirty in the cache; no data reaches its home block untilcommit().- The commit point is
write_head()— once the log header is on disk, the transaction is permanent. logheader.nis the only thing that matters for recovery:n = 0→ nothing to replay;n > 0→ replay all logged blocks.- Group commit batches multiple concurrent transactions, reducing disk I/O overhead.
- The buffer cache (
bcache) serves both as a read cache and as the staging area for uncommitted writes.