File Systems and Block I/O

Why This Matters

Every time a program reads or writes a file, the kernel must translate that operation into physical disk accesses—correctly, quickly, and without losing data if the power fails mid-write. This module explains two complementary subsystems: the ext file system family (how data is organized on disk) and the block I/O layer (how the kernel orchestrates the actual hardware transfers). Together they sit at the bottom of the storage stack, and understanding them reveals why choices like journaling, I/O scheduling, and request merging exist.


A Brief History of Unix File Systems

UFS — The Original (1974)

Dennis Ritchie and Ken Thompson designed the original Unix File System, later called UFS. The first Linux file system (ext) and Minix FS used a very similar layout. The fundamental weakness was seek latency: inodes and their associated data blocks were scattered across the entire disk, forcing the read/write head to travel long distances during ordinary file access.

FFS — Fast File System (BSD, 1984)

Marshall Kirk McKusick and colleagues responded with FFS, the file system of BSD UNIX. The key idea is the cylinder group: one or more consecutive disk cylinders treated as a unit.


The ext Family

ext2

ext2 is the second extended Linux file system, directly inspired by FFS. Its analog of the cylinder group is the block group.

On-disk layout:

Each block group contains:

Structure Purpose
Superblock Global metadata: block size, total inode/block counts, free block counts. Each block group keeps a copy for crash recovery.
Group Descriptor Per-group metadata: free block count, free inode count, number of directories in the group.
Block Bitmap One bit per block — is it in use?
Inode Bitmap One bit per inode — is it allocated?
Inode Table The actual inode structures for this group.
Data Blocks File data and directory entries.

ext3

ext3 adds journaling to ext2 to guarantee consistency after a crash or sudden power loss.

Journaling = Write-Ahead Logging (WAL): before overwriting any on-disk structure, the kernel first writes a description of the intended change to a well-known journal region. If power fails mid-operation, the journal is replayed on the next mount to bring the file system to a consistent state.

The journal is implemented as a fixed-size, circular array stored in a special file with a hard-coded inode number. Each transaction in the journal consists of:

  1. A begin marker
  2. The log data (the pending changes)
  3. An end marker (commit)

If a crash happens before the end marker is written, that incomplete transaction is discarded on replay. This "all-or-nothing" guarantee is the same atomic-commit model used in relational databases.

ext4

ext4 is the current default for most Linux distributions (and Android). It extends ext3 with:

Feature ext3 ext4
Addressing width 32-bit 64-bit
Max file size 16 GB 16 TB
Max file system size 2³² blocks 2⁶⁴ blocks
Block mapping Indirect block map Extent tree (contiguous runs)
Directory indexing Linked list HTree (hash tree, B-tree variant)
Max subdirectories 32,000 64,000

The extent tree replaces the old multi-level indirect block pointer scheme with a structure that records contiguous runs of blocks (extents). This is far more efficient for large files because a single extent entry can describe millions of bytes.


The Linux Block I/O Layer

The block I/O layer is the kernel subsystem that sits between file systems (and other block device users) and the actual hardware drivers. It is more complex and performance-critical than character device I/O because reads and writes to rotating media are expensive and must be carefully ordered.

Anatomy of a Block Device

Term Definition
Sector The smallest addressable unit, a physical property of the device. Typically 512 bytes (HDD), 2 KB (CD-ROM), or 4 KB (SSD).
Block The unit of file system access — a contiguous group of sectors. Usually 4 KB or 8 KB. Defined by the file system, not the hardware.

Key rule: the block size is always a multiple of the sector size and never larger than the page size.

Buffers and Buffer Heads

When a file system reads or writes a block, that block is cached in memory as a buffer. The associated metadata (which device, which block number, its state — uptodate, dirty, locked, etc.) is stored in a buffer head (struct buffer_head). The buffer head is the kernel's handle to an in-memory block.

The bio Structure

struct bio represents a single block I/O operation in flight.

struct bio {
    sector_t        bi_sector;   // first sector of the I/O
    struct block_device *bi_bdev;
    unsigned short  bi_vcnt;     // number of bio_vec segments
    struct bio_vec  *bi_io_vec;  // the array of segments
    ...
};

A bio can hold multiple memory segments (a bio_vec array). Each segment is a page + offset + length tuple. This scatter-gather design means the data for a single I/O request does not need to be contiguous in physical memory—only the on-disk range needs to be contiguous.

Request Queues

Block devices maintain a request queue (struct request_queue, defined in include/linux/blkdev.h) to hold pending I/O operations:

  1. File systems (or other callers) add requests to the queue.
  2. The block device driver pulls requests from the queue and submits them to hardware.

A single struct request can cover multiple consecutive disk blocks, so it may contain one or more bio objects chained together.


I/O Schedulers

Sending requests to the disk in arrival order is inefficient — it causes many random seeks. The kernel solves this by merging (combining adjacent requests) and sorting (reordering requests by on-disk location) in the request queue. The policy that governs these operations is the I/O scheduler.

The I/O scheduler virtualizes the disk the same way the process scheduler virtualizes the CPU.

Linus Elevator (default until kernel 2.4)

The Linus elevator applies four rules in order when a new request arrives:

  1. Front/back merge: If an existing request covers an adjacent on-disk sector, merge the new request into it.
  2. Anti-starvation: If any request in the queue is old enough (approaching a starvation threshold), insert the new request at the tail of the queue to let the old request proceed.
  3. Sorted insertion: If a suitable position exists that keeps the queue sorted by on-disk sector, insert there.
  4. Tail insertion: Otherwise, insert at the tail.

Goal: minimize total disk seek time (best global throughput). Weakness: a request to a far sector can starve if traffic to nearby sectors is heavy.

Deadline Scheduler

Adds per-request expiration timestamps on top of sorting:

Request type Deadline
Read now + 0.5 seconds
Write now + 5 seconds

Reads have a much tighter deadline because processes typically block waiting for reads, while writes can usually proceed asynchronously. If any request's deadline expires, it is promoted to the front of the queue, guaranteeing forward progress even for starved requests.

Noop Scheduler

Performs no reordering — only merges strictly sequential requests. Intended for devices where seeking is irrelevant (flash/SSD, RAM disks, virtual disks). On such devices, sorting by on-disk position wastes CPU time for zero benefit.

Configuring the I/O Scheduler

Select the scheduler at boot time with a kernel parameter:

elevator=<cfq|deadline|noop>

Or change it per device at runtime:

# See current scheduler
cat /sys/block/<device>/queue/scheduler

# Switch to noop
echo noop > /sys/block/<device>/queue/scheduler

Key Takeaways

Practice

  1. What is the smallest addressable unit of a block device?
  2. What problem did the Fast File System (FFS) solve that UFS had?
  3. In ext3 journaling, what is the correct sequence of actions for a single transaction?
  4. What does the struct bio in the Linux block I/O layer represent, and what is the significance of its scatter-gather design?
  5. The Linus elevator checks four conditions in order when inserting a new I/O request. Which condition is checked second?
  6. Why does the ext4 extent tree improve performance compared to ext2's indirect block map for large files?
  7. Which I/O scheduler should you choose for a solid-state drive (SSD), and why?
  8. The Deadline I/O scheduler assigns read requests a deadline of 0.5 seconds but write requests a deadline of 5 seconds. Explain why reads are given a much tighter deadline than writes, and what happens when a deadline expires.
  9. A system running ext2 loses power while writing a large file. When it reboots, the file system is inconsistent. Explain specifically what ext3 journaling adds to prevent this, and why an incomplete journal transaction does not cause corruption.
  10. Describe the role of the request queue (struct request_queue) in the Linux block I/O path, and explain how multiple bio objects can be associated with a single struct request.