File Systems and Block I/O

Why This Matters

Every time a program reads or writes a file, the kernel must translate that operation into physical disk accesses—correctly, quickly, and without losing data if the power fails mid-write. This module explains two complementary subsystems: the ext file system family (how data is organized on disk) and the block I/O layer (how the kernel orchestrates the actual hardware transfers). Together they sit at the bottom of the storage stack, and understanding them reveals why choices like journaling, I/O scheduling, and request merging exist.

A Brief History of Unix File Systems

UFS — The Original (1974)

Dennis Ritchie and Ken Thompson designed the original Unix File System, later called UFS. The first Linux file system (ext) and Minix FS used a very similar layout. The fundamental weakness was seek latency: inodes and their associated data blocks were scattered across the entire disk, forcing the read/write head to travel long distances during ordinary file access.

FFS — Fast File System (BSD, 1984)

Marshall Kirk McKusick and colleagues responded with FFS, the file system of BSD UNIX. The key idea is the cylinder group: one or more consecutive disk cylinders treated as a unit.

Inodes and their data blocks are allocated in the same cylinder group, dramatically cutting seek time.
Disk block allocation heuristics further reduce head movement.
Almost every modern in-place-update file system (including ext2/3/4) descends from this design.

The ext Family

ext2

ext2 is the second extended Linux file system, directly inspired by FFS. Its analog of the cylinder group is the block group.

On-disk layout:

The very first block of the partition is reserved for the boot sector (not managed by the file system).
The rest is split into equally-sized block groups stored sequentially.

Each block group contains:

Structure	Purpose
Superblock	Global metadata: block size, total inode/block counts, free block counts. Each block group keeps a copy for crash recovery.
Group Descriptor	Per-group metadata: free block count, free inode count, number of directories in the group.
Block Bitmap	One bit per block — is it in use?
Inode Bitmap	One bit per inode — is it allocated?
Inode Table	The actual inode structures for this group.
Data Blocks	File data and directory entries.

ext3

ext3 adds journaling to ext2 to guarantee consistency after a crash or sudden power loss.

Journaling = Write-Ahead Logging (WAL): before overwriting any on-disk structure, the kernel first writes a description of the intended change to a well-known journal region. If power fails mid-operation, the journal is replayed on the next mount to bring the file system to a consistent state.

The journal is implemented as a fixed-size, circular array stored in a special file with a hard-coded inode number. Each transaction in the journal consists of:

A begin marker
The log data (the pending changes)
An end marker (commit)

If a crash happens before the end marker is written, that incomplete transaction is discarded on replay. This "all-or-nothing" guarantee is the same atomic-commit model used in relational databases.

ext4

ext4 is the current default for most Linux distributions (and Android). It extends ext3 with:

Feature	ext3	ext4
Addressing width	32-bit	64-bit
Max file size	16 GB	16 TB
Max file system size	2³² blocks	2⁶⁴ blocks
Block mapping	Indirect block map	Extent tree (contiguous runs)
Directory indexing	Linked list	HTree (hash tree, B-tree variant)
Max subdirectories	32,000	64,000

The extent tree replaces the old multi-level indirect block pointer scheme with a structure that records contiguous runs of blocks (extents). This is far more efficient for large files because a single extent entry can describe millions of bytes.

The Linux Block I/O Layer

The block I/O layer is the kernel subsystem that sits between file systems (and other block device users) and the actual hardware drivers. It is more complex and performance-critical than character device I/O because reads and writes to rotating media are expensive and must be carefully ordered.

Anatomy of a Block Device

Term	Definition
Sector	The smallest addressable unit, a physical property of the device. Typically 512 bytes (HDD), 2 KB (CD-ROM), or 4 KB (SSD).
Block	The unit of file system access — a contiguous group of sectors. Usually 4 KB or 8 KB. Defined by the file system, not the hardware.

Key rule: the block size is always a multiple of the sector size and never larger than the page size.

Buffers and Buffer Heads

When a file system reads or writes a block, that block is cached in memory as a buffer. The associated metadata (which device, which block number, its state — uptodate, dirty, locked, etc.) is stored in a buffer head (struct buffer_head). The buffer head is the kernel's handle to an in-memory block.

The `bio` Structure

struct bio represents a single block I/O operation in flight.

struct bio {
    sector_t        bi_sector;   // first sector of the I/O
    struct block_device *bi_bdev;
    unsigned short  bi_vcnt;     // number of bio_vec segments
    struct bio_vec  *bi_io_vec;  // the array of segments
    ...
};

A bio can hold multiple memory segments (a bio_vec array). Each segment is a page + offset + length tuple. This scatter-gather design means the data for a single I/O request does not need to be contiguous in physical memory—only the on-disk range needs to be contiguous.

Request Queues

Block devices maintain a request queue (struct request_queue, defined in include/linux/blkdev.h) to hold pending I/O operations:

File systems (or other callers) add requests to the queue.
The block device driver pulls requests from the queue and submits them to hardware.

A single struct request can cover multiple consecutive disk blocks, so it may contain one or more bio objects chained together.

I/O Schedulers

Sending requests to the disk in arrival order is inefficient — it causes many random seeks. The kernel solves this by merging (combining adjacent requests) and sorting (reordering requests by on-disk location) in the request queue. The policy that governs these operations is the I/O scheduler.

The I/O scheduler virtualizes the disk the same way the process scheduler virtualizes the CPU.

Linus Elevator (default until kernel 2.4)

The Linus elevator applies four rules in order when a new request arrives:

Front/back merge: If an existing request covers an adjacent on-disk sector, merge the new request into it.
Anti-starvation: If any request in the queue is old enough (approaching a starvation threshold), insert the new request at the tail of the queue to let the old request proceed.
Sorted insertion: If a suitable position exists that keeps the queue sorted by on-disk sector, insert there.
Tail insertion: Otherwise, insert at the tail.

Goal: minimize total disk seek time (best global throughput). Weakness: a request to a far sector can starve if traffic to nearby sectors is heavy.

Deadline Scheduler

Adds per-request expiration timestamps on top of sorting:

Request type	Deadline
Read	now + 0.5 seconds
Write	now + 5 seconds

Reads have a much tighter deadline because processes typically block waiting for reads, while writes can usually proceed asynchronously. If any request's deadline expires, it is promoted to the front of the queue, guaranteeing forward progress even for starved requests.

Noop Scheduler

Performs no reordering — only merges strictly sequential requests. Intended for devices where seeking is irrelevant (flash/SSD, RAM disks, virtual disks). On such devices, sorting by on-disk position wastes CPU time for zero benefit.

Configuring the I/O Scheduler

Select the scheduler at boot time with a kernel parameter:

elevator=<cfq|deadline|noop>

Or change it per device at runtime:

# See current scheduler
cat /sys/block/<device>/queue/scheduler

# Switch to noop
echo noop > /sys/block/<device>/queue/scheduler

Key Takeaways

ext4 is the dominant Linux file system. It descends from FFS (1984) through ext2 (block groups) and ext3 (journaling), adding 64-bit addressing, extent trees, and HTree directories.
Journaling (WAL) is the mechanism that prevents file system corruption after crashes: changes are written to the journal first and committed only when the end marker is present.
The block I/O layer abstracts hardware differences. A sector is a hardware unit; a block is a file system unit; a bio is the kernel's in-flight I/O descriptor supporting scatter-gather memory.
I/O schedulers exist because seek time on rotating media is expensive. They merge and sort requests, with different schedulers making different throughput vs. latency trade-offs. Flash storage typically uses the noop scheduler.

File Systems and Block I/O

Why This Matters

A Brief History of Unix File Systems

UFS — The Original (1974)

FFS — Fast File System (BSD, 1984)

The ext Family

ext2

ext3

ext4

The Linux Block I/O Layer

Anatomy of a Block Device

Buffers and Buffer Heads

The `bio` Structure

Request Queues

I/O Schedulers

Linus Elevator (default until kernel 2.4)

Deadline Scheduler

Noop Scheduler

Configuring the I/O Scheduler

Key Takeaways

Practice

Model answer

Model answer

Model answer

Results

File Systems and Block I/O

Why This Matters

A Brief History of Unix File Systems

UFS — The Original (1974)

FFS — Fast File System (BSD, 1984)

The ext Family

ext2

ext3

ext4

The Linux Block I/O Layer

Anatomy of a Block Device

Buffers and Buffer Heads

The bio Structure

Request Queues

I/O Schedulers

Linus Elevator (default until kernel 2.4)

Deadline Scheduler

Noop Scheduler

Configuring the I/O Scheduler

Key Takeaways

Practice

Model answer

Model answer

Model answer

Results

The `bio` Structure