File Systems and Block I/O
Why This Matters
Every time a program reads or writes a file, the kernel must translate that operation into physical disk accesses—correctly, quickly, and without losing data if the power fails mid-write. This module explains two complementary subsystems: the ext file system family (how data is organized on disk) and the block I/O layer (how the kernel orchestrates the actual hardware transfers). Together they sit at the bottom of the storage stack, and understanding them reveals why choices like journaling, I/O scheduling, and request merging exist.
A Brief History of Unix File Systems
UFS — The Original (1974)
Dennis Ritchie and Ken Thompson designed the original Unix File System, later called UFS. The first Linux file system (ext) and Minix FS used a very similar layout. The fundamental weakness was seek latency: inodes and their associated data blocks were scattered across the entire disk, forcing the read/write head to travel long distances during ordinary file access.
FFS — Fast File System (BSD, 1984)
Marshall Kirk McKusick and colleagues responded with FFS, the file system of BSD UNIX. The key idea is the cylinder group: one or more consecutive disk cylinders treated as a unit.
- Inodes and their data blocks are allocated in the same cylinder group, dramatically cutting seek time.
- Disk block allocation heuristics further reduce head movement.
- Almost every modern in-place-update file system (including ext2/3/4) descends from this design.
The ext Family
ext2
ext2 is the second extended Linux file system, directly inspired by FFS. Its analog of the cylinder group is the block group.
On-disk layout:
- The very first block of the partition is reserved for the boot sector (not managed by the file system).
- The rest is split into equally-sized block groups stored sequentially.
Each block group contains:
| Structure | Purpose |
|---|---|
| Superblock | Global metadata: block size, total inode/block counts, free block counts. Each block group keeps a copy for crash recovery. |
| Group Descriptor | Per-group metadata: free block count, free inode count, number of directories in the group. |
| Block Bitmap | One bit per block — is it in use? |
| Inode Bitmap | One bit per inode — is it allocated? |
| Inode Table | The actual inode structures for this group. |
| Data Blocks | File data and directory entries. |
ext3
ext3 adds journaling to ext2 to guarantee consistency after a crash or sudden power loss.
Journaling = Write-Ahead Logging (WAL): before overwriting any on-disk structure, the kernel first writes a description of the intended change to a well-known journal region. If power fails mid-operation, the journal is replayed on the next mount to bring the file system to a consistent state.
The journal is implemented as a fixed-size, circular array stored in a special file with a hard-coded inode number. Each transaction in the journal consists of:
- A begin marker
- The log data (the pending changes)
- An end marker (commit)
If a crash happens before the end marker is written, that incomplete transaction is discarded on replay. This "all-or-nothing" guarantee is the same atomic-commit model used in relational databases.
ext4
ext4 is the current default for most Linux distributions (and Android). It extends ext3 with:
| Feature | ext3 | ext4 |
|---|---|---|
| Addressing width | 32-bit | 64-bit |
| Max file size | 16 GB | 16 TB |
| Max file system size | 2³² blocks | 2⁶⁴ blocks |
| Block mapping | Indirect block map | Extent tree (contiguous runs) |
| Directory indexing | Linked list | HTree (hash tree, B-tree variant) |
| Max subdirectories | 32,000 | 64,000 |
The extent tree replaces the old multi-level indirect block pointer scheme with a structure that records contiguous runs of blocks (extents). This is far more efficient for large files because a single extent entry can describe millions of bytes.
The Linux Block I/O Layer
The block I/O layer is the kernel subsystem that sits between file systems (and other block device users) and the actual hardware drivers. It is more complex and performance-critical than character device I/O because reads and writes to rotating media are expensive and must be carefully ordered.
Anatomy of a Block Device
| Term | Definition |
|---|---|
| Sector | The smallest addressable unit, a physical property of the device. Typically 512 bytes (HDD), 2 KB (CD-ROM), or 4 KB (SSD). |
| Block | The unit of file system access — a contiguous group of sectors. Usually 4 KB or 8 KB. Defined by the file system, not the hardware. |
Key rule: the block size is always a multiple of the sector size and never larger than the page size.
Buffers and Buffer Heads
When a file system reads or writes a block, that block is cached in memory as a buffer. The associated metadata (which device, which block number, its state — uptodate, dirty, locked, etc.) is stored in a buffer head (struct buffer_head). The buffer head is the kernel's handle to an in-memory block.
The bio Structure
struct bio represents a single block I/O operation in flight.
struct bio {
sector_t bi_sector; // first sector of the I/O
struct block_device *bi_bdev;
unsigned short bi_vcnt; // number of bio_vec segments
struct bio_vec *bi_io_vec; // the array of segments
...
};
A bio can hold multiple memory segments (a bio_vec array). Each segment is a page + offset + length tuple. This scatter-gather design means the data for a single I/O request does not need to be contiguous in physical memory—only the on-disk range needs to be contiguous.
Request Queues
Block devices maintain a request queue (struct request_queue, defined in include/linux/blkdev.h) to hold pending I/O operations:
- File systems (or other callers) add requests to the queue.
- The block device driver pulls requests from the queue and submits them to hardware.
A single struct request can cover multiple consecutive disk blocks, so it may contain one or more bio objects chained together.
I/O Schedulers
Sending requests to the disk in arrival order is inefficient — it causes many random seeks. The kernel solves this by merging (combining adjacent requests) and sorting (reordering requests by on-disk location) in the request queue. The policy that governs these operations is the I/O scheduler.
The I/O scheduler virtualizes the disk the same way the process scheduler virtualizes the CPU.
Linus Elevator (default until kernel 2.4)
The Linus elevator applies four rules in order when a new request arrives:
- Front/back merge: If an existing request covers an adjacent on-disk sector, merge the new request into it.
- Anti-starvation: If any request in the queue is old enough (approaching a starvation threshold), insert the new request at the tail of the queue to let the old request proceed.
- Sorted insertion: If a suitable position exists that keeps the queue sorted by on-disk sector, insert there.
- Tail insertion: Otherwise, insert at the tail.
Goal: minimize total disk seek time (best global throughput). Weakness: a request to a far sector can starve if traffic to nearby sectors is heavy.
Deadline Scheduler
Adds per-request expiration timestamps on top of sorting:
| Request type | Deadline |
|---|---|
| Read | now + 0.5 seconds |
| Write | now + 5 seconds |
Reads have a much tighter deadline because processes typically block waiting for reads, while writes can usually proceed asynchronously. If any request's deadline expires, it is promoted to the front of the queue, guaranteeing forward progress even for starved requests.
Noop Scheduler
Performs no reordering — only merges strictly sequential requests. Intended for devices where seeking is irrelevant (flash/SSD, RAM disks, virtual disks). On such devices, sorting by on-disk position wastes CPU time for zero benefit.
Configuring the I/O Scheduler
Select the scheduler at boot time with a kernel parameter:
elevator=<cfq|deadline|noop>
Or change it per device at runtime:
# See current scheduler
cat /sys/block/<device>/queue/scheduler
# Switch to noop
echo noop > /sys/block/<device>/queue/scheduler
Key Takeaways
- ext4 is the dominant Linux file system. It descends from FFS (1984) through ext2 (block groups) and ext3 (journaling), adding 64-bit addressing, extent trees, and HTree directories.
- Journaling (WAL) is the mechanism that prevents file system corruption after crashes: changes are written to the journal first and committed only when the end marker is present.
- The block I/O layer abstracts hardware differences. A sector is a hardware unit; a block is a file system unit; a bio is the kernel's in-flight I/O descriptor supporting scatter-gather memory.
- I/O schedulers exist because seek time on rotating media is expensive. They merge and sort requests, with different schedulers making different throughput vs. latency trade-offs. Flash storage typically uses the noop scheduler.