Introduction to Filesystems

Why This Matters

Every program you write eventually needs to store data that survives a reboot, share it with other processes, or address it by a human-readable name. Without a filesystem, each program would have to manage raw disk blocks on its own — a maintenance nightmare. Understanding filesystems connects the abstract OS concepts you've studied (virtual memory, processes, synchronization) to the concrete question: how does data actually persist?


Why Do We Need Filesystems?

Three core motivations:

Need What it means
Persistence Data survives process exit, crashes, and reboots
Naming & organization Humans (and programs) refer to data by name, not disk address
Sharing Multiple processes and users can access the same data

Without these guarantees, building almost any real application would be impractical.


What Is a File?

A file is simply a named sequence of bytes. A few important subtleties:

Block-Oriented vs. Stream-Oriented Files

Property Block-oriented Stream-oriented
Unit of access Fixed-size block Byte (character)
Random access Yes — blocks can be addressed in any order Typically no
Typical example Disk file Network socket, mouse input

Disk-based files are block-oriented; the filesystem manages blocks of fixed size (512 bytes in xv6). Stream-oriented files are common for I/O devices and network connections where you process data as it arrives.


Unix File Names: Three Layers

Unix (and xv6) separates identity from human name from open handle:

1. Inodes (Index Nodes)

Each file is assigned a unique inode number. The inode stores the file's metadata and pointers to its data blocks. The inode number is the true, stable identity of a file.

2. Paths

Paths like /usr/bin/gcc are human-convenient names in a hierarchical namespace. The filesystem maps each path component to an inode number — a path entry is called a link (a hard link in Unix terminology). Multiple paths can point to the same inode.

3. File Descriptors

When a process calls open(), the OS:

  1. Resolves the path to an inode.
  2. Creates an open file description (tracking current position and access mode) in a kernel table.
  3. Returns an integer file descriptor to the process.

Subsequent read(), write(), lseek() calls use this integer index. File descriptors are process-local; the open file descriptions they reference can be shared (e.g., after fork()).


Layered Filesystem Design

xv6 implements the filesystem as a stack of layers, each providing a cleaner abstraction to the layer above:

User syscall layer  (sysfile.c — sys_read, sys_write, sys_link, …)
        ↓
  Inode layer       (path resolution, inode read/write)
        ↓
  Logging layer     (crash recovery via write-ahead log)
        ↓
  Buffer cache      (in-memory cache of disk blocks)
        ↓
     Disk driver    (raw block reads/writes)

A quiz question that frequently appears: the virtual memory manager is NOT a layer of the xv6 filesystem. The filesystem does use virtual memory indirectly (buffer cache lives in kernel memory), but it is not a dedicated filesystem layer.


Essential Questions Every Filesystem Must Answer

  1. How do we keep track of disk/file metadata?Inodes and the superblock
  2. How do we keep track of free space?Free space bitmap
  3. How do we track which disk blocks belong to a given file?Block pointers inside the inode
  4. How do we map paths to files?Directory entries (dirent)

On-Disk Metadata Structures

Superblock

A centralized block near the start of the disk holding global metadata: total number of blocks, number of inode blocks, number of log blocks, etc. When the OS mounts a filesystem it reads the superblock first to understand the layout.

Free Space Bitmap

A compact array of bits — one bit per disk block — indicating whether each block is free or allocated. Scanning or flipping a bit is O(1) per block, making allocation fast.

Inodes

One inode per file, stored in dedicated inode blocks. Each inode contains:

Directories (dirent)

A directory is itself a file whose content is an array of directory entries. Each entry pairs a name string with an inode number. This is how /usr/bin/gcc is resolved: look up usr in /, then bin in /usr, then gcc in /usr/bin.


Disk as an Array of Blocks

At the hardware level, a disk is just an array of fixed-size blocks (512 bytes in xv6). The filesystem imposes structure on this flat array:

[ boot | superblock | log blocks | inode blocks | bitmap | data blocks … ]

mkfs.c in xv6 creates this layout: it writes the superblock, allocates inode blocks, writes the free space bitmap, and copies any initial files (utilities) into the data region.


Key Takeaways

Practice

  1. Which of the following is NOT a major layer in the xv6 filesystem stack?
  2. In Unix, what is the primary identifier that uniquely names a file on disk?
  3. What does a directory entry (dirent) store?
  4. Which on-disk structure does xv6 use to efficiently track which disk blocks are free?
  5. A mouse delivers input one byte at a time and does not support random access. What file orientation does this represent?
  6. Explain the three distinct levels of naming that Unix uses for a file, and describe what happens at each level when a process calls open("/usr/bin/gcc", O_RDONLY).
  7. What is the superblock, what information does it hold, and when is it read?
  8. After a process calls open() and receives file descriptor 5, it then calls fork(). Which statement best describes what happens to the file descriptor?
  9. Sketch the xv6 disk layout (the regions in order from block 0 onward) and explain the purpose of each region.
  10. A student claims: 'Deleting a file in Unix just removes the inode.' Is this correct? Explain what actually happens and what role the link count plays.