Memory Management

Why This Matters

Every byte your kernel code touches has to come from somewhere. Understanding how the kernel tracks, allocates, and releases memory is essential for writing correct, efficient, and secure kernel modules. Get it wrong and you face memory leaks, use-after-free bugs, kernel panics, or—worst of all—silent information disclosure to user space.

Pages: The Basic Unit

Physical memory is divided into fixed-size chunks called pages (also called frames). The size is determined by the CPU's Memory Management Unit (MMU):

Common page size	Use case
4 KB	Default on most architectures
2 MB	"Huge pages" for reduced TLB pressure
1 GB	"Gigantic pages" for very large working sets

Run getconf PAGESIZE on any Linux box to see the default.

Every physical page has exactly one struct page object associated with it, defined in include/linux/mm_types.h. At 64 bytes per struct page, this bookkeeping is cheap but not free—on an 8 GB system with 4 KB pages you have about 2 million pages, consuming roughly 128 MB (~1.5% of RAM).

struct page tells the kernel who owns the page: a user-space process, a statically allocated kernel data structure, the page cache, and so on.

Zones: Partitioning Physical Memory

Not all physical pages are equal. Hardware constraints force the kernel to group pages into zones:

Zone	Purpose
`ZONE_DMA`	Lowest 16 MB; legacy ISA devices that can only DMA here
`ZONE_NORMAL`	Directly mapped into the kernel address space
`ZONE_HIGHMEM`	x86-32 only; above 896 MB, not permanently mapped

On x86-64 there is effectively no high memory problem because the 64-bit address space is large enough to map all of RAM directly. On x86-32 the kernel/user split (1 GB kernel / 3 GB user) limits the kernel to 896 MB of direct mapping, so pages above that threshold land in ZONE_HIGHMEM and must be temporarily mapped before use.

Each zone is described by struct zone in include/linux/mmzone.h. The page allocator consults zone constraints every time it satisfies an allocation request.

The Buddy System: Low-Level Page Allocation

The kernel's lowest-level allocator is the buddy system, whose API lives in include/linux/gfp.h. It allocates memory in page-granularity blocks that are always a power-of-two in size (1, 2, 4, 8, … pages).

The key insight is that every allocated block of size 2ⁿ has a unique "buddy" at a predictable address. When you free a block, the allocator checks whether its buddy is also free; if so, they merge into a block of size 2ⁿ⁺¹. This coalescing keeps large contiguous regions available and prevents fragmentation.

gfp_t: Controlling Allocation Behavior

Every allocation call takes a gfp_t (Get Free Page flags) bitmask. Flags fall into three categories:

Action modifiers — how to allocate:

Flag	Meaning
`__GFP_WAIT`	Allocator may sleep/block
`__GFP_IO`	May start disk I/O
`__GFP_FS`	May invoke filesystem operations
`__GFP_HIGH`	Use emergency reserves

Zone modifiers — where to allocate from:

Flag	Zone preference
(none)	`ZONE_NORMAL` (fallback to `ZONE_DMA`)
`__GFP_HIGHMEM`	`ZONE_HIGHMEM`
`__GFP_DMA`	`ZONE_DMA`

Type flags — convenient combinations you should reach for first:

Flag	Expands to	Use when
`GFP_KERNEL`	`__GFP_WAIT \| __GFP_IO \| __GFP_FS`	Normal kernel code; may sleep
`GFP_ATOMIC`	`__GFP_HIGH`	Interrupt handlers, spinlock-held sections; never sleeps
`GFP_NOWAIT`	Like `GFP_ATOMIC` but no emergency-pool fallback	Soft-IRQ, tasklets
`GFP_NOIO`	May block, no disk I/O	Block-layer code (avoids recursion)
`GFP_NOFS`	May block and do I/O, no filesystem ops	Filesystem internals
`GFP_USER`	Normal user-space allocation	Allocating for user processes
`GFP_HIGHUSER`	`GFP_USER \| __GFP_HIGHMEM`	User-space; highmem OK
`GFP_DMA`	Allocate from `ZONE_DMA`	DMA-capable buffers

Security note: By default, newly allocated pages retain whatever data a previous owner left behind. Use get_zeroed_page(gfp_mask) when allocating pages destined for user space to prevent information leakage.

kmalloc() and vmalloc(): Byte-Granularity Allocation

The buddy system works at page granularity. For arbitrary byte-sized allocations the kernel provides two families:

kmalloc() / kfree()

void *kmalloc(size_t size, gfp_t flags);
void  kfree(const void *ptr);

Returns physically contiguous memory (backed by the buddy system).
Works in any context (even interrupt context with GFP_ATOMIC).
Required for DMA buffers and anything that depends on physical contiguity.
Maximum allocation size is typically a few MB (architecture-dependent).

vmalloc() / vfree()

void *vmalloc(unsigned long size);
void  vfree(const void *addr);

Returns virtually contiguous memory that may be physically scattered.
Cannot be used for I/O buffers that require physical contiguity.
May sleep—never call from interrupt context.
Useful when you need a large contiguous region and physical contiguity is not required.

When to choose which

Criterion	`kmalloc`	`vmalloc`
Physical contiguity	Guaranteed	Not guaranteed
Performance	Faster (direct mapping, no extra page-table setup)	Slower
Interrupt-safe	Yes (with `GFP_ATOMIC`)	No
Large allocations	Hard	Easy
DMA-capable	Yes	No

Most kernel code prefers kmalloc for performance. Use vmalloc only when kmalloc cannot satisfy the size.

The Slab Allocator: Object Caching

The kernel frequently allocates and frees the same types of data structures (e.g., task_struct, inode, dentry). Calling the buddy system every time would be slow and wasteful. The slab allocator solves this with object caching:

At module/subsystem init time, create a cache for a specific struct type.
The cache pre-allocates slabs of those objects.
Allocation = pull a free object off the free list (O(1)).
Deallocation = push the object back (O(1), no zeroing by default).

The slab allocator also handles:

Correct object alignment for the architecture's cache line size.
NUMA awareness.
Cache coloring to spread objects across cache sets and reduce conflicts.

Creating and Using a Cache

struct kmem_cache *my_cache;

// Init
my_cache = kmem_cache_create("my_struct_cache", sizeof(struct my_struct),
                              0, SLAB_HW_CACHEALIGN, NULL);

// Alloc
struct my_struct *obj = kmem_cache_alloc(my_cache, GFP_KERNEL);

// Free
kmem_cache_free(my_cache, obj);

// Cleanup
kmem_cache_destroy(my_cache);

Useful Slab Flags

Flag	Effect
`SLAB_HW_CACHEALIGN`	Align objects to cache-line boundary; prevents false sharing (costs memory)
`SLAB_POISON`	Fill slabs with `0xa5a5a5a5`; helps detect uninitialized-memory access
`SLAB_RED_ZONE`	Add padding around objects; helps detect buffer overflows
`SLAB_PANIC`	Panic the kernel if cache creation fails
`SLAB_CACHE_DMA`	Allocate slab memory from `ZONE_DMA`

Per-CPU Data: Lock-Free Per-Core State

Some kernel data is naturally per-core (e.g., run-queue statistics, per-CPU counters). Sharing one variable across all CPUs requires locking. Per-CPU variables give each core its own private copy:

No locking required — each core only touches its own copy.
Reduced cache thrashing — no false sharing across cores.

Internally, per-CPU variables are arrays indexed by CPU number. The API is defined in include/linux/percpu.h:

// Define a per-CPU integer
DEFINE_PER_CPU(int, my_counter);

// Access (must be in preempt-disabled section)
int val = get_cpu_var(my_counter);
my_counter = val + 1;
put_cpu_var(my_counter);

get_cpu_var disables preemption (to prevent migration to another core mid-access) and returns a reference to the current CPU's copy. Always pair it with put_cpu_var.

Key Takeaways

struct page tracks every physical page; it's the foundation of all memory management.
Zones (DMA, Normal, HighMem) reflect hardware constraints; the allocator picks the right zone automatically when you use the correct gfp_t flag.
Use GFP_KERNEL in process context, GFP_ATOMIC in interrupt/spinlock context—never sleep in atomic context.
kmalloc gives physically contiguous memory and is the default; vmalloc gives virtually contiguous memory for large allocations that don't need DMA.
The slab allocator makes frequent allocation/deallocation of fixed-size structs fast through object caching—always prefer a dedicated cache over repeated kmalloc for hot paths.
Per-CPU variables eliminate locking overhead for truly per-core state; always access them with preemption disabled.

Memory Management

Why This Matters

Pages: The Basic Unit

Zones: Partitioning Physical Memory

The Buddy System: Low-Level Page Allocation

gfp_t: Controlling Allocation Behavior

kmalloc() and vmalloc(): Byte-Granularity Allocation

kmalloc() / kfree()

vmalloc() / vfree()

When to choose which

The Slab Allocator: Object Caching

Creating and Using a Cache

Useful Slab Flags

Per-CPU Data: Lock-Free Per-Core State

Key Takeaways

Practice

Model answer

Model answer

Model answer

Model answer

Results