Memory Management
Why This Matters
Every byte your kernel code touches has to come from somewhere. Understanding how the kernel tracks, allocates, and releases memory is essential for writing correct, efficient, and secure kernel modules. Get it wrong and you face memory leaks, use-after-free bugs, kernel panics, or—worst of all—silent information disclosure to user space.
Pages: The Basic Unit
Physical memory is divided into fixed-size chunks called pages (also called frames). The size is determined by the CPU's Memory Management Unit (MMU):
| Common page size | Use case |
|---|---|
| 4 KB | Default on most architectures |
| 2 MB | "Huge pages" for reduced TLB pressure |
| 1 GB | "Gigantic pages" for very large working sets |
Run getconf PAGESIZE on any Linux box to see the default.
Every physical page has exactly one struct page object associated with it, defined in include/linux/mm_types.h. At 64 bytes per struct page, this bookkeeping is cheap but not free—on an 8 GB system with 4 KB pages you have about 2 million pages, consuming roughly 128 MB (~1.5% of RAM).
struct page tells the kernel who owns the page: a user-space process, a statically allocated kernel data structure, the page cache, and so on.
Zones: Partitioning Physical Memory
Not all physical pages are equal. Hardware constraints force the kernel to group pages into zones:
| Zone | Purpose |
|---|---|
ZONE_DMA |
Lowest 16 MB; legacy ISA devices that can only DMA here |
ZONE_NORMAL |
Directly mapped into the kernel address space |
ZONE_HIGHMEM |
x86-32 only; above 896 MB, not permanently mapped |
On x86-64 there is effectively no high memory problem because the 64-bit address space is large enough to map all of RAM directly. On x86-32 the kernel/user split (1 GB kernel / 3 GB user) limits the kernel to 896 MB of direct mapping, so pages above that threshold land in ZONE_HIGHMEM and must be temporarily mapped before use.
Each zone is described by struct zone in include/linux/mmzone.h. The page allocator consults zone constraints every time it satisfies an allocation request.
The Buddy System: Low-Level Page Allocation
The kernel's lowest-level allocator is the buddy system, whose API lives in include/linux/gfp.h. It allocates memory in page-granularity blocks that are always a power-of-two in size (1, 2, 4, 8, … pages).
The key insight is that every allocated block of size 2ⁿ has a unique "buddy" at a predictable address. When you free a block, the allocator checks whether its buddy is also free; if so, they merge into a block of size 2ⁿ⁺¹. This coalescing keeps large contiguous regions available and prevents fragmentation.
gfp_t: Controlling Allocation Behavior
Every allocation call takes a gfp_t (Get Free Page flags) bitmask. Flags fall into three categories:
Action modifiers — how to allocate:
| Flag | Meaning |
|---|---|
__GFP_WAIT |
Allocator may sleep/block |
__GFP_IO |
May start disk I/O |
__GFP_FS |
May invoke filesystem operations |
__GFP_HIGH |
Use emergency reserves |
Zone modifiers — where to allocate from:
| Flag | Zone preference |
|---|---|
| (none) | ZONE_NORMAL (fallback to ZONE_DMA) |
__GFP_HIGHMEM |
ZONE_HIGHMEM |
__GFP_DMA |
ZONE_DMA |
Type flags — convenient combinations you should reach for first:
| Flag | Expands to | Use when |
|---|---|---|
GFP_KERNEL |
__GFP_WAIT | __GFP_IO | __GFP_FS |
Normal kernel code; may sleep |
GFP_ATOMIC |
__GFP_HIGH |
Interrupt handlers, spinlock-held sections; never sleeps |
GFP_NOWAIT |
Like GFP_ATOMIC but no emergency-pool fallback |
Soft-IRQ, tasklets |
GFP_NOIO |
May block, no disk I/O | Block-layer code (avoids recursion) |
GFP_NOFS |
May block and do I/O, no filesystem ops | Filesystem internals |
GFP_USER |
Normal user-space allocation | Allocating for user processes |
GFP_HIGHUSER |
GFP_USER | __GFP_HIGHMEM |
User-space; highmem OK |
GFP_DMA |
Allocate from ZONE_DMA |
DMA-capable buffers |
Security note: By default, newly allocated pages retain whatever data a previous owner left behind. Use get_zeroed_page(gfp_mask) when allocating pages destined for user space to prevent information leakage.
kmalloc() and vmalloc(): Byte-Granularity Allocation
The buddy system works at page granularity. For arbitrary byte-sized allocations the kernel provides two families:
kmalloc() / kfree()
void *kmalloc(size_t size, gfp_t flags);
void kfree(const void *ptr);
- Returns physically contiguous memory (backed by the buddy system).
- Works in any context (even interrupt context with
GFP_ATOMIC). - Required for DMA buffers and anything that depends on physical contiguity.
- Maximum allocation size is typically a few MB (architecture-dependent).
vmalloc() / vfree()
void *vmalloc(unsigned long size);
void vfree(const void *addr);
- Returns virtually contiguous memory that may be physically scattered.
- Cannot be used for I/O buffers that require physical contiguity.
- May sleep—never call from interrupt context.
- Useful when you need a large contiguous region and physical contiguity is not required.
When to choose which
| Criterion | kmalloc |
vmalloc |
|---|---|---|
| Physical contiguity | Guaranteed | Not guaranteed |
| Performance | Faster (direct mapping, no extra page-table setup) | Slower |
| Interrupt-safe | Yes (with GFP_ATOMIC) |
No |
| Large allocations | Hard | Easy |
| DMA-capable | Yes | No |
Most kernel code prefers kmalloc for performance. Use vmalloc only when kmalloc cannot satisfy the size.
The Slab Allocator: Object Caching
The kernel frequently allocates and frees the same types of data structures (e.g., task_struct, inode, dentry). Calling the buddy system every time would be slow and wasteful. The slab allocator solves this with object caching:
- At module/subsystem init time, create a cache for a specific struct type.
- The cache pre-allocates slabs of those objects.
- Allocation = pull a free object off the free list (O(1)).
- Deallocation = push the object back (O(1), no zeroing by default).
The slab allocator also handles:
- Correct object alignment for the architecture's cache line size.
- NUMA awareness.
- Cache coloring to spread objects across cache sets and reduce conflicts.
Creating and Using a Cache
struct kmem_cache *my_cache;
// Init
my_cache = kmem_cache_create("my_struct_cache", sizeof(struct my_struct),
0, SLAB_HW_CACHEALIGN, NULL);
// Alloc
struct my_struct *obj = kmem_cache_alloc(my_cache, GFP_KERNEL);
// Free
kmem_cache_free(my_cache, obj);
// Cleanup
kmem_cache_destroy(my_cache);
Useful Slab Flags
| Flag | Effect |
|---|---|
SLAB_HW_CACHEALIGN |
Align objects to cache-line boundary; prevents false sharing (costs memory) |
SLAB_POISON |
Fill slabs with 0xa5a5a5a5; helps detect uninitialized-memory access |
SLAB_RED_ZONE |
Add padding around objects; helps detect buffer overflows |
SLAB_PANIC |
Panic the kernel if cache creation fails |
SLAB_CACHE_DMA |
Allocate slab memory from ZONE_DMA |
Per-CPU Data: Lock-Free Per-Core State
Some kernel data is naturally per-core (e.g., run-queue statistics, per-CPU counters). Sharing one variable across all CPUs requires locking. Per-CPU variables give each core its own private copy:
- No locking required — each core only touches its own copy.
- Reduced cache thrashing — no false sharing across cores.
Internally, per-CPU variables are arrays indexed by CPU number. The API is defined in include/linux/percpu.h:
// Define a per-CPU integer
DEFINE_PER_CPU(int, my_counter);
// Access (must be in preempt-disabled section)
int val = get_cpu_var(my_counter);
my_counter = val + 1;
put_cpu_var(my_counter);
get_cpu_var disables preemption (to prevent migration to another core mid-access) and returns a reference to the current CPU's copy. Always pair it with put_cpu_var.
Key Takeaways
struct pagetracks every physical page; it's the foundation of all memory management.- Zones (DMA, Normal, HighMem) reflect hardware constraints; the allocator picks the right zone automatically when you use the correct
gfp_tflag. - Use
GFP_KERNELin process context,GFP_ATOMICin interrupt/spinlock context—never sleep in atomic context. kmallocgives physically contiguous memory and is the default;vmallocgives virtually contiguous memory for large allocations that don't need DMA.- The slab allocator makes frequent allocation/deallocation of fixed-size structs fast through object caching—always prefer a dedicated cache over repeated
kmallocfor hot paths. - Per-CPU variables eliminate locking overhead for truly per-core state; always access them with preemption disabled.