Memory Management

Why This Matters

Every byte your kernel code touches has to come from somewhere. Understanding how the kernel tracks, allocates, and releases memory is essential for writing correct, efficient, and secure kernel modules. Get it wrong and you face memory leaks, use-after-free bugs, kernel panics, or—worst of all—silent information disclosure to user space.


Pages: The Basic Unit

Physical memory is divided into fixed-size chunks called pages (also called frames). The size is determined by the CPU's Memory Management Unit (MMU):

Common page size Use case
4 KB Default on most architectures
2 MB "Huge pages" for reduced TLB pressure
1 GB "Gigantic pages" for very large working sets

Run getconf PAGESIZE on any Linux box to see the default.

Every physical page has exactly one struct page object associated with it, defined in include/linux/mm_types.h. At 64 bytes per struct page, this bookkeeping is cheap but not free—on an 8 GB system with 4 KB pages you have about 2 million pages, consuming roughly 128 MB (~1.5% of RAM).

struct page tells the kernel who owns the page: a user-space process, a statically allocated kernel data structure, the page cache, and so on.


Zones: Partitioning Physical Memory

Not all physical pages are equal. Hardware constraints force the kernel to group pages into zones:

Zone Purpose
ZONE_DMA Lowest 16 MB; legacy ISA devices that can only DMA here
ZONE_NORMAL Directly mapped into the kernel address space
ZONE_HIGHMEM x86-32 only; above 896 MB, not permanently mapped

On x86-64 there is effectively no high memory problem because the 64-bit address space is large enough to map all of RAM directly. On x86-32 the kernel/user split (1 GB kernel / 3 GB user) limits the kernel to 896 MB of direct mapping, so pages above that threshold land in ZONE_HIGHMEM and must be temporarily mapped before use.

Each zone is described by struct zone in include/linux/mmzone.h. The page allocator consults zone constraints every time it satisfies an allocation request.


The Buddy System: Low-Level Page Allocation

The kernel's lowest-level allocator is the buddy system, whose API lives in include/linux/gfp.h. It allocates memory in page-granularity blocks that are always a power-of-two in size (1, 2, 4, 8, … pages).

The key insight is that every allocated block of size 2ⁿ has a unique "buddy" at a predictable address. When you free a block, the allocator checks whether its buddy is also free; if so, they merge into a block of size 2ⁿ⁺¹. This coalescing keeps large contiguous regions available and prevents fragmentation.

gfp_t: Controlling Allocation Behavior

Every allocation call takes a gfp_t (Get Free Page flags) bitmask. Flags fall into three categories:

Action modifiershow to allocate:

Flag Meaning
__GFP_WAIT Allocator may sleep/block
__GFP_IO May start disk I/O
__GFP_FS May invoke filesystem operations
__GFP_HIGH Use emergency reserves

Zone modifierswhere to allocate from:

Flag Zone preference
(none) ZONE_NORMAL (fallback to ZONE_DMA)
__GFP_HIGHMEM ZONE_HIGHMEM
__GFP_DMA ZONE_DMA

Type flags — convenient combinations you should reach for first:

Flag Expands to Use when
GFP_KERNEL __GFP_WAIT | __GFP_IO | __GFP_FS Normal kernel code; may sleep
GFP_ATOMIC __GFP_HIGH Interrupt handlers, spinlock-held sections; never sleeps
GFP_NOWAIT Like GFP_ATOMIC but no emergency-pool fallback Soft-IRQ, tasklets
GFP_NOIO May block, no disk I/O Block-layer code (avoids recursion)
GFP_NOFS May block and do I/O, no filesystem ops Filesystem internals
GFP_USER Normal user-space allocation Allocating for user processes
GFP_HIGHUSER GFP_USER | __GFP_HIGHMEM User-space; highmem OK
GFP_DMA Allocate from ZONE_DMA DMA-capable buffers

Security note: By default, newly allocated pages retain whatever data a previous owner left behind. Use get_zeroed_page(gfp_mask) when allocating pages destined for user space to prevent information leakage.


kmalloc() and vmalloc(): Byte-Granularity Allocation

The buddy system works at page granularity. For arbitrary byte-sized allocations the kernel provides two families:

kmalloc() / kfree()

void *kmalloc(size_t size, gfp_t flags);
void  kfree(const void *ptr);

vmalloc() / vfree()

void *vmalloc(unsigned long size);
void  vfree(const void *addr);

When to choose which

Criterion kmalloc vmalloc
Physical contiguity Guaranteed Not guaranteed
Performance Faster (direct mapping, no extra page-table setup) Slower
Interrupt-safe Yes (with GFP_ATOMIC) No
Large allocations Hard Easy
DMA-capable Yes No

Most kernel code prefers kmalloc for performance. Use vmalloc only when kmalloc cannot satisfy the size.


The Slab Allocator: Object Caching

The kernel frequently allocates and frees the same types of data structures (e.g., task_struct, inode, dentry). Calling the buddy system every time would be slow and wasteful. The slab allocator solves this with object caching:

  1. At module/subsystem init time, create a cache for a specific struct type.
  2. The cache pre-allocates slabs of those objects.
  3. Allocation = pull a free object off the free list (O(1)).
  4. Deallocation = push the object back (O(1), no zeroing by default).

The slab allocator also handles:

Creating and Using a Cache

struct kmem_cache *my_cache;

// Init
my_cache = kmem_cache_create("my_struct_cache", sizeof(struct my_struct),
                              0, SLAB_HW_CACHEALIGN, NULL);

// Alloc
struct my_struct *obj = kmem_cache_alloc(my_cache, GFP_KERNEL);

// Free
kmem_cache_free(my_cache, obj);

// Cleanup
kmem_cache_destroy(my_cache);

Useful Slab Flags

Flag Effect
SLAB_HW_CACHEALIGN Align objects to cache-line boundary; prevents false sharing (costs memory)
SLAB_POISON Fill slabs with 0xa5a5a5a5; helps detect uninitialized-memory access
SLAB_RED_ZONE Add padding around objects; helps detect buffer overflows
SLAB_PANIC Panic the kernel if cache creation fails
SLAB_CACHE_DMA Allocate slab memory from ZONE_DMA

Per-CPU Data: Lock-Free Per-Core State

Some kernel data is naturally per-core (e.g., run-queue statistics, per-CPU counters). Sharing one variable across all CPUs requires locking. Per-CPU variables give each core its own private copy:

Internally, per-CPU variables are arrays indexed by CPU number. The API is defined in include/linux/percpu.h:

// Define a per-CPU integer
DEFINE_PER_CPU(int, my_counter);

// Access (must be in preempt-disabled section)
int val = get_cpu_var(my_counter);
my_counter = val + 1;
put_cpu_var(my_counter);

get_cpu_var disables preemption (to prevent migration to another core mid-access) and returns a reference to the current CPU's copy. Always pair it with put_cpu_var.


Key Takeaways

  1. struct page tracks every physical page; it's the foundation of all memory management.
  2. Zones (DMA, Normal, HighMem) reflect hardware constraints; the allocator picks the right zone automatically when you use the correct gfp_t flag.
  3. Use GFP_KERNEL in process context, GFP_ATOMIC in interrupt/spinlock context—never sleep in atomic context.
  4. kmalloc gives physically contiguous memory and is the default; vmalloc gives virtually contiguous memory for large allocations that don't need DMA.
  5. The slab allocator makes frequent allocation/deallocation of fixed-size structs fast through object caching—always prefer a dedicated cache over repeated kmalloc for hot paths.
  6. Per-CPU variables eliminate locking overhead for truly per-core state; always access them with preemption disabled.

Practice

  1. On a machine with 8 GB of RAM and 4 KB pages, approximately how much memory does the kernel reserve for struct page objects (at 64 bytes each)?
  2. Which gfp_t type flag must you use when allocating memory from within an interrupt handler?
  3. A kernel developer needs to allocate a 64 MB buffer that will NOT be used for DMA. Which allocator is the best choice?
  4. Which slab allocator flag should you set to detect buffer overruns that write past the end of an allocated object?
  5. On x86-32 Linux, physical memory above which boundary is placed in ZONE_HIGHMEM and requires temporary mapping before the kernel can access it?
  6. Explain why get_zeroed_page() should be used instead of a plain page allocation when the page will be handed to a user-space process.
  7. Describe how the buddy allocator prevents external memory fragmentation, and give one downside of its approach.
  8. A kernel module allocates many instances of struct my_node throughout its lifetime. Why is creating a dedicated slab cache better than calling kmalloc(sizeof(struct my_node), GFP_KERNEL) each time?
  9. A filesystem driver calls kmalloc(size, GFP_KERNEL) while holding no locks and running in process context. Inside that allocation path the kernel tries to reclaim memory by writing dirty pages to disk. Which flag change would prevent the kernel from initiating disk I/O during this reclaim attempt, while still allowing the allocation to block?
  10. A developer writes code that increments a global counter from multiple CPUs using a per-CPU variable. They notice the global total is sometimes incorrect when they sum the per-CPU values. What are the two things they must do correctly to read the per-CPU variable safely, and why?