Process Management in the Linux Kernel

Why this matters

Every system call you write, every kernel module you build, and every driver you debug runs in the context of a process. Understanding how the kernel tracks, schedules, creates, and destroys processes is foundational — it explains why current works, why threads behave the way they do, and what really happens when a program crashes or exits.


What Is a Process?

A process is a program currently executing in the system. The kernel tracks four kinds of per-process state:

Component Examples
CPU registers instruction pointer, stack pointer, general-purpose regs
Program code the text section mapped from the ELF binary
Memory segments data, BSS, heap, stack
Kernel resources open file descriptors, pending signals, address space descriptor

Processes also own one or more threads — concurrent flows of execution within the same address space. From the kernel's perspective (more on this below), threads and processes are the same thing.

From user space, three system calls drive the process lifecycle:


The Process Descriptor: task_struct

Every process is represented by a struct task_struct (defined in include/linux/sched.h). This is one of the largest structs in the kernel — it holds everything about a process.

Allocation and access

task_struct is heap-allocated, not placed on the kernel stack. This is a deliberate security decision: putting it on the stack would risk corruption from kernel stack overflows.

To access the current process efficiently, the kernel uses a per-CPU variable:

/* arch/x86/include/asm/current.h */
DECLARE_PER_CPU(struct task_struct *, current_task);

static __always_inline struct task_struct *get_current(void)
{
    return this_cpu_read_stable(current_task);
}

#define current get_current()

The current macro is valid only when the kernel runs in process context (e.g., inside a system call). In interrupt context there is no associated user process, so current is meaningless.

Process ID (PID)

Each process gets a pid_t identifier:

Process states (task->__state)

State Meaning
TASK_RUNNING Runnable — either executing on a CPU or waiting in the scheduler run queue. Can be in user- or kernel-space.
TASK_INTERRUPTIBLE Sleeping, waiting for a condition. Wakes on the condition or on a signal.
TASK_UNINTERRUPTIBLE Sleeping, waiting for a condition. Does not wake on signals — used for I/O that must not be interrupted (e.g., disk reads).
__TASK_TRACED Being traced by another process (e.g., a debugger via ptrace).
__TASK_STOPPED Execution paused by a signal such as SIGSTOP. Neither running nor waiting to run.

The Process Family Tree

The Linux process tree is rooted at init (PID 1), launched by the kernel as the final boot step. On modern Debian-based systems, init is replaced by systemd. init's task_struct is the global init_task.

Every task_struct carries links for navigating the tree:

current->parent      // my parent
current->children    // list of my children
current->sibling     // my siblings under the same parent
current->tasks       // node in the global list of all tasks

Helper macros next_task(t) and for_each_process(t) make it easy to walk the full task list.


Process Creation

Linux has no primitive to create a process from nothing (no spawn or CreateProcess). Instead:

  1. fork() creates a child — a near-copy of the parent (different PID, PPID, and a few resource counters).
  2. exec() loads a new binary into the copied address space.

Copy-on-Write (CoW)

Naively copying all parent pages on fork() would be expensive. Linux uses Copy-on-Write:

  1. fork() duplicates only the page tables, marking all pages read-only.
  2. When either process tries to write a page, a page fault fires.
  3. The kernel copies just that page, remaps it as read-write in the faulting process, and resumes.

Result: fork() is fast (no data is copied upfront) and memory-efficient (read-only pages are shared until a write occurs).

Inside fork() / clone()

fork() is implemented via the clone() system call. The kernel path is:

fork()  →  clone()  →  kernel_clone()  →  copy_process()  →  wake_up_new_task()

copy_process() does the heavy lifting:

  1. dup_task_struct() — copies the kernel stack, task_struct, and thread_info
  2. Checks process count limits
  3. Clears fields that must not be inherited (e.g., pending signals)
  4. sched_fork() — sets child state to TASK_NEW
  5. Copies parent's files, signal handlers, address space metadata, etc.
  6. alloc_pid() — assigns a new PID
  7. Returns a pointer to the child task_struct

wake_up_new_task() then transitions the child to TASK_RUNNING.


Threads

In most operating systems, threads and processes are distinct kernel objects. Linux has no separate thread concept.

A thread is simply another process that shares resources with its creator. The clone() flags control what is shared:

// Creating a POSIX thread
clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
Flag What is shared
CLONE_VM Virtual address space (same page tables)
CLONE_FS Filesystem context (root, cwd, umask)
CLONE_FILES File descriptor table
CLONE_SIGHAND Signal handlers

Each thread still has its own task_struct and is scheduled independently.


Kernel Threads

Kernel threads perform background kernel work (e.g., kworker for work queues, migration for CPU load balancing). They differ from user processes in one key way: they have no user-space address space (mm in task_struct is NULL).

All kernel threads are children of kthreadd (PID 2). You can list them with:

ps --ppid 2

Kernel thread API

// Create a stopped thread
struct task_struct *t = kthread_create(threadfn, data, "name-%d", i);

// Make it runnable
wake_up_process(t);

// Create and immediately start (combines the two above)
kthread_run(threadfn, data, "name-%d", i);

// Request the thread to stop (sets kthread_should_stop())
kthread_stop(t);

The threadfn should periodically call kthread_should_stop() and return when it is true.


Process Termination

A process terminates when it calls exit() (explicitly, or implicitly when main() returns). The kernel path is:

exit()  →  sys_exit()  →  do_exit()   (kernel/exit.c)

do_exit() steps:

  1. exit_signals() — sets the PF_EXITING flag; stores exit code in task_struct.exit_code
  2. exit_mm() — releases the mm_struct (address space)
  3. exit_sem() — dequeues from any semaphore wait
  4. exit_files() / exit_fs() — decrements reference counts on file descriptors and filesystem objects; frees any that reach zero
  5. exit_notify() — sends signals to the parent, reparents children, sets exit_state = EXIT_ZOMBIE
  6. do_task_dead() — sets state to TASK_DEAD and calls schedule() — the process never returns

Zombie state and cleanup

After do_exit(), the process is a zombie: its task_struct, kernel stack, and thread_info persist so the parent can call wait() to retrieve the exit code. Once the parent calls wait(), release_task() removes the task from the task list and frees the remaining memory.

Orphan / zombie prevention

If a parent exits before its children, those children must be reparented:

  1. exit_notify() calls forget_original_parent()find_new_reaper()
  2. find_new_reaper() returns another thread in the thread group if one exists; otherwise it returns init (PID 1)
  3. All children are reparented to the reaper, which will call wait() for them

This guarantees no orphaned zombie can accumulate indefinitely.


Key Takeaways

Practice

  1. Why is task_struct allocated on the heap rather than the kernel stack?
  2. A process is waiting for data from a slow NFS mount and must not be woken by signals (e.g., SIGINT). Which task state should it be placed in?
  3. What does Copy-on-Write (CoW) mean in the context of fork()?
  4. In Linux, how is a POSIX thread different from a regular process at the kernel level?
  5. Which kernel function is the direct implementation of the fork() system call, and what does it call first?
  6. What distinguishes a kernel thread from a regular user process in terms of task_struct?
  7. Explain the zombie process state: when does it occur, what data is retained, and how is it cleaned up?
  8. You are writing a kernel module that needs to periodically flush a cache in the background. Sketch (in pseudocode or prose) how you would create, run, and cleanly stop a kernel thread for this purpose, naming the relevant API functions.
  9. A process calls exit(). Which sequence correctly describes what do_exit() does with the process's memory and file resources before setting the zombie state?
  10. Why is the current macro only meaningful in process context, and what alternative context exists in the kernel where it is not valid?