Linux Process Scheduler

Why This Matters

Every line of kernel code you write will run under the scheduler's control, and every user-space program you profile is subject to its decisions. Understanding the scheduler helps you reason about latency, throughput, priority inversion, and why your kernel module might not fire exactly when you expect. The scheduler is also a prime example of an elegant, extensible kernel subsystem — studying it teaches broader kernel design patterns.


What Is Process Scheduling?

The processor scheduler decides which process runs next, when it runs, and for how long. Its goals are:


Multitasking: Cooperative vs. Preemptive

Modern Linux is a preemptive multitasking OS, but it helps to understand both models:

Model Who controls the CPU Used by
Cooperative The running process (yields voluntarily) Old Windows (3.1), some language runtimes (Go's old runtime)
Preemptive The OS (can interrupt at any time) All modern OSes including Linux

In preemptive scheduling the OS sets a timer interrupt. When it fires, the kernel can switch to a higher-priority runnable process — even if the current one is in an infinite loop. This is the key insight: the hardware timer gives the kernel a guaranteed re-entry point.


I/O-Bound vs. CPU-Bound Processes

Scheduling policy needs to account for the character of each workload:

Characteristic I/O-bound (e.g. vim, shell) CPU-bound (e.g. MATLAB, video encoder)
Where time is spent Blocked waiting for I/O Executing on the CPU
Run duration per burst Short Long
What matters most Low latency (responsiveness) High throughput (cache warmth)

A good scheduler gives interactive tasks the CPU quickly when they wake up, while letting batch jobs run long enough to keep CPU caches hot.


Process Priority in Linux

Linux uses two separate priority spaces:

Nice values (time-sharing processes)

Real-time priority

You can inspect both with:

ps ax -eo pid,ni,rtprio,cmd

Time Slice: The Challenge

A time slice is the period a process runs uninterrupted before the scheduler considers preempting it.

Fixed time slices create a dilemma:

Real-time Round Robin (SCHED_RR) still uses a fixed slice:

$ cat /proc/sys/kernel/sched_rr_timeslice_ms
100

CFS solves the dilemma differently (see below).


The Completely Fair Scheduler (CFS)

CFS is the default scheduler for normal (non-real-time) processes in Linux.

Core idea

Rather than a fixed time slice, CFS gives each process a proportional share of the CPU:

At any moment, every runnable process of the same priority should have received the same total CPU time.

If we could run N tasks simultaneously, each would get 1/N of the CPU. CFS approximates this by running one process at a time, always picking the one that has accumulated the least CPU time so far.

Dynamic time slice formula

time_slice = (weight of task) / (total weight of all runnable tasks)
             × targeted latency

Preemption rule

When a process P becomes runnable (e.g., wakes from I/O), CFS compares P's accumulated CPU time to the currently running process C. If P has consumed less CPU time than C, P preempts C immediately.

Text editor vs. video encoder example

With both processes at equal priority (50% each):

  1. The video encoder runs while the text editor sleeps waiting for keystrokes.
  2. When the user presses a key, the text editor wakes up.
  3. CFS observes the text editor has used far less CPU time than the encoder.
  4. The text editor immediately preempts the encoder.
  5. After handling the keypress, the text editor sleeps again; the encoder resumes.

Result: excellent interactive performance and good CPU-bound throughput — no fixed priority needed.


Scheduler Class Architecture

Linux uses a modular, pluggable scheduler design. Each scheduler class implements a specific algorithm and has a fixed priority relative to other classes.

Available classes

Class Type Algorithm Key use
SCHED_FIFO Real-time First-in, first-out Hard real-time, runs until it blocks or yields
SCHED_RR Real-time Round-robin with fixed slice Soft real-time, fairer among RT tasks
SCHED_DEADLINE Real-time Sporadic task / EDF Tasks with explicit deadlines
SCHED_NORMAL Time-sharing CFS Default for all user processes
SCHED_BATCH Time-sharing CFS (batch variant) Background batch jobs

The base scheduler (kernel/sched/core.c) iterates classes in priority order — real-time classes are checked before CFS, so any runnable RT task always wins.

sched_class — the abstract interface

Each scheduler class exposes a sched_class struct of function pointers. Important callbacks include:

CFS's implementation lives in kernel/sched/fair.c.

task_struct scheduler fields

Each process descriptor (task_struct) carries scheduler-related data:

When does scheduling happen?

The base scheduler fires in exactly two situations:

  1. Timer interruptscheduler_tick() — checks whether the current process has exhausted its share; sets a TIF_NEED_RESCHED flag if so.
  2. Explicit kernel callschedule() — invoked when a process blocks (e.g., waits on I/O) or voluntarily yields.

This two-path design means scheduling is both periodic (driven by the hardware clock) and event-driven (driven by blocking operations).


Key Takeaways

Practice

  1. What is the primary responsibility of the Linux process scheduler?
  2. In Linux, a process has a nice value of -10. Which statement is correct?
  3. Which of the following best describes cooperative multitasking?
  4. In the Linux Completely Fair Scheduler, when does process P preempt the currently running process C?
  5. CFS sets a 'targeted latency'. What does this parameter control?
  6. Why does CFS use a proportional, dynamic time slice rather than a fixed time slice? What problem does this solve?
  7. A system has two scheduler classes active: SCHED_FIFO (real-time) and SCHED_NORMAL (CFS). A SCHED_NORMAL process is currently running. A SCHED_FIFO process becomes runnable. What happens?
  8. Which two events trigger the base scheduler code in the Linux kernel?
  9. You are writing a kernel module (as in A6) that hooks pick_next_task_fair() with Kprobes to count how often each PID is scheduled. Explain why hooking this specific function gives you accurate per-process scheduling counts, and name one limitation of this approach.
  10. A process descriptor (task_struct) stores a field called 'policy'. Which value would you expect for a normal interactive shell process running on Linux?