Linux Process Scheduler

Why This Matters

Every line of kernel code you write will run under the scheduler's control, and every user-space program you profile is subject to its decisions. Understanding the scheduler helps you reason about latency, throughput, priority inversion, and why your kernel module might not fire exactly when you expect. The scheduler is also a prime example of an elegant, extensible kernel subsystem — studying it teaches broader kernel design patterns.

What Is Process Scheduling?

The processor scheduler decides which process runs next, when it runs, and for how long. Its goals are:

Utilization — never waste CPU cycles on blocked or idle processes.
Fairness — don't starve low-priority work.
Responsiveness — honor higher-priority processes promptly.

Multitasking: Cooperative vs. Preemptive

Modern Linux is a preemptive multitasking OS, but it helps to understand both models:

Model	Who controls the CPU	Used by
Cooperative	The running process (yields voluntarily)	Old Windows (3.1), some language runtimes (Go's old runtime)
Preemptive	The OS (can interrupt at any time)	All modern OSes including Linux

In preemptive scheduling the OS sets a timer interrupt. When it fires, the kernel can switch to a higher-priority runnable process — even if the current one is in an infinite loop. This is the key insight: the hardware timer gives the kernel a guaranteed re-entry point.

I/O-Bound vs. CPU-Bound Processes

Scheduling policy needs to account for the character of each workload:

Characteristic	I/O-bound (e.g. `vim`, shell)	CPU-bound (e.g. MATLAB, video encoder)
Where time is spent	Blocked waiting for I/O	Executing on the CPU
Run duration per burst	Short	Long
What matters most	Low latency (responsiveness)	High throughput (cache warmth)

A good scheduler gives interactive tasks the CPU quickly when they wake up, while letting batch jobs run long enough to keep CPU caches hot.

Process Priority in Linux

Linux uses two separate priority spaces:

Nice values (time-sharing processes)

Range: −20 to +19 (default 0)
Lower nice = higher priority (think of "nice" as how much you yield to others)
Set with nice -n 5 vim; change a running process with sudo renice -n -5 -p $(pidof vim)

Real-time priority

Range: 0 to 99
Higher value = higher real-time priority
Real-time processes always run before any normal (nice-based) process

You can inspect both with:

ps ax -eo pid,ni,rtprio,cmd

Time Slice: The Challenge

A time slice is the period a process runs uninterrupted before the scheduler considers preempting it.

Fixed time slices create a dilemma:

Too long → poor interactive performance (long wait before the text editor gets the CPU back)
Too short → high context-switch overhead (the system spends more time switching than working)

Real-time Round Robin (SCHED_RR) still uses a fixed slice:

$ cat /proc/sys/kernel/sched_rr_timeslice_ms
100

CFS solves the dilemma differently (see below).

The Completely Fair Scheduler (CFS)

CFS is the default scheduler for normal (non-real-time) processes in Linux.

Core idea

Rather than a fixed time slice, CFS gives each process a proportional share of the CPU:

At any moment, every runnable process of the same priority should have received the same total CPU time.

If we could run N tasks simultaneously, each would get 1/N of the CPU. CFS approximates this by running one process at a time, always picking the one that has accumulated the least CPU time so far.

Dynamic time slice formula

time_slice = (weight of task) / (total weight of all runnable tasks)
             × targeted latency

Targeted latency: the period within which every runnable process gets scheduled at least once (a tunable kernel parameter).
Minimum granularity: 1 ms by default — prevents excessive switching when there are many tasks.
Weight: derived from the nice value — lower nice → higher weight → larger share.

Preemption rule

When a process P becomes runnable (e.g., wakes from I/O), CFS compares P's accumulated CPU time to the currently running process C. If P has consumed less CPU time than C, P preempts C immediately.

Text editor vs. video encoder example

With both processes at equal priority (50% each):

The video encoder runs while the text editor sleeps waiting for keystrokes.
When the user presses a key, the text editor wakes up.
CFS observes the text editor has used far less CPU time than the encoder.
The text editor immediately preempts the encoder.
After handling the keypress, the text editor sleeps again; the encoder resumes.

Result: excellent interactive performance and good CPU-bound throughput — no fixed priority needed.

Scheduler Class Architecture

Linux uses a modular, pluggable scheduler design. Each scheduler class implements a specific algorithm and has a fixed priority relative to other classes.

Available classes

Class	Type	Algorithm	Key use
`SCHED_FIFO`	Real-time	First-in, first-out	Hard real-time, runs until it blocks or yields
`SCHED_RR`	Real-time	Round-robin with fixed slice	Soft real-time, fairer among RT tasks
`SCHED_DEADLINE`	Real-time	Sporadic task / EDF	Tasks with explicit deadlines
`SCHED_NORMAL`	Time-sharing	CFS	Default for all user processes
`SCHED_BATCH`	Time-sharing	CFS (batch variant)	Background batch jobs

The base scheduler (kernel/sched/core.c) iterates classes in priority order — real-time classes are checked before CFS, so any runnable RT task always wins.

`sched_class` — the abstract interface

Each scheduler class exposes a sched_class struct of function pointers. Important callbacks include:

enqueue_task / dequeue_task — add/remove a task from the runqueue
pick_next_task — choose the next process to run
task_tick — called on every timer tick; decides if preemption is needed
set_curr_task — called when a task is selected

CFS's implementation lives in kernel/sched/fair.c.

`task_struct` scheduler fields

Each process descriptor (task_struct) carries scheduler-related data:

policy — which scheduler class to use (SCHED_NORMAL, SCHED_FIFO, etc.)
prio / static_prio / normal_prio — priority values
sched_class — pointer to the active scheduler class
se — a sched_entity embedded struct; CFS uses this to track accumulated runtime (vruntime)

When does scheduling happen?

The base scheduler fires in exactly two situations:

Timer interrupt → scheduler_tick() — checks whether the current process has exhausted its share; sets a TIF_NEED_RESCHED flag if so.
Explicit kernel call → schedule() — invoked when a process blocks (e.g., waits on I/O) or voluntarily yields.

This two-path design means scheduling is both periodic (driven by the hardware clock) and event-driven (driven by blocking operations).

Key Takeaways

The scheduler's job is to decide who runs, when, and for how long, balancing responsiveness, throughput, and fairness.
Linux is preemptive: the timer interrupt gives the OS guaranteed control even over infinite loops.
CFS avoids fixed time slices; it tracks vruntime (virtual runtime) per task and always runs the task with the least accumulated CPU time, weighted by nice value.
The targeted latency bounds worst-case scheduling delay; minimum granularity prevents excessive context switching under high load.
The scheduler is modular: real-time classes (SCHED_FIFO, SCHED_RR, SCHED_DEADLINE) always outrank CFS classes.
Scheduling is triggered by the timer interrupt (scheduler_tick) or by explicit schedule() calls when a task blocks.
The pick_next_task_fair() function in kernel/sched/fair.c is the heart of CFS — hooking it with Kprobes (as in A6) gives direct visibility into scheduler decisions.

Linux Process Scheduler

Why This Matters

What Is Process Scheduling?

Multitasking: Cooperative vs. Preemptive

I/O-Bound vs. CPU-Bound Processes

Process Priority in Linux

Nice values (time-sharing processes)

Real-time priority

Time Slice: The Challenge

The Completely Fair Scheduler (CFS)

Core idea

Dynamic time slice formula

Preemption rule

Text editor vs. video encoder example

Scheduler Class Architecture

Available classes

`sched_class` — the abstract interface

`task_struct` scheduler fields

When does scheduling happen?

Key Takeaways

Practice

Model answer

Model answer

Results

Linux Process Scheduler

Why This Matters

What Is Process Scheduling?

Multitasking: Cooperative vs. Preemptive

I/O-Bound vs. CPU-Bound Processes

Process Priority in Linux

Nice values (time-sharing processes)

Real-time priority

Time Slice: The Challenge

The Completely Fair Scheduler (CFS)

Core idea

Dynamic time slice formula

Preemption rule

Text editor vs. video encoder example

Scheduler Class Architecture

Available classes

sched_class — the abstract interface

task_struct scheduler fields

When does scheduling happen?

Key Takeaways

Practice

Model answer

Model answer

Results

`sched_class` — the abstract interface

`task_struct` scheduler fields