Process Scheduling: CFS Deep Dive, Real-Time Policies, and EEVDF
Why This Matters
Every time your terminal feels snappy while a background compile is grinding away, the Linux scheduler is making dozens of decisions per second about which task gets the CPU and for how long. Understanding these decisions — from the red-black tree that CFS uses to pick the next task, to the "lag" value that the EEVDF scheduler tracks — gives you the mental model to reason about performance, latency, and starvation in kernel code you write or debug.
1. CFS Quick Recap
CFS (Completely Fair Scheduler) does not assign fixed time slices. Instead:
- Each runnable process receives a proportion of the CPU based on total system load.
- That proportion is further weighted by priority (nice value / weight).
- A newly-runnable process preempts the current one if its virtual runtime (vruntime) is smaller — i.e., it has consumed less CPU time so far.
This design lets a text editor (low CPU consumer) stay responsive while a video encoder (high CPU consumer) runs in the background, both getting their fair share.
2. CFS Implementation: Four Components
2.1 Time Accounting — Virtual Runtime
The core accounting unit in CFS is vruntime: how much CPU time a task has virtually consumed, normalized by its weight.
On every timer interrupt, CFS updates the running task's vruntime. The field lives inside sched_entity, which is embedded in task_struct.
task_struct
└─ se (struct sched_entity)
└─ vruntime ← updated every tick
2.2 Process Selection — The Red-Black Tree
CFS maintains a red-black tree (rb-tree) of all runnable tasks, keyed by vruntime:
- The leftmost node always holds the task with the smallest vruntime.
- Picking the next task is O(1) — the kernel caches
rb_leftmost.
/* kernel/sched/fair.c */
struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
{
struct rb_node *left = cfs_rq->rb_leftmost;
if (!left)
return NULL;
return rb_entry(left, struct sched_entity, run_node);
}
Enqueue (task wakes up): insert into the tree in O(log n).
Dequeue (task sleeps): remove from the tree in O(log n).
2.3 Scheduler Entry Points
| Function | Role |
|---|---|
schedule() |
Main entry point; calls __schedule() → pick_next_task() |
scheduler_tick() |
Called on every timer interrupt; updates vruntime, checks if preemption is needed |
2.4 Sleeping and Waking Up
- When a task blocks (e.g., waiting for I/O), it is dequeued from the rb-tree.
- When it wakes, it is enqueued back. Its vruntime is adjusted so it doesn't get a huge burst of CPU time after a long sleep.
3. Scheduler Classes and Policies
Linux has a layered hierarchy of scheduler classes, checked in order from highest to lowest priority:
Stop → Deadline → Real-Time → Fair (CFS) → Idle
schedule() calls pick_next_task(), which iterates through each class via for_each_class() and returns the first task found. A real-time task will always run before any CFS task.
| Class | Policies |
|---|---|
| Deadline | SCHED_DEADLINE |
| Real-Time | SCHED_FIFO, SCHED_RR |
| Fair | SCHED_NORMAL, SCHED_BATCH |
| Idle | SCHED_IDLE |
4. Real-Time Scheduling Policies
Linux provides soft real-time scheduling — best effort, no hard guarantees.
SCHED_FIFO
- A task runs until it blocks or yields.
- Only a higher-priority RT task can preempt it.
- Tasks at the same priority are scheduled round-robin.
SCHED_RR
- Same as SCHED_FIFO, but each task gets a fixed time slice.
- Default: 100 ms (
/proc/sys/kernel/sched_rr_timeslice_ms).
SCHED_DEADLINE (mainlined in v3.14)
- Implements Earliest Deadline First (EDF) scheduling.
- Each task specifies a period and a worst-case execution time (WCET).
- The task with the earliest deadline runs first.
Setting an RT Policy in User Space
#include <sched.h>
struct sched_param param;
param.sched_priority = 99; // max for SCHED_FIFO
sched_setscheduler(0, SCHED_FIFO, ¶m);
// Requires root privileges
RT Starvation
An infinite loop in an RT task can starve all lower-priority tasks. The kernel has a throttling mechanism (/proc/sys/kernel/sched_rt_runtime_us and sched_rt_period_us) that limits how much CPU RT tasks collectively consume, preventing complete starvation by default. Disabling this (writing -1 to sched_rt_runtime_us) allows starvation — useful to test in a QEMU VM, not production.
5. Multi-Core Scheduling
Per-CPU Runqueues
Each CPU has its own runqueue (rb-tree). This avoids contention on a single shared data structure — a major scalability win.
Load Balancing
Runqueues can become unbalanced. A load balancer runs periodically (triggered by the scheduler tick or idle entry) and migrates tasks between CPUs based on priority and current load.
Example imbalance: one CPU has a long queue of high-priority tasks; the other has one low-priority task. Without balancing, the high-priority tasks would each receive less CPU time than the single low-priority task on the idle CPU.
6. Preemption and Context Switch
A context switch swaps the currently-running process for another. It is performed by context_switch(), called from __schedule():
schedule() → __schedule() → context_switch()
├─ switch_mm() # swap address space (page tables)
└─ switch_to() # swap CPU registers (PC, SP, ...)
When Does schedule() Get Called?
- Voluntary: a task calls
schedule()directly (e.g., blocks on I/O). - Involuntary preemption:
- The running task's vruntime is no longer the smallest (another task caught up).
- A task with higher priority wakes up.
7. EEVDF Scheduler
Merged into the Linux kernel as a replacement (and enhancement) of CFS, Earliest Eligible Virtual Deadline First (EEVDF) adds a finer notion of fairness.
Core Idea: Lag
Instead of just comparing vruntimes, EEVDF tracks lag for each task:
lag = virtual running time − actual running time
| Lag | Meaning |
|---|---|
| Positive | Task is owed CPU time |
| Negative | Task has received more than its share |
| Zero | Perfectly balanced |
Eligibility
A task is eligible only if its lag ≥ 0. The scheduler picks the task with the earliest virtual deadline among all eligible tasks.
Worked Example
Three equal-priority tasks A, B, C. A runs a 30 ms time slice (others are virtually running at 10 ms each):
| Step | A lag | B lag | C lag | Notes |
|---|---|---|---|---|
| Initially | 0 ms | 0 ms | 0 ms | Pick A |
| After A runs 30 ms | −20 ms | +10 ms | +10 ms | A overran; B and C eligible |
| After B runs | −10 ms | −10 ms | +20 ms | C is still owed the most |
| After C runs | 0 ms | 0 ms | 0 ms | Back to balanced |
A's negative lag means it won't be selected until B and C catch up.
Key Takeaways
- CFS uses vruntime to approximate fairness; the rb-tree makes next-task selection O(1).
- Scheduler classes are checked in strict priority order; any RT task runs before any CFS task.
- SCHED_FIFO runs until blocked/yielded; SCHED_RR adds a fixed time slice; SCHED_DEADLINE uses EDF with period + WCET.
- RT tasks can starve normal tasks — the kernel's RT throttle prevents this by default.
- Per-CPU runqueues avoid lock contention; a load balancer compensates for imbalance.
- A context switch calls
switch_mm()(address space) thenswitch_to()(registers). - EEVDF extends CFS with a lag metric and eligibility test, giving tighter fairness guarantees and better latency.