Process Scheduling: CFS Deep Dive, Real-Time Policies, and EEVDF

Why This Matters

Every time your terminal feels snappy while a background compile is grinding away, the Linux scheduler is making dozens of decisions per second about which task gets the CPU and for how long. Understanding these decisions — from the red-black tree that CFS uses to pick the next task, to the "lag" value that the EEVDF scheduler tracks — gives you the mental model to reason about performance, latency, and starvation in kernel code you write or debug.


1. CFS Quick Recap

CFS (Completely Fair Scheduler) does not assign fixed time slices. Instead:

This design lets a text editor (low CPU consumer) stay responsive while a video encoder (high CPU consumer) runs in the background, both getting their fair share.


2. CFS Implementation: Four Components

2.1 Time Accounting — Virtual Runtime

The core accounting unit in CFS is vruntime: how much CPU time a task has virtually consumed, normalized by its weight.

On every timer interrupt, CFS updates the running task's vruntime. The field lives inside sched_entity, which is embedded in task_struct.

task_struct
  └─ se (struct sched_entity)
       └─ vruntime   ← updated every tick

2.2 Process Selection — The Red-Black Tree

CFS maintains a red-black tree (rb-tree) of all runnable tasks, keyed by vruntime:

/* kernel/sched/fair.c */
struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
{
    struct rb_node *left = cfs_rq->rb_leftmost;
    if (!left)
        return NULL;
    return rb_entry(left, struct sched_entity, run_node);
}

Enqueue (task wakes up): insert into the tree in O(log n).
Dequeue (task sleeps): remove from the tree in O(log n).

2.3 Scheduler Entry Points

Function Role
schedule() Main entry point; calls __schedule()pick_next_task()
scheduler_tick() Called on every timer interrupt; updates vruntime, checks if preemption is needed

2.4 Sleeping and Waking Up


3. Scheduler Classes and Policies

Linux has a layered hierarchy of scheduler classes, checked in order from highest to lowest priority:

Stop → Deadline → Real-Time → Fair (CFS) → Idle

schedule() calls pick_next_task(), which iterates through each class via for_each_class() and returns the first task found. A real-time task will always run before any CFS task.

Class Policies
Deadline SCHED_DEADLINE
Real-Time SCHED_FIFO, SCHED_RR
Fair SCHED_NORMAL, SCHED_BATCH
Idle SCHED_IDLE

4. Real-Time Scheduling Policies

Linux provides soft real-time scheduling — best effort, no hard guarantees.

SCHED_FIFO

SCHED_RR

SCHED_DEADLINE (mainlined in v3.14)

Setting an RT Policy in User Space

#include <sched.h>

struct sched_param param;
param.sched_priority = 99; // max for SCHED_FIFO
sched_setscheduler(0, SCHED_FIFO, &param);
// Requires root privileges

RT Starvation

An infinite loop in an RT task can starve all lower-priority tasks. The kernel has a throttling mechanism (/proc/sys/kernel/sched_rt_runtime_us and sched_rt_period_us) that limits how much CPU RT tasks collectively consume, preventing complete starvation by default. Disabling this (writing -1 to sched_rt_runtime_us) allows starvation — useful to test in a QEMU VM, not production.


5. Multi-Core Scheduling

Per-CPU Runqueues

Each CPU has its own runqueue (rb-tree). This avoids contention on a single shared data structure — a major scalability win.

Load Balancing

Runqueues can become unbalanced. A load balancer runs periodically (triggered by the scheduler tick or idle entry) and migrates tasks between CPUs based on priority and current load.

Example imbalance: one CPU has a long queue of high-priority tasks; the other has one low-priority task. Without balancing, the high-priority tasks would each receive less CPU time than the single low-priority task on the idle CPU.


6. Preemption and Context Switch

A context switch swaps the currently-running process for another. It is performed by context_switch(), called from __schedule():

schedule() → __schedule() → context_switch()
                                  ├─ switch_mm()    # swap address space (page tables)
                                  └─ switch_to()    # swap CPU registers (PC, SP, ...)

When Does schedule() Get Called?

  1. Voluntary: a task calls schedule() directly (e.g., blocks on I/O).
  2. Involuntary preemption:
    • The running task's vruntime is no longer the smallest (another task caught up).
    • A task with higher priority wakes up.

7. EEVDF Scheduler

Merged into the Linux kernel as a replacement (and enhancement) of CFS, Earliest Eligible Virtual Deadline First (EEVDF) adds a finer notion of fairness.

Core Idea: Lag

Instead of just comparing vruntimes, EEVDF tracks lag for each task:

lag = virtual running time − actual running time

Lag Meaning
Positive Task is owed CPU time
Negative Task has received more than its share
Zero Perfectly balanced

Eligibility

A task is eligible only if its lag ≥ 0. The scheduler picks the task with the earliest virtual deadline among all eligible tasks.

Worked Example

Three equal-priority tasks A, B, C. A runs a 30 ms time slice (others are virtually running at 10 ms each):

Step A lag B lag C lag Notes
Initially 0 ms 0 ms 0 ms Pick A
After A runs 30 ms −20 ms +10 ms +10 ms A overran; B and C eligible
After B runs −10 ms −10 ms +20 ms C is still owed the most
After C runs 0 ms 0 ms 0 ms Back to balanced

A's negative lag means it won't be selected until B and C catch up.


Key Takeaways

Practice

  1. In Linux CFS, how does the scheduler decide which runnable task to execute next?
  2. A task's vruntime is updated on every timer interrupt. What purpose does this serve?
  3. Which statement correctly distinguishes SCHED_FIFO from SCHED_RR?
  4. What is the correct order of scheduler class priority in Linux, from highest to lowest?
  5. A context switch is performed when schedule() selects a new task. Name the two kernel functions called inside context_switch() and explain what each does.
  6. Why does Linux use per-CPU runqueues instead of a single global runqueue shared across all CPUs?
  7. In EEVDF, a task's 'lag' is currently +15 ms. What does this mean, and is the task eligible to be scheduled?
  8. A SCHED_FIFO task runs an infinite loop on a single-CPU system. Under what conditions will other tasks still get CPU time, and what kernel mechanism prevents total starvation by default?
  9. Walk through the EEVDF lag example from the lecture: three tasks A, B, C each with equal priority. A runs a 30 ms time slice while B and C are runnable. After A finishes its slice, what is the lag of each task, which tasks are eligible, and which will run next?