Kernel Synchronization III: Seqlocks, Barriers, and RCU
Why this matters
By now you know the core primitives—spinlocks, mutexes, semaphores, and rwlocks. This module covers the tools the kernel uses when those primitives are either too slow or the wrong semantic fit. Seqlocks give writers priority without starving them. Memory barriers let you reason about hardware and compiler reordering. And RCU is the crown jewel: a lock-free read path used throughout the kernel for read-mostly data structures. Understanding these will help you both write efficient kernel code and reason about why the kernel is designed the way it is.
Completion Variables
A completion variable solves a simple but common problem: one kernel thread needs to wait until another thread (or interrupt handler) signals that something has happened. Semaphores can do this, but completion variables are the idiomatic way in Linux.
| API | Meaning |
|---|---|
DECLARE_COMPLETION(x) |
Statically declare and initialize |
init_completion(&x) |
Dynamic initialization |
wait_for_completion(&x) |
Block until signaled |
complete(&x) |
Wake up one waiter |
complete_all(&x) |
Wake up all waiters |
Think of a completion as a one-shot gate: the waiter sleeps at the gate; the producer opens it exactly once (or for all waiters with complete_all).
Sequential Lock (Seqlock)
The problem with rwlock
A standard reader-writer lock (rwlock) is reader-preferred: as long as readers hold the lock, a writer must wait. Under continuous read traffic, a writer can starve indefinitely. Seqlocks flip this preference.
How seqlock works
A seqlock maintains an integer sequence counter initialized to zero.
- When a writer acquires the lock, it increments the counter (making it odd).
- When the writer releases the lock, it increments it again (making it even).
- A reader snapshots the counter before reading, does its work, then snapshots it after. If the two values differ, or if the before-snapshot was odd, a write was in progress and the reader retries.
/* Reader side */
unsigned seq;
do {
seq = read_seqbegin(&lock);
/* read shared data */
} while (read_seqretry(&lock, seq));
/* Writer side */
write_seqlock(&lock);
/* modify shared data */
write_sequnlock(&lock);
When to use seqlock
- Many readers, few writers — readers never block writers.
- Writers must be preferred — writers are never starved.
- The data being protected must be small enough that readers can retry cheaply, and the data must not contain pointers that would be unsafe to dereference during a retry.
Preemption Disabling
When a spinlock is held, kernel preemption is automatically disabled on that CPU. But sometimes you need to work with per-CPU data without a spinlock. If you are preempted mid-operation and rescheduled on a different CPU, your per-CPU data is suddenly the wrong CPU's data.
The fix is to explicitly disable preemption:
int cpu;
cpu = get_cpu(); /* disable preemption, return current CPU id */
/* safely access per_cpu_data[cpu] here */
put_cpu(); /* re-enable preemption */
After put_cpu(), the cpu variable is no longer valid—you may have been rescheduled.
Memory Ordering and Barriers
The reordering problem
Both the compiler (at compile time) and the CPU (at run time) may reorder memory operations for performance:
- The compiler can move loads and stores around as long as the single-threaded behavior is unchanged.
- Modern CPUs use models like TSO (Total Store Ordering) or PSO (Partial Store Ordering) that allow reordering visible to other CPUs.
In concurrent kernel code, this means a sequence like:
a = 1;
b = 2;
might be observed by another CPU as b = 2 happening before a = 1.
Barrier types
| Barrier | What it prevents |
|---|---|
barrier() |
Compiler reordering only (no CPU instruction emitted) |
rmb() |
CPU from reordering reads across the barrier |
wmb() |
CPU from reordering writes across the barrier |
mb() |
CPU from reordering both reads and writes |
Example
/* Initial: a = 1, b = 2 */
a = 3;
wmb(); /* ensures a is written to memory before b */
b = 4;
On the reader side:
b_val = b;
rmb(); /* ensures b is read before a */
a_val = a;
With these barriers, any CPU that sees b == 4 is guaranteed to also see a == 3.
Read-Copy-Update (RCU)
Motivation: the limits of rwlock and seqlock
Both rwlock and seqlock require readers to do some coordination with writers (acquire a lock or check a counter). For read-mostly data structures accessed on every CPU on every context switch or system call, even that small overhead adds up.
The RCU insight
RCU decouples reading from writing entirely:
- Readers take no locks, execute no atomic operations, and see no overhead. They are simply marked as being in an RCU read-side critical section.
- Writers work on a copy of the data, modify it, then atomically publish it by updating a single pointer. They delay freeing the old copy until all pre-existing readers have finished.
The copy-modify-publish pattern
For a linked list:
- Writer copies the node to be updated.
- Writer modifies the copy.
- Writer atomically updates the
nextpointer of the preceding node to point to the new copy (new readers now see the updated node). - Writer waits until all existing readers finish (a grace period), then frees the old node.
This is safe because a reader either sees the old pointer (and finishes before the free) or sees the new pointer (and never touches the old node).
Core RCU API
| API | Side | Purpose |
|---|---|---|
rcu_read_lock() |
Reader | Marks entry into RCU critical section (disables preemption) |
rcu_read_unlock() |
Reader | Marks exit; if no more readers on this CPU, ends grace period |
rcu_dereference(p) |
Reader | Safely load an RCU-protected pointer (includes read barrier) |
rcu_assign_pointer(p, v) |
Writer | Atomically publish a new pointer (includes write barrier) |
synchronize_rcu() |
Writer | Block until all pre-existing readers finish (grace period) |
call_rcu(&head, func) |
Writer | Asynchronous alternative: call func after grace period |
Replacing rwlock with RCU
/* Old rwlock reader */
read_lock(&list_lock);
list_for_each_entry(p, &list, node) { ... }
read_unlock(&list_lock);
/* New RCU reader */
rcu_read_lock();
list_for_each_entry_rcu(p, &list, node) { ... }
rcu_read_unlock();
/* Old rwlock writer */
write_lock(&list_lock);
list_del(&entry->node);
write_unlock(&list_lock);
free(entry);
/* New RCU writer */
spin_lock(&writer_lock); /* RCU does NOT coordinate writers */
list_del_rcu(&entry->node);
spin_unlock(&writer_lock);
synchronize_rcu(); /* wait for readers */
free(entry);
RCU limitations
- Writers still need mutual exclusion among themselves. RCU only protects readers from writers; most RCU-based code uses a spinlock to serialize concurrent writes.
- Updates must be a single-pointer swap. Complex multi-field updates are hard to express as a single atomic pointer change. This constrains the data structures and algorithms that can use RCU effectively.
Where RCU is used in Linux
RCU is pervasive in the Linux kernel: the directory-entry cache (dcache), routing tables, network protocol handlers, and the process credential structure all rely on it. Paul McKenney (Meta/IBM) developed and championed RCU in Linux.
Key Takeaways
- Completion variables are the idiomatic way for one thread to wait for an event signaled by another.
- Seqlocks give writers priority over readers via a sequence counter; readers retry if a write races them.
- Preemption disabling with
get_cpu()/put_cpu()is needed when accessing per-CPU data without a spinlock. - Memory barriers (
rmb,wmb,mb,barrier) prevent the compiler and CPU from reordering loads and stores in ways that break concurrent correctness. - RCU achieves near-zero reader overhead by using a copy-modify-publish protocol and deferring reclamation until a grace period expires; writers still need their own mutual exclusion.
- The right primitive depends on the read/write ratio, whether writers can be delayed, and how complex the update is.