Kernel Synchronization: Spinlocks, Sleeplocks, Deadlocks, and Multi-core Boot

Why This Matters

A single-core OS can get away with disabling interrupts to protect shared data. The moment a second CPU enters the picture, that trick fails — two processors can race on the same kernel data structure simultaneously. Kernel synchronization is what keeps that from corrupting state. Understanding it also unlocks everything that comes after: process scheduling, file systems, and device drivers all depend on locking done right.


1. Atomic Instructions — The Hardware Foundation

Locks are only possible because CPUs provide instructions that read and write memory atomically: no other CPU can observe a half-finished operation.

Two workhorses in x86:

Instruction What it does
LOCK XADD dst, src Atomically: old = dst; dst = dst + src; return old
LOCK XCHG dst, src Atomically swaps dst and src

The LOCK prefix causes the CPU to assert a bus lock (or cache-line lock) for the duration of the instruction, ensuring the read-modify-write is indivisible.

xv6 wraps LOCK XCHG in a helper:

static inline uint xchg(volatile uint *addr, uint newval) {
    uint result;
    asm volatile("lock; xchgl %0, %1"
        : "+m" (*addr), "=a" (result)
        : "1" (newval)
        : "cc");
    return result;   // returns the *old* value of *addr
}

The +m constraint tells the compiler that *addr is both read and written ("read-modify-write operand"). The return value is the value *addr held before the swap.


2. xv6 Spinlock

A spinlock is a busy-wait lock: a thread trying to acquire it loops ("spins") until it succeeds. This is cheap when critical sections are very short (a few instructions), because the overhead of putting a thread to sleep and waking it up would cost more than just spinning.

Structure

struct spinlock {
    uint locked;   // 0 = free, 1 = held
    char *name;
    struct cpu *cpu;
    // ...
};

Acquire

void acquire(struct spinlock *lk) {
    pushcli();                         // disable interrupts
    if (holding(lk)) panic("acquire"); // detect self-deadlock

    while (xchg(&lk->locked, 1) != 0) // spin until we swap in 1
        ;

    __sync_synchronize();              // memory barrier
}

Key points:

Release

void release(struct spinlock *lk) {
    __sync_synchronize();              // barrier before clearing the lock

    asm volatile("movl $0, %0" : "+m" (lk->locked) : );

    popcli();                          // re-enable interrupts
}

The barrier ensures any writes inside the critical section are visible to other CPUs before the lock word is cleared. A plain lk->locked = 0 would be optimizable by the compiler in ways that break this guarantee.

Usage Pattern

struct {
    struct spinlock lock;
    struct proc proc[NPROC];
} ptable;

initlock(&ptable.lock, "ptable");

acquire(&ptable.lock);
for (p = ptable.proc; p < &ptable.proc[NPROC]; p++)
    // ... examine/modify process table
release(&ptable.lock);

When to Use a Lock — Checklist

Ask these questions about any piece of data:

  1. Is the data global (shared across CPUs or contexts)?
  2. Can another thread access it concurrently?
  3. What happens if this function runs simultaneously on another CPU?
  4. Is the data shared between process context and interrupt context?
  5. If a process is preempted while accessing this data, can the newly scheduled process touch the same data?

If any answer is "yes", you need a lock.


3. xv6 Sleeplock

Spinlocks are wrong for long waits — spinning wastes a whole CPU. If a critical section involves disk I/O or another slow operation, the thread should sleep instead.

Structure

struct sleeplock {
    uint locked;         // Is the lock held?
    struct spinlock lk;  // protects this sleeplock's fields
    char *name;
    int pid;
};

A sleeplock embeds a spinlock to protect its own locked field during the brief moment of inspection and update.

Acquire

void acquiresleep(struct sleeplock *lk) {
    acquire(&lk->lk);           // protect the check below
    while (lk->locked)
        sleep(lk, &lk->lk);    // atomically releases lk->lk and sleeps
    lk->locked = 1;
    lk->pid = proc->pid;
    release(&lk->lk);
}

sleep(chan, lk) atomically releases the spinlock and puts the process to sleep on channel chan. This avoids the lost-wakeup race: the wakeup cannot arrive between checking lk->locked and calling sleep.

Release

void releasesleep(struct sleeplock *lk) {
    acquire(&lk->lk);
    lk->locked = 0;
    lk->pid = 0;
    wakeup(lk);          // wake all sleepers on this channel
    release(&lk->lk);
}

Usage (inode locking in fs.c)

for (i = 0; i < NINODE; i++)
    initsleeplock(&icache.inode[i].lock, "inode");

acquiresleep(&ip->lock);
// Read blocks from disk — this may block for milliseconds
releasesleep(&ip->lock);

Spinlock vs. Sleeplock

Property Spinlock Sleeplock
Waiting thread Spins (wastes CPU) Sleeps (yields CPU)
Critical section Very short (no blocking) Can be long (disk I/O OK)
Can hold across sleep()? No Yes
Interrupts Disabled while held Not disabled

4. Deadlocks

A deadlock occurs when a set of threads are each waiting for a resource held by another thread in the set — nobody can proceed.

Self-Deadlock

struct spinlock mylock;

void low_level_fn() {
    acquire(&mylock);     // (2) tries to acquire — DEADLOCK: it's already held
    // ...
    release(&mylock);
}

void high_level_fn() {
    acquire(&mylock);     // (1) acquires successfully
    low_level_fn();       // (2) calls into a function that also tries to acquire
    release(&mylock);
}

xv6 spinlocks are not reentrant. Calling acquire on a lock you already hold panics (the holding() check in acquire). Without that check, you'd spin forever.

Circular Deadlock (Two Locks, Two Threads)

Thread 1:           Thread 2:
acquire(lock1)      acquire(lock2)
acquire(lock2)      acquire(lock1)   ← each waits for the other

Thread 1 holds lock1 and waits for lock2. Thread 2 holds lock2 and waits for lock1. Neither can progress.

Prevention Strategies

Strategy How it works
Fixed global lock ordering All code acquires locks in the same global order (e.g., always lock1 before lock2) — circular waits are impossible
Trylock Attempt to acquire; if it fails, release all held locks and retry from scratch
Lock hierarchy Assign a level number; only acquire locks at strictly increasing levels

Note: disabling interrupts globally before acquiring any lock does not prevent multi-CPU deadlocks; it only prevents interrupt-handler deadlocks on a single CPU.


5. Memory Ordering and Barriers

Modern compilers and CPUs reorder memory operations to improve performance. This is safe within a single thread (the CPU preserves the illusion of in-order execution for that thread), but it can break assumptions between CPUs.

Example: without barriers, a compiler might move the lk->locked = 0 in release() before the stores inside the critical section, making other CPUs observe a "released" lock while the critical section's writes haven't landed yet.

__sync_synchronize();

This GCC built-in emits a full memory barrier (on x86: mfence, or a lock orq $0, (%rsp)). It tells both the compiler and the CPU:

xv6 places barriers in both acquire (after spinning) and release (before clearing the lock) to sandwich the critical section.


6. Multi-core Boot in xv6

Booting additional CPUs requires careful synchronization.

Roles

Boot Sequence

BSP:
  main()
    mpinit()         ← reads MP configuration table, discovers APs
    startothers()    ← wakes one AP at a time
      lapicstartap() ← sends INIT/SIPI inter-processor interrupt to AP
      while(c->started == 0) ;  ← busy-waits until AP is up
    mpmain()         ← BSP finishes its own per-CPU init, enters scheduler()

Each AP:
  entryother.S      ← trampoline at physical address 0x7000
  mpenter()         ← per-CPU C initialization
  mpmain()
    xchg(&cpu->started, 1)   ← signal BSP that this CPU is ready
    scheduler()              ← start scheduling

The xchg on cpu->started is both the signal to the BSP and an implicit memory barrier, ensuring the AP's initialization is visible before the flag is set.

Scalability Concern

With 64 CPUs all issuing system calls simultaneously:

A single global lock means at most one CPU does useful kernel work at a time. Real kernels address this with fine-grained locking (per-CPU structures, per-inode locks) and lock-free data structures.


7. Read/Write Locks

Many data structures are read far more often than written. A plain spinlock serializes readers unnecessarily.

A read/write lock allows either:

xv6-style Implementation

struct rwlock { int n; };
// n == 0:   unlocked
// n == -1:  locked by one writer
// n > 0:    locked by n readers

Writer lock:

w_lock(l):
    while (1):
        if comp_xchg(&l->n, 0, -1)  // only succeeds if n == 0
            return

Reader lock:

r_lock(l):
    while (1):
        x = l->n
        if x < 0: continue           // writer holds it
        if comp_xchg(&l->n, x, x+1) // atomically increment reader count
            return

comp_xchg (compare-and-swap, CMPXCHG) atomically checks that the value equals an expected value, then replaces it — this is the same primitive as xchg but conditional.

Linux's rwlock is reader-preferred: readers can always enter as long as no writer holds the lock. A downside is that a continuous stream of readers can starve writers.

Seqlock (Brief)

A seqlock avoids writer starvation by letting writers proceed unconditionally. Readers detect a concurrent write by checking a sequence counter before and after reading — if the counter changed, the read is retried. Writers never block.


Key Takeaways

  1. Atomic instructions (LOCK XCHG, CMPXCHG) are the hardware building blocks for all kernel locks.
  2. Spinlocks are for short critical sections; they disable interrupts and busy-wait. They must not be held across operations that block.
  3. Sleeplocks are for long critical sections (e.g., disk I/O); the waiting thread sleeps instead of spinning.
  4. Deadlocks arise from cycles in the lock-wait graph. The most reliable prevention is a fixed global lock ordering.
  5. Memory barriers (__sync_synchronize) are essential around lock acquire/release to prevent compilers and CPUs from reordering stores out of critical sections.
  6. Multi-core boot uses startothers() with a flag polled via xchg to safely bring up each AP in sequence.
  7. Lock contention is the primary scalability bottleneck in a kernel with many CPUs; fine-grained locking and per-CPU data structures mitigate it.
  8. RW locks improve throughput for read-heavy workloads; seqlocks eliminate writer starvation at the cost of reader retries.

Practice

  1. What does the xv6 xchg(&lk->locked, 1) call return when the spinlock is successfully acquired (i.e., the lock was free)?
  2. Why does xv6's acquire() call pushcli() (disable interrupts) before spinning on the lock?
  3. Which of the following correctly distinguishes a sleeplock from a spinlock in xv6?
  4. Consider two threads: Thread A acquires lock1 then tries to acquire lock2. Thread B acquires lock2 then tries to acquire lock1. Both are blocked. Which deadlock-prevention strategy directly addresses this scenario?
  5. In acquiresleep, the code calls sleep(lk, &lk->lk) while holding lk->lk (the internal spinlock). Why is this safe — doesn't holding a spinlock while sleeping violate the rule that spinlocks must not be held across blocking operations?
  6. What role does __sync_synchronize() play in xv6's spinlock release()?
  7. In xv6's multi-core boot, why does each AP (Application Processor) use xchg(&cpu->started, 1) to signal readiness rather than a plain assignment cpu->started = 1?
  8. You are writing a kernel function that reads a per-inode reference count (ip->ref) and, if it is zero, frees the inode. Identify what type of lock is appropriate here and explain what would go wrong if you used the wrong type.
  9. A read/write lock implementation stores the lock state in a single integer n where n=0 means free, n=-1 means writer-held, and n>0 means n readers hold it. A reader acquires the lock by reading x = l->n and then calling comp_xchg(&l->n, x, x+1). Why is it incorrect to replace this with two separate operations: if (l->n >= 0) { l->n++; }?
  10. A server runs 64 threads on 64 CPU cores. Profiling shows that as you add more cores beyond 8, throughput barely increases. All threads spend most of their time in kernel code. What is the most likely cause, and what are two concrete mitigation strategies?