Kernel Tracing, eBPF, and Interrupt Handling

Why This Matters

Production systems break in ways that cannot be reproduced in a lab. Tracing tools let you observe the live kernel without reboots or recompilation. Meanwhile, every device you interact with — keyboard, NIC, disk — relies on the interrupt mechanism to talk to the CPU efficiently. Understanding both topics unlocks the ability to diagnose real system behavior and to write correct, high-performance kernel code.


Part 1: Kernel Tracing

What is Kernel Tracing?

Kernel tracing is dynamic instrumentation: attaching probes to running kernel functions or instructions without modifying or recompiling the kernel. Primary use cases:

Kprobes and Kretprobes

Probe type Where it fires Typical use
kprobe Before any kernel instruction Inspect arguments, count calls
kretprobe When the probed function returns Capture return value, measure latency

Kprobes work via x86's int3 one-byte breakpoint instruction. When the kernel registers a kprobe, it replaces the target instruction with int3. When that instruction executes, the CPU raises a debug exception, the kprobe handler runs, and then normal execution resumes.

Writing a kprobe Kernel Module

A kprobe module fills in a struct kprobe and registers it:

static struct kprobe kp;

static int handler_pre(struct kprobe *p, struct pt_regs *regs) {
    char filename[256];
    if (regs->si && copy_from_user(filename,
            (char __user *)regs->si, sizeof(filename)) == 0) {
        printk(KERN_INFO "[kprobe] File opened: %s\n", filename);
    }
    return 0;
}

static int __init kprobe_init(void) {
    kp.symbol_name = "do_sys_openat2";
    kp.pre_handler  = handler_pre;
    return register_kprobe(&kp);
}

Key points:

Profiling with perf and kprobes

You can attach kprobes without writing a module using perf probe:

sudo perf probe -a do_sys_openat2          # register the probe
sudo cat /sys/kernel/debug/kprobes/list    # verify it's active
sudo perf record -e probe:do_sys_openat2 -aR sleep 1
sudo perf report                           # analyze call sites
sudo perf probe -d do_sys_openat2          # remove the probe

Part 2: eBPF

What is eBPF?

eBPF (Extended Berkeley Packet Filter) lets you run sandboxed programs inside the kernel in response to events — without writing a kernel module. Programs are verified by an in-kernel verifier before execution, preventing crashes or infinite loops. eBPF is now the foundation for many Linux tracing, networking, and security tools.

Key properties:

bpftrace

bpftrace is a high-level eBPF frontend with a language similar to AWK. It compiles scripts to BPF bytecode and loads them automatically.

# List all syscall tracepoints
sudo bpftrace -l 'tracepoint:syscalls:*'

# Print every execve call with the process name and binary path
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve {
    printf("Process %s executed %s\n", comm, str(args->filename)); }'

# Trace malloc calls in libc with size
sudo bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc {
    printf("PID %d (%s) called malloc(size=%llu)\n", pid, comm, arg0); }'

bpftrace supports:


Part 3: Interrupts

Why Interrupts Exist

Devices are slow compared to CPUs. A spinning hard disk takes ~10 ms for a read (4–10 ms seek + ~5.5 ms rotational latency at 5400 RPM). Two strategies exist for the CPU to learn when I/O finishes:

Strategy Description Problem
Polling CPU repeatedly checks the device status register Burns CPU cycles while waiting
Interrupt Device signals the CPU on completion CPU is free until the signal arrives

Interrupts let the CPU run other processes while I/O is in flight, making them essential for multiprogramming.

Interrupt Controller Hardware

Devices send electrical signals to the CPU through an interrupt controller:

Interrupt Request Lines (IRQ)

Each device is identified by an IRQ number. Classic 8259A examples:

IRQ Device
0 System timer
1 Keyboard controller
3, 4 Serial ports
5 Parallel port / sound card

Modern PCIe devices share IRQ lines using MSI (Message Signaled Interrupts), making sharing the norm.

Exceptions: Software Interrupts

Exceptions are interrupts raised by the CPU itself during instruction execution:

The int N instruction triggers a software interrupt for vector N (0–255). int 0x80 is the classic Linux 32-bit syscall mechanism. iret returns from any interrupt or exception.

Maskable vs. Non-Maskable Interrupts

Type Can be disabled? Examples
Non-Maskable (NMI) No — always handled Power failure, uncorrectable memory error
Maskable Yes — cleared by cli / set by sti All normal device interrupts

The EFLAGS.IF flag controls masking. Clearing it with cli disables all maskable interrupts on the local CPU.

Interrupt Descriptor Table (IDT)

The IDT is the hardware dispatch table for interrupts and exceptions:

Predefined Vectors

Vector Meaning
0 Divide Error
1 Debug Exception
2 NMI
3 Breakpoint (int3)
13 General Protection Fault
14 Page Fault
32–255 User-defined (device interrupts)

What Happens When int N Executes

  1. CPU looks up vector N in the IDT (base address from IDTR + N × 16).
  2. Checks CPL ≤ DPL in the gate descriptor (privilege check).
  3. Saves current SS:ESP to a CPU-internal register.
  4. Loads the new kernel SS:ESP from the TSS (Task State Segment) — switches to the kernel stack.
  5. Pushes user SS, ESP, EFLAGS, CS, EIP onto the kernel stack.
  6. Clears certain EFLAGS bits (e.g., IF for interrupt gates).
  7. Jumps to the handler address from the IDT descriptor.
  8. Handler returns with iret, which pops all saved state and resumes user mode.

Interrupt Service Routines (ISR)

An ISR is a normal C function matching the irq_handler_t prototype:

typedef irqreturn_t (*irq_handler_t)(int irq, void *dev_id);
// Return IRQ_HANDLED if this device generated the interrupt,
// IRQ_NONE otherwise (important for shared lines)

Register and free handlers with:

int request_irq(unsigned int irq, irq_handler_t handler,
                unsigned long flags, const char *devname, void *dev_id);
void free_irq(unsigned int irq, void *dev_id);

For shared IRQ lines (IRQF_SHARED flag), dev_id must be unique per handler so the kernel can identify who to call and who to remove.

Interrupt Context Constraints

ISRs run in interrupt context (also called atomic context), not process context. This has important consequences:

Forbidden in ISR Reason Alternative
kmalloc(…, GFP_KERNEL) May sleep waiting for memory Use GFP_ATOMIC
mutex_lock() May sleep Use spinlock
printk() in hot paths Too slow / unsafe on some paths Use trace_printk()
Sleeping / blocking ISR is not a schedulable entity Defer work to bottom half

Stack size is limited to one page (4 KB) per interrupt.

Top Half vs. Bottom Half

Because ISRs must be fast but device work can be substantial, Linux splits interrupt handling:

Hardware interrupt arrives
       │
       ▼
  ┌─────────────────────────────────────┐
  │  TOP HALF (ISR — runs immediately)  │
  │  • Acknowledge hardware             │
  │  • Copy data to kernel memory       │
  │  • Re-arm the device                │
  └─────────────────┬───────────────────┘
                    │ schedules
                    ▼
  ┌─────────────────────────────────────┐
  │  BOTTOM HALF (deferred)             │
  │  • Softirq / Tasklet / Work Queue   │
  │  • Runs with interrupts enabled     │
  │  • Does the heavy processing        │
  └─────────────────────────────────────┘

Network example: The top half copies the packet from the NIC into main memory (urgent — NIC buffer is small). The bottom half parses protocol headers, routes the packet, and hands it to a socket (can wait).

Interrupt Control in the Kernel

Kernel code sometimes needs to run atomically with respect to interrupts:

local_irq_disable();   // cli on this CPU
/* critical section */
local_irq_enable();    // sti on this CPU

Warning: local_irq_disable() is not reference-counted. Calling it twice and then local_irq_enable() once re-enables interrupts immediately — a bug. Use local_irq_save(flags) / local_irq_restore(flags) when nesting is possible.

Disabling local interrupts does not protect against other CPU cores. Pair with spinlocks for SMP safety.

To disable a specific IRQ line (e.g., while reinitializing a device):

disable_irq(irq);      // waits for any running handler to finish
disable_irq_nosync(irq);  // returns immediately
enable_irq(irq);

Interrupt Handling Flow in Linux

Each interrupt vector has a specific entry point in the kernel that:

  1. Saves the interrupt vector number and all registers.
  2. Calls common_interrupt(struct pt_regs *regs, u32 vector).
  3. common_interrupt acknowledges the interrupt and calls architecture-specific dispatch logic to invoke the registered ISR.

You can inspect interrupt counts and which handlers are registered at /proc/interrupts.


Key Takeaways

Practice

  1. How does the kernel implement a kprobe on an x86 system?
  2. You write a kprobe module targeting do_sys_openat2. In handler_pre, which argument register on x86-64 holds the second argument (the filename pointer passed by user space)?
  3. Which statement best describes the safety guarantee eBPF provides that a kernel module does not?
  4. Write a one-line bpftrace command that prints the PID and command name every time any process calls the write syscall.
  5. Why do modern systems use interrupt-driven I/O instead of polling for disk reads?
  6. A system has the Local APIC and I/O APIC. Which component is responsible for generating the timer interrupt used by the kernel scheduler?
  7. Explain the difference between the top half and bottom half of interrupt processing in Linux, and give a concrete example using network packet reception.
  8. Your ISR needs to allocate a small buffer. Which of the following is correct?
  9. A kernel developer calls local_irq_disable() inside a function, then calls another function that also calls local_irq_disable() and then local_irq_enable(). After both functions return, are interrupts enabled or disabled? Why is this a bug?
  10. When the CPU executes int 0x80, which of the following correctly describes the privilege transition?