Kernel Tracing, eBPF, and Interrupt Handling
Why This Matters
Production systems break in ways that cannot be reproduced in a lab. Tracing tools let you observe the live kernel without reboots or recompilation. Meanwhile, every device you interact with — keyboard, NIC, disk — relies on the interrupt mechanism to talk to the CPU efficiently. Understanding both topics unlocks the ability to diagnose real system behavior and to write correct, high-performance kernel code.
Part 1: Kernel Tracing
What is Kernel Tracing?
Kernel tracing is dynamic instrumentation: attaching probes to running kernel functions or instructions without modifying or recompiling the kernel. Primary use cases:
- Debugging and performance analysis
- Identifying bottlenecks and security threats
- Observability in production systems
Kprobes and Kretprobes
| Probe type | Where it fires | Typical use |
|---|---|---|
| kprobe | Before any kernel instruction | Inspect arguments, count calls |
| kretprobe | When the probed function returns | Capture return value, measure latency |
Kprobes work via x86's int3 one-byte breakpoint instruction. When the kernel registers a kprobe, it replaces the target instruction with int3. When that instruction executes, the CPU raises a debug exception, the kprobe handler runs, and then normal execution resumes.
Writing a kprobe Kernel Module
A kprobe module fills in a struct kprobe and registers it:
static struct kprobe kp;
static int handler_pre(struct kprobe *p, struct pt_regs *regs) {
char filename[256];
if (regs->si && copy_from_user(filename,
(char __user *)regs->si, sizeof(filename)) == 0) {
printk(KERN_INFO "[kprobe] File opened: %s\n", filename);
}
return 0;
}
static int __init kprobe_init(void) {
kp.symbol_name = "do_sys_openat2";
kp.pre_handler = handler_pre;
return register_kprobe(&kp);
}
Key points:
symbol_nametargets any exported or non-exported kernel symbol.pre_handlerreceives astruct pt_regs *, giving access to all CPU registers at the probe site — the arguments todo_sys_openat2live inrdi,rsi,rdx, etc.- Always call
unregister_kprobe()in your module's exit function.
Profiling with perf and kprobes
You can attach kprobes without writing a module using perf probe:
sudo perf probe -a do_sys_openat2 # register the probe
sudo cat /sys/kernel/debug/kprobes/list # verify it's active
sudo perf record -e probe:do_sys_openat2 -aR sleep 1
sudo perf report # analyze call sites
sudo perf probe -d do_sys_openat2 # remove the probe
Part 2: eBPF
What is eBPF?
eBPF (Extended Berkeley Packet Filter) lets you run sandboxed programs inside the kernel in response to events — without writing a kernel module. Programs are verified by an in-kernel verifier before execution, preventing crashes or infinite loops. eBPF is now the foundation for many Linux tracing, networking, and security tools.
Key properties:
- Safe: The verifier rejects programs that could corrupt kernel memory or loop forever.
- Efficient: Programs are JIT-compiled to native machine code.
- Widely applicable: kprobes, tracepoints, network hooks, perf events, and more.
bpftrace
bpftrace is a high-level eBPF frontend with a language similar to AWK. It compiles scripts to BPF bytecode and loads them automatically.
# List all syscall tracepoints
sudo bpftrace -l 'tracepoint:syscalls:*'
# Print every execve call with the process name and binary path
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve {
printf("Process %s executed %s\n", comm, str(args->filename)); }'
# Trace malloc calls in libc with size
sudo bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc {
printf("PID %d (%s) called malloc(size=%llu)\n", pid, comm, arg0); }'
bpftrace supports:
tracepoint:— stable kernel tracepointskprobe:/kretprobe:— dynamic function entry/return probesuprobe:— user-space function probes
Part 3: Interrupts
Why Interrupts Exist
Devices are slow compared to CPUs. A spinning hard disk takes ~10 ms for a read (4–10 ms seek + ~5.5 ms rotational latency at 5400 RPM). Two strategies exist for the CPU to learn when I/O finishes:
| Strategy | Description | Problem |
|---|---|---|
| Polling | CPU repeatedly checks the device status register | Burns CPU cycles while waiting |
| Interrupt | Device signals the CPU on completion | CPU is free until the signal arrives |
Interrupts let the CPU run other processes while I/O is in flight, making them essential for multiprogramming.
Interrupt Controller Hardware
Devices send electrical signals to the CPU through an interrupt controller:
- I/O APIC (Advanced Programmable Interrupt Controller): sits on the system chipset, receives interrupts from devices, and routes them to the appropriate processor.
- Local APIC: one per CPU core. Handles the routed interrupt and can also generate a timer interrupt (the heartbeat that drives scheduling) and inter-processor interrupts (IPI) for SMP coordination.
Interrupt Request Lines (IRQ)
Each device is identified by an IRQ number. Classic 8259A examples:
| IRQ | Device |
|---|---|
| 0 | System timer |
| 1 | Keyboard controller |
| 3, 4 | Serial ports |
| 5 | Parallel port / sound card |
Modern PCIe devices share IRQ lines using MSI (Message Signaled Interrupts), making sharing the norm.
Exceptions: Software Interrupts
Exceptions are interrupts raised by the CPU itself during instruction execution:
- Faults: recoverable — page fault (CPU retries the instruction), general protection fault
- Traps: non-recoverable, no retry — breakpoint (
int3), overflow - Aborts: unrecoverable — machine check
The int N instruction triggers a software interrupt for vector N (0–255). int 0x80 is the classic Linux 32-bit syscall mechanism. iret returns from any interrupt or exception.
Maskable vs. Non-Maskable Interrupts
| Type | Can be disabled? | Examples |
|---|---|---|
| Non-Maskable (NMI) | No — always handled | Power failure, uncorrectable memory error |
| Maskable | Yes — cleared by cli / set by sti |
All normal device interrupts |
The EFLAGS.IF flag controls masking. Clearing it with cli disables all maskable interrupts on the local CPU.
Interrupt Descriptor Table (IDT)
The IDT is the hardware dispatch table for interrupts and exceptions:
- 256 entries, each 16 bytes on x86-64
- Each entry (a gate descriptor) contains:
- Offset: 64-bit destination instruction pointer (split across the entry)
- Segment selector: the kernel code segment (CS) to load
- Present flag: marks the entry as valid
- The IDTR register holds the base address and size of the IDT.
- The
lidtinstruction loads the IDTR.
Predefined Vectors
| Vector | Meaning |
|---|---|
| 0 | Divide Error |
| 1 | Debug Exception |
| 2 | NMI |
| 3 | Breakpoint (int3) |
| 13 | General Protection Fault |
| 14 | Page Fault |
| 32–255 | User-defined (device interrupts) |
What Happens When int N Executes
- CPU looks up vector N in the IDT (base address from IDTR + N × 16).
- Checks
CPL ≤ DPLin the gate descriptor (privilege check). - Saves current
SS:ESPto a CPU-internal register. - Loads the new kernel
SS:ESPfrom the TSS (Task State Segment) — switches to the kernel stack. - Pushes user
SS,ESP,EFLAGS,CS,EIPonto the kernel stack. - Clears certain EFLAGS bits (e.g.,
IFfor interrupt gates). - Jumps to the handler address from the IDT descriptor.
- Handler returns with
iret, which pops all saved state and resumes user mode.
Interrupt Service Routines (ISR)
An ISR is a normal C function matching the irq_handler_t prototype:
typedef irqreturn_t (*irq_handler_t)(int irq, void *dev_id);
// Return IRQ_HANDLED if this device generated the interrupt,
// IRQ_NONE otherwise (important for shared lines)
Register and free handlers with:
int request_irq(unsigned int irq, irq_handler_t handler,
unsigned long flags, const char *devname, void *dev_id);
void free_irq(unsigned int irq, void *dev_id);
For shared IRQ lines (IRQF_SHARED flag), dev_id must be unique per handler so the kernel can identify who to call and who to remove.
Interrupt Context Constraints
ISRs run in interrupt context (also called atomic context), not process context. This has important consequences:
| Forbidden in ISR | Reason | Alternative |
|---|---|---|
kmalloc(…, GFP_KERNEL) |
May sleep waiting for memory | Use GFP_ATOMIC |
mutex_lock() |
May sleep | Use spinlock |
printk() in hot paths |
Too slow / unsafe on some paths | Use trace_printk() |
| Sleeping / blocking | ISR is not a schedulable entity | Defer work to bottom half |
Stack size is limited to one page (4 KB) per interrupt.
Top Half vs. Bottom Half
Because ISRs must be fast but device work can be substantial, Linux splits interrupt handling:
Hardware interrupt arrives
│
▼
┌─────────────────────────────────────┐
│ TOP HALF (ISR — runs immediately) │
│ • Acknowledge hardware │
│ • Copy data to kernel memory │
│ • Re-arm the device │
└─────────────────┬───────────────────┘
│ schedules
▼
┌─────────────────────────────────────┐
│ BOTTOM HALF (deferred) │
│ • Softirq / Tasklet / Work Queue │
│ • Runs with interrupts enabled │
│ • Does the heavy processing │
└─────────────────────────────────────┘
Network example: The top half copies the packet from the NIC into main memory (urgent — NIC buffer is small). The bottom half parses protocol headers, routes the packet, and hands it to a socket (can wait).
Interrupt Control in the Kernel
Kernel code sometimes needs to run atomically with respect to interrupts:
local_irq_disable(); // cli on this CPU
/* critical section */
local_irq_enable(); // sti on this CPU
Warning: local_irq_disable() is not reference-counted. Calling it twice and then local_irq_enable() once re-enables interrupts immediately — a bug. Use local_irq_save(flags) / local_irq_restore(flags) when nesting is possible.
Disabling local interrupts does not protect against other CPU cores. Pair with spinlocks for SMP safety.
To disable a specific IRQ line (e.g., while reinitializing a device):
disable_irq(irq); // waits for any running handler to finish
disable_irq_nosync(irq); // returns immediately
enable_irq(irq);
Interrupt Handling Flow in Linux
Each interrupt vector has a specific entry point in the kernel that:
- Saves the interrupt vector number and all registers.
- Calls
common_interrupt(struct pt_regs *regs, u32 vector). common_interruptacknowledges the interrupt and calls architecture-specific dispatch logic to invoke the registered ISR.
You can inspect interrupt counts and which handlers are registered at /proc/interrupts.
Key Takeaways
- kprobes attach dynamically to any kernel instruction using the
int3breakpoint trap;kretprobesfire on function return. - eBPF programs run sandboxed in the kernel, verified before execution, and JIT-compiled for efficiency;
bpftraceis the easiest entry point. - Devices are slow relative to the CPU; interrupts let the CPU work on other tasks until a device signals completion (vs. wasteful polling).
- The IDT is a 256-entry hardware table mapping interrupt vectors to handler addresses;
IDTRholds its base address. - Exceptions (divide-by-zero, page fault,
int3) are CPU-generated interrupts, handled identically to hardware interrupts. - ISRs run in interrupt context: no sleeping, no blocking locks, limited stack,
GFP_ATOMICfor allocations. - Linux splits interrupt processing into a fast top half (ISR) and a deferred bottom half (softirq/tasklet/work queue) to balance latency against throughput.
local_irq_disable()is not reference-counted — save/restore flags when nesting; it does not protect against other cores.