OS Virtualization
Why This Matters
Every major cloud provider β AWS, Google Cloud, Azure β runs your code inside a virtual machine. Virtualization is the foundational technology that makes cloud computing, secure sandboxing, and live migration possible. As a Linux kernel programmer, you need to understand not just what a hypervisor is, but how it tricks an unmodified OS into thinking it owns the hardware, and where the real engineering challenges lie.
What Is OS Virtualization?
OS virtualization is the use of software (and hardware extensions) to run multiple operating systems simultaneously on one physical machine. The central component is the Virtual Machine Monitor (VMM), also called the hypervisor. The hypervisor presents each guest OS with an illusion of dedicated hardware while actually multiplexing the real hardware underneath.
Two key properties drive the design:
- Same code, near-native performance β guest code should run as fast as it would on bare metal.
- Strong isolation β a bug or compromise in one VM must not affect others.
A Brief History
| Era | State of virtualization |
|---|---|
| 1960sβ70s | Popular on mainframes to share expensive hardware; IBM VM/370 |
| 1980sβ90s | Interest declined as cheap personal computers and multi-user OSes (Unix) reduced the need |
| 2000sβpresent | Revival driven by server consolidation, cloud computing, and hardware support (Intel VT-x, AMD SVM) |
Use Cases
Server Consolidation
Running X VMs on Y physical hosts (Y < X) allows you to keep physical CPUs busy rather than idle. This was the original commercial driver for modern virtualization (VMware, Xen).
Software Development
- Spin up any OS on your laptop without rebooting.
- VMs are self-contained snapshots of an environment, making reproducible builds and automated testing easy.
Migration and Checkpoint/Restart
A VM's entire state (CPU registers, memory, device state) is captured in software, making two operations straightforward that are hard for plain processes:
- Live migration: move a running VM across hosts transparently (for maintenance, load balancing, or fault avoidance).
- Checkpoint/restart: dump state to disk and resume later.
Hardware Emulation
Cross-ISA emulators (e.g., QEMU in full-emulation mode) let you run ARM binaries on an x86 host, or prototype hardware that doesn't exist yet. Architecture simulators like Gem5 interpret every guest instruction in software for detailed performance/power analysis (5Γ to 1000Γ slowdown).
Cloud Computing
Virtualization enables the three cloud service models:
- IaaS β rent VMs (e.g., AWS EC2).
- PaaS β deploy apps on managed infrastructure (e.g., Google App Engine).
- SaaS β use hosted services (e.g., Gmail).
Security
- Sandboxing: isolate untrusted code (malware analysis, honeypots, browser tabs).
- VM introspection: analyze guest behavior from the higher-privileged VMM without the guest's cooperation.
Categories of Virtual Machines
System-Level VMs (Hypervisor-Based)
Run an unmodified OS image on top of virtualized hardware. The hypervisor exposes a virtual CPU, virtual memory, and virtual devices.
Machine simulators (e.g., QEMU in full-emulation mode, Gem5) emulate a different architecture entirely β every instruction is interpreted, so they are slow but portable.
Hypervisor-based VMs virtualize the same architecture as the host, using hardware extensions to execute guest code directly on the CPU. Examples: Xen-HVM, KVM, VMware ESXi, Hyper-V, VirtualBox.
Type 1 vs. Type 2 Hypervisors
| Type 1 (bare-metal) | Type 2 (hosted) | |
|---|---|---|
| Runs on | Raw hardware | Host OS |
| Examples | Xen, VMware ESX, Hyper-V | KVM/QEMU, VMware Workstation |
| Performance | Higher (no host OS overhead) | Slightly lower |
| Deployment | Servers/datacenters | Developer laptops |
Note: KVM is a kernel module that turns Linux itself into a Type-1-like hypervisor, but it is conventionally listed as Type 2 because it relies on a host Linux kernel.
OS-Level Lightweight VMs (Containers)
Instead of virtualizing hardware, containers use OS mechanisms β chroot, cgroups, namespaces β to isolate processes. They share the host kernel, so:
- Pros: fast startup, low overhead, small image.
- Cons: cannot run a different OS type (e.g., Windows on a Linux host); weaker isolation boundary.
Docker and LXC are well-known container runtimes. You can write a basic container in ~100 lines of Go using the primitives directly.
The x86 Virtualization Challenge
To virtualize a CPU cleanly, every instruction that touches privileged state must trap to the hypervisor when executed at unprivileged level, so the VMM can emulate it. This property is called strict virtualizability.
x86 is not strictly virtualizable. Robin & Irvine (2000) identified 17 sensitive, non-privileged instructions β instructions that read or modify protected state but do not cause a trap when executed in user mode. Two examples:
in/outβ access hardware I/O ports; they silently succeed or fail depending on the I/O Permission Bitmap rather than trapping.hltβ halts the CPU until an interrupt; in ring 3 it raises a #GP, but the behavior is ISA-version-dependent and has historically been problematic for VMMs.
Because these instructions can leak or corrupt VM state without the VMM's knowledge, naive "run and hope for a trap" doesn't work.
CPU Virtualization: Three Solutions
Solution 1 β Trap & Emulate via Dynamic Binary Translation (DBT)
Before executing a block of guest code, scan it for sensitive instructions and replace them with calls into the VMM (stored in a code cache). The translated code runs natively; only the sensitive spots are intercepted. This approach requires no guest OS changes and no special hardware, but the scanning overhead can be significant.
Solution 2 β Paravirtualization
Modify the guest OS so it is aware it is running inside a VM. Replace sensitive instructions with hypercalls β explicit requests to the VMM:
/* Guest kernel, before paravirtualization */
cli /* clear interrupt flag β sensitive! */
/* After paravirtualization (Xen style) */
hypercall(HYPERCALL_DISABLE_INTERRUPTS);
- Pro: near-native performance, no hardware support required.
- Con: requires a specially patched guest; you cannot run an unmodified Windows.
Solution 3 β Hardware-Assisted Virtualization (Intel VT-x / AMD SVM)
Intel VT-x adds Virtual Machine Extensions (VMX) to x86. All 17 problematic instructions are made to trap when executed in non-root mode. The hardware handles the trapping, removing the need for binary translation or guest modifications.
VT-x Operating Modes
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VMX Non-Root (Guest) β
β Ring 3: User Apps Ring 0: Guest OS kernel β
β β VM Exit β VM Entry β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β VMX Root (VMM) β
β Ring 0: Virtual Machine Monitor β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Executes on the physical CPU hardware
- VM Entry: transition from VMM (root) into the guest (non-root).
- VM Exit: any sensitive operation in the guest causes a trap back to the VMM.
The VM Control Structure (VMCS) is a per-VM data structure that holds:
- Which events trigger a VM Exit (execution controls).
- The guest's saved CPU state (registers, segment descriptors, etc.).
- The host's state to restore on exit.
Minimizing unnecessary VM Exits is a key performance optimization.
KVM in Practice
KVM (Kernel-based Virtual Machine) is a Linux kernel module that exposes hardware virtualization through /dev/kvm. A minimal setup in C:
int kvm_fd = open("/dev/kvm", O_RDWR);
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
// mmap 4 KB guest memory, load guest binary
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
int vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
// Set RIP=0, RFLAGS=0x2, then:
ioctl(vcpu_fd, KVM_RUN, 0); // runs until VM Exit
When the guest executes hlt, KVM reports KVM_EXIT_HLT and the host process decides what to do.
Memory Virtualization
A guest OS manages its own page tables mapping Guest Virtual Addresses (GVA) β Guest Physical Addresses (GPA). But GPA is not real physical memory β the hypervisor owns the real Host Physical Addresses (HPA). Two approaches bridge this gap:
| Approach | How it works | Notes |
|---|---|---|
| Shadow Page Tables | VMM maintains a shadow PT that maps GVA β HPA directly; write-protects guest PT pages to intercept updates | Older approach; high overhead on PT-heavy workloads |
| Nested (Extended) Page Tables | Hardware walks two levels of page tables automatically (Intel EPT / AMD NPT): GVAβGPA handled by guest PT, GPAβHPA handled by the VMM's nested PT | Modern standard; reduces VM exits for PT updates |
I/O Virtualization
Emulation
The VMM presents a virtual device to the guest (e.g., an e1000 NIC). The guest uses its normal driver. Every I/O access causes a VM Exit; the VMM intercepts it and calls an emulated-device module (often QEMU) that performs the real I/O. Simple and compatible, but high overhead (many exits per packet).
Front-End / Back-End Driver Model (Xen)
Used in paravirtualized guests:
- Front-end driver: runs in the guest (DomU), sends requests through a shared memory ring buffer.
- Back-end driver: runs in Dom0 (the privileged management domain), receives requests and performs actual I/O with the native driver.
No per-operation VM Exit once the ring is set up; much lower overhead than full emulation.
SR-IOV (Single-Root I/O Virtualization)
Hardware-level solution. A SR-IOV-capable NIC exposes:
- One Physical Function (PF) β controlled by the hypervisor/Dom0.
- Multiple Virtual Functions (VFs) β lightweight PCIe functions, each assignable to a different VM.
The VM accesses the VF directly (with IOMMU protection), bypassing the VMM on the data path entirely. Near-native network performance.
Device Drivers: Character vs. Block
Linux exposes devices as files under /dev. There are two major types:
| Type | Data model | Random access | Examples |
|---|---|---|---|
| Character device | Byte stream | No | Serial ports, terminals, /dev/null, /dev/zero |
| Block device | Fixed-size blocks (512 B or multiples) | Yes | Hard drives, USB drives, SSDs |
Common misconception:
/dev/nulland/dev/zeroare character devices, not block devices, despite feeling like "storage."
Writing a kernel character device driver follows the same module skeleton seen in earlier lectures:
- Define a major/minor number and
file_operationsstruct. - Implement
open,release,read,write. - Register with
register_chrdev. - Create the device file:
sudo mknod /dev/my_dev c <major> 0.
Key Takeaways
- A hypervisor virtualizes hardware so multiple OSes share one physical machine with strong isolation and near-native performance.
- x86 has 17 sensitive non-privileged instructions that break naive virtualization; the three fixes are binary translation, paravirtualization, and hardware support (VT-x/SVM).
- VT-x adds VMX non-root/root modes: guest code runs natively; sensitive operations cause a VM Exit handled by the VMM, controlled via VMCS.
- Memory virtualization evolved from software-maintained shadow page tables to hardware-walked nested/extended page tables (EPT/NPT).
- I/O virtualization ranges from slow-but-compatible device emulation, through paravirtualized front/back-end rings, to near-native SR-IOV hardware pass-through.
- Containers are OS-level VMs β they share the host kernel via namespaces and cgroups, offering speed at the cost of weaker isolation and no cross-OS support.