OS Virtualization

Why This Matters

Every major cloud provider β€” AWS, Google Cloud, Azure β€” runs your code inside a virtual machine. Virtualization is the foundational technology that makes cloud computing, secure sandboxing, and live migration possible. As a Linux kernel programmer, you need to understand not just what a hypervisor is, but how it tricks an unmodified OS into thinking it owns the hardware, and where the real engineering challenges lie.


What Is OS Virtualization?

OS virtualization is the use of software (and hardware extensions) to run multiple operating systems simultaneously on one physical machine. The central component is the Virtual Machine Monitor (VMM), also called the hypervisor. The hypervisor presents each guest OS with an illusion of dedicated hardware while actually multiplexing the real hardware underneath.

Two key properties drive the design:


A Brief History

Era State of virtualization
1960s–70s Popular on mainframes to share expensive hardware; IBM VM/370
1980s–90s Interest declined as cheap personal computers and multi-user OSes (Unix) reduced the need
2000s–present Revival driven by server consolidation, cloud computing, and hardware support (Intel VT-x, AMD SVM)

Use Cases

Server Consolidation

Running X VMs on Y physical hosts (Y < X) allows you to keep physical CPUs busy rather than idle. This was the original commercial driver for modern virtualization (VMware, Xen).

Software Development

Migration and Checkpoint/Restart

A VM's entire state (CPU registers, memory, device state) is captured in software, making two operations straightforward that are hard for plain processes:

Hardware Emulation

Cross-ISA emulators (e.g., QEMU in full-emulation mode) let you run ARM binaries on an x86 host, or prototype hardware that doesn't exist yet. Architecture simulators like Gem5 interpret every guest instruction in software for detailed performance/power analysis (5Γ— to 1000Γ— slowdown).

Cloud Computing

Virtualization enables the three cloud service models:

Security


Categories of Virtual Machines

System-Level VMs (Hypervisor-Based)

Run an unmodified OS image on top of virtualized hardware. The hypervisor exposes a virtual CPU, virtual memory, and virtual devices.

Machine simulators (e.g., QEMU in full-emulation mode, Gem5) emulate a different architecture entirely β€” every instruction is interpreted, so they are slow but portable.

Hypervisor-based VMs virtualize the same architecture as the host, using hardware extensions to execute guest code directly on the CPU. Examples: Xen-HVM, KVM, VMware ESXi, Hyper-V, VirtualBox.

Type 1 vs. Type 2 Hypervisors

Type 1 (bare-metal) Type 2 (hosted)
Runs on Raw hardware Host OS
Examples Xen, VMware ESX, Hyper-V KVM/QEMU, VMware Workstation
Performance Higher (no host OS overhead) Slightly lower
Deployment Servers/datacenters Developer laptops

Note: KVM is a kernel module that turns Linux itself into a Type-1-like hypervisor, but it is conventionally listed as Type 2 because it relies on a host Linux kernel.

OS-Level Lightweight VMs (Containers)

Instead of virtualizing hardware, containers use OS mechanisms β€” chroot, cgroups, namespaces β€” to isolate processes. They share the host kernel, so:

Docker and LXC are well-known container runtimes. You can write a basic container in ~100 lines of Go using the primitives directly.


The x86 Virtualization Challenge

To virtualize a CPU cleanly, every instruction that touches privileged state must trap to the hypervisor when executed at unprivileged level, so the VMM can emulate it. This property is called strict virtualizability.

x86 is not strictly virtualizable. Robin & Irvine (2000) identified 17 sensitive, non-privileged instructions β€” instructions that read or modify protected state but do not cause a trap when executed in user mode. Two examples:

Because these instructions can leak or corrupt VM state without the VMM's knowledge, naive "run and hope for a trap" doesn't work.


CPU Virtualization: Three Solutions

Solution 1 β€” Trap & Emulate via Dynamic Binary Translation (DBT)

Before executing a block of guest code, scan it for sensitive instructions and replace them with calls into the VMM (stored in a code cache). The translated code runs natively; only the sensitive spots are intercepted. This approach requires no guest OS changes and no special hardware, but the scanning overhead can be significant.

Solution 2 β€” Paravirtualization

Modify the guest OS so it is aware it is running inside a VM. Replace sensitive instructions with hypercalls β€” explicit requests to the VMM:

/* Guest kernel, before paravirtualization */
cli   /* clear interrupt flag β€” sensitive! */

/* After paravirtualization (Xen style) */
hypercall(HYPERCALL_DISABLE_INTERRUPTS);

Solution 3 β€” Hardware-Assisted Virtualization (Intel VT-x / AMD SVM)

Intel VT-x adds Virtual Machine Extensions (VMX) to x86. All 17 problematic instructions are made to trap when executed in non-root mode. The hardware handles the trapping, removing the need for binary translation or guest modifications.

VT-x Operating Modes

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  VMX Non-Root (Guest)                   β”‚
β”‚   Ring 3: User Apps       Ring 0: Guest OS kernel       β”‚
β”‚                          ↑ VM Exit  ↓ VM Entry          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                    VMX Root (VMM)                       β”‚
β”‚             Ring 0: Virtual Machine Monitor             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          Executes on the physical CPU hardware

The VM Control Structure (VMCS) is a per-VM data structure that holds:

Minimizing unnecessary VM Exits is a key performance optimization.

KVM in Practice

KVM (Kernel-based Virtual Machine) is a Linux kernel module that exposes hardware virtualization through /dev/kvm. A minimal setup in C:

int kvm_fd  = open("/dev/kvm", O_RDWR);
int vm_fd   = ioctl(kvm_fd, KVM_CREATE_VM, 0);
// mmap 4 KB guest memory, load guest binary
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &region);
int vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
// Set RIP=0, RFLAGS=0x2, then:
ioctl(vcpu_fd, KVM_RUN, 0);  // runs until VM Exit

When the guest executes hlt, KVM reports KVM_EXIT_HLT and the host process decides what to do.


Memory Virtualization

A guest OS manages its own page tables mapping Guest Virtual Addresses (GVA) β†’ Guest Physical Addresses (GPA). But GPA is not real physical memory β€” the hypervisor owns the real Host Physical Addresses (HPA). Two approaches bridge this gap:

Approach How it works Notes
Shadow Page Tables VMM maintains a shadow PT that maps GVA β†’ HPA directly; write-protects guest PT pages to intercept updates Older approach; high overhead on PT-heavy workloads
Nested (Extended) Page Tables Hardware walks two levels of page tables automatically (Intel EPT / AMD NPT): GVA→GPA handled by guest PT, GPA→HPA handled by the VMM's nested PT Modern standard; reduces VM exits for PT updates

I/O Virtualization

Emulation

The VMM presents a virtual device to the guest (e.g., an e1000 NIC). The guest uses its normal driver. Every I/O access causes a VM Exit; the VMM intercepts it and calls an emulated-device module (often QEMU) that performs the real I/O. Simple and compatible, but high overhead (many exits per packet).

Front-End / Back-End Driver Model (Xen)

Used in paravirtualized guests:

No per-operation VM Exit once the ring is set up; much lower overhead than full emulation.

SR-IOV (Single-Root I/O Virtualization)

Hardware-level solution. A SR-IOV-capable NIC exposes:

The VM accesses the VF directly (with IOMMU protection), bypassing the VMM on the data path entirely. Near-native network performance.


Device Drivers: Character vs. Block

Linux exposes devices as files under /dev. There are two major types:

Type Data model Random access Examples
Character device Byte stream No Serial ports, terminals, /dev/null, /dev/zero
Block device Fixed-size blocks (512 B or multiples) Yes Hard drives, USB drives, SSDs

Common misconception: /dev/null and /dev/zero are character devices, not block devices, despite feeling like "storage."

Writing a kernel character device driver follows the same module skeleton seen in earlier lectures:

  1. Define a major/minor number and file_operations struct.
  2. Implement open, release, read, write.
  3. Register with register_chrdev.
  4. Create the device file: sudo mknod /dev/my_dev c <major> 0.

Key Takeaways

  1. A hypervisor virtualizes hardware so multiple OSes share one physical machine with strong isolation and near-native performance.
  2. x86 has 17 sensitive non-privileged instructions that break naive virtualization; the three fixes are binary translation, paravirtualization, and hardware support (VT-x/SVM).
  3. VT-x adds VMX non-root/root modes: guest code runs natively; sensitive operations cause a VM Exit handled by the VMM, controlled via VMCS.
  4. Memory virtualization evolved from software-maintained shadow page tables to hardware-walked nested/extended page tables (EPT/NPT).
  5. I/O virtualization ranges from slow-but-compatible device emulation, through paravirtualized front/back-end rings, to near-native SR-IOV hardware pass-through.
  6. Containers are OS-level VMs β€” they share the host kernel via namespaces and cgroups, offering speed at the cost of weaker isolation and no cross-OS support.

Practice

  1. What is the primary role of a Virtual Machine Monitor (VMM)?
  2. Which of the following best distinguishes a Type-1 hypervisor from a Type-2 hypervisor?
  3. Why is x86 considered 'not strictly virtualizable' in the classic sense?
  4. In Intel VT-x, what triggers a VM Exit?
  5. A guest OS running under Xen with paravirtualization replaces the 'cli' (clear interrupt flag) instruction with a hypercall. What is the main advantage of this approach over pure binary translation?
  6. What is the key difference between shadow page tables and nested (extended) page tables for memory virtualization?
  7. Which of the following is a character device rather than a block device?
  8. Describe the front-end/back-end driver model used by Xen for I/O virtualization. What roles do Dom0 and DomU play, and why is this more efficient than full device emulation?
  9. SR-IOV allows a physical NIC to present multiple 'Virtual Functions' to VMs. Explain how this achieves near-native network performance compared to emulation or the front-end/back-end model.
  10. A classmate says: 'Containers are basically just lightweight virtual machines β€” they give you a different OS inside, just like KVM does.' Identify what is wrong with this claim and explain the key architectural difference.