OS Virtualization

Why This Matters

Every major cloud provider — AWS, Google Cloud, Azure — runs your code inside a virtual machine. Virtualization is the foundational technology that makes cloud computing, secure sandboxing, and live migration possible. As a Linux kernel programmer, you need to understand not just what a hypervisor is, but how it tricks an unmodified OS into thinking it owns the hardware, and where the real engineering challenges lie.

What Is OS Virtualization?

OS virtualization is the use of software (and hardware extensions) to run multiple operating systems simultaneously on one physical machine. The central component is the Virtual Machine Monitor (VMM), also called the hypervisor. The hypervisor presents each guest OS with an illusion of dedicated hardware while actually multiplexing the real hardware underneath.

Two key properties drive the design:

Same code, near-native performance — guest code should run as fast as it would on bare metal.
Strong isolation — a bug or compromise in one VM must not affect others.

A Brief History

Era	State of virtualization
1960s–70s	Popular on mainframes to share expensive hardware; IBM VM/370
1980s–90s	Interest declined as cheap personal computers and multi-user OSes (Unix) reduced the need
2000s–present	Revival driven by server consolidation, cloud computing, and hardware support (Intel VT-x, AMD SVM)

Use Cases

Server Consolidation

Running X VMs on Y physical hosts (Y < X) allows you to keep physical CPUs busy rather than idle. This was the original commercial driver for modern virtualization (VMware, Xen).

Software Development

Spin up any OS on your laptop without rebooting.
VMs are self-contained snapshots of an environment, making reproducible builds and automated testing easy.

Migration and Checkpoint/Restart

A VM's entire state (CPU registers, memory, device state) is captured in software, making two operations straightforward that are hard for plain processes:

Live migration: move a running VM across hosts transparently (for maintenance, load balancing, or fault avoidance).
Checkpoint/restart: dump state to disk and resume later.

Hardware Emulation

Cross-ISA emulators (e.g., QEMU in full-emulation mode) let you run ARM binaries on an x86 host, or prototype hardware that doesn't exist yet. Architecture simulators like Gem5 interpret every guest instruction in software for detailed performance/power analysis (5× to 1000× slowdown).

Cloud Computing

Virtualization enables the three cloud service models:

IaaS — rent VMs (e.g., AWS EC2).
PaaS — deploy apps on managed infrastructure (e.g., Google App Engine).
SaaS — use hosted services (e.g., Gmail).

Security

Sandboxing: isolate untrusted code (malware analysis, honeypots, browser tabs).
VM introspection: analyze guest behavior from the higher-privileged VMM without the guest's cooperation.

Categories of Virtual Machines

System-Level VMs (Hypervisor-Based)

Run an unmodified OS image on top of virtualized hardware. The hypervisor exposes a virtual CPU, virtual memory, and virtual devices.

Machine simulators (e.g., QEMU in full-emulation mode, Gem5) emulate a different architecture entirely — every instruction is interpreted, so they are slow but portable.

Hypervisor-based VMs virtualize the same architecture as the host, using hardware extensions to execute guest code directly on the CPU. Examples: Xen-HVM, KVM, VMware ESXi, Hyper-V, VirtualBox.

Type 1 vs. Type 2 Hypervisors

	Type 1 (bare-metal)	Type 2 (hosted)
Runs on	Raw hardware	Host OS
Examples	Xen, VMware ESX, Hyper-V	KVM/QEMU, VMware Workstation
Performance	Higher (no host OS overhead)	Slightly lower
Deployment	Servers/datacenters	Developer laptops

Note: KVM is a kernel module that turns Linux itself into a Type-1-like hypervisor, but it is conventionally listed as Type 2 because it relies on a host Linux kernel.

OS-Level Lightweight VMs (Containers)

Instead of virtualizing hardware, containers use OS mechanisms — chroot, cgroups, namespaces — to isolate processes. They share the host kernel, so:

Pros: fast startup, low overhead, small image.
Cons: cannot run a different OS type (e.g., Windows on a Linux host); weaker isolation boundary.

Docker and LXC are well-known container runtimes. You can write a basic container in ~100 lines of Go using the primitives directly.

The x86 Virtualization Challenge

To virtualize a CPU cleanly, every instruction that touches privileged state must trap to the hypervisor when executed at unprivileged level, so the VMM can emulate it. This property is called strict virtualizability.

x86 is not strictly virtualizable. Robin & Irvine (2000) identified 17 sensitive, non-privileged instructions — instructions that read or modify protected state but do not cause a trap when executed in user mode. Two examples:

in/out — access hardware I/O ports; they silently succeed or fail depending on the I/O Permission Bitmap rather than trapping.
hlt — halts the CPU until an interrupt; in ring 3 it raises a #GP, but the behavior is ISA-version-dependent and has historically been problematic for VMMs.

Because these instructions can leak or corrupt VM state without the VMM's knowledge, naive "run and hope for a trap" doesn't work.

CPU Virtualization: Three Solutions

Solution 1 — Trap & Emulate via Dynamic Binary Translation (DBT)

Before executing a block of guest code, scan it for sensitive instructions and replace them with calls into the VMM (stored in a code cache). The translated code runs natively; only the sensitive spots are intercepted. This approach requires no guest OS changes and no special hardware, but the scanning overhead can be significant.

Solution 2 — Paravirtualization

Modify the guest OS so it is aware it is running inside a VM. Replace sensitive instructions with hypercalls — explicit requests to the VMM:

/* Guest kernel, before paravirtualization */
cli   /* clear interrupt flag — sensitive! */

/* After paravirtualization (Xen style) */
hypercall(HYPERCALL_DISABLE_INTERRUPTS);

Pro: near-native performance, no hardware support required.
Con: requires a specially patched guest; you cannot run an unmodified Windows.

Solution 3 — Hardware-Assisted Virtualization (Intel VT-x / AMD SVM)

Intel VT-x adds Virtual Machine Extensions (VMX) to x86. All 17 problematic instructions are made to trap when executed in non-root mode. The hardware handles the trapping, removing the need for binary translation or guest modifications.

VT-x Operating Modes

┌─────────────────────────────────────────────────────────┐
│                  VMX Non-Root (Guest)                   │
│   Ring 3: User Apps       Ring 0: Guest OS kernel       │
│                          ↑ VM Exit  ↓ VM Entry          │
├─────────────────────────────────────────────────────────┤
│                    VMX Root (VMM)                       │
│             Ring 0: Virtual Machine Monitor             │
└─────────────────────────────────────────────────────────┘
          Executes on the physical CPU hardware

VM Entry: transition from VMM (root) into the guest (non-root).
VM Exit: any sensitive operation in the guest causes a trap back to the VMM.

The VM Control Structure (VMCS) is a per-VM data structure that holds:

Which events trigger a VM Exit (execution controls).
The guest's saved CPU state (registers, segment descriptors, etc.).
The host's state to restore on exit.

Minimizing unnecessary VM Exits is a key performance optimization.

KVM in Practice

KVM (Kernel-based Virtual Machine) is a Linux kernel module that exposes hardware virtualization through /dev/kvm. A minimal setup in C:

int kvm_fd  = open("/dev/kvm", O_RDWR);
int vm_fd   = ioctl(kvm_fd, KVM_CREATE_VM, 0);
// mmap 4 KB guest memory, load guest binary
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &region);
int vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
// Set RIP=0, RFLAGS=0x2, then:
ioctl(vcpu_fd, KVM_RUN, 0);  // runs until VM Exit

When the guest executes hlt, KVM reports KVM_EXIT_HLT and the host process decides what to do.

Memory Virtualization

A guest OS manages its own page tables mapping Guest Virtual Addresses (GVA) → Guest Physical Addresses (GPA). But GPA is not real physical memory — the hypervisor owns the real Host Physical Addresses (HPA). Two approaches bridge this gap:

Approach	How it works	Notes
Shadow Page Tables	VMM maintains a shadow PT that maps GVA → HPA directly; write-protects guest PT pages to intercept updates	Older approach; high overhead on PT-heavy workloads
Nested (Extended) Page Tables	Hardware walks two levels of page tables automatically (Intel EPT / AMD NPT): GVA→GPA handled by guest PT, GPA→HPA handled by the VMM's nested PT	Modern standard; reduces VM exits for PT updates

I/O Virtualization

Emulation

The VMM presents a virtual device to the guest (e.g., an e1000 NIC). The guest uses its normal driver. Every I/O access causes a VM Exit; the VMM intercepts it and calls an emulated-device module (often QEMU) that performs the real I/O. Simple and compatible, but high overhead (many exits per packet).

Front-End / Back-End Driver Model (Xen)

Used in paravirtualized guests:

Front-end driver: runs in the guest (DomU), sends requests through a shared memory ring buffer.
Back-end driver: runs in Dom0 (the privileged management domain), receives requests and performs actual I/O with the native driver.

No per-operation VM Exit once the ring is set up; much lower overhead than full emulation.

SR-IOV (Single-Root I/O Virtualization)

Hardware-level solution. A SR-IOV-capable NIC exposes:

One Physical Function (PF) — controlled by the hypervisor/Dom0.
Multiple Virtual Functions (VFs) — lightweight PCIe functions, each assignable to a different VM.

The VM accesses the VF directly (with IOMMU protection), bypassing the VMM on the data path entirely. Near-native network performance.

Device Drivers: Character vs. Block

Linux exposes devices as files under /dev. There are two major types:

Type	Data model	Random access	Examples
Character device	Byte stream	No	Serial ports, terminals, `/dev/null`, `/dev/zero`
Block device	Fixed-size blocks (512 B or multiples)	Yes	Hard drives, USB drives, SSDs

Common misconception: /dev/null and /dev/zero are character devices, not block devices, despite feeling like "storage."

Writing a kernel character device driver follows the same module skeleton seen in earlier lectures:

Define a major/minor number and file_operations struct.
Implement open, release, read, write.
Register with register_chrdev.
Create the device file: sudo mknod /dev/my_dev c <major> 0.

Key Takeaways

A hypervisor virtualizes hardware so multiple OSes share one physical machine with strong isolation and near-native performance.
x86 has 17 sensitive non-privileged instructions that break naive virtualization; the three fixes are binary translation, paravirtualization, and hardware support (VT-x/SVM).
VT-x adds VMX non-root/root modes: guest code runs natively; sensitive operations cause a VM Exit handled by the VMM, controlled via VMCS.
Memory virtualization evolved from software-maintained shadow page tables to hardware-walked nested/extended page tables (EPT/NPT).
I/O virtualization ranges from slow-but-compatible device emulation, through paravirtualized front/back-end rings, to near-native SR-IOV hardware pass-through.
Containers are OS-level VMs — they share the host kernel via namespaces and cgroups, offering speed at the cost of weaker isolation and no cross-OS support.

OS Virtualization

Why This Matters

What Is OS Virtualization?

A Brief History

Use Cases

Server Consolidation

Software Development

Migration and Checkpoint/Restart

Hardware Emulation

Cloud Computing

Security

Categories of Virtual Machines

System-Level VMs (Hypervisor-Based)

Type 1 vs. Type 2 Hypervisors

OS-Level Lightweight VMs (Containers)

The x86 Virtualization Challenge

CPU Virtualization: Three Solutions

Solution 1 — Trap & Emulate via Dynamic Binary Translation (DBT)

Solution 2 — Paravirtualization

Solution 3 — Hardware-Assisted Virtualization (Intel VT-x / AMD SVM)

VT-x Operating Modes

KVM in Practice

Memory Virtualization

I/O Virtualization

Emulation

Front-End / Back-End Driver Model (Xen)

SR-IOV (Single-Root I/O Virtualization)

Device Drivers: Character vs. Block

Key Takeaways

Practice

Model answer

Model answer

Model answer

Results