Software Testing, Fuzzing & Symbolic Execution

The previous modules showed you how to exploit a stack buffer overflow once you have one. Now flip the question: how does a defender (or a security researcher) find the overflow in the first place? Programs ship with millions of lines of C/C++, and manual code review at that scale is impractical. This module covers the automated tools that are actually used to discover memory-safety bugs before attackers do — the same tools that have turned up thousands of CVEs in widely-deployed software.

Why Automated Bug Finding?

Finding a security bug takes two steps:

  1. Locate the bug — identify the dangerous code (e.g., an unchecked memcpy).
  2. Trigger the bug — find an input that reaches that code with the right values to cause a crash.

Manual code review handles step 1 on small programs, but becomes unreliable at scale. Step 2 is often even harder: consider if (magic == 0x31323334) { memcpy(dst, src, len); } — how do you find the exact input that makes magic equal 0x31323334? That is the central problem all three techniques below attack.

Software Testing Basics

Testing runs a program on a set of inputs and checks whether observed behavior matches a specification. Two key dimensions classify testing approaches:

Black-Box White-Box
Manual Hand-crafted inputs, no source needed Code review + manual test design
Automated Random/generational fuzzing Coverage-guided fuzzing, symbolic execution

Black-box testing treats the program as an opaque binary. It works even when source is unavailable (malware analysis, proprietary software) and handles any input format. Its weakness is that it misses deep paths guarded by specific checks.

White-box testing uses knowledge of the source or binary structure to create more targeted inputs. It can be far more thorough but requires access to — and analysis of — the code.

Code Coverage

How do you know your test suite is good? Code coverage measures the fraction of the program actually exercised by the tests. Common granularities:

Coverage is given as a percentage. 100% coverage is rarely achievable in practice (unreachable code, error handlers that require hardware faults) but is often required for safety-critical systems (avionics, medical devices). For security testing, branch coverage is the most relevant: an unexercised branch might be the one hiding the overflow.

Fuzzing

Fuzzing (fuzz testing) automates the generation of test inputs and feeds them to the target program, watching for crashes, hangs, or sanitizer alerts. The key insight: crashes almost always mean a bug; if you can make the program crash, you likely found something exploitable.

Black-Box vs. Coverage-Guided

Pure random / black-box fuzzing generates entirely random bytes. It is trivially fast and requires nothing about the target, but it rarely penetrates past the first few input-validation checks.

Coverage-guided (grey-box) fuzzing — the state of the art — instruments the binary at compile time and tracks which branches each input exercises. Inputs that hit new branches are kept and mutated further; inputs that cover already-seen paths are discarded. This feedback loop lets the fuzzer learn the program's structure without doing full symbolic analysis.

AFL: American Fuzzy Lop

AFL (by Michal Zalewski) is the most influential coverage-guided fuzzer. The workflow is:

# 1. Compile with AFL instrumentation
./afl-2.52b/afl-gcc prog.c -o prog.afl

# 2. Provide a seed input
echo AAAA > input/test

# 3. Run the fuzzer
./afl-2.52b/afl-fuzz -i input -o output ./prog.afl

AFL's dashboard shows live statistics: total paths discovered, unique crashes, execution speed (commonly hundreds or thousands of executions per second), and the mutation strategies currently being tried (bit flips, byte flips, arithmetic, known-integer substitution, havoc, splice).

How AFL finds "487": Suppose a program crashes only on input starting with '4', '8', '7'. AFL starts from a seed like "AAAA". After a few hundred mutations it happens to flip the first byte to '4' — branch coverage increases, so that input is saved and queued. From that input it eventually hits '8' in position 1, extending the coverage again. Finally it reaches '7' and triggers the crash. Each step is guided by the fact that new coverage = new interesting input.

AFL++ is the modern successor: faster execution, better mutation strategies (including custom mutators), and improved instrumentation (LTO mode, COMPCOV for comparison splitting). It is the version used in practice today.

libFuzzer

libFuzzer is LLVM/Clang's in-process fuzzer. Instead of forking a new process per input, it links a fuzzing harness directly into the target and calls a LLVMFuzzerTestOneInput(data, size) function in a tight loop. This eliminates fork overhead and makes it feasible to fuzz at tens of thousands of executions per second.

libFuzzer is almost always paired with AddressSanitizer (ASan):

clang -fsanitize=address,fuzzer target.c -o target_fuzz

ASan instruments every memory access at compile time and detects:

Without a sanitizer, many memory bugs cause silent corruption rather than an immediate crash, so the fuzzer would miss them. With ASan, even a one-byte out-of-bounds read is caught and reported with a full stack trace and the exact address. This combination — coverage-guided fuzzing + sanitizers — is how Google's OSS-Fuzz service has found thousands of bugs in open-source C/C++ projects.

Symbolic Execution

Fuzzing is fast but fundamentally probabilistic: it might never generate the exact value 0x31323334 needed to reach a deep bug. Symbolic execution attacks this problem head-on.

Instead of running the program with concrete values (e.g., x = 42), a symbolic executor assigns symbols to inputs (e.g., x = α) and propagates those symbols through the computation. At every branch, both outcomes are explored, and the constraint that would lead down each path is recorded.

Input: x = α (symbolic)

if (x > 10):          path constraint: α > 10
    if (x == 0x31323334):   path constraint: α > 10 AND α == 0x31323334
        memcpy(dst, src, len);   <-- BUG REACHABLE

An SMT solver (Satisfiability Modulo Theories — tools like Z3) is then asked: "Is this set of constraints satisfiable?" If yes, the solver hands back a concrete input that reaches the bug. This approach is complete in theory: given enough time it will find every reachable bug.

Concolic Execution

Pure symbolic execution becomes expensive quickly. Concolic execution (concrete + symbolic) is a practical hybrid:

  1. Run the program on a real (concrete) seed input, recording the path taken.
  2. Collect the path constraints for that run.
  3. Negate one constraint, ask the SMT solver for a new input that takes a different branch.
  4. Run again on the new concrete input, repeat.

The Godefroid et al. "Automated Whitebox Fuzz Testing" (NDSS 2008) paper describes this loop as the foundation of automated white-box fuzz testing.

Path Explosion

The core limitation of symbolic execution is path explosion: the number of paths through a program grows exponentially with the number of branches. Loops and recursion make this worse. Practical tools mitigate this by:

Fuzzing vs. Symbolic Execution

Property Coverage-Guided Fuzzing (AFL) Symbolic / Concolic Execution
Speed Very fast (millions of inputs/day) Slow (SMT solving is expensive)
Deep path penetration Poor for magic-value checks Excellent — solver derives the value
Handles complex programs Good at scale Struggles with large programs (path explosion)
Tool maturity Excellent (AFL, AFL++, libFuzzer) Moderate (KLEE, angr, S2E)
Best use Broad coverage, file parsers, network input Targeted analysis, protocol parsing, test-case generation

In practice, the strongest pipelines combine both: fuzzing explores the easy paths quickly, while symbolic execution is invoked selectively on branches the fuzzer cannot penetrate.

Key Takeaways

Practice

  1. Which of the following best describes a black-box testing approach?
  2. A test suite exercises 80 out of 100 reachable branches in a C program. What coverage metric does this describe, and what does the remaining 20% imply for security?
  3. AFL keeps an input in its queue only if it increases code coverage compared to all previously seen inputs. Why is this feedback loop central to AFL's effectiveness?
  4. libFuzzer is paired with AddressSanitizer (ASan) in almost every real-world fuzzing campaign. What specific problem does ASan solve that fuzzing alone cannot?
  5. Consider this code fragment: c if (magic == 0x31323334) { memcpy(dst, src, len); // potential overflow } Why does coverage-guided fuzzing (AFL) struggle to trigger the memcpy, while symbolic execution handles it easily?
  6. What is path explosion in symbolic execution, and which program construct most aggressively triggers it?
  7. Concolic execution is described as a 'concrete + symbolic' hybrid. Which sequence correctly describes one iteration of the concolic loop?
  8. A security engineer is fuzzing a network protocol parser written in C. After 24 hours of AFL, the fuzzer has not increased branch coverage in 6 hours and no new crashes have appeared. Suggest two concrete next steps the engineer could take, drawing on the techniques covered in this module.
  9. Explain why AFL's coverage-guided fuzzing is placed in the automated white-box quadrant of the software testing classification matrix, rather than automated black-box.