70024: Added cmpiler sanitizer notes

OliverKillane · Feb 22, 2024 · 61d22c3 · 61d22c3
1 parent f418ed9
commit 61d22c3
Show file tree

Hide file tree

Showing 11 changed files with 209 additions and 35 deletions.
diff --git a/70024 - Software Reliability/Compiler Sanitizers.md b/70024 - Software Reliability/Compiler Sanitizers.md
@@ -1,6 +1,64 @@
 ## Definition
-[[TODO]]
+Instrument code at compile time (static instrumentation) to analyse at runtime.
+- Some cannot be combined (e.g. MSan and ASan)
+- Much faster than dynamic instrumentation of binaries (e.g. [[Valgrind]])
+## Clang/GCC Sanitizers
+### Address Sanitizer
+```bash
+-fsanitize=address
+```
+- Detects out of bounds accesses, use after free, and memory leaks. 
+- Typically $\approx2\times$ slowdown.
+- Adds red-zones around all allocations (marked as poisoned), and after frees. If poisoned addresses are accessed, then throws an error, backtrace and printout of accessed poisoned buffer.
+
+#### Poisoning
+Poisoning is done with their own version of malloc on the heap, and by adding to stack frames (extra pushes/pops added at compile time). BSS has redzones added at compile time also.
+
+When memory is freed it is put in *quarantine* (a limited size FIFO), when accessed.
+- Limited size $\to$ large distance means access not detected.
+#### Shadow Memory
+(Also used in [[Valgrind]]), each byte is associated with some memory containing metadata.
+- On each memory access, check its shadow exists (create if not)
+Each 8 bytes (assumes 8 byte alignment) encoded with 9 states held in a single byte:
+
+| State | Meaning |
+| ---- | ---- |
+| $=0$ | All bytes addressable |
+| $<0$ | No bytes Addressable |
+| $1 \leq k \leq 7$ | $k$ bytes addressable |
+```rust
+unsafe fn to_shadow(addr: *u8) -> *u8 {
+	SHADOWN_REGION_OFFSET + (addr / 8)
+}
+```
+#### Results
+No false positives, some false negatives (overflow bypasses redzones, large time distance between free and use after free). 
+### Memory Sanitizer
+```bash
+-fsanitize=memory
+```
+Detects uninitialized memory accesses. Typically $\approx3\times$ slowdown.
+#### Shadow Memory
+Each byte is mapped to a single *validity bit*, declaring if it is initialised.
+#### Results
+Not all reads uninitialized are bad.
+```c
+int a, b;
+a = b; // value ignored, so no need to report 
+a = 1;
+```
+
+Hence it is restricted to:
+- reads in branch conditions
+- reads for determining memory accesses
+- reads passed to system calls
+It's reports have:
+- false positives when interacting with uninstrumented code 
+- allows for annotations to declare safe objects in source, to avoid reporting
 
-## Address Sanitizer
 ## Undefined Behaviour Sanitizer
+Detects a set of undefined behaviours.
+[UndefinedBehaviorSanitizer — Clang 19.0.0git documentation (llvm.org)](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html)
+
 ## Thread Sanitizer
+Detects data races and deadlocks. Typically $\approx5-15\times$ slowdown.
diff --git a/70024 - Software Reliability/Concolic Execution.md b/70024 - Software Reliability/Concolic Execution.md
@@ -1,5 +1,5 @@
 ## Definition
-A type of [[Symbolic Execution 1]],  that obtains a set of test cases that hit all possible branches in a program.
+A type of [[Symbolic Execution]],  that obtains a set of test cases that hit all possible branches in a program.
 1. Start with concrete input.
 2. run it, and collect path conditions.
 3. choose a given path condition, and negate it.

diff --git a/70024 - Software Reliability/Constraint Solving.md b/70024 - Software Reliability/Constraint Solving.md
@@ -1 +1,72 @@
-## Aims
+## Summary
+
+| Aim | Why |
+| ---- | ---- |
+| Bit Level Modelling | System level languages allow this granularity of access, and define behaviours at this level. |
+| Performance | Real programs generate many complex constraints, need to analyse these programs quickly enough to be sueful. |
+[[SAT]] solvers could be used to implement this, however most applications generate constraints at a higher level than just boolean formulae (e.g. integers, integer comparison). 
+
+Hence we use [[Satisfiability Modulo Theories]] and [[BitVectors]].
+- Each block of memory is represented by an array of 8-bit [[BitVectors]] to allow constraints with bit-level accuracy.
+- Operations on [[BitVectors]] include concatenation, extraction, bitwise operations and arithmetic.
+- Eagre translation is used to simplify formulas at the theory level (e.g. $x - x \equiv 0$)
+- Arithmetic can be encoded as a formula representing a circuit (e.g. a ripple-carry adder)
+## Arrays
+
+```rust
+/// Read a value
+fn read(a: &Array, i: Index) -> Value;
+
+/// return a new array that is a copy of a, with v at index i
+fn write(a: &Array, i: Index, v: Value) -> Array;
+```
+$$\begin{matrix*}[l r l l] 
+\forall a, i, j. & i = j & \Rightarrow read(a,i) = read(a,j) & \text{Array Congruence} \\
+\forall a, i, j, v. & i = j & \Rightarrow read(write(a, i, v), j) = v & \text{Read-Over-Write 1}\\
+\forall a, i, j, v. & i \neq j & \Rightarrow read(write(a, i, v), j) = read(a, j) & \text{Read-Over-Write 2} \\
+\end{matrix*}$$
+### Eliminating Arrays
+#### Eliminate Writes
+$$\begin{matrix*} 
+read(write(a, v, i), j) \Leftrightarrow ite(i = j, v, read(a, j))
+\end{matrix*}$$
+The write can be eliminated, instead just check if the $read$ reads from the index of the write, in which case just return the new value.
+#### Eliminate Reads
+Can replace each syntactically-unique $read$ with a fresh variable. For each array access with the same indices, we can imply the same variable is used (pairwise compare all indexes).
+- Expensive as it expands formulas by $\cfrac{n(n-1)}{2}$ terms (where $n$ are the syntactically different indexes) to compare all indexes used.
+
+For example:
+$$\begin{split} 
+& (a[i_1] = e_1) \land (a[i_2] = e_2) \land (a[i_3] = e_3) \land (i_1 + i_2 + i_3 = 6) \\
+\to \ & (v_1 = e_1) \land (v_2 = e_2) \land (v_3 = e_3) \land (i_1 + i_2 + i_3 = 6) \\
+\to \ & (i_1 = i_2 \Rightarrow v_1 = v_2) \land (i_1 = i_3 \Rightarrow v_1 = v_3) \land (i_2 = i_3 \Rightarrow v_2 = v_3)\\
+\end{split}$$
+### Array-Based Refinement
+When we cannot eliminate the array, we can 
+[[TODO]]
+
+## Performance
+[[Satisfiability Modulo Theories]] solving is NP-Complete and hence expensive, it is also invoked at every branch.
+- [[70024 - Software Reliability/Constraint Solving#Eliminating Arrays|Elimination]] of constraints.
+- Caching constraints and the values satisfying them.
+- Constraint simplification
+- Summarizing loops
+### Use Different SMT/SAT Solvers
+Different solvers perform differently on different formulas, so we can invoke multiple and see which is the fastest.
+### Eliminating Irrelevant Constraints
+Many conditions only rely on a small number of variables. We only need to include relevant constraints to a branch condition.
+### Caching Solutions
+Lots of similar constraint sets for branches (e.g. in loops, the same constraints used many times).
+### Array Transformation
+#### Index-Based
+```rust
+b[k] > 0
+```
+We can add in constraints on the indices of `b` that are in some value range.
+#### Value-Based
+```rust
+b[k] > x
+```
+we can turn this into:
+$$ite(0 \leq k \leq \dots, ite( k = \dots,  ite(\dots, ))) > x$$
+Effectively converting the array access into a set of conditionals in an expression.
diff --git a/70024 - Software Reliability/Memory Safety Violations Based Attacks.md b/70024 - Software Reliability/Memory Safety Violations Based Attacks.md
@@ -0,0 +1,30 @@
+## Attacks
+### Code Corruption
+An attempt to modify code in memory.
+- For elf binaries pages containing code are marked as readonly, and the heap and stack marked as non-executable. But this does not prevent modification of interpreted bytecodes stored on the heap.
+- Systems using code generation (e.g. JIT) need pages that are readable, writable and executable, so cannot use the same protections.
+### Control Flow Hijack
+Corrupt pointers used for control flow (e.g. return address, function pointers, vtables) to direct control flow.
+### Non-Control-Data Attacks
+Corrupt security related data (not code).
+```c
+void authenticate(int input) {
+	char auth = 0; 
+	char buf[128] = {0};
+	// depending on layout of local vars on stack, an overflow to 
+	// buf can overwrite auth
+	buf[input] = 1;
+	if (auth) {
+		set_privileged_mode();
+	}
+}
+```
+This also includes dangling pointers left after freeing, that can be used to access data allocated in the same location later, and potentially to corrupt.
+### Leak Confidential Memory
+Out of bounds reads on buffers. 
+[Heartbleed - Wikipedia](https://en.wikipedia.org/wiki/Heartbleed)
+## Defences
+### Stack Canaries
+Compiler adds canaries (a magic value) into stack frames, and verifies integrity at start and end of call to ensure no corruption has occurred. If the canary is valid, it is likely no stack corruption occurred.
+
+Common values include `\0` / null (e.g. to prevent null terminated string copy/access from overflowing).
diff --git a/70024 - Software Reliability/Path Explosion.md b/70024 - Software Reliability/Path Explosion.md
@@ -1,6 +1,6 @@
 ## Definition
 Real world programs contain many conditions resulting in a huge number of (often exponential in terms of conditions) paths.
-- When using [[Symbolic Execution 1]]to find bugs, we should prioritise the most important paths
+- When using [[Symbolic Execution]]to find bugs, we should prioritise the most important paths
 ## Solutions
 ### Search Heuristics
 

diff --git a/70024 - Software Reliability/Satisfiability Modulo Theories.md b/70024 - Software Reliability/Satisfiability Modulo Theories.md
@@ -0,0 +1,5 @@
+## Definition
+A generalisation of [[SAT]] theory to more complex formulas including:
+- real numbers
+- integers
+- data structures (arrays, lists)
diff --git a/70024 - Software Reliability/Software Reliability Techniques.md b/70024 - Software Reliability/Software Reliability Techniques.md
@@ -7,4 +7,4 @@
 ## Automated
 ### [[Fuzzing]]
 ### [[Static Analysis]]
-### [[Symbolic Execution 1]]
+### [[Symbolic Execution]]
diff --git a/70024 - Software Reliability/Static Analysis.md b/70024 - Software Reliability/Static Analysis.md
@@ -14,6 +14,7 @@ Reasoning about behaviour of programs without executing them.
 | No test Cases | Analysis often does not provide easily applicable test cases for developers. |
 
 Imprecise (but faster) analyses can result in false positives.
+## [[Data Flow Analysis]]
 ## Examples
 ### Coverity ([[A Few Billion Lines of Code Later]])
 ### Theory of [[Program Analysis]]
diff --git a/70024 - Software Reliability/Symbolic Execution 1.md b/70024 - Software Reliability/Symbolic Execution 1.md
diff --git a/70024 - Software Reliability/Symbolic Execution.md b/70024 - Software Reliability/Symbolic Execution.md
@@ -1 +1,36 @@
-[[TODO]]
+## Definition
+By running a program on symbolic input (representing any value, can be logically constrained), every path through a program is analysed by constraining symbolic values to symbols representing the sets of possible values on the paths.
+- Program instructions act on symbolic values.
+- At conditionals, fork the symbolic values and constrain by the branch condition.
+- On termination generate a test case by solving the constraints on the path.
+### Advantages
+| Advantage | Description |
+| ---- | ---- |
+| Automatic | Requires no input test cases, initial seeds, configuration (like [[Fuzzing]]/[[Swarm Testing]]). |
+| Systematic | Can reach all possible paths in the program, and reason about all possible values on these paths. |
+| Deep Bugs | Cam reason about all values, hence including extremely rare edge case values & memory layouts. |
+| [[Functional Bugs]] | Can reason about logical statements given appropriate oracles. e.g. checking for crashes, given assert statements cause crashes. |
+| Test Cases | Generates test cases for developers to allow them to reproduce and debug. |
+### Disadvantages
+| Disadvantage | Description |
+| ---- | ---- |
+| Complex Toolchain | Requires the source to be available, and compiled to some form the symbolic execution tool can interpret. |
+| Constrain Solving | Constraint solving is expensive. |
+## Mixed Concrete & Symbolic Execution
+Many values are concrete (e.g. constants in the code, or concrete inputs set by the user).
+- Only operations that include symbolic values need to be symbolically executed, the rest can be executed as normal.
+- Allows interaction with outside environment (e.g. operating system, un-instrumented libraries, etc.)
+- 
+## Challenges
+### [[Path Explosion]]
+### [[70024 - Software Reliability/Constraint Solving|Constraint Solving]]
+### [[Environment Modelling]]
+
+## Applications
+- Automatically generating high coverage test suites.
+- High coverage regression testing (test cases for all paths, generate test suite, check new versions)
+- [[Differential Testing]] between clear and highly correct implementations, and optimised implementations. (can check behaviour is identical)
+## Examples
+KLEE, CREST, SPF, FuzzBall
+### [[KLEE]]
+### [[EXE]]
diff --git a/70024 - Software Reliability/Valgrind.md b/70024 - Software Reliability/Valgrind.md
@@ -0,0 +1,2 @@
+## Definition
+Dynamic instrumentation of binaries to detect memory related [[Generic Bugs]].
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		## Definition
		Dynamic instrumentation of binaries to detect memory related [[Generic Bugs]].