Skip to content

Commit

Permalink
70024: Added cmpiler sanitizer notes
Browse files Browse the repository at this point in the history
  • Loading branch information
OliverKillane committed Feb 22, 2024
1 parent f418ed9 commit 61d22c3
Show file tree
Hide file tree
Showing 11 changed files with 209 additions and 35 deletions.
62 changes: 60 additions & 2 deletions 70024 - Software Reliability/Compiler Sanitizers.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,64 @@
## Definition
[[TODO]]
Instrument code at compile time (static instrumentation) to analyse at runtime.
- Some cannot be combined (e.g. MSan and ASan)
- Much faster than dynamic instrumentation of binaries (e.g. [[Valgrind]])
## Clang/GCC Sanitizers
### Address Sanitizer
```bash
-fsanitize=address
```
- Detects out of bounds accesses, use after free, and memory leaks.
- Typically $\approx2\times$ slowdown.
- Adds red-zones around all allocations (marked as poisoned), and after frees. If poisoned addresses are accessed, then throws an error, backtrace and printout of accessed poisoned buffer.

#### Poisoning
Poisoning is done with their own version of malloc on the heap, and by adding to stack frames (extra pushes/pops added at compile time). BSS has redzones added at compile time also.

When memory is freed it is put in *quarantine* (a limited size FIFO), when accessed.
- Limited size $\to$ large distance means access not detected.
#### Shadow Memory
(Also used in [[Valgrind]]), each byte is associated with some memory containing metadata.
- On each memory access, check its shadow exists (create if not)
Each 8 bytes (assumes 8 byte alignment) encoded with 9 states held in a single byte:

| State | Meaning |
| ---- | ---- |
| $=0$ | All bytes addressable |
| $<0$ | No bytes Addressable |
| $1 \leq k \leq 7$ | $k$ bytes addressable |
```rust
unsafe fn to_shadow(addr: *u8) -> *u8 {
SHADOWN_REGION_OFFSET + (addr / 8)
}
```
#### Results
No false positives, some false negatives (overflow bypasses redzones, large time distance between free and use after free).
### Memory Sanitizer
```bash
-fsanitize=memory
```
Detects uninitialized memory accesses. Typically $\approx3\times$ slowdown.
#### Shadow Memory
Each byte is mapped to a single *validity bit*, declaring if it is initialised.
#### Results
Not all reads uninitialized are bad.
```c
int a, b;
a = b; // value ignored, so no need to report
a = 1;
```

Hence it is restricted to:
- reads in branch conditions
- reads for determining memory accesses
- reads passed to system calls
It's reports have:
- false positives when interacting with uninstrumented code
- allows for annotations to declare safe objects in source, to avoid reporting

## Address Sanitizer
## Undefined Behaviour Sanitizer
Detects a set of undefined behaviours.
[UndefinedBehaviorSanitizer — Clang 19.0.0git documentation (llvm.org)](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html)

## Thread Sanitizer
Detects data races and deadlocks. Typically $\approx5-15\times$ slowdown.
2 changes: 1 addition & 1 deletion 70024 - Software Reliability/Concolic Execution.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## Definition
A type of [[Symbolic Execution 1]], that obtains a set of test cases that hit all possible branches in a program.
A type of [[Symbolic Execution]], that obtains a set of test cases that hit all possible branches in a program.
1. Start with concrete input.
2. run it, and collect path conditions.
3. choose a given path condition, and negate it.
Expand Down
73 changes: 72 additions & 1 deletion 70024 - Software Reliability/Constraint Solving.md
Original file line number Diff line number Diff line change
@@ -1 +1,72 @@
## Aims
## Summary

| Aim | Why |
| ---- | ---- |
| Bit Level Modelling | System level languages allow this granularity of access, and define behaviours at this level. |
| Performance | Real programs generate many complex constraints, need to analyse these programs quickly enough to be sueful. |
[[SAT]] solvers could be used to implement this, however most applications generate constraints at a higher level than just boolean formulae (e.g. integers, integer comparison).

Hence we use [[Satisfiability Modulo Theories]] and [[BitVectors]].
- Each block of memory is represented by an array of 8-bit [[BitVectors]] to allow constraints with bit-level accuracy.
- Operations on [[BitVectors]] include concatenation, extraction, bitwise operations and arithmetic.
- Eagre translation is used to simplify formulas at the theory level (e.g. $x - x \equiv 0$)
- Arithmetic can be encoded as a formula representing a circuit (e.g. a ripple-carry adder)
## Arrays

```rust
/// Read a value
fn read(a: &Array, i: Index) -> Value;

/// return a new array that is a copy of a, with v at index i
fn write(a: &Array, i: Index, v: Value) -> Array;
```
$$\begin{matrix*}[l r l l]
\forall a, i, j. & i = j & \Rightarrow read(a,i) = read(a,j) & \text{Array Congruence} \\
\forall a, i, j, v. & i = j & \Rightarrow read(write(a, i, v), j) = v & \text{Read-Over-Write 1}\\
\forall a, i, j, v. & i \neq j & \Rightarrow read(write(a, i, v), j) = read(a, j) & \text{Read-Over-Write 2} \\
\end{matrix*}$$
### Eliminating Arrays
#### Eliminate Writes
$$\begin{matrix*}
read(write(a, v, i), j) \Leftrightarrow ite(i = j, v, read(a, j))
\end{matrix*}$$
The write can be eliminated, instead just check if the $read$ reads from the index of the write, in which case just return the new value.
#### Eliminate Reads
Can replace each syntactically-unique $read$ with a fresh variable. For each array access with the same indices, we can imply the same variable is used (pairwise compare all indexes).
- Expensive as it expands formulas by $\cfrac{n(n-1)}{2}$ terms (where $n$ are the syntactically different indexes) to compare all indexes used.

For example:
$$\begin{split}
& (a[i_1] = e_1) \land (a[i_2] = e_2) \land (a[i_3] = e_3) \land (i_1 + i_2 + i_3 = 6) \\
\to \ & (v_1 = e_1) \land (v_2 = e_2) \land (v_3 = e_3) \land (i_1 + i_2 + i_3 = 6) \\
\to \ & (i_1 = i_2 \Rightarrow v_1 = v_2) \land (i_1 = i_3 \Rightarrow v_1 = v_3) \land (i_2 = i_3 \Rightarrow v_2 = v_3)\\
\end{split}$$
### Array-Based Refinement
When we cannot eliminate the array, we can
[[TODO]]

## Performance
[[Satisfiability Modulo Theories]] solving is NP-Complete and hence expensive, it is also invoked at every branch.
- [[70024 - Software Reliability/Constraint Solving#Eliminating Arrays|Elimination]] of constraints.
- Caching constraints and the values satisfying them.
- Constraint simplification
- Summarizing loops
### Use Different SMT/SAT Solvers
Different solvers perform differently on different formulas, so we can invoke multiple and see which is the fastest.
### Eliminating Irrelevant Constraints
Many conditions only rely on a small number of variables. We only need to include relevant constraints to a branch condition.
### Caching Solutions
Lots of similar constraint sets for branches (e.g. in loops, the same constraints used many times).
### Array Transformation
#### Index-Based
```rust
b[k] > 0
```
We can add in constraints on the indices of `b` that are in some value range.
#### Value-Based
```rust
b[k] > x
```
we can turn this into:
$$ite(0 \leq k \leq \dots, ite( k = \dots, ite(\dots, ))) > x$$
Effectively converting the array access into a set of conditionals in an expression.
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
## Attacks
### Code Corruption
An attempt to modify code in memory.
- For elf binaries pages containing code are marked as readonly, and the heap and stack marked as non-executable. But this does not prevent modification of interpreted bytecodes stored on the heap.
- Systems using code generation (e.g. JIT) need pages that are readable, writable and executable, so cannot use the same protections.
### Control Flow Hijack
Corrupt pointers used for control flow (e.g. return address, function pointers, vtables) to direct control flow.
### Non-Control-Data Attacks
Corrupt security related data (not code).
```c
void authenticate(int input) {
char auth = 0;
char buf[128] = {0};
// depending on layout of local vars on stack, an overflow to
// buf can overwrite auth
buf[input] = 1;
if (auth) {
set_privileged_mode();
}
}
```
This also includes dangling pointers left after freeing, that can be used to access data allocated in the same location later, and potentially to corrupt.
### Leak Confidential Memory
Out of bounds reads on buffers.
[Heartbleed - Wikipedia](https://en.wikipedia.org/wiki/Heartbleed)
## Defences
### Stack Canaries
Compiler adds canaries (a magic value) into stack frames, and verifies integrity at start and end of call to ensure no corruption has occurred. If the canary is valid, it is likely no stack corruption occurred.
Common values include `\0` / null (e.g. to prevent null terminated string copy/access from overflowing).
2 changes: 1 addition & 1 deletion 70024 - Software Reliability/Path Explosion.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Definition
Real world programs contain many conditions resulting in a huge number of (often exponential in terms of conditions) paths.
- When using [[Symbolic Execution 1]]to find bugs, we should prioritise the most important paths
- When using [[Symbolic Execution]]to find bugs, we should prioritise the most important paths
## Solutions
### Search Heuristics

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## Definition
A generalisation of [[SAT]] theory to more complex formulas including:
- real numbers
- integers
- data structures (arrays, lists)
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@
## Automated
### [[Fuzzing]]
### [[Static Analysis]]
### [[Symbolic Execution 1]]
### [[Symbolic Execution]]
1 change: 1 addition & 0 deletions 70024 - Software Reliability/Static Analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Reasoning about behaviour of programs without executing them.
| No test Cases | Analysis often does not provide easily applicable test cases for developers. |

Imprecise (but faster) analyses can result in false positives.
## [[Data Flow Analysis]]
## Examples
### Coverity ([[A Few Billion Lines of Code Later]])
### Theory of [[Program Analysis]]
28 changes: 0 additions & 28 deletions 70024 - Software Reliability/Symbolic Execution 1.md

This file was deleted.

37 changes: 36 additions & 1 deletion 70024 - Software Reliability/Symbolic Execution.md
Original file line number Diff line number Diff line change
@@ -1 +1,36 @@
[[TODO]]
## Definition
By running a program on symbolic input (representing any value, can be logically constrained), every path through a program is analysed by constraining symbolic values to symbols representing the sets of possible values on the paths.
- Program instructions act on symbolic values.
- At conditionals, fork the symbolic values and constrain by the branch condition.
- On termination generate a test case by solving the constraints on the path.
### Advantages
| Advantage | Description |
| ---- | ---- |
| Automatic | Requires no input test cases, initial seeds, configuration (like [[Fuzzing]]/[[Swarm Testing]]). |
| Systematic | Can reach all possible paths in the program, and reason about all possible values on these paths. |
| Deep Bugs | Cam reason about all values, hence including extremely rare edge case values & memory layouts. |
| [[Functional Bugs]] | Can reason about logical statements given appropriate oracles. e.g. checking for crashes, given assert statements cause crashes. |
| Test Cases | Generates test cases for developers to allow them to reproduce and debug. |
### Disadvantages
| Disadvantage | Description |
| ---- | ---- |
| Complex Toolchain | Requires the source to be available, and compiled to some form the symbolic execution tool can interpret. |
| Constrain Solving | Constraint solving is expensive. |
## Mixed Concrete & Symbolic Execution
Many values are concrete (e.g. constants in the code, or concrete inputs set by the user).
- Only operations that include symbolic values need to be symbolically executed, the rest can be executed as normal.
- Allows interaction with outside environment (e.g. operating system, un-instrumented libraries, etc.)
-
## Challenges
### [[Path Explosion]]
### [[70024 - Software Reliability/Constraint Solving|Constraint Solving]]
### [[Environment Modelling]]

## Applications
- Automatically generating high coverage test suites.
- High coverage regression testing (test cases for all paths, generate test suite, check new versions)
- [[Differential Testing]] between clear and highly correct implementations, and optimised implementations. (can check behaviour is identical)
## Examples
KLEE, CREST, SPF, FuzzBall
### [[KLEE]]
### [[EXE]]
2 changes: 2 additions & 0 deletions 70024 - Software Reliability/Valgrind.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
## Definition
Dynamic instrumentation of binaries to detect memory related [[Generic Bugs]].

0 comments on commit 61d22c3

Please sign in to comment.