Prepare L06 flipped note

jzarnett · Jan 8, 2024 · 9872d76 · 9872d76
1 parent 95d6412
commit 9872d76
Showing 1 changed file with 29 additions and 14 deletions.
diff --git a/lectures/flipped/L06.md b/lectures/flipped/L06.md
@@ -1,37 +1,50 @@
-# Lecture 06: Modern Processors
+# Lecture 6 — Modern Processors
+
+## Roadmap
+
+We will talk about some techniques to speed up CPU execution
+
+## Mini-lecture
+
+CISC to RISC leads to impressive scaling on CPU frequency during a time, but, we
+hit the wall eventually: clock speeds stop getting faster around 2005, stopping
+at around 3 GHz. That's why we seek other techniques.
 
 - Pipelining
 
+This is straightforward I guess.
+
 - Register renaming
 
 ```asm
 MOV R2, R7 + 32
 ADD R1, R2
-MOV R2, R9 + 64
+MOV R2, R9 + 64 ; we can rename R2 to say RY
 ADD R3, R2
 ```
 
 - Speculation
 
 ```asm
-ld rax, rbx+16 ; assume cache miss
-add rbx, 16    ; carry on anyway, ADD doesn’t need rax value from LD
-               ; register renaming => LD (write to reg)/ADD (read from reg) don’t interfere
-cmp rax, 0     ; needs rax value, queue till available
-jeq null_chk   ; oops! need cmp result
-               ; speculate: assume branch not taken
-st rbx-16, rcx ; speculatively store to store buf (not L1)
-ld rcx, rdx    ; unrelated cache miss: 2 misses now active, 1 speculative
-ld rax, rax+8  ; now must wait for result of first LD
+ld  rax, rbx+16 ; assume cache miss
+add rbx, 16     ; carry on anyway, ADD doesn’t need rax value from LD
+                ; register renaming => LD (read rbx)/ADD (write to renamed rbx) don’t interfere
+cmp rax, 0      ; needs rax value, queue till available
+jeq null_chk    ; oops! need cmp result
+                ; speculate: assume branch not taken
+st  rbx-16, rcx ; speculatively store to store buf (not L1)
+ld  rcx, rdx    ; unrelated cache miss: 2 misses now active, 1 speculative
+ld  rax, rax+8  ; now must wait for result of first LD since we need rax
+                ; but we still almost cut the time in half
 ```
 
 ## Calculation
 
 ### q1
 
 Assume we can always find the data in L3 cache, cache miss rates are 40 per 1000
-for L1D and 4 per 1000 for L2, and cache miss penalty are 5 cycles for L1D and 300
-cycles for L2, what is the average running time for an instruction?
+for L1D (L1 data) and 4 per 1000 for L2, and cache miss penalty are 5 cycles for
+L1D and 300 cycles for L2, what is the average running time for an instruction?
 
 ### q2
 
@@ -44,7 +57,9 @@ if you have a page fault?
 
 Talked about frequency scaling.
 
-Pipelining: Put 5 instructions on post-it notes. First, had a student acting out executing the stages of the instructions sequentially. Then, had 4 more students come up, and acted out pipelining the instructions. Just a bit of chaos here.
+Pipelining: Put 5 instructions on post-it notes. First, had a student acting out
+executing the stages of the instructions sequentially. Then, had 4 more students
+come up, and acted out pipelining the instructions. Just a bit of chaos here.
 
 Did an illustration of waiting for cache/working in the miss shadow.