Reference
- Berkeley CS 267 Lecture 7
- UIUC 408 L1
- Programming Massively Parallel Processors 3rd chapter 1
- 什么是GPU
GPU是heterogeneous chip. 有负责不同功能的计算模块
SMs: streaming multiprocessors
SPs: streaming processors : each SM have multiple SP that share control logic and instruction cache
- 为了设么设计
GPU design for high throughput, don't care about throughput so much
CPU design for low latency
- CPU GPU
CPU : multicore system : latency oriented
GPU : manycore / many-thread system : throughput oriented
- Idea 1 : 去除CPU中让CPU serialize code运行更快的
CPU中包含out of order execution, branch predictor, memory prefetch等机制让CPU运行serialize code fast,但是这些部分占用很大的memory和chip。
GPU去除这些部分。
- Idea 2 :larger number of smaller simpler core
相比起使用small number of complex core, GPU的工作经常simple core就可以处理。
但也带来了挑战,需要programmer expose large parallel从而充分利用全部的core
- idea 3:让simple core共享instruction stream,减少负责Fetch Decode的芯片面积
因为很多工作都是parallel的,所以多个small simple core共享instruction stream就可以,减少了chip上负责instruction stream的部分。
SIMT single instruction multiple threads.
SIMT 与 SIMD 有一些不一样。SIMT可以平行thread,而SIMD只可以平行instruction
- idea 4:使用mask来解决branching
在CPU中使用branch prediction
在GPU中,使用mask来解决branching
- idea 5:hide latency instead of reduce latency
CPU通过fancy cache + prefetch logic来avoid stall
GPU通过lots of thread来hide latency。这依赖于fast switch to other threads, 也就需要keep lots of threads alive.
- GPU Register 特点
GPU的register通常很大,在V100里与half L1 cahce+shared memory一样大
经常也被叫做inverted memory hierchy