CPU vs GPU

Reference

Berkeley CS 267 Lecture 7

UIUC 408 L1

Programming Massively Parallel Processors 3rd chapter 1

GPU是heterogeneous chip. 有负责不同功能的计算模块

SMs: streaming multiprocessors

SPs: streaming processors : each SM have multiple SP that share control logic and instruction cache

GPU design for high throughput, don't care about throughput so much

CPU design for low latency

CPU : multicore system : latency oriented

GPU : manycore / many-thread system : throughput oriented

Idea to design throuput oriented GPU

CPU中包含out of order execution, branch predictor, memory prefetch等机制让CPU运行serialize code fast，但是这些部分占用很大的memory和chip。

GPU去除这些部分。

相比起使用small number of complex core, GPU的工作经常simple core就可以处理。

但也带来了挑战，需要programmer expose large parallel从而充分利用全部的core

因为很多工作都是parallel的，所以多个small simple core共享instruction stream就可以，减少了chip上负责instruction stream的部分。

SIMT single instruction multiple threads.

SIMT 与 SIMD 有一些不一样。SIMT可以平行thread，而SIMD只可以平行instruction

在CPU中使用branch prediction

在GPU中，使用mask来解决branching

CPU通过fancy cache + prefetch logic来avoid stall

GPU通过lots of thread来hide latency。这依赖于fast switch to other threads, 也就需要keep lots of threads alive.

GPU的register通常很大，在V100里与half L1 cahce+shared memory一样大

经常也被叫做inverted memory hierchy