Day 02 Paper Review: ResNet

Deep Residual Learning for Image Recognition

작성자: YBIGTA 10기 김지중
References
- Deep Residual Learning for Image Recognition
- Coursera - Convolutional Neural Networks WEEK 2
2015년 이미지넷 대회 1등!
- VGG 모델같은 경우 레이어 수가 16개, 19개
- 이 팀에서 쓴 네트워크는 152개 레이어까지 쌓을 수 있었음

1. 배경: degradation problem

문제 현상
- 이론적으로 레이어를 쌓을 수록, 모델이 깊어질 수록 training error는 감소.
- 하지만 실제 구현단에서, 레이어를 단순히 쌓기만 하면(plain networks) 어느 순간부터 모델이 깊어질 수록 training error가 증가.
- 다시 강조하지만 training error의 증가를 이야기하고 있다. 오버피팅이 발생하여 test error가 증가한다는 것이 아니다.

원인
- vanishing or exploding gradients
기존 해결방안
- initial normalization
- intermediate normalization (batch normalization) (nonlinearity 앞에)
이 논문에서는 아래와 같은 그래프를 얻기 위해 Deep Residual Learning를 제안

2. Residual Block

ResNet의 근간이 되는 Residual Block을 살펴보자

2-1. Plain Network

2-2. Residual Block

2-3. Residual Block 해석하기 - Identity Mapping

그래서 이게 뭐 어쨌다는 걸까? 아래 예시를 통해 살펴보자.

가정1: input( $a^{[l]}$ )과 output( $a^{[l+2]}$ )은 차원이 같다.
가정2: activation function(g)는 ReLU (x if x $\ge$ 0 else 0)
가정3:
- 정규화 등의 weight decay를 통해 weight가 줄고 줄어 0에 수렴했다는 가정.
- 본인은 학습과정에서의 Worst Case에 대한 가정으로 이해함.

앞서 Residual Block에서 $a^{[l+2]}$ 의 수식은 아래와 같다고 밝혔다.

결론

Residual Block은 최소한 input 그대로를 output으로 뱉는다.
따라서 레이어를 더 쌓는다고 학습에 해가 되지 않는다.
따라서 Plain Network와 달리, 레이어를 더 쌓는다고 training error가 오르는, degradation problem이 발생하지 않는다.

3. Variations

위 예제에서는 1개 레이어를 skip했다. 2~3개의 레이어를 skip 해도 효과가 있음을 실험을 통해 확인했다고 한다.
- 하지만 바로 전 레이어를 바로 더해주는 것은 효과가 없다고 한다.
- shortcut을 linear function과 relu 사이에 넣는 시도
Residual Block의 input과 output의 dimension을 다르게 설계할 수도 있다.
- $a^{[l+2]} = g(z^[l+2] + W a^{[l]})$ 와 같은 형태로, input에 가중치 매트릭스를 곱해 차원을 조절할 수 있다.
- 이 때, W를 모델의 파라미터로 둘 수도 있고(학습시킬 수도 있고), ~~pre-train된 가중치를 가져와 쓸 수도 있고,~~ 그냥 zero-padding 형식의 projection matrix로 둘 수도 있다고 한다.
- 어떤 방식이든 이러한 형태를 논문에서는 Projection Shortcuts라고 부른다. 혹자는 차원이 같은 Residual Block을 "Identity Block", Projection을 활용한 Residual Block을 "Convolutional Block"이라 부른다.

4. Residual Network

위에서 정의한 Residual Block을 쌓아 만든 네트워크가 Residual Network다.

예시 - ResNet-50

끝.

Q1. 왜 영향이 없었을까?
Q2. 왜 pretrained된 가중치를 가져올까? CNN을 plain으로 학습시킨 다음에 가져온거임?
- A) 강의를 다시 들어보니 it could be a matrix of parameters that we learned, it could be a fixed matrix that just implements zero-padding이라고 하심. 즉, 모델의 파라미터로 두거나, 제로패딩을 씌워주는 projection matrix를 사용함. pre-trained는 그냥 제 머리에서 나왔나봅니다 하하하ㅏ ㅠㅠ
Q3. 왜 max pooling이 아니라 average pooling 씀?
- A) Using multiple FC layers introduces several hidden layers that might obscure the "reasoning" behind decisions made by the final classification layer. GAP(Global Average Pooling) followed by a single FC layer makes it easier to understand what decisions are being made as examination of the GAP ouptuts can give some insight into those decisions that several hidden FC layers would otherwise obscure.