srt/9 - 2 - Backpropagation Algorithm (12 min).srt

﻿1
00:00:00,090 --> 00:00:01,798
In the previous video, we talked about
在上一个视频里 我们讲解了
(字幕整理：中国海洋大学 黄海广,haiguang2000@qq.com )

2
00:00:01,857 --> 00:00:03,868
a cost function for the neural network.
神经网络的代价函数

3
00:00:04,139 --> 00:00:07,079
In this video, let's start to talk about an algorithm,
在这个视频里 让我们来说说

4
00:00:07,200 --> 00:00:09,062
for trying to minimize the cost function.
让代价函数最小化的算法

5
00:00:09,240 --> 00:00:12,735
In particular, we'll talk about the back propagation algorithm.
具体来说 我们将主要讲解反向传播算法

6
00:00:13,834 --> 00:00:15,380
Here's the cost function that
这个就是我们上一个视频里写好的

7
00:00:15,520 --> 00:00:17,905
we wrote down in the previous video.
代价函数

8
00:00:17,972 --> 00:00:19,438
What we'd like to do is
我们要做的就是

9
00:00:19,484 --> 00:00:21,161
try to find parameters theta
设法找到参数

10
00:00:21,246 --> 00:00:23,440
to try to minimize j of theta.
使得J(θ)取到最小值

11
00:00:23,530 --> 00:00:25,782
In order to use either gradient descent
为了使用梯度下降法或者

12
00:00:25,832 --> 00:00:28,625
or one of the advance optimization algorithms.
其他某种高级优化算法

13
00:00:28,675 --> 00:00:30,206
What we need to do therefore is
我们需要做的就是

14
00:00:30,249 --> 00:00:31,598
to write code that takes
写好一个可以通过输入 参数 θ

15
00:00:31,645 --> 00:00:33,487
this input the parameters theta
然后计算 J(θ)

16
00:00:33,540 --> 00:00:34,965
and computes j of theta
和这些

17
00:00:35,014 --> 00:00:37,364
and these partial derivative terms.
偏导数项的代码

18
00:00:37,425 --> 00:00:38,763
Remember, that the parameters
记住 这些神经网络里

19
00:00:38,790 --> 00:00:40,710
in the the neural network of these things,
对应的参数

20
00:00:40,760 --> 00:00:43,435
theta superscript l subscript ij,
也就是 θ 上标 (l) 下标 ij 的参数

21
00:00:43,492 --> 00:00:44,868
that's the real number
这些都是实数

22
00:00:44,930 --> 00:00:47,185
and so, these are the partial derivative terms
所以这些都是我们需要计算的

23
00:00:47,249 --> 00:00:48,869
we need to compute.
偏导数项

24
00:00:48,900 --> 00:00:50,077
In order to compute the
为了计算代价函数

25
00:00:50,115 --> 00:00:51,840
cost function j of theta,
J(θ)

26
00:00:51,883 --> 00:00:53,986
we just use this formula up here
我们就是用上面这个公式

27
00:00:54,042 --> 00:00:55,617
and so, what I want to do
所以我们在本节视频里大部分时间

28
00:00:55,655 --> 00:00:56,850
for the most of this video is
想要做的都是

29
00:00:56,897 --> 00:00:58,595
focus on talking about
重点关注

30
00:00:58,636 --> 00:00:59,952
how we can compute these
如何计算这些

31
00:00:59,989 --> 00:01:01,994
partial derivative terms.
偏导数项

32
00:01:02,031 --> 00:01:03,812
Let's start by talking about
我们从只有一个

33
00:01:03,858 --> 00:01:05,512
the case of when we have only
训练样本的情况

34
00:01:05,556 --> 00:01:06,839
one training example,
开始说起

35
00:01:06,872 --> 00:01:09,385
so imagine, if you will that our entire
假设 我们整个训练集

36
00:01:09,432 --> 00:01:11,301
training set comprises only one
只包含一个训练样本

37
00:01:11,351 --> 00:01:14,006
training example which is a pair xy.
也就是实数对

38
00:01:14,049 --> 00:01:15,591
I'm not going to write x1y1
我这里不写成x(1) y(1)

39
00:01:15,629 --> 00:01:16,375
just write this.
就写成这样

40
00:01:16,410 --> 00:01:17,665
Write a one training example
把这一个训练样本记为 (x, y)

41
00:01:17,718 --> 00:01:19,980
as xy and let's tap through
让我们粗看一遍

42
00:01:20,031 --> 00:01:21,423
the sequence of calculations
使用这一个训练样本

43
00:01:21,462 --> 00:01:24,332
we would do with this one training example.
来计算的顺序

44
00:01:25,754 --> 00:01:27,129
The first thing we do is
首先我们

45
00:01:27,167 --> 00:01:29,175
we apply forward propagation in
应用前向传播方法来

46
00:01:29,212 --> 00:01:31,773
order to compute whether a hypotheses
计算一下在给定输入的时候

47
00:01:31,813 --> 00:01:34,238
actually outputs given the input.
假设函数是否会真的输出结果

48
00:01:34,272 --> 00:01:36,734
Concretely, the called the
具体地说 这里的

49
00:01:36,769 --> 00:01:39,025
a(1) is the activation values
a(1) 就是第一层的激励值

50
00:01:39,071 --> 00:01:41,541
of this first layer that was the input there.
也就是输入层在的地方

51
00:01:41,600 --> 00:01:43,452
So, I'm going to set that to x
所以我准备设定他为

52
00:01:43,505 --> 00:01:45,389
and then we're going to compute
然后我们来计算

53
00:01:45,435 --> 00:01:47,506
z(2) equals theta(1) a(1)
z(2) 等于 θ(1) 乘以 a(1)

54
00:01:47,552 --> 00:01:49,919
and a(2) equals g, the sigmoid
然后 a(2) 就等于 g(z(2)) 函数

55
00:01:49,980 --> 00:01:52,250
activation function applied to z(2)
其中g是一个S型激励函数

56
00:01:52,310 --> 00:01:53,753
and this would give us our
这就会计算出第一个

57
00:01:53,800 --> 00:01:56,115
activations for the first middle layer.
隐藏层的激励值

58
00:01:56,162 --> 00:01:58,208
That is for layer two of the network
也就是神经网络的第二层

59
00:01:58,241 --> 00:02:00,649
and we also add those bias terms.
我们还增加这个偏差项

60
00:02:01,315 --> 00:02:03,132
Next we apply 2 more steps
接下来我们再用2次

61
00:02:03,176 --> 00:02:04,966
of this four and propagation
前向传播

62
00:02:05,013 --> 00:02:08,328
to compute a(3) and a(4)
来计算出 a(3) 和 最后的 a(4)

63
00:02:08,360 --> 00:02:11,458
which is also the upwards
同样也就是假设函数

64
00:02:11,505 --> 00:02:14,089
of a hypotheses h of x.
h(x) 的输出

65
00:02:14,711 --> 00:02:18,103
So this is our vectorized implementation of
所以这里我们实现了把前向传播

66
00:02:18,145 --> 00:02:19,228
forward propagation
向量化

67
00:02:19,276 --> 00:02:20,888
and it allows us to compute
这使得我们可以计算

68
00:02:20,938 --> 00:02:22,280
the activation values
神经网络结构里的

69
00:02:22,345 --> 00:02:24,056
for all of the neurons
每一个神经元的

70
00:02:24,110 --> 00:02:25,948
in our neural network.
激励值

71
00:02:27,934 --> 00:02:29,608
Next, in order to compute
接下来

72
00:02:29,650 --> 00:02:30,967
the derivatives, we're going to use
为了计算导数项 我们将

73
00:02:31,026 --> 00:02:33,589
an algorithm called back propagation.
采用一种叫做反向传播(Backpropagation)的算法

74
00:02:34,904 --> 00:02:37,765
The intuition of the back propagation algorithm
反向传播算法从直观上说

75
00:02:37,807 --> 00:02:38,430
is that for each note
就是对每一个结点

76
00:02:38,430 --> 00:02:41,065
we're going to compute the term
我们计算这样一项

77
00:02:41,126 --> 00:02:43,642
delta superscript l subscript j
δ下标 j 上标(l)

78
00:02:43,676 --> 00:02:45,130
that's going to somehow
这就用某种形式

79
00:02:45,171 --> 00:02:46,310
represent the error
代表了第 l 层的第 j 个结点的

80
00:02:46,361 --> 00:02:48,511
of note j in the layer l.
误差

81
00:02:48,552 --> 00:02:49,682
So, recall that
我们还记得

82
00:02:49,716 --> 00:02:52,313
a superscript l subscript j
a 上标 (l) 下标 j

83
00:02:52,355 --> 00:02:54,138
that does the activation of
表示的是第 l 层第 j 个单元的

84
00:02:54,185 --> 00:02:56,182
the j of unit in layer l
激励值

85
00:02:56,224 --> 00:02:58,001
and so, this delta term
所以这个 δ 项

86
00:02:58,045 --> 00:02:59,037
is in some sense
在某种程度上

87
00:02:59,082 --> 00:03:00,978
going to capture our error
就捕捉到了我们

88
00:03:01,012 --> 00:03:03,618
in the activation of that neural duo.
在这个神经节点的激励值的误差

89
00:03:03,650 --> 00:03:05,798
So, how we might wish the activation
所以我们可能希望这个节点的

90
00:03:05,823 --> 00:03:07,975
of that note is slightly different.
激励值稍微不一样

91
00:03:08,047 --> 00:03:09,670
Concretely, taking the example
具体地讲 我们用

92
00:03:10,270 --> 00:03:11,100
neural network that we have
右边这个有四层

93
00:03:11,360 --> 00:03:12,700
on the right which has four layers.
的神经网络结构做例子

94
00:03:13,440 --> 00:03:15,710
And so capital L is equal to 4.
所以这里大写 L 等于4

95
00:03:16,060 --> 00:03:17,120
For each output unit, we're going to compute this delta term.
对于每一个输出单元 我们准备计算δ项

96
00:03:17,400 --> 00:03:19,130
So, delta for the j of unit in the fourth layer is equal to
所以第四层的第j个单元的δ就等于

97
00:03:23,380 --> 00:03:24,490
just the activation of that
这个单元的激励值

98
00:03:24,720 --> 00:03:26,350
unit minus what was
减去训练样本里的

99
00:03:26,490 --> 00:03:28,650
the actual value of 0 in our training example.
真实值0

100
00:03:29,900 --> 00:03:32,420
So, this term here can
所以这一项可以

101
00:03:32,580 --> 00:03:34,510
also be written h of
同样可以写成

102
00:03:34,710 --> 00:03:38,040
x subscript j, right.
h(x) 下标 j

103
00:03:38,330 --> 00:03:39,640
So this delta term is just
所以 δ 这一项就是

104
00:03:39,930 --> 00:03:40,900
the difference between when a
假设输出

105
00:03:41,290 --> 00:03:43,200
hypotheses output and what
和训练集y值

106
00:03:43,370 --> 00:03:44,870
was the value of y
之间的差

107
00:03:45,570 --> 00:03:46,900
in our training set whereas
这里

108
00:03:47,060 --> 00:03:48,610
y subscript j is
y 下标 j 就是

109
00:03:48,750 --> 00:03:49,910
the j of element of the
我们标记训练集里向量

110
00:03:50,090 --> 00:03:53,340
vector value y in our labeled training set.
的第j个元素的值

111
00:03:56,200 --> 00:03:57,790
And by the way, if you
顺便说一句

112
00:03:57,970 --> 00:04:00,460
think of delta a and
如果你把 δ a 和 y 这三个

113
00:04:01,000 --> 00:04:02,350
y as vectors then you can
都看做向量

114
00:04:02,520 --> 00:04:03,760
also take those and come
那么你可以同样这样写

115
00:04:04,030 --> 00:04:05,890
up with a vectorized implementation of
并且得出一个向量化的表达式

116
00:04:06,010 --> 00:04:07,310
it, which is just
也就是

117
00:04:07,690 --> 00:04:09,840
delta 4 gets set as
δ(4)等于

118
00:04:10,700 --> 00:04:14,330
a4 minus y. Where
a(4) 减去 y 这里

119
00:04:14,560 --> 00:04:15,820
here, each of these delta
每一个变量

120
00:04:16,540 --> 00:04:18,080
4 a4 and y, each of
也就是 δ(4) a(4) 和 y

121
00:04:18,180 --> 00:04:19,860
these is a vector whose
都是一个向量

122
00:04:20,640 --> 00:04:22,040
dimension is equal to
并且向量维数等于

123
00:04:22,250 --> 00:04:24,150
the number of output units in our network.
输出单元的数目

124
00:04:25,210 --> 00:04:26,880
So we've now computed the
所以现在我们计算出

125
00:04:27,320 --> 00:04:28,670
era term's delta
网络结构的

126
00:04:29,020 --> 00:04:30,170
4 for our network.
误差项 δ(4)

127
00:04:31,440 --> 00:04:32,950
What we do next is compute
我们下一步就是计算

128
00:04:33,620 --> 00:04:36,280
the delta terms for the earlier layers in our network.
网络中前面几层的误差项 δ

129
00:04:37,210 --> 00:04:38,690
Here's a formula for computing delta
这个就是计算 δ(3) 的公式

130
00:04:39,010 --> 00:04:39,830
3 is delta 3 is equal
δ(3) 等于

131
00:04:40,310 --> 00:04:42,050
to theta 3 transpose times delta 4.
θ(3) 的转置乘以 δ(4)

132
00:04:42,560 --> 00:04:44,190
And this dot times, this
然后这里的点乘

133
00:04:44,390 --> 00:04:46,390
is the element y's multiplication operation
这是我们从 MATLAB 里知道的

134
00:04:47,580 --> 00:04:48,380
that we know from MATLAB.
对 y 元素的乘法操作

135
00:04:49,160 --> 00:04:50,760
So delta 3 transpose delta
所以 θ(3) 转置乘以

136
00:04:51,020 --> 00:04:52,860
4, that's a vector; g prime
δ(4) 这是一个向量

137
00:04:53,480 --> 00:04:55,080
z3 that's also a vector
g'(z(3)) 同样也是一个向量

138
00:04:55,800 --> 00:04:57,370
and so dot times is
所以点乘就是

139
00:04:57,530 --> 00:04:59,670
in element y's multiplication between these two vectors.
两个向量的元素间对应相乘

140
00:05:01,460 --> 00:05:02,650
This term g prime of
其中这一项 g'(z(3))

141
00:05:02,740 --> 00:05:04,560
z3, that formally is actually
其实是对激励函数 g

142
00:05:04,950 --> 00:05:06,420
the derivative of the activation
在输入值为 z(3) 的时候

143
00:05:06,720 --> 00:05:08,740
function g evaluated at
所求的

144
00:05:08,890 --> 00:05:10,620
the input values given by z3.
导数

145
00:05:10,760 --> 00:05:12,620
If you know calculus, you
如果你掌握微积分的话

146
00:05:12,710 --> 00:05:13,470
can try to work it out yourself
你可以试着自己解出来

147
00:05:13,850 --> 00:05:16,100
and see that you can simplify it to the same answer that I get.
然后可以简化得到我这里的结果

148
00:05:16,860 --> 00:05:19,690
But I'll just tell you pragmatically what that means.
但是我只是从实际角度告诉你这是什么意思

149
00:05:20,000 --> 00:05:21,260
What you do to compute this g
你计算这个 g'

150
00:05:21,460 --> 00:05:23,310
prime, these derivative terms is
这个导数项其实是

151
00:05:23,510 --> 00:05:25,660
just a3 dot times1
a(3) 点乘 (1-a(3))

152
00:05:26,010 --> 00:05:27,900
minus A3 where A3
这里a(3)是

153
00:05:28,160 --> 00:05:29,420
is the vector of activations.
激励向量

154
00:05:30,150 --> 00:05:31,440
1 is the vector of
1是以1为元素的向量

155
00:05:31,600 --> 00:05:33,240
ones and A3 is
a(3) 又是

156
00:05:34,020 --> 00:05:35,970
again the activation
一个对那一层的

157
00:05:36,290 --> 00:05:38,850
the vector of activation values for that layer.
激励向量

158
00:05:39,170 --> 00:05:40,210
Next you apply a similar
接下来你应用一个相似的公式

159
00:05:40,540 --> 00:05:42,850
formula to compute delta 2
来计算 δ(2)

160
00:05:43,220 --> 00:05:45,230
where again that can be
同样这里可以利用一个

161
00:05:45,670 --> 00:05:47,410
computed using a similar formula.
相似的公式

162
00:05:48,450 --> 00:05:49,950
Only now it is a2
只是在这里

163
00:05:50,120 --> 00:05:53,850
like so and I
是 a(2)

164
00:05:53,960 --> 00:05:55,020
then prove it here but you
这里我并没有证明

165
00:05:55,110 --> 00:05:56,400
can actually, it's possible to
但是如果你懂微积分的话

166
00:05:56,490 --> 00:05:57,520
prove it if you know calculus
证明是完全可以做到的

167
00:05:58,240 --> 00:05:59,520
that this expression is equal
那么这个表达式从数学上讲

168
00:05:59,860 --> 00:06:02,010
to mathematically, the derivative of
就等于激励函数

169
00:06:02,190 --> 00:06:03,570
the g function of the activation
g函数的偏导数

170
00:06:04,040 --> 00:06:05,460
function, which I'm denoting
这里我用

171
00:06:05,910 --> 00:06:08,540
by g prime. And finally,
g‘来表示

172
00:06:09,270 --> 00:06:10,690
that's it and there is
最后 就到这儿结束了

173
00:06:10,860 --> 00:06:13,650
no delta1 term, because the
这里没有 δ(1) 项 因为

174
00:06:13,720 --> 00:06:15,590
first layer corresponds to the
第一次对应输入层

175
00:06:15,630 --> 00:06:16,940
input layer and that's just the
那只是表示

176
00:06:17,000 --> 00:06:18,200
feature we observed in our
我们在训练集观察到的

177
00:06:18,300 --> 00:06:20,380
training sets, so that doesn't have any error associated with that.
所以不会存在误差

178
00:06:20,600 --> 00:06:22,080
It's not like, you know,
这就是说

179
00:06:22,120 --> 00:06:23,680
we don't really want to try to change those values.
我们是不想改变这些值的

180
00:06:24,320 --> 00:06:25,240
And so we have delta
所以这个例子中我们的 δ 项就只有

181
00:06:25,510 --> 00:06:28,090
terms only for layers 2, 3 and for this example.
第2层和第3层

182
00:06:30,170 --> 00:06:32,120
The name back propagation comes from
反向传播法这个名字

183
00:06:32,170 --> 00:06:33,260
the fact that we start by
源于我们从

184
00:06:33,350 --> 00:06:34,720
computing the delta term for
输出层开始计算

185
00:06:34,740 --> 00:06:36,190
the output layer and then
δ项

186
00:06:36,370 --> 00:06:37,480
we go back a layer and
然后我们返回到上一层

187
00:06:37,880 --> 00:06:39,670
compute the delta terms for the
计算第三隐藏层的

188
00:06:39,850 --> 00:06:41,050
third hidden layer and then we
δ项 接着我们

189
00:06:41,180 --> 00:06:42,540
go back another step to compute
再往前一步来计算

190
00:06:42,770 --> 00:06:44,070
delta 2 and so, we're sort of
δ(2) 所以说

191
00:06:44,660 --> 00:06:46,060
back propagating the errors from
我们是类似于把输出层的误差

192
00:06:46,280 --> 00:06:47,270
the output layer to layer 3
反向传播给了第3层

193
00:06:47,650 --> 00:06:50,180
to their to hence the name back complication.
然后是再传到第二层 这就是反向传播的意思

194
00:06:51,270 --> 00:06:53,120
Finally, the derivation is
最后 这个推导过程是出奇的麻烦的

195
00:06:53,340 --> 00:06:56,510
surprisingly complicated, surprisingly involved but
出奇的复杂

196
00:06:56,820 --> 00:06:58,100
if you just do this few steps
但是如果你按照

197
00:06:58,280 --> 00:07:00,130
steps of computation it is possible
这样几个步骤计算

198
00:07:00,680 --> 00:07:02,540
to prove viral frankly some
就有可能简单直接地完成

199
00:07:02,810 --> 00:07:04,440
what complicated mathematical proof.
复杂的数学证明

200
00:07:05,200 --> 00:07:07,410
It's possible to prove that if
如果你忽略标准化所产生的项

201
00:07:07,560 --> 00:07:09,690
you ignore authorization then the
我们可以证明

202
00:07:09,800 --> 00:07:11,080
partial derivative terms you want
我们要求的偏导数项

203
00:07:12,220 --> 00:07:14,650
are exactly given by the
恰好就等于

204
00:07:14,780 --> 00:07:17,690
activations and these delta terms.
激励函数和这些 δ 项

205
00:07:17,870 --> 00:07:20,630
This is ignoring lambda or
这里我们忽略了 λ

206
00:07:20,780 --> 00:07:22,730
alternatively the regularization
或者说 标准化项

207
00:07:23,770 --> 00:07:24,630
term lambda will
λ 是等于

208
00:07:25,000 --> 00:07:25,170
equal to 0.

209
00:07:25,680 --> 00:07:27,130
We'll fix this detail later
我们将在之后完善这一个

210
00:07:27,470 --> 00:07:29,430
about the regularization term, but
关于正则化项

211
00:07:29,620 --> 00:07:30,740
so by performing back propagation
所以到现在 我们通过

212
00:07:31,610 --> 00:07:32,820
and computing these delta terms,
反向传播 计算这些δ项

213
00:07:33,180 --> 00:07:34,240
you can, you know, pretty
可以非常快速的计算出

214
00:07:34,530 --> 00:07:36,320
quickly compute these partial
所有参数的

215
00:07:36,380 --> 00:07:38,150
derivative terms for all of your parameters.
偏导数项

216
00:07:38,920 --> 00:07:40,020
So this is a lot of detail.
好了 现在讲了很多细节了

217
00:07:40,570 --> 00:07:41,900
Let's take everything and put
现在让我们把所有内容整合在一起

218
00:07:42,320 --> 00:07:43,660
it all together to talk about
然后说说

219
00:07:44,120 --> 00:07:45,490
how to implement back propagation
如何实现反向传播算法

220
00:07:46,560 --> 00:07:48,590
to compute derivatives with respect to your parameters.
来计算关于这些参数的偏导数

221
00:07:49,790 --> 00:07:50,770
And for the case of when
当我们有

222
00:07:51,000 --> 00:07:52,460
we have a large training
一个非常大的训练样本时

223
00:07:52,830 --> 00:07:53,850
set, not just a training
而不是像我们例子里这样的一个训练样本

224
00:07:54,100 --> 00:07:56,320
set of one example, here's what we do.
我们是这样做的

225
00:07:57,290 --> 00:07:58,140
Suppose we have a training
假设我们有

226
00:07:58,270 --> 00:07:59,750
set of m examples like
m 个样本的训练集

227
00:07:59,900 --> 00:08:01,610
that shown here.
正如此处所写

228
00:08:01,850 --> 00:08:02,600
The first thing we're going to do is
我要做的第一件事就是

229
00:08:03,220 --> 00:08:04,560
we're going to set these delta
固定这些

230
00:08:05,100 --> 00:08:07,270
l subscript i j. So this triangular symbol?
带下标 i j 的 Δ

231
00:08:08,090 --> 00:08:09,990
That's actually the capital Greek
这其实是

232
00:08:10,310 --> 00:08:11,980
alphabet delta .
大写的希腊字母 δ

233
00:08:12,050 --> 00:08:14,080
The symbol we had on the previous slide was the lower case delta.
我们之前写的那个是小写

234
00:08:14,390 --> 00:08:16,810
So the triangle is capital delta.
这个三角形是大写的 Δ

235
00:08:17,430 --> 00:08:18,490
We're gonna set this equal to zero
我们将对每一个i

236
00:08:18,680 --> 00:08:21,930
for all values of l i j.
和 j 对应的 Δ 等于0

237
00:08:22,110 --> 00:08:23,850
Eventually, this capital delta
实际上 这些大写 Δij

238
00:08:24,530 --> 00:08:25,830
l i j will be used
会被用来计算

239
00:08:26,860 --> 00:08:29,920
to compute the partial
偏导数项

240
00:08:30,290 --> 00:08:31,570
derivative term, partial derivative
就是 J(θ)

241
00:08:32,380 --> 00:08:35,240
respect to theta l i j of
关于 θ 上标(l) 下标 i j 的

242
00:08:35,430 --> 00:08:37,190
J of theta.
偏导数

243
00:08:39,040 --> 00:08:40,210
So as we'll see in
所以 正如我们接下来看到的

244
00:08:40,480 --> 00:08:41,550
a second, these deltas are going
这些 δ

245
00:08:41,670 --> 00:08:43,700
to be used as accumulators that
会被作为累加项

246
00:08:43,950 --> 00:08:45,360
will slowly add things in
慢慢地增加

247
00:08:45,700 --> 00:08:47,130
order to compute these partial derivatives.
以算出这些偏导数

248
00:08:49,570 --> 00:08:51,920
Next, we're going to loop through our training set.
接下来我们将遍历我们的训练集

249
00:08:52,150 --> 00:08:53,270
So, we'll say for i equals
我们这样写

250
00:08:53,610 --> 00:08:55,400
1 through m and so
写成 For i = 1 to m

251
00:08:55,620 --> 00:08:57,270
for the i iteration, we're
对于第 i 个循环而言

252
00:08:57,410 --> 00:08:59,180
going to working with the training example xi, yi.
我们将取训练样本 (x(i), y(i))

253
00:09:00,480 --> 00:09:03,220
So
所以

254
00:09:03,720 --> 00:09:04,590
the first thing we're going to do
我们要做的第一件事是

255
00:09:04,690 --> 00:09:06,120
is set a1 which is the
设定a(1) 也就是

256
00:09:06,570 --> 00:09:07,830
activations of the input layer,
输入层的激励函数

257
00:09:08,190 --> 00:09:09,030
set that to be equal to
设定它等于 x(i)

258
00:09:09,950 --> 00:09:11,800
xi is the inputs for our
x(i) 是我们第 i 个训练样本的

259
00:09:12,670 --> 00:09:15,070
i training example, and then
输入值

260
00:09:15,340 --> 00:09:17,590
we're going to perform forward propagation to
接下来我们运用正向传播

261
00:09:17,730 --> 00:09:19,400
compute the activations for
来计算第二层的激励值

262
00:09:19,790 --> 00:09:20,900
layer two, layer three and so
然后是第三层 第四层

263
00:09:21,170 --> 00:09:22,050
on up to the final
一直这样

264
00:09:22,500 --> 00:09:25,190
layer, layer capital L. Next,
到最后一层 L层

265
00:09:25,570 --> 00:09:26,970
we're going to use the output
接下来 我们将用

266
00:09:27,280 --> 00:09:28,530
label yi from this
我们这个样本的

267
00:09:28,680 --> 00:09:29,870
specific example we're looking
输出值 y(i)

268
00:09:30,340 --> 00:09:31,650
at to compute the error
来计算这个输出值

269
00:09:31,950 --> 00:09:34,140
term for delta L for the output there.
所对应的误差项 δ(L)

270
00:09:34,480 --> 00:09:35,730
So delta L is what
所以 δ(L) 就是

271
00:09:35,880 --> 00:09:38,190
a hypotheses output minus what
假设输出减去

272
00:09:38,660 --> 00:09:39,870
the target label was?
目标输出

273
00:09:41,840 --> 00:09:42,560
And then we're going to use
接下来 我们将

274
00:09:42,850 --> 00:09:44,550
the back propagation algorithm to
运用反向传播算法

275
00:09:44,740 --> 00:09:46,020
compute delta L minus 1,
来计算 δ(L-1)

276
00:09:46,220 --> 00:09:47,250
delta L minus 2, and
δ(L-2)

277
00:09:47,350 --> 00:09:49,880
so on down to delta 2 and once again
一直这样直到 δ(2)

278
00:09:50,270 --> 00:09:51,380
there is now delta 1 because
再强调一下 这里没有 δ(1)

279
00:09:51,460 --> 00:09:54,380
we don't associate an error term with the input layer.
因为我们不需要对输入层考虑误差项

280
00:09:57,000 --> 00:09:58,160
And finally, we're going to
最后我们将用

281
00:09:58,340 --> 00:10:00,650
use these capital delta terms
这些大写的 Δ

282
00:10:01,190 --> 00:10:02,800
to accumulate these partial derivative
来累积我们在前面写好的

283
00:10:03,400 --> 00:10:05,670
terms that we wrote down on the previous line.
偏导数项

284
00:10:06,870 --> 00:10:07,870
And by the way, if you
顺便说一下

285
00:10:07,960 --> 00:10:11,340
look at this expression, it's possible to vectorize this too.
如果你再看下这个表达式 你可以把它写成向量形式

286
00:10:12,020 --> 00:10:13,040
Concretely, if you think
具体地说

287
00:10:13,310 --> 00:10:14,860
of delta ij as
如果你把

288
00:10:15,000 --> 00:10:18,090
a matrix, indexed by subscript ij.
δij 看作一个矩阵 i j代表矩阵中的位置

289
00:10:19,220 --> 00:10:20,590
Then, if delta L is
那么 如果 δ(L) 是一个矩阵

290
00:10:20,780 --> 00:10:22,040
a matrix we can rewrite
我们就可以写成

291
00:10:22,130 --> 00:10:24,100
this as delta L, gets
Δ(l) 等于

292
00:10:24,350 --> 00:10:26,710
updated as delta L plus
Δ(l) 加上

293
00:10:27,830 --> 00:10:29,370
lower case delta L plus
小写的 δ(l+1)

294
00:10:29,640 --> 00:10:32,780
one times aL transpose.
乘以 a(l) 的转置

295
00:10:33,570 --> 00:10:35,380
So that's a vectorized implementation of
这就是用向量化的形式

296
00:10:35,520 --> 00:10:37,150
this that automatically does
实现了对所有 i 和 j

297
00:10:37,590 --> 00:10:38,850
an update for all values of
的自动更新值

298
00:10:39,010 --> 00:10:41,250
i and j. Finally, after
最后

299
00:10:41,500 --> 00:10:43,480
executing the body of
执行这个 for 循环体之后

300
00:10:43,580 --> 00:10:45,350
the four-loop we then go outside the four-loop
我们跳出这个 for 循环

301
00:10:46,330 --> 00:10:47,000
and we compute the following.
然后计算下面这些式子

302
00:10:47,440 --> 00:10:49,690
We compute capital D as
我们按照如下公式计算

303
00:10:50,020 --> 00:10:51,400
follows and we have
大写

304
00:10:51,510 --> 00:10:52,750
two separate cases for j
我们对于 j=0 和 j≠0

305
00:10:52,980 --> 00:10:54,890
equals zero and j not equals zero.
分两种情况讨论

306
00:10:56,080 --> 00:10:57,250
The case of j equals zero
在 j=0 的情况下

307
00:10:57,680 --> 00:10:58,730
corresponds to the bias
对应偏差项

308
00:10:59,150 --> 00:11:00,030
term so when j equals
所以当 j=0 的时候

309
00:11:00,390 --> 00:11:01,320
zero that's why we're missing
这就是为什么

310
00:11:01,800 --> 00:11:03,320
is an extra regularization term.
我们没有写额外的标准化项

311
00:11:05,470 --> 00:11:06,850
Finally, while the formal proof
最后 尽管严格的证明对于

312
00:11:07,180 --> 00:11:08,970
is pretty complicated what you
你来说太复杂

313
00:11:09,030 --> 00:11:10,410
can show is that once
你现在可以说明的是

314
00:11:10,640 --> 00:11:12,530
you've computed these D terms,
一旦你计算出来了这些

315
00:11:13,510 --> 00:11:15,230
that is exactly the partial
这就正好是

316
00:11:15,640 --> 00:11:17,610
derivative of the cost
代价函数关于

317
00:11:17,920 --> 00:11:19,230
function with respect to each
每一个参数的偏导数

318
00:11:19,470 --> 00:11:20,890
of your perimeters and so you
所以你可以把他们用在

319
00:11:21,040 --> 00:11:22,470
can use those in either gradient
梯度下降法

320
00:11:22,610 --> 00:11:23,530
descent or in one of the advanced authorization
或者其他一种更高级的

321
00:11:25,450 --> 00:11:25,450
algorithms.
优化算法上

322
00:11:28,310 --> 00:11:29,360
So that's the back propagation
这就是反向传播算法

323
00:11:29,990 --> 00:11:31,110
algorithm and how you compute
以及你如何计算

324
00:11:31,470 --> 00:11:33,080
derivatives of your cost
神经网络代价函数的

325
00:11:33,340 --> 00:11:34,710
function for a neural network.
偏导数

326
00:11:35,470 --> 00:11:36,330
I know this looks like this
我知道这个里面

327
00:11:36,470 --> 00:11:38,810
was a lot of details and this was a lot of steps strung together.
细节琐碎 步骤繁多

328
00:11:39,460 --> 00:11:40,770
But both in the programming
但是在后面的编程作业

329
00:11:41,100 --> 00:11:43,010
assignments write out and later
和后续的视频里

330
00:11:43,110 --> 00:11:44,580
in this video, we'll give
我都会给你一个

331
00:11:44,720 --> 00:11:45,900
you a summary of this so
清晰的总结

332
00:11:46,050 --> 00:11:46,830
we can have all the pieces
这样我们就可以把算法的所有细节

333
00:11:47,260 --> 00:11:48,780
of the algorithm together so that
拼合到一起

334
00:11:48,920 --> 00:11:50,550
you know exactly what you need
这样 当你想运用反向传播算法

335
00:11:50,610 --> 00:11:51,760
to implement if you want
来计算你的神经网络的代价函数

336
00:11:51,940 --> 00:11:53,460
to implement back propagation to compute
关于这些参数的偏导数的时候

337
00:11:53,890 --> 00:11:56,432
the derivatives of your neural network's
你就会清晰地知道

338
00:11:56,574 --> 00:11:59,348
cost function with respect to those parameters.
你要的是什么