srt/12 - 6 - Using An SVM (21 min).srt

﻿1
00:00:00,140 --> 00:00:01,310
So far we've been talking about
目前为止 我们已经讨论了

2
00:00:01,640 --> 00:00:03,290
SVMs in a fairly abstract level.
SVM比较抽象的层面

3
00:00:03,980 --> 00:00:05,030
In this video I'd like to
在这个视频中 我将要

4
00:00:05,200 --> 00:00:06,460
talk about what you actually need
讨论到为了运行或者运用SVM

5
00:00:06,740 --> 00:00:09,410
to do in order to run or to use an SVM.
你实际上所需要的一些东西

6
00:00:11,320 --> 00:00:12,300
The support vector machine algorithm
支持向量机算法

7
00:00:12,850 --> 00:00:14,870
poses a particular optimization problem.
提出了一个特别优化的问题

8
00:00:15,530 --> 00:00:16,940
But as I briefly mentioned in
但是就如在之前的

9
00:00:17,120 --> 00:00:18,150
an earlier video, I really
视频中我简单提到的

10
00:00:18,380 --> 00:00:20,570
do not recommend writing your
我真的不建议你自己写

11
00:00:20,630 --> 00:00:22,810
own software to solve for the parameter's theta yourself.
软件来求解参数θ

12
00:00:23,950 --> 00:00:26,110
So just as today, very
因此由于今天

13
00:00:26,420 --> 00:00:27,730
few of us, or maybe almost essentially
我们中的很少人 或者其实

14
00:00:28,090 --> 00:00:29,400
none of us would think of
没有人考虑过

15
00:00:29,530 --> 00:00:31,680
writing code ourselves to invert a matrix
自己写代码来转换矩阵

16
00:00:31,950 --> 00:00:33,940
or take a square root of a number, and so on.
或求一个数的平方根等

17
00:00:34,190 --> 00:00:36,570
We just, you know, call some library function to do that.
我们只是知道如何去调用库函数来实现这些功能

18
00:00:36,700 --> 00:00:38,090
In the same way, the
同样的

19
00:00:38,850 --> 00:00:40,310
software for solving the SVM
用以解决SVM

20
00:00:40,620 --> 00:00:42,200
optimization problem is very
最优化问题的软件很

21
00:00:42,440 --> 00:00:43,880
complex, and there have
复杂 且已经有

22
00:00:43,990 --> 00:00:44,960
been researchers that have been
研究者做了

23
00:00:45,110 --> 00:00:47,560
doing essentially numerical optimization research for many years.
很多年数值优化了

24
00:00:47,850 --> 00:00:48,960
So you come up with good
因此你提出好的

25
00:00:49,150 --> 00:00:50,550
software libraries and good software
软件库和好的软件

26
00:00:50,930 --> 00:00:52,270
packages to do this.
包来做这样一些事儿

27
00:00:52,470 --> 00:00:53,480
And then strongly recommend just using
然后强烈建议使用

28
00:00:53,860 --> 00:00:55,260
one of the highly optimized software
高优化软件库中的一个

29
00:00:55,710 --> 00:00:57,780
libraries rather than trying to implement something yourself.
而不是尝试自己落实一些数据

30
00:00:58,730 --> 00:01:00,680
And there are lots of good software libraries out there.
有许多好的软件库

31
00:01:00,970 --> 00:01:02,060
The two that I happen to
我正好用得最多的

32
00:01:02,210 --> 00:01:03,220
use the most often are the
两个是

33
00:01:03,400 --> 00:01:05,000
linear SVM but there are really
线性SVM 但是真的有

34
00:01:05,410 --> 00:01:06,860
lots of good software libraries for
很多软件库可以用来

35
00:01:07,030 --> 00:01:08,430
doing this that you know, you can
做这件事儿 你可以

36
00:01:08,600 --> 00:01:10,190
link to many of the
连接许多

37
00:01:10,450 --> 00:01:11,860
major programming languages that you
你可能会用来编写学习算法的

38
00:01:11,950 --> 00:01:14,410
may be using to code up learning algorithm.
主要编程语言

39
00:01:15,280 --> 00:01:16,460
Even though you shouldn't be writing
尽管你不去写

40
00:01:16,730 --> 00:01:18,330
your own SVM optimization software,
你自己的SVM（支持向量机）的优化软件

41
00:01:19,120 --> 00:01:20,680
there are a few things you need to do, though.
但是你也需要做几件事儿

42
00:01:21,420 --> 00:01:23,130
First is to come up
首先是提出

43
00:01:23,130 --> 00:01:24,230
with with some choice of the
参数C的选择

44
00:01:24,320 --> 00:01:25,640
parameter's C. We talked a
我们在之前的视频中

45
00:01:25,940 --> 00:01:26,930
little bit of the bias/variance properties of
讨论过误差/方差在

46
00:01:27,040 --> 00:01:28,850
this in the earlier video.
这方面的性质

47
00:01:30,290 --> 00:01:31,480
Second, you also need to
第二 你也需要

48
00:01:31,630 --> 00:01:33,040
choose the kernel or the
选择内核参数或

49
00:01:33,410 --> 00:01:34,880
similarity function that you want to use.
你想要使用的相似函数

50
00:01:35,730 --> 00:01:37,080
So one choice might
其中一个选择是

51
00:01:37,280 --> 00:01:38,980
be if we decide not to use any kernel.
我们选择不需要任何内核参数

52
00:01:40,560 --> 00:01:41,510
And the idea of no kernel
没有内核参数的理念

53
00:01:41,910 --> 00:01:43,600
is also called a linear kernel.
也叫线性核函数

54
00:01:44,130 --> 00:01:45,320
So if someone says, I use
因此 如果有人说他使用

55
00:01:45,530 --> 00:01:46,760
an SVM with a linear kernel,
了线性核的SVM（支持向量机）

56
00:01:47,180 --> 00:01:48,330
what that means is you know, they use
这就意味这 他使用了

57
00:01:48,490 --> 00:01:50,690
an SVM without using without
不带有

58
00:01:51,020 --> 00:01:52,250
using a kernel and it
核函数的SVM（支持向量机）

59
00:01:52,360 --> 00:01:53,410
was a version of the SVM
这是一个

60
00:01:54,120 --> 00:01:55,870
that just uses theta transpose X, right,
只是用了θTX

61
00:01:56,140 --> 00:01:57,620
that predicts 1 theta 0
预测1 θ0

62
00:01:57,850 --> 00:01:59,420
plus theta 1 X1
+θ1X1

63
00:01:59,740 --> 00:02:01,000
plus so on plus theta
+...+θnXn

64
00:02:01,690 --> 00:02:04,160
N, X N is greater than equals 0.
这个式子大于等于0

65
00:02:05,520 --> 00:02:06,830
This term linear kernel, you
这个内核线性参数

66
00:02:06,950 --> 00:02:08,250
can think of this as you know this
你可以把它想象成

67
00:02:08,480 --> 00:02:09,290
is the version of the SVM
SVM的一个版本

68
00:02:10,340 --> 00:02:12,320
that just gives you a standard linear classifier.
它只是给你一个标准的线性分类器

69
00:02:13,940 --> 00:02:14,700
So that would be one
因此它可以成为一个

70
00:02:15,040 --> 00:02:16,160
reasonable choice for some problems,
解决一些问题的合理选择

71
00:02:17,130 --> 00:02:18,080
and you know, there would be many software
且你知道的 有许多软件

72
00:02:18,470 --> 00:02:20,900
libraries, like linear, was
库 如线性的是

73
00:02:21,210 --> 00:02:22,320
one example, out of many,
其中的一个例子

74
00:02:22,840 --> 00:02:23,880
one example of a software library
一个软件库的例子

75
00:02:24,560 --> 00:02:25,620
that can train an SVM
可以用来训练不带

76
00:02:25,980 --> 00:02:27,410
without using a kernel, also
内核参数的SVM 也

77
00:02:27,760 --> 00:02:29,470
called a linear kernel.
叫线性内核函数

78
00:02:29,850 --> 00:02:31,340
So, why would you want to do this?
那么你为什么想要做这样一件事儿呢？

79
00:02:31,410 --> 00:02:32,820
If you have a large number of
如果你有大量的

80
00:02:33,150 --> 00:02:34,280
features, if N is
特征值 如果N

81
00:02:34,430 --> 00:02:37,800
large, and M the
很大 且M

82
00:02:37,990 --> 00:02:39,590
number of training examples is
训练的样本数

83
00:02:39,670 --> 00:02:41,050
small, then you know
很小 那么

84
00:02:41,230 --> 00:02:42,300
you have a huge number of
你有大量的

85
00:02:42,360 --> 00:02:43,630
features that if X, this is
特征值 如果X是

86
00:02:43,710 --> 00:02:45,850
an X is an Rn, Rn +1.
X属于Rn+1

87
00:02:46,010 --> 00:02:46,940
So if you have a
那么如果你已经有

88
00:02:47,080 --> 00:02:48,700
huge number of features already, with
大量的特征值 而只有

89
00:02:48,800 --> 00:02:50,540
a small training set, you know, maybe you
很小的训练数据集 也许你

90
00:02:50,610 --> 00:02:51,430
want to just fit a linear
就只想牛和一个线性

91
00:02:51,710 --> 00:02:52,890
decision boundary and not try
的判定边界 而不会去

92
00:02:53,060 --> 00:02:54,420
to fit a very complicated nonlinear
拟合一个非常复杂的非线性

93
00:02:54,860 --> 00:02:56,980
function, because might not have enough data.
函数 因为没有足够的数据

94
00:02:57,560 --> 00:02:59,330
And you might risk overfitting, if
你可能会过度拟合 如果

95
00:02:59,470 --> 00:03:00,530
you're trying to fit a very complicated function
你试着拟合非常复杂的函数的话

96
00:03:01,540 --> 00:03:03,220
in a very high dimensional feature space,
在一个非常高维的特征空间中

97
00:03:03,980 --> 00:03:04,990
but if your training set sample
但是如果你的训练集样本

98
00:03:05,040 --> 00:03:07,120
is small. So this
很小的话 因此

99
00:03:07,340 --> 00:03:08,600
would be one reasonable setting where
这将是一个合理的设置 在此

100
00:03:08,740 --> 00:03:09,950
you might decide to just
你可以决定

101
00:03:10,700 --> 00:03:11,960
not use a kernel, or
不使用内核参数 或

102
00:03:12,250 --> 00:03:15,580
equivalents to use what's called a linear kernel.
一些被叫做线性内核函数的等价物

103
00:03:15,740 --> 00:03:16,740
A second choice for the kernel that
对于内核函数的第二个选择是

104
00:03:16,820 --> 00:03:18,010
you might make, is this Gaussian
你可以构建 这是一个高斯

105
00:03:18,370 --> 00:03:19,920
kernel, and this is what we had previously.
内核函数 这个是我们之前有的

106
00:03:21,270 --> 00:03:22,350
And if you do this, then the
如果你选择这个 那么

107
00:03:22,440 --> 00:03:23,130
other choice you need to make
你需要做的另外一个选择是

108
00:03:23,420 --> 00:03:25,980
is to choose this parameter sigma squared
选择一个参数σ的平方

109
00:03:26,850 --> 00:03:29,800
when we also talk a little bit about the bias variance tradeoffs
当我们开始讨论一些如何权衡偏差方差的时候

110
00:03:30,820 --> 00:03:32,360
of how, if sigma squared is
如果σ2

111
00:03:32,600 --> 00:03:33,890
large, then you tend
很大 那么你就很有可能

112
00:03:34,160 --> 00:03:35,580
to have a higher bias, lower
会有一个较大的误差 较低

113
00:03:35,770 --> 00:03:37,650
variance classifier, but if
方差的分类器 但是如果

114
00:03:37,800 --> 00:03:39,700
sigma squared is small, then you
σ2很小 那么你

115
00:03:40,060 --> 00:03:42,360
have a higher variance, lower bias classifier.
就会有较大的方差 较低误差的分类器

116
00:03:43,940 --> 00:03:45,350
So when would you choose a Gaussian kernel?
那么什么时候选择高斯内核函数呢？

117
00:03:46,210 --> 00:03:48,050
Well, if your omission
如果你忽略了

118
00:03:48,310 --> 00:03:49,540
of features X, I mean
特征值X 我的意思是

119
00:03:49,820 --> 00:03:51,370
Rn, and if N
Rn 如果N

120
00:03:51,570 --> 00:03:53,890
is small, and, ideally, you know,
值很小 很理想地

121
00:03:55,660 --> 00:03:57,110
if n is large, right,
如果n值很大

122
00:03:58,470 --> 00:04:00,170
so that's if, you know, we have
那么如果我们有

123
00:04:00,550 --> 00:04:02,340
say, a two-dimensional training set,
如一个二维的训练集

124
00:04:03,130 --> 00:04:04,880
like the example I drew earlier.
就像我前面讲到的例子一样

125
00:04:05,470 --> 00:04:08,320
So n is equal to 2, but we have a pretty large training set.
那么n等于2 但是我们有相当大的训练集

126
00:04:08,680 --> 00:04:09,770
So, you know, I've drawn in a
我已经有了一个

127
00:04:09,950 --> 00:04:10,890
fairly large number of training examples,
相当大的训练样本了

128
00:04:11,650 --> 00:04:12,410
then maybe you want to use
那么可能你想用

129
00:04:12,540 --> 00:04:14,400
a kernel to fit a more
一个内核函数去拟合一个更加

130
00:04:14,910 --> 00:04:16,260
complex nonlinear decision boundary,
复杂的非线性的判定边界

131
00:04:16,650 --> 00:04:18,750
and the Gaussian kernel would be a fine way to do this.
那么高斯内核函数是一个不错的选择

132
00:04:19,480 --> 00:04:20,610
I'll say more towards the end
我会在这个视频的后面

133
00:04:20,720 --> 00:04:22,570
of the video, a little bit
部分讲到更多 一些关于

134
00:04:22,660 --> 00:04:23,760
more about when you might choose a
什么时候你可以选择

135
00:04:23,970 --> 00:04:26,310
linear kernel, a Gaussian kernel and so on.
线性内核函数 高斯内核函数等

136
00:04:27,860 --> 00:04:29,740
But if concretely, if you
但是如果具体地你

137
00:04:30,040 --> 00:04:31,210
decide to use a Gaussian
决定用高斯

138
00:04:31,720 --> 00:04:33,910
kernel, then here's what you need to do.
内核函数的话 那么这里就是你需要做的

139
00:04:35,380 --> 00:04:36,550
Depending on what support vector machine
根据你所要用的支持向量机

140
00:04:37,280 --> 00:04:38,990
software package you use, it
软件包 这

141
00:04:39,100 --> 00:04:40,960
may ask you to implement a
可能需要你实现一个

142
00:04:41,070 --> 00:04:42,200
kernel function, or to implement
核函数 或者实现

143
00:04:43,060 --> 00:04:43,880
the similarity function.
相似的函数

144
00:04:45,020 --> 00:04:46,750
So if you're using an
因此 如果你用

145
00:04:47,010 --> 00:04:49,820
octave or MATLAB implementation of
octave或者Matlab来实现

146
00:04:50,000 --> 00:04:50,720
an SVM, it may ask you
支持向量机的话 那么就需要你

147
00:04:50,810 --> 00:04:52,560
to provide a function to
提供一个函数来

148
00:04:52,690 --> 00:04:54,680
compute a particular feature of the kernel.
计算核函数的特征值

149
00:04:55,110 --> 00:04:56,480
So this is really computing f
因此这个是在一个特定值i

150
00:04:56,770 --> 00:04:57,890
subscript i for one
的情况下来

151
00:04:58,220 --> 00:04:59,560
particular value of i, where
计算fi

152
00:05:00,570 --> 00:05:02,310
f here is just a
这里的f只是一个

153
00:05:02,330 --> 00:05:03,570
single real number, so maybe
简单的实数

154
00:05:03,840 --> 00:05:05,060
I should move this better written
也许最好是写成

155
00:05:05,250 --> 00:05:07,230
f(i), but what you
f(i) 但是你所

156
00:05:07,510 --> 00:05:08,130
need to do is to write a kernel
需要做的是写一个核

157
00:05:08,480 --> 00:05:09,530
function that takes this input, you know,
函数 让它把这个作为输入 你知道的

158
00:05:10,610 --> 00:05:11,910
a training example or a
一个训练样本 或者一个

159
00:05:12,020 --> 00:05:13,140
test example whatever it takes
测试样本 不管是什么它

160
00:05:13,280 --> 00:05:14,640
in some vector X and takes
把向量X作为输入

161
00:05:14,990 --> 00:05:16,220
as input one of the
把输入作为一种

162
00:05:16,370 --> 00:05:18,270
landmarks and but
标识 不过

163
00:05:18,880 --> 00:05:20,750
only I've come down X1 and
在这里我只写了X1和

164
00:05:20,950 --> 00:05:21,810
X2 here, because the
X2 因为这些

165
00:05:21,900 --> 00:05:23,750
landmarks are really training examples as well.
标识也是训练样本

166
00:05:24,470 --> 00:05:26,160
But what you
但是你所

167
00:05:26,400 --> 00:05:27,490
need to do is write software that
需要做的是写一个

168
00:05:27,670 --> 00:05:28,960
takes this input, you know, X1, X2
可以将这些X1,X2进行输入的软

169
00:05:29,150 --> 00:05:30,320
and computes this sort
并用它们来计算

170
00:05:30,580 --> 00:05:31,950
of similarity function between them
这个相似函数

171
00:05:32,530 --> 00:05:33,470
and return a real number.
之后返回一个实数

172
00:05:36,180 --> 00:05:37,430
And so what some support vector machine
因此一些支持向量机的

173
00:05:37,580 --> 00:05:39,040
packages do is expect
包所做的是期望

174
00:05:39,510 --> 00:05:40,860
you to provide this kernel function
你能提供一个核函数

175
00:05:41,410 --> 00:05:44,580
that take this input you know, X1, X2 and returns a real number.
能够输入X1, X2 并返回一个实数

176
00:05:45,580 --> 00:05:46,460
And then it will take it from there
从这里开始

177
00:05:46,850 --> 00:05:49,070
and it will automatically generate all the features, and
它将自动地生成所有特征变量

178
00:05:49,410 --> 00:05:51,480
so automatically take X and
自动利用特征变量X

179
00:05:51,600 --> 00:05:53,370
map it to f1,
并用你写的函数对应到f1

180
00:05:53,420 --> 00:05:54,420
f2, down to f(m) using
f2 一直到f(m)

181
00:05:54,750 --> 00:05:56,200
this function that you write, and
并且

182
00:05:56,310 --> 00:05:57,190
generate all the features and
生成所有特征变量

183
00:05:57,650 --> 00:05:59,080
train the support vector machine from there.
并从这儿开始训练支持向量机

184
00:05:59,870 --> 00:06:00,800
But sometimes you do need to
但是有些时候你却一定要

185
00:06:00,880 --> 00:06:04,710
provide this function yourself.
自己提供这个函数

186
00:06:05,680 --> 00:06:06,770
Other if you are using the Gaussian kernel, some SVM implementations will also include the Gaussian kernel
如果你使用高斯核函数 一些SVM的函数实现也会包括高斯核函数

187
00:06:06,980 --> 00:06:09,950
and a
和一

188
00:06:10,040 --> 00:06:10,990
few other kernels as well, since
些其他的核函数 这是因为

189
00:06:11,230 --> 00:06:13,580
the Gaussian kernel is probably the most common kernel.
高斯核函数可能是最常见的核函数

190
00:06:14,880 --> 00:06:16,290
Gaussian and linear kernels are
到目前为止高斯核函数和线性核函数是

191
00:06:16,380 --> 00:06:18,210
really the two most popular kernels by far.
最普遍的核函数

192
00:06:19,130 --> 00:06:20,230
Just one implementational note.
一个实现函数的注意事项

193
00:06:20,750 --> 00:06:21,820
If you have features of very
如果你有大小很不一样

194
00:06:22,080 --> 00:06:23,620
different scales, it is important
的特征变量

195
00:06:24,700 --> 00:06:26,270
to perform feature scaling before
在使用高斯函数之前

196
00:06:26,600 --> 00:06:27,780
using the Gaussian kernel.
将这些特征变量的大小按比例归一化

197
00:06:28,580 --> 00:06:29,180
And here's why.
这就是原因

198
00:06:30,150 --> 00:06:31,600
If you imagine the computing
如果假设你在计算

199
00:06:32,290 --> 00:06:33,570
the norm between X and
X和l之间的标量

200
00:06:33,790 --> 00:06:34,890
l, right, so this term here,
就是这样一个式子

201
00:06:35,390 --> 00:06:37,150
and the numerator term over there.
这里是一个计算的式子

202
00:06:38,300 --> 00:06:39,780
What this is doing, the norm
这个式子所算的就是

203
00:06:40,070 --> 00:06:40,930
between X and l, that's really
X和l之间的标量

204
00:06:41,130 --> 00:06:42,140
saying, you know, let's compute the vector
换句话说，计算一个向量

205
00:06:42,450 --> 00:06:43,290
V, which is equal to
V 这个向量V等于

206
00:06:43,410 --> 00:06:44,980
X minus l. And then
X减l 然后

207
00:06:45,250 --> 00:06:47,940
let's compute the norm does
计算向量V的标量

208
00:06:48,130 --> 00:06:49,080
vector V, which is the
这就是

209
00:06:49,170 --> 00:06:50,510
difference between X. So the
与X不同的地方 因此

210
00:06:50,580 --> 00:06:51,510
norm of V is really
V的标量等于

211
00:06:53,360 --> 00:06:54,140
equal to V1 squared
V1的平方

212
00:06:54,250 --> 00:06:55,610
plus V2 squared plus
加v2的平方加

213
00:06:55,830 --> 00:06:58,290
dot dot dot, plus Vn squared.
点点点 加Vn的平方

214
00:06:58,900 --> 00:07:00,320
Because here X is in
因为这里的X属于

215
00:07:01,060 --> 00:07:02,200
Rn, or Rn
Rn 或者

216
00:07:02,290 --> 00:07:05,180
plus 1, but I'm going to ignore, you know, X0.
Rn+1 但是我容易忽略x0

217
00:07:06,540 --> 00:07:08,420
So, let's pretend X is
因此我们假设X是属于

218
00:07:08,510 --> 00:07:10,800
an Rn, square on
Rn的 在

219
00:07:10,950 --> 00:07:12,320
the left side is what makes this correct.
左边方的平方就是正确的了

220
00:07:12,570 --> 00:07:14,090
So this is equal
因此这部分就等于

221
00:07:14,400 --> 00:07:16,120
to that, right?
那个部分

222
00:07:17,210 --> 00:07:18,710
And so written differently, this is
那么另一种不同的写法就是

223
00:07:18,850 --> 00:07:20,100
going to be X1 minus l1
X1减l1

224
00:07:20,290 --> 00:07:22,600
squared, plus x2
的平方加X2

225
00:07:22,910 --> 00:07:24,590
minus l2 squared, plus
减l2的平方 加

226
00:07:24,910 --> 00:07:26,580
dot dot dot plus Xn minus
点点点 加Xn减

227
00:07:27,130 --> 00:07:28,540
ln squared.
ln的平方

228
00:07:29,720 --> 00:07:30,790
And now if your features
现在如果你的特征向量

229
00:07:31,850 --> 00:07:33,460
take on very different ranges of value.
的值的范围很不一样

230
00:07:33,940 --> 00:07:35,150
So take a housing
就拿房价预测来

231
00:07:35,360 --> 00:07:37,180
prediction, for example, if
举例 如果

232
00:07:38,020 --> 00:07:40,490
your data is some data about houses.
你的数据是一些房价数据

233
00:07:41,420 --> 00:07:43,000
And if X is in the
如果X在

234
00:07:43,140 --> 00:07:44,660
range of thousands of square
成千上万平方英尺的范围内

235
00:07:44,950 --> 00:07:47,190
feet, for the
对于

236
00:07:48,010 --> 00:07:48,840
first feature, X1.
第一个特征变量X1

237
00:07:49,700 --> 00:07:51,630
But if your second feature, X2 is the number of bedrooms.
但是如果你的第二个特征向量X2是卧室的数量

238
00:07:52,540 --> 00:07:53,610
So if this is in the
且如果它在

239
00:07:53,730 --> 00:07:56,720
range of one to five bedrooms, then
一到五个卧室范围内 那么

240
00:07:57,810 --> 00:07:59,320
X1 minus l1 is going to be huge.
X1减l1将会很大

241
00:07:59,780 --> 00:08:00,820
This could be like a thousand squared,
这有可能上千数值的平方

242
00:08:01,000 --> 00:08:02,880
whereas X2 minus l2
然而X2减l2

243
00:08:03,200 --> 00:08:04,620
is going to be much smaller and if
将会变得很小 如果

244
00:08:04,750 --> 00:08:06,800
that's the case, then in this term,
是在这样的情况下的话 那么在这个式子中

245
00:08:08,320 --> 00:08:09,660
those distances will be almost
这些间距将几乎

246
00:08:10,060 --> 00:08:12,060
essentially dominated by the
都是由

247
00:08:12,570 --> 00:08:13,280
sizes of the houses
房子的大小来决定的

248
00:08:14,390 --> 00:08:15,760
and the number of bathrooms would be largely ignored.
从而护绿了卧室的数量

249
00:08:16,950 --> 00:08:18,060
As so as, to avoid this in
为了避免这种情况

250
00:08:18,230 --> 00:08:19,070
order to make a machine work
让机器算法得以

251
00:08:19,360 --> 00:08:21,890
well, do perform future scaling.
很好的实现 就需要进一步地缩放比例

252
00:08:23,420 --> 00:08:24,830
And that will sure that the SVM
这将会保证SVM

253
00:08:25,810 --> 00:08:27,020
gives, you know, comparable amount of attention
能考虑到

254
00:08:27,950 --> 00:08:28,870
to all of your different features,
所有不同的特征变量

255
00:08:29,190 --> 00:08:30,450
and not just to in
而不只是像

256
00:08:30,600 --> 00:08:31,870
this example to size of
例子中的那样

257
00:08:32,150 --> 00:08:33,440
houses were big movement here the features.
房子的大小影响特别大 这就是特征变量

258
00:08:34,700 --> 00:08:35,810
When you try a support vector
当你尝试支持向量机

259
00:08:36,110 --> 00:08:38,760
machines chances are by
时 目前为止你最有

260
00:08:38,970 --> 00:08:40,000
far the two most common
可能用到的两个最常用

261
00:08:40,460 --> 00:08:41,750
kernels you use will
的核函数是

262
00:08:41,850 --> 00:08:43,120
be the linear kernel, meaning no
线性核函数 意思就是

263
00:08:43,320 --> 00:08:45,600
kernel, or the Gaussian kernel that we talked about.
没有核参数的函数 或是我们所讨论到的高斯核函数

264
00:08:46,520 --> 00:08:47,390
And just one note of warning
这里有一个警告

265
00:08:47,900 --> 00:08:49,070
which is that not all similarity
不是所有你可能提出来

266
00:08:49,580 --> 00:08:50,590
functions you might come up
的相似函数

267
00:08:50,770 --> 00:08:52,520
with are valid kernels.
都是有效的核函数

268
00:08:53,450 --> 00:08:54,840
And the Gaussian kernel and the linear
高斯核函数和线性

269
00:08:55,090 --> 00:08:56,410
kernel and other kernels that you
核函数以及其你有时

270
00:08:56,710 --> 00:08:57,850
sometimes others will use, all
可能会用到的核函数 所有的

271
00:08:58,030 --> 00:08:59,840
of them need to satisfy a technical condition.
这些函数都需要满足一个技术条件

272
00:09:00,380 --> 00:09:02,510
It's called Mercer's Theorem and
它叫作默塞尔定理

273
00:09:02,630 --> 00:09:03,560
the reason you need to this
需要满足这个条件的原因是

274
00:09:03,710 --> 00:09:05,430
is because support vector machine
因为支持向量机

275
00:09:06,380 --> 00:09:08,140
algorithms or implementations of the
算法或者

276
00:09:08,480 --> 00:09:09,560
SVM have lots of clever
SVM的实现函数有许多熟练的

277
00:09:10,050 --> 00:09:11,380
numerical optimization tricks.
数值优化技巧

278
00:09:12,110 --> 00:09:13,270
In order to solve for the
为了有效地求解

279
00:09:13,340 --> 00:09:15,650
parameter's theta efficiently and
参数θ

280
00:09:16,590 --> 00:09:18,840
in the original design envisaged,
在最初的设想里

281
00:09:19,470 --> 00:09:21,010
those are decision made to restrict
这些决策都用以

282
00:09:21,540 --> 00:09:22,900
our attention only to kernels
将我们的注意力仅仅限制在

283
00:09:23,510 --> 00:09:25,860
that satisfy this technical condition called Mercer's Theorem.
可以满足默塞尔定理的核函数上

284
00:09:26,280 --> 00:09:27,360
And what that does is, that
这个定理所做的是

285
00:09:27,570 --> 00:09:28,540
makes sure that all of these
确保所有的

286
00:09:28,820 --> 00:09:30,270
SVM packages, all of these SVM
SVM包 所有的SVM

287
00:09:30,500 --> 00:09:32,210
software packages can use the
软件包能够用

288
00:09:32,310 --> 00:09:34,740
large class of optimizations and
大类的优化方法并

289
00:09:35,280 --> 00:09:37,470
get the parameter theta very quickly.
很快得到参数θ

290
00:09:39,320 --> 00:09:40,340
So, what most people end up doing
大多数人最后要做的是

291
00:09:40,840 --> 00:09:42,470
is using either the linear
用线性核函数

292
00:09:42,610 --> 00:09:44,210
or Gaussian kernel, but there
或者高斯核函数 但是

293
00:09:44,430 --> 00:09:45,610
are a few other kernels that also
有几个其他的几个核函数也

294
00:09:45,940 --> 00:09:47,460
satisfy Mercer's theorem and
满足默塞尔定理

295
00:09:47,560 --> 00:09:48,690
that you may run across other
你可能会遇到其他人使用这些核函数

296
00:09:48,850 --> 00:09:50,050
people using, although I personally
然而我个人

297
00:09:50,880 --> 00:09:53,780
end up using other kernels you know, very, very rarely, if at all.
最后是很少很少使用其他核函数

298
00:09:54,160 --> 00:09:56,990
Just to mention some of the other kernels that you may run across.
只是简单提及一下你可能会遇到的其他核函数

299
00:09:57,990 --> 00:10:00,300
One is the polynomial kernel.
一个是多项式核函数

300
00:10:01,570 --> 00:10:03,350
And for that the similarity between
X和l之间的

301
00:10:03,800 --> 00:10:05,520
X and l is
相似值

302
00:10:05,730 --> 00:10:06,760
defined as, there are
定义为 有

303
00:10:06,830 --> 00:10:07,880
a lot of options, you can
很多种选择 你可以

304
00:10:08,640 --> 00:10:10,370
take X transpose l squared.
用X的转置I的平方

305
00:10:10,960 --> 00:10:13,410
So, here's one measure of how similar X and l are.
那么这里就有一个估计X和l相似度的估量

306
00:10:13,610 --> 00:10:14,930
If X and l are very close with
如果X和l相互之间很接近

307
00:10:15,500 --> 00:10:18,260
each other, then the inner product will tend to be large.
那么这个内积就会很大

308
00:10:20,200 --> 00:10:21,870
And so, you know, this is a slightly
这是一个有些

309
00:10:23,080 --> 00:10:23,520
unusual kernel.
不寻常的核函数

310
00:10:24,000 --> 00:10:25,130
That is not used that often, but
它并不那么常用 但是

311
00:10:26,490 --> 00:10:29,190
you may run across some people using it.
你可能会见到有些人使用它

312
00:10:30,050 --> 00:10:31,810
This is one version of a polynomial kernel.
这是一个多项式核函数的变体

313
00:10:32,330 --> 00:10:35,090
Another is X transpose l cubed.
另外一个是X转置I 的立方

314
00:10:36,690 --> 00:10:38,780
These are all examples of the polynomial kernel.
这些是多项式核函数的所有例子

315
00:10:39,040 --> 00:10:41,270
X transpose l plus 1 cubed.
X的转置l加1的立方

316
00:10:42,560 --> 00:10:43,620
X transpose l plus maybe
X的转置加 可以

317
00:10:43,910 --> 00:10:44,930
a number different then one 5
是一个不同的数 然后一个5

318
00:10:44,970 --> 00:10:46,680
and, you know, to the power of 4 and
的4次方

319
00:10:47,700 --> 00:10:49,840
so the polynomial kernel actually has two parameters.
多项式内核函数实际上有两个参数

320
00:10:50,610 --> 00:10:53,020
One is, what number do you add over here?
一个是 你需要在这里加上一个什么样的数字

321
00:10:53,520 --> 00:10:53,920
It could be 0.
可能是0

322
00:10:54,430 --> 00:10:58,660
This is really plus 0 over there, as well as what's the degree of the polynomial over there.
这里就是一个0 这些是多项式的次数

323
00:10:58,680 --> 00:11:01,670
So the degree power and these numbers.
这些数字就是多项式的次数

324
00:11:02,250 --> 00:11:04,140
And the more general form of the
多项式核函数更一般的

325
00:11:04,280 --> 00:11:05,530
polynomial kernel is X
形式是X

326
00:11:05,720 --> 00:11:07,620
transpose l, plus some
转置I 加上一些

327
00:11:07,940 --> 00:11:11,510
constant and then
常数项 然后

328
00:11:11,800 --> 00:11:14,850
to some degree in the
是指数部分

329
00:11:15,060 --> 00:11:16,720
X1 and so both
（在X1上）这两个

330
00:11:16,940 --> 00:11:19,650
of these are parameters for the polynomial kernel.
都是多项式核函数的参数

331
00:11:20,510 --> 00:11:22,820
So the polynomial kernel almost always
多项式核函数几乎总是

332
00:11:23,350 --> 00:11:24,440
or usually performs worse.
或者经常执行的效果都比较差

333
00:11:24,820 --> 00:11:25,950
And the Gaussian kernel does not
高斯核函数用得比不是

334
00:11:26,270 --> 00:11:28,370
use that much, but this is just something that you may run across.
那么多 但是你有可能会碰到

335
00:11:29,320 --> 00:11:30,480
Usually it is used only for
通常它用在

336
00:11:30,750 --> 00:11:31,710
data where X and l
当X和l

337
00:11:32,000 --> 00:11:33,180
are all strictly non negative,
都是严格的非负数时

338
00:11:33,740 --> 00:11:34,720
and so that ensures that these
这样以保证这些

339
00:11:34,910 --> 00:11:36,710
inner products are never negative.
内积值永远不会是负数

340
00:11:37,850 --> 00:11:40,010
And this captures the intuition that
这扑捉到了

341
00:11:40,390 --> 00:11:41,340
X and l are very similar
X和l之间非常相似

342
00:11:41,540 --> 00:11:44,110
to each other, then maybe the inter product between them will be large.
也许它们之间的内积会很大

343
00:11:44,420 --> 00:11:45,590
They have some other properties as well
它们也有其他的一些性质

344
00:11:46,260 --> 00:11:48,080
but people tend not to use it much.
但是人们通常用得不多

345
00:11:49,130 --> 00:11:50,150
And then, depending on what you're
那么 根据你所做的

346
00:11:50,260 --> 00:11:51,210
doing, there are other, sort of more
也有其他一些更加

347
00:11:52,330 --> 00:11:54,950
esoteric kernels as well, that you may come across.
难懂的核函数 这你也有可能会碰到

348
00:11:55,670 --> 00:11:57,180
You know, there's a string kernel, this
有字符串核函数 这个

349
00:11:57,340 --> 00:11:58,430
is sometimes used if your
在你的

350
00:11:58,550 --> 00:12:01,350
input data is text strings or other types of strings.
输入数据是文本字符串或者其他类型的字符串时 有时会用到

351
00:12:02,270 --> 00:12:02,940
There are things like the
还有一些函数 如

352
00:12:03,260 --> 00:12:06,000
chi-square kernel, the histogram intersection kernel, and so on.
卡方核函数 直方相交核函数 等等

353
00:12:06,690 --> 00:12:08,420
There are sort of more esoteric kernels that
有一些难懂的核函数

354
00:12:08,660 --> 00:12:09,840
you can use to measure similarity
这样一些函数你可以用来估量

355
00:12:10,760 --> 00:12:12,030
between different objects.
不同对象之间的相似性

356
00:12:12,660 --> 00:12:13,800
So for example, if you're trying to
例如 如果你在尝试

357
00:12:14,380 --> 00:12:15,840
do some sort of text classification
做一些文本分类的

358
00:12:16,170 --> 00:12:17,060
problem, where the input
问题 这个问题中输入

359
00:12:17,200 --> 00:12:19,300
x is a string then
变量X是一个字符串 然后

360
00:12:19,490 --> 00:12:20,490
maybe we want to find the
我们也许想要通过字符串核函数来找到

361
00:12:20,550 --> 00:12:22,050
similarity between two strings
两个字符串间的相似度

362
00:12:22,430 --> 00:12:24,240
using the string kernel, but I
但是我

363
00:12:24,520 --> 00:12:26,440
personally you know end up very rarely,
个人很少用这个

364
00:12:26,990 --> 00:12:29,340
if at all, using these more esoteric kernels. I
如果真要用的话 就用这些更加难懂的核函数 我

365
00:12:29,880 --> 00:12:30,970
think I might have use the chi-square
认为我可以用卡方

366
00:12:31,170 --> 00:12:32,270
kernel, may be once in
核函数 也许是我

367
00:12:32,340 --> 00:12:33,670
my life and the histogram kernel,
人生唯一一次 和直方核函数

368
00:12:34,240 --> 00:12:35,580
may be once or twice in my life. I've
也许是我人生中的一次或者两次用它 我

369
00:12:35,630 --> 00:12:38,500
actually never used the string kernel myself. But in
实际上从来不用字符串核函数 但是

370
00:12:39,350 --> 00:12:41,560
case you've run across this in other applications. You know, if
以防万一你已经在其他应用中碰到了这样的情况 如果

371
00:12:42,700 --> 00:12:43,640
you do a quick web
你快速搜索网页

372
00:12:43,860 --> 00:12:44,850
search we do a quick Google
用快速的谷歌

373
00:12:45,040 --> 00:12:46,000
search or quick Bing search
搜索或者快速的Bing搜索

374
00:12:46,590 --> 00:12:48,240
you should have found definitions that these are the kernels as well. So
你应该已经发现这些定义也是核函数

375
00:12:51,480 --> 00:12:55,680
just two last details I want to talk about in this video. One in multiclass classification. So, you
我想要在这个视频里讨论的最后两个细节 一个是在多类分类中

376
00:12:56,370 --> 00:12:59,510
have four classes or more generally
你有4个类别或者更一般的是

377
00:12:59,800 --> 00:13:01,880
3 classes output some appropriate
3（K）个类别  输出的是一些在你的多个类别间恰当的

378
00:13:02,530 --> 00:13:06,860
decision bounday between your multiple classes. Most SVM, many SVM
判定边界 大部分的SVM 许多SVM

379
00:13:07,220 --> 00:13:08,750
packages already have built-in
包已经建立了

380
00:13:09,030 --> 00:13:10,430
multiclass classification functionality. So
多类分类的函数了 因此

381
00:13:11,100 --> 00:13:12,060
if your using a pattern like
如果你使用一个类似与这样的一个模式

382
00:13:12,270 --> 00:13:13,320
that, you just use the
你只是用了

383
00:13:13,540 --> 00:13:15,370
both that functionality and that
这样的函数 且

384
00:13:15,490 --> 00:13:16,940
should work fine. Otherwise,
应该会工作得很好 此外

385
00:13:17,790 --> 00:13:18,790
one way to do this
实现这个的一个方式

386
00:13:19,000 --> 00:13:19,880
is to use the one
是用 one-vs

387
00:13:20,000 --> 00:13:21,280
versus all method that we
.all method（一对多） 的方法 这个我们在

388
00:13:21,370 --> 00:13:23,690
talked about when we are developing logistic regression. So
讲解逻辑回归的时候讨论过

389
00:13:24,680 --> 00:13:25,410
what you do is you trade
所以你要做的是训练

390
00:13:26,160 --> 00:13:27,550
kSVM's if you have
kSVM 如果你有

391
00:13:27,700 --> 00:13:29,190
k classes, one to distinguish
k个类别 用以将

392
00:13:29,900 --> 00:13:31,060
each of the classes from the rest.
每个类别从其他的类别中区分开来

393
00:13:31,850 --> 00:13:32,930
And this would give you k parameter
这个会给你K参数的

394
00:13:33,520 --> 00:13:34,530
vectors, so this will
向量 这个会

395
00:13:34,680 --> 00:13:36,210
give you, upi lmpw. theta 1, which
给你θ1 这会

396
00:13:36,530 --> 00:13:38,170
is trying to distinguish class y equals
尝试从所有其他

397
00:13:38,630 --> 00:13:39,980
one from all of
类别中识别出y=1的类别

398
00:13:40,130 --> 00:13:41,340
the other classes, then you
之后你

399
00:13:41,420 --> 00:13:42,910
get the second parameter, vector
得到第二个参数 向量

400
00:13:42,970 --> 00:13:43,910
theta 2, which is what
θ2 这个是

401
00:13:44,020 --> 00:13:45,420
you get when you, you know, have
在把

402
00:13:45,720 --> 00:13:47,080
y equals 2 as the positive class
y=2作为正类别

403
00:13:47,460 --> 00:13:48,680
and all the others as negative class
其他的作为负类别时 得到的

404
00:13:49,260 --> 00:13:50,550
and so on up to
以此类推

405
00:13:50,800 --> 00:13:52,400
a parameter vector theta k,
参数向量θk

406
00:13:52,750 --> 00:13:54,520
which is the parameter vector for
是用于

407
00:13:54,600 --> 00:13:56,770
distinguishing the final class
识别最后一个类别参数向量

408
00:13:57,360 --> 00:13:59,380
key from anything else, and
最后

409
00:13:59,490 --> 00:14:00,590
then lastly, this is exactly
这就与

410
00:14:01,270 --> 00:14:02,040
the same as the one versus
我们在逻辑回归中用到的

411
00:14:02,420 --> 00:14:04,230
all method we have for logistic regression.
一对多（one-vs-all）的方法一样

412
00:14:04,760 --> 00:14:05,910
Where we you just predict the class
在逻辑回归中我们只是

413
00:14:06,390 --> 00:14:07,690
i with the largest theta
用最大的θ转置X来预测类别i

414
00:14:08,030 --> 00:14:11,840
transpose X.  So let's multiclass classification designate.
定义一下多类分类

415
00:14:12,440 --> 00:14:13,750
For the more common cases
对于更为常见的情况

416
00:14:14,300 --> 00:14:15,090
that there is a good
很有

417
00:14:15,180 --> 00:14:16,460
chance that whatever software package
可能不过你使用神马软件包

418
00:14:16,780 --> 00:14:18,010
you use, you know, there will be
都有可能

419
00:14:18,340 --> 00:14:19,650
a reasonable chance that are already
已经建立了

420
00:14:19,920 --> 00:14:21,740
have built in multiclass classification functionality,
多类分类的函数功能

421
00:14:21,920 --> 00:14:24,410
and so you don't need to worry about this result.
因此你不用去当心这种资源

422
00:14:25,280 --> 00:14:27,010
Finally, we developed support vector
最后 我们开展

423
00:14:27,210 --> 00:14:28,650
machines starting off with logistic
以逻辑回归开始的支持向量机

424
00:14:29,090 --> 00:14:31,500
regression and then modifying the cost function a little bit.
然后改造一下代价函数

425
00:14:31,910 --> 00:14:34,900
The last thing we want to do in this video is, just say a little bit about.
最后我想要在这个视频中讨论一点的是

426
00:14:35,550 --> 00:14:36,570
when you will use one of
当你要使用

427
00:14:36,660 --> 00:14:38,840
these two algorithms, so let's
这两个算法中的一个时

428
00:14:39,080 --> 00:14:40,000
say n is the number
假设n是一个

429
00:14:40,160 --> 00:14:42,000
of features and m is the number of training examples.
特征变量 m是一个训练样本

430
00:14:43,190 --> 00:14:45,250
So, when should we use one algorithm versus the other?
那么我们什么时候用哪一个呢？

431
00:14:47,130 --> 00:14:48,430
Well, if n is larger
如果n

432
00:14:48,980 --> 00:14:50,140
relative to your training set
相对于你的训练数据集大小较大的时候

433
00:14:50,360 --> 00:14:51,390
size, so for example,
比如

434
00:14:52,810 --> 00:14:53,990
if you take a business
如果你有一个

435
00:14:54,250 --> 00:14:55,180
with a number of features this is
特征变量的数值 这个数值

436
00:14:55,330 --> 00:14:56,870
much larger than m and this
比m要大很多

437
00:14:57,120 --> 00:14:58,210
might be, for example, if you
这可以是 比如说 如果你

438
00:14:58,320 --> 00:15:00,590
have a text classification problem, where
有一个文本分类的问题 在这里

439
00:15:01,550 --> 00:15:02,430
you know, the dimension of the feature
特征向量的维数

440
00:15:02,700 --> 00:15:04,160
vector is I don't know, maybe, 10 thousand.
我是不知道的 有可能是1万（10千）

441
00:15:05,370 --> 00:15:06,350
And if your training
且如果你的训练

442
00:15:06,720 --> 00:15:08,290
set size is maybe 10
集的大小有可能是10

443
00:15:08,510 --> 00:15:10,250
you know, maybe, up to 1000.
到1000范围内

444
00:15:10,500 --> 00:15:12,140
So, imagine a spam
想象一下垃圾邮件

445
00:15:12,320 --> 00:15:14,250
classification problem, where email
的分类问题 在这个问题中

446
00:15:14,510 --> 00:15:15,840
spam, where you have 10,000
你有 （1万）10,000

447
00:15:16,150 --> 00:15:18,010
features corresponding to 10,000 words
个与10,000（1万）个单词对应的特征向量

448
00:15:18,190 --> 00:15:19,550
but you have, you know, maybe 10
但是你可能有10

449
00:15:19,780 --> 00:15:21,150
training examples or maybe up to 1,000 examples.
训练样本 也可能有多达 1000个训练样本

450
00:15:22,450 --> 00:15:23,750
So if n is large relative to
如果n相对于

451
00:15:23,890 --> 00:15:25,090
m, then what I
m足够大的话 那么我

452
00:15:25,250 --> 00:15:26,480
would usually do is use logistic
通常所做的就是使用逻辑

453
00:15:26,850 --> 00:15:27,990
regression or use it
回归 或者

454
00:15:28,100 --> 00:15:29,030
as the m without a kernel or
用不带核函数的m 或者

455
00:15:29,460 --> 00:15:30,790
use it with a linear kernel.
使用线性核函数

456
00:15:31,620 --> 00:15:32,430
Because, if you have so many
因为 如果你有许多

457
00:15:32,580 --> 00:15:33,830
features with smaller training sets, you know,
特征变量 而有相对较小的训练集

458
00:15:34,530 --> 00:15:35,870
a linear function will probably
线性函数可能会

459
00:15:36,330 --> 00:15:37,380
do fine, and you don't have
工作得很好 而且你也没有

460
00:15:37,640 --> 00:15:38,790
really enough data to
足够的数据

461
00:15:38,910 --> 00:15:40,760
fit a very complicated nonlinear function.
来拟合非常复杂的非线性函数

462
00:15:41,340 --> 00:15:42,410
Now if is n is
现在如果n

463
00:15:42,520 --> 00:15:44,020
small and m is
较小 m

464
00:15:44,350 --> 00:15:45,890
intermediate what I mean
大小适中 我的意思是

465
00:15:45,940 --> 00:15:47,450
by this is n is
在这里n

466
00:15:48,040 --> 00:15:50,350
maybe anywhere from 1 - 1000, 1 would be very small.
可能是1-1000之间的任何数 1会很小

467
00:15:50,530 --> 00:15:51,470
But maybe up to 1000
也许也会到1000个

468
00:15:51,700 --> 00:15:54,270
features and if
变量 如果

469
00:15:54,590 --> 00:15:56,180
the number of training
训练样本的数量

470
00:15:56,330 --> 00:15:57,700
examples is maybe anywhere from
可能是从

471
00:15:58,210 --> 00:16:00,750
10, you know, 10 to maybe up to 10,000 examples.
10 也许是从10到10,000中的任何一个数值

472
00:16:01,350 --> 00:16:03,160
Maybe up to 50,000 examples.
也许多达（5万）50,000个样本

473
00:16:03,630 --> 00:16:06,490
If m is pretty big like maybe 10,000 but not a million.
如果m做够大 如可能是（一万）10,000 但是不是一百万

474
00:16:06,760 --> 00:16:08,100
Right? So if m is an
因此如果m是一个

475
00:16:08,300 --> 00:16:09,950
intermediate size then often
大小合适的数值 那么通常

476
00:16:10,790 --> 00:16:12,980
an SVM with a linear kernel will work well.
线性核函数的SVM会工作得很好

477
00:16:13,530 --> 00:16:14,580
We talked about this early as
这个我们在这之前也讨论过

478
00:16:14,710 --> 00:16:15,800
well, with the one concrete example,
一个具体的例子

479
00:16:16,350 --> 00:16:17,100
this would be if you have
是这样一个例子 如果你有

480
00:16:17,520 --> 00:16:19,720
a two dimensional training set. So, if n
一个二维的训练集 如果n

481
00:16:19,900 --> 00:16:21,010
is equal to 2 where you
等于2 在这里你

482
00:16:21,320 --> 00:16:23,710
have, you know, drawing in a pretty large number of training examples.
可以画相当多的训练样本

483
00:16:24,710 --> 00:16:25,860
So Gaussian kernel will do
高斯核函数可以

484
00:16:26,130 --> 00:16:28,160
a pretty good job separating positive and negative classes.
很好地把正类和负类区分开来

485
00:16:29,770 --> 00:16:30,890
One third setting that's of
第三种有趣的情况是

486
00:16:30,980 --> 00:16:32,420
interest is if n is
如果n很

487
00:16:32,520 --> 00:16:34,270
small but m is large.
小 但是m很大

488
00:16:34,890 --> 00:16:36,560
So if n is you know, again maybe
如果n也是

489
00:16:37,390 --> 00:16:39,280
1 to 1000, could be larger.
1到1000之间的数 可能会更大一点

490
00:16:40,200 --> 00:16:42,750
But if m was, maybe
但是如果m是

491
00:16:43,320 --> 00:16:46,400
50,000 and greater to millions.
（5万）50,000 或者更大 大到上百万

492
00:16:47,520 --> 00:16:50,270
So, 50,000, a 100,000, million, trillion.
（5万） 50,000, （10万）100,000, 百万，万亿

493
00:16:51,290 --> 00:16:54,020
You have very very large training set sizes, right.
你有很大很大的数据训练集

494
00:16:55,240 --> 00:16:56,160
So if this is the case,
如果是这样的情况

495
00:16:56,380 --> 00:16:57,630
then a SVM of the
那么高斯核函数的支持向量机

496
00:16:57,900 --> 00:16:59,850
Gaussian Kernel will be somewhat slow to run.
运行起来就会很慢

497
00:17:00,160 --> 00:17:02,300
Today's SVM packages, if you're
如果你用高斯核函数的话 你就会知道今天的SVM包

498
00:17:02,410 --> 00:17:04,900
using a Gaussian Kernel, tend to struggle a bit.
运行这样的函数会很慢

499
00:17:05,050 --> 00:17:06,250
If you have, you know, maybe 50
如果你有50 千（5万）

500
00:17:06,590 --> 00:17:07,530
thousands okay, but if you
但是你

501
00:17:07,620 --> 00:17:10,250
have a million training examples, maybe
有百万个训练样本 也许

502
00:17:10,450 --> 00:17:11,950
or even a 100,000 with a
甚至是100,000（十万）个

503
00:17:12,170 --> 00:17:13,730
massive value of m. Today's
m值很大的训练样本 今天的

504
00:17:14,180 --> 00:17:15,590
SVM packages are very good,
SVM包会工作得很好

505
00:17:15,870 --> 00:17:17,100
but they can still struggle
但是它们仍然会有

506
00:17:17,600 --> 00:17:18,400
a little bit when you have a
一些慢 当你有

507
00:17:19,010 --> 00:17:20,940
massive, massive trainings that size when using a Gaussian Kernel.
非常非常大的训练集 且用高斯核函数是

508
00:17:22,050 --> 00:17:23,150
So in that case, what I
在这种情况下 我

509
00:17:23,350 --> 00:17:24,960
would usually do is try to just
经常会做的是尝试

510
00:17:25,330 --> 00:17:26,660
manually create have more
手动地创建 拥有更多

511
00:17:26,800 --> 00:17:28,600
features and then use
的特征变量 然后用

512
00:17:28,930 --> 00:17:30,340
logistic regression or an SVM
逻辑回归或者

513
00:17:30,630 --> 00:17:32,060
without the Kernel.
不带核函数的支持向量机

514
00:17:33,140 --> 00:17:34,030
And in case you look at this
如果你看到这张

515
00:17:34,230 --> 00:17:35,900
slide and you see logistic regression
幻灯片 看到了逻辑回归

516
00:17:36,460 --> 00:17:37,750
or SVM without a kernel.
或者不带核函数的支持向量机

517
00:17:38,510 --> 00:17:39,890
In both of these places, I
在这个两个地方

518
00:17:39,980 --> 00:17:41,750
kind of paired them together. There's
我把它们放在一起

519
00:17:42,060 --> 00:17:43,050
a reason for that, is that
是有原因的 原因是

520
00:17:43,900 --> 00:17:45,640
logistic regression and SVM without
逻辑回归和不带核函数的支持向量机

521
00:17:46,000 --> 00:17:47,130
the kernel, those are really pretty
它们都是非常

522
00:17:47,350 --> 00:17:49,450
similar algorithms and, you know, either
相似的算法 不管是

523
00:17:49,680 --> 00:17:51,170
logistic regression or SVM
逻辑回归还是

524
00:17:51,500 --> 00:17:53,230
without a kernel will usually do
不带核函数的SVM通常都会做

525
00:17:53,380 --> 00:17:54,780
pretty similar things and give
相似的事情 并给出

526
00:17:54,900 --> 00:17:56,690
pretty similar performance, but depending
相似的结果 但是根据

527
00:17:57,060 --> 00:18:00,340
on your implementational details, one may be more efficient than the other.
你实现的情况 其中一个可能会比另一个更加有效

528
00:18:00,930 --> 00:18:02,220
But, where one of
但是在其中一个

529
00:18:02,310 --> 00:18:03,530
these algorithms applies, logistic
算法应用的地方 逻辑

530
00:18:03,740 --> 00:18:05,190
regression where SVM without a
回归或不带

531
00:18:05,420 --> 00:18:05,840
kernel, the other one is to likely
核函数的SVM 另一个也很有可能

532
00:18:06,650 --> 00:18:07,600
to work pretty well as well.
很有效

533
00:18:08,540 --> 00:18:09,660
But along with the power of
但是随着

534
00:18:09,720 --> 00:18:11,610
the SVM is when you
SVM的复杂度增加 当你

535
00:18:11,810 --> 00:18:14,100
use different kernels to learn
使用不同的内核函数来学习

536
00:18:14,430 --> 00:18:15,860
complex nonlinear functions.
复杂的非线性函数时

537
00:18:16,680 --> 00:18:20,300
And this regime, you know, when you
这个体系 你知道的 当你

538
00:18:20,550 --> 00:18:22,530
have maybe up to 10,000 examples, maybe up to 50,000.
有多达1万（10,000）的样本时 也可能是5万（50,000）

539
00:18:22,610 --> 00:18:25,010
And your number of features,
你特征变量的数量

540
00:18:26,580 --> 00:18:27,540
this is reasonably large.
这是相当大的

541
00:18:27,840 --> 00:18:29,230
That's a very common regime
那是一个非常常见的体系

542
00:18:29,670 --> 00:18:30,910
and maybe that's a regime
也许在这个体系里

543
00:18:31,430 --> 00:18:33,830
where a support vector machine with a kernel kernel will shine.
不带核函数的支持向量机就会表现得相当突出

544
00:18:34,320 --> 00:18:35,640
You can do things that are much
你可以做比这

545
00:18:35,860 --> 00:18:39,850
harder to do that will need logistic regression.
困难得多需要逻辑回归的事情

546
00:18:40,100 --> 00:18:40,930
And finally, where do neural networks fit in?
最后 神经网络使用于什么时候呢？

547
00:18:41,120 --> 00:18:42,230
Well for all of these
对于所有的这些

548
00:18:42,440 --> 00:18:43,890
problems, for all of
问题 对于所有的

549
00:18:43,960 --> 00:18:46,310
these different regimes, a well
这些不同体系 一个

550
00:18:46,630 --> 00:18:49,110
designed neural network is likely to work well as well.
设计得很好的神经网络也很有可能会非常有效

551
00:18:50,320 --> 00:18:51,700
The one disadvantage, or the one
有一个缺点是 或者说是

552
00:18:51,830 --> 00:18:52,980
reason that might not sometimes use
有时可能不会使用

553
00:18:53,220 --> 00:18:54,690
the neural network is that,
神经网络的原因是

554
00:18:54,920 --> 00:18:56,080
for some of these problems, the
对于许多这样的问题

555
00:18:56,180 --> 00:18:57,640
neural network might be slow to train.
神经网络训练起来可能会特别慢

556
00:18:58,250 --> 00:18:59,080
But if you have a very good
但是如果你有一个非常好的

557
00:18:59,350 --> 00:19:01,190
SVM implementation package, that
SVM实现包 它

558
00:19:01,400 --> 00:19:04,120
could run faster, quite a bit faster than your neural network.
可能会运行得比较快 比神经网络快很多

559
00:19:05,130 --> 00:19:06,130
And, although we didn't show this
尽管我们在此之前没有展示

560
00:19:06,350 --> 00:19:07,520
earlier, it turns out that
但是事实证明

561
00:19:07,630 --> 00:19:09,800
the optimization problem that the
SVM具有的优化问题

562
00:19:10,070 --> 00:19:11,120
SVM has is a convex
是一种凸

563
00:19:12,320 --> 00:19:13,830
optimization problem and so the
优化问题 因此

564
00:19:14,410 --> 00:19:15,800
good SVM optimization software
好的SVM优化软件

565
00:19:16,160 --> 00:19:17,870
packages will always find
包总是会找到

566
00:19:18,240 --> 00:19:21,370
the global minimum or something close to it.
全局最小值 或者接近它的值

567
00:19:21,720 --> 00:19:24,100
And so for the SVM you don't need to worry about local optima.
对于SVM 你不需要担心局部最优

568
00:19:25,280 --> 00:19:26,440
In practice local optima aren't
在实际应用中 局部最优不是

569
00:19:26,580 --> 00:19:27,920
a huge problem for neural networks
神经网络所需要解决的一个重大问题

570
00:19:28,090 --> 00:19:29,120
but they all solve, so this
所以这是

571
00:19:29,310 --> 00:19:31,520
is one less thing to worry about if you're using an SVM.
你在使用SVM的时候不需要太去担心的一个问题

572
00:19:33,350 --> 00:19:34,560
And depending on your problem, the neural
根据你的问题 神经

573
00:19:34,910 --> 00:19:37,050
network may be slower, especially
网络可能会比SVM慢 尤其是

574
00:19:37,580 --> 00:19:41,020
in this sort of regime than the SVM.
在这样一个体系中

575
00:19:41,420 --> 00:19:42,200
In case the guidelines they gave
至于这里给出的参考

576
00:19:42,520 --> 00:19:43,500
here, seem a little bit vague
看上去有些模糊

577
00:19:43,860 --> 00:19:44,600
and if you're looking at some problems, you know,
如果你在考虑一些问题

578
00:19:46,930 --> 00:19:48,050
the guidelines are a bit
这些参考会有一些

579
00:19:48,170 --> 00:19:49,190
vague, I'm still not entirely
模糊 但是我仍然不能完全

580
00:19:49,570 --> 00:19:50,730
sure, should I use this
确定 我是该用这个

581
00:19:50,780 --> 00:19:52,690
algorithm or that algorithm, that's actually okay.
算法还是改用那个算法 这个没有太大关系

582
00:19:52,950 --> 00:19:54,100
When I face a machine learning
当我遇到机器学习

583
00:19:54,330 --> 00:19:55,570
problem, you know, sometimes its actually
问题的时候 有时它确实

584
00:19:55,730 --> 00:19:57,010
just not clear whether that's the
不清楚这是否是

585
00:19:57,150 --> 00:19:58,700
best algorithm to use, but as
最好的算法 但是就如

586
00:19:59,540 --> 00:20:00,590
you saw in the earlier videos, really,
在之前的视频中看到的

587
00:20:01,200 --> 00:20:02,470
you know, the algorithm does
算法确实

588
00:20:02,700 --> 00:20:03,920
matter, but what often matters
很重要 但是经常

589
00:20:04,250 --> 00:20:06,400
even more is things like, how much data do you have.
更加重要的是 你有多少数据

590
00:20:07,090 --> 00:20:08,280
And how skilled are you, how
你有多熟练

591
00:20:08,450 --> 00:20:09,500
good are you at doing error
是否擅长做误差

592
00:20:09,750 --> 00:20:11,450
analysis and debugging learning
分析和排除学习

593
00:20:11,660 --> 00:20:13,090
algorithms, figuring out how
算法 指出如何

594
00:20:13,220 --> 00:20:15,120
to design new features and
设定新的特征变量

595
00:20:15,280 --> 00:20:17,540
figuring out what other features to give you learning algorithms and so on.
和找出其他能决定你学习算法的变量等方面

596
00:20:17,960 --> 00:20:19,110
And often those things will matter
通常这些方面会比

597
00:20:19,660 --> 00:20:20,700
more than what you are
你使用

598
00:20:20,840 --> 00:20:22,370
using logistic regression or an SVM.
逻辑回归还是SVM这方面更加重要

599
00:20:23,280 --> 00:20:24,650
But having said that,
但是 已经说过了

600
00:20:25,010 --> 00:20:26,180
the SVM is still widely
SVM仍然被广泛

601
00:20:26,630 --> 00:20:27,890
perceived as one of
认为是一种

602
00:20:27,950 --> 00:20:29,600
the most powerful learning algorithms, and
最强大的学习算法

603
00:20:29,740 --> 00:20:31,570
there is this regime of when there's
这是一个体系 包含了什么时候

604
00:20:31,790 --> 00:20:34,340
a very effective way to learn complex non linear functions.
一个有效的方法去学习复杂的非线性函数

605
00:20:35,150 --> 00:20:36,840
And so I actually, together with
因此 实际上与

606
00:20:37,040 --> 00:20:38,930
logistic regressions, neural networks, SVM's,
逻辑回归 神经网络 SVM一起

607
00:20:39,090 --> 00:20:40,630
using those to speed
使用这些方法来提高

608
00:20:40,760 --> 00:20:42,170
learning algorithms you're I think
学习算法 我认为

609
00:20:42,440 --> 00:20:43,610
very well positioned to build
你会很好地建立

610
00:20:44,120 --> 00:20:45,120
state of the art you know,
很有技术的状态

611
00:20:45,310 --> 00:20:46,710
machine learning systems for a wide
机器学习系统对于一个宽泛的

612
00:20:46,960 --> 00:20:49,110
region for applications and this
应用领域来说 这是

613
00:20:49,330 --> 00:20:52,460
is another very powerful tool to have in your arsenal.
另一个在你军械库里非常强大的工具

614
00:20:53,160 --> 00:20:54,270
One that is used all
你可以把它应用到

615
00:20:54,460 --> 00:20:55,850
over the place in Silicon Valley,
很多地方 在硅谷

616
00:20:56,390 --> 00:20:58,030
or in industry and in
在工业 在

617
00:20:58,310 --> 00:20:59,860
the Academia, to build many
学术等领域 建立许多

618
00:21:00,120 --> 00:21:01,680
high performance machine learning system.
高性能的机器学习系统