1
00:00:00,750 --> 00:00:02,160
Sometimes people talk about support

2
00:00:02,520 --> 00:00:04,380
vector machines, as large margin

3
00:00:04,990 --> 00:00:06,950
classifiers, in this video I'd

4
00:00:07,080 --> 00:00:08,030
like to tell you what that

5
00:00:08,410 --> 00:00:09,500
means, and this will

6
00:00:09,780 --> 00:00:10,520
also give us a useful

7
00:00:11,030 --> 00:00:12,780
picture of what an SVM

8
00:00:13,020 --> 00:00:17,460
hypothesis may look like.

9
00:00:18,070 --> 00:00:19,290
Here's my cost function for the support vector machine

10
00:00:21,310 --> 00:00:22,290
where here on the left

11
00:00:22,790 --> 00:00:24,300
I've plotted my cost 1

12
00:00:24,560 --> 00:00:28,100
of z function that I used for positive examples and on the right I've  plotted my

13
00:00:30,080 --> 00:00:31,510
zero of 'Z' function, where I have

14
00:00:31,950 --> 00:00:33,850
'Z' here on the horizontal axis.

15
00:00:34,380 --> 00:00:35,520
Now, let's think about what

16
00:00:35,650 --> 00:00:38,380
it takes to make these cost functions small.

17
00:00:39,660 --> 00:00:40,970
If you have a positive example,

18
00:00:41,950 --> 00:00:43,170
so if y is equal to

19
00:00:43,490 --> 00:00:45,060
1, then cost 1 of

20
00:00:45,200 --> 00:00:46,750
Z is zero only when

21
00:00:47,700 --> 00:00:50,070
Z is greater than or equal to 1.

22
00:00:50,180 --> 00:00:51,370
So in other words, if you

23
00:00:51,510 --> 00:00:52,860
have a positive example, we really

24
00:00:53,110 --> 00:00:54,550
want theta transpose x to be greater

25
00:00:54,870 --> 00:00:55,760
than or equal to 1

26
00:00:56,450 --> 00:00:58,030
and conversely if y is

27
00:00:58,150 --> 00:00:59,300
equal to zero, look this

28
00:00:59,510 --> 00:01:00,490
cost zero of z function,

29
00:01:01,560 --> 00:01:03,000
then it's only in

30
00:01:03,200 --> 00:01:04,310
this region where z is

31
00:01:04,460 --> 00:01:05,810
less than equal to 1

32
00:01:06,150 --> 00:01:07,320
we have the cost is zero

33
00:01:07,610 --> 00:01:10,150
as z is equals to zero,

34
00:01:10,640 --> 00:01:12,270
and this is an interesting property of the support

35
00:01:12,560 --> 00:01:13,630
vector machine right, which is

36
00:01:13,800 --> 00:01:15,060
that, if you have a positive

37
00:01:15,440 --> 00:01:17,650
example so if y is equal to one,

38
00:01:18,370 --> 00:01:19,250
then all we really need

39
00:01:19,550 --> 00:01:21,950
is that theta transpose x is greater than equal to zero.

40
00:01:22,970 --> 00:01:25,270
And that would mean that we classify correctly

41
00:01:25,860 --> 00:01:26,950
because if theta transpose x is greater than zero our

42
00:01:27,510 --> 00:01:28,980
hypothesis will predict zero.

43
00:01:29,840 --> 00:01:30,710
And similarly, if you have

44
00:01:31,340 --> 00:01:34,090
a negative example, then really all you want is that theta transpose x is

45
00:01:34,850 --> 00:01:37,290
less than zero and that will make sure we got the example right.

46
00:01:37,670 --> 00:01:40,230
But the support vector machine wants a bit more than that.

47
00:01:40,580 --> 00:01:43,360
It says, you know, don't just barely get the example right.

48
00:01:44,320 --> 00:01:45,990
So then don't just

49
00:01:46,240 --> 00:01:47,580
have it just a little bit bigger than zero. What

50
00:01:47,890 --> 00:01:48,870
i really want is for this to be

51
00:01:49,060 --> 00:01:50,370
quite a lot bigger than zero

52
00:01:50,490 --> 00:01:51,430
say maybe

53
00:01:51,680 --> 00:01:52,530
bit greater or equal to one

54
00:01:52,870 --> 00:01:54,400
and I want this to be much less than zero.

55
00:01:54,800 --> 00:01:55,970
Maybe I want it less than or

56
00:01:56,230 --> 00:01:58,140
equal to -1.

57
00:01:58,830 --> 00:02:00,000
And so this builds in an

58
00:02:00,120 --> 00:02:01,660
extra safety factor or safety

59
00:02:02,070 --> 00:02:03,630
margin factor into the support vector machine.

60
00:02:04,030 --> 00:02:05,700
Logistic regression

61
00:02:06,340 --> 00:02:07,620
does something similar too of course,

62
00:02:07,820 --> 00:02:08,900
but let's see what

63
00:02:09,110 --> 00:02:10,350
happens or let's see what

64
00:02:10,460 --> 00:02:11,290
the consequences of this are, in the

65
00:02:11,360 --> 00:02:13,180
context of the support vector machine.

66
00:02:14,830 --> 00:02:15,740
Concretely, what I'd like to do next is

67
00:02:16,010 --> 00:02:17,760
consider a case

68
00:02:17,900 --> 00:02:19,130
case where we set

69
00:02:19,460 --> 00:02:21,240
this constant C to be

70
00:02:21,400 --> 00:02:23,340
a very large value, so let's

71
00:02:23,530 --> 00:02:24,700
imagine we set C to

72
00:02:24,820 --> 00:02:28,080
a very large value, may be a hundred thousand, some huge number.

73
00:02:29,370 --> 00:02:31,290
Let's see what the support vector machine will do.

74
00:02:31,580 --> 00:02:33,510
If C is very,

75
00:02:33,820 --> 00:02:35,340
very large, then when minimizing

76
00:02:36,350 --> 00:02:38,080
this optimization objective, we're going

77
00:02:38,300 --> 00:02:39,640
to be highly motivated to choose

78
00:02:39,950 --> 00:02:41,240
a value, so that this

79
00:02:41,380 --> 00:02:43,180
first term is equal to zero.

80
00:02:44,810 --> 00:02:46,250
So let's try to

81
00:02:46,670 --> 00:02:48,320
understand the optimization problem in

82
00:02:48,430 --> 00:02:49,820
the context of, what would

83
00:02:50,050 --> 00:02:51,520
it take to make this

84
00:02:51,880 --> 00:02:53,060
first term in the objective

85
00:02:53,470 --> 00:02:54,890
equal to zero, because you

86
00:02:55,000 --> 00:02:56,100
know, maybe we'll set C to

87
00:02:56,250 --> 00:02:59,420
some huge constant, and this

88
00:02:59,590 --> 00:03:00,780
will hope, this should give us

89
00:03:01,300 --> 00:03:02,920
additional intuition about what

90
00:03:03,110 --> 00:03:05,520
sort of hypotheses a support vector machine learns.

91
00:03:06,440 --> 00:03:07,720
So we saw already that

92
00:03:08,140 --> 00:03:09,260
whenever you have a training

93
00:03:09,480 --> 00:03:11,350
example with a label

94
00:03:11,690 --> 00:03:13,850
of y=1 if you

95
00:03:13,950 --> 00:03:15,050
want to make that first term

96
00:03:15,240 --> 00:03:16,280
zero, what you need is

97
00:03:16,450 --> 00:03:17,680
is to find a value of theta

98
00:03:17,990 --> 00:03:20,380
so that theta transpose x i is greater

99
00:03:20,690 --> 00:03:22,800
than or equal to 1.

100
00:03:23,220 --> 00:03:24,250
And similarly, whenever we have an example,

101
00:03:24,960 --> 00:03:26,910
with label zero, in order

102
00:03:27,240 --> 00:03:28,060
to make sure that the cost,

103
00:03:29,000 --> 00:03:30,520
cost zero of Z,  in order to

104
00:03:30,610 --> 00:03:31,530
make sure that cost is

105
00:03:31,790 --> 00:03:33,250
zero we need that theta transpose x i

106
00:03:33,810 --> 00:03:36,180
is less than or

107
00:03:37,900 --> 00:03:38,740
equal to -1.

108
00:03:39,510 --> 00:03:40,770
So, if we think

109
00:03:41,050 --> 00:03:43,030
of our optimization problem as

110
00:03:43,360 --> 00:03:45,000
now, really choosing parameters

111
00:03:45,710 --> 00:03:46,750
and show that this first

112
00:03:47,020 --> 00:03:48,170
term is equal to zero,

113
00:03:49,130 --> 00:03:50,230
what we're left with is

114
00:03:50,330 --> 00:03:51,670
the following optimization problem.

115
00:03:52,050 --> 00:03:53,720
We're going to minimize that first

116
00:03:53,980 --> 00:03:55,360
term zero, so C

117
00:03:55,590 --> 00:03:56,710
times zero, because we're going

118
00:03:56,870 --> 00:03:58,040
to choose parameters so that's equal

119
00:03:58,150 --> 00:03:59,710
to zero, plus one half

120
00:04:00,330 --> 00:04:01,330
and then you know that

121
00:04:01,460 --> 00:04:05,440
second term and this

122
00:04:05,620 --> 00:04:06,880
first term is 'C' times zero,

123
00:04:07,160 --> 00:04:08,020
so let's just cross that

124
00:04:08,130 --> 00:04:11,210
out because I know that's going to be zero.

125
00:04:11,380 --> 00:04:12,570
And this will be subject to the constraint

126
00:04:13,400 --> 00:04:15,410
that theta transpose x(i)

127
00:04:16,390 --> 00:04:17,560
is greater than or equal to

128
00:04:18,700 --> 00:04:20,930
one, if y(i)

129
00:04:22,180 --> 00:04:24,150
Is equal to one and

130
00:04:24,940 --> 00:04:26,560
theta transpose x(i) is less than

131
00:04:26,690 --> 00:04:28,060
or equal to minus one

132
00:04:29,030 --> 00:04:31,680
whenever you have

133
00:04:32,110 --> 00:04:34,460
a negative example and it

134
00:04:34,540 --> 00:04:35,520
turns out that when you

135
00:04:35,660 --> 00:04:37,930
solve this optimization problem, when you

136
00:04:38,070 --> 00:04:39,440
minimize this as a function of the parameters theta

137
00:04:40,710 --> 00:04:42,090
you get a very interesting decision

138
00:04:42,590 --> 00:04:44,870
boundary. Concretely, if you

139
00:04:45,010 --> 00:04:46,470
look at a data set

140
00:04:46,750 --> 00:04:49,660
like this with positive and negative examples,  this data

141
00:04:50,920 --> 00:04:52,430
is linearly separable and by

142
00:04:52,710 --> 00:04:54,960
that, I mean that there exists, you know, a straight line,

143
00:04:55,530 --> 00:04:56,830
altough there is many a different straight lines,

144
00:04:56,920 --> 00:04:57,810
they can separate the positive and

145
00:04:58,720 --> 00:05:01,060
negative examples perfectly.

146
00:05:01,560 --> 00:05:02,710
For example, here is one decision boundary

147
00:05:04,270 --> 00:05:05,430
that separates the positive and

148
00:05:05,570 --> 00:05:06,840
negative examples, but somehow that

149
00:05:07,030 --> 00:05:07,810
doesn't look like a very

150
00:05:07,900 --> 00:05:09,680
natural one, right? Or by

151
00:05:09,810 --> 00:05:11,050
drawing an even worse one, you know

152
00:05:11,230 --> 00:05:13,540
here's another decision boundary that

153
00:05:13,710 --> 00:05:14,830
separates the positive and negative examples

154
00:05:14,900 --> 00:05:15,960
but just barely.

155
00:05:16,120 --> 00:05:18,530
But neither of those seem like particularly good choices.

156
00:05:20,420 --> 00:05:22,880
The Support Vector Machines will instead choose this

157
00:05:23,140 --> 00:05:26,450
decision boundary, which I'm drawing in black.

158
00:05:29,010 --> 00:05:30,030
And that seems like a much better decision boundary

159
00:05:30,760 --> 00:05:32,310
than either of

160
00:05:32,420 --> 00:05:34,450
the ones that I drew in magenta or in green.

161
00:05:34,750 --> 00:05:35,790
The black line seems like a more

162
00:05:36,050 --> 00:05:37,840
robust separator, it does

163
00:05:38,610 --> 00:05:39,710
a better job of separating the positive and negative examples.

164
00:05:39,800 --> 00:05:42,830
And mathematically, what that does is,

165
00:05:43,530 --> 00:05:45,680
this black decision boundary has a larger distance.

166
00:05:49,160 --> 00:05:50,580
That distance is called the margin, when I

167
00:05:50,760 --> 00:05:51,790
draw up this two extra

168
00:05:52,380 --> 00:05:54,320
blue lines, we see

169
00:05:54,540 --> 00:05:56,010
that the black decision boundary has

170
00:05:56,240 --> 00:05:59,990
some larger minimum distance from any of my training examples,

171
00:06:00,120 --> 00:06:01,350
whereas the magenta and the green lines

172
00:06:01,580 --> 00:06:02,600
they come awfully close to the training examples.

173
00:06:04,640 --> 00:06:06,100
and then that seems to do a less a good job separating

174
00:06:06,500 --> 00:06:08,910
the positive and negative classes than my black line.

175
00:06:09,850 --> 00:06:11,500
And so

176
00:06:11,800 --> 00:06:13,600
this distance is called

177
00:06:13,960 --> 00:06:16,500
the margin of the

178
00:06:16,600 --> 00:06:21,300
support vector machine and this

179
00:06:21,500 --> 00:06:22,480
gives the SVM a certain

180
00:06:22,940 --> 00:06:24,010
robustness, because it tries

181
00:06:24,360 --> 00:06:25,530
to separate the data with as

182
00:06:25,700 --> 00:06:27,440
a large a margin as possible.

183
00:06:29,210 --> 00:06:30,250
So the support vector machine is

184
00:06:30,380 --> 00:06:31,650
sometimes also called a large

185
00:06:31,830 --> 00:06:33,930
margin classifier and this

186
00:06:34,170 --> 00:06:36,180
is actually a consequence of

187
00:06:36,430 --> 00:06:39,370
the optimization problem we wrote down on the previous slide.

188
00:06:40,140 --> 00:06:40,950
I know that you might be

189
00:06:41,100 --> 00:06:42,250
wondering how is it that

190
00:06:42,400 --> 00:06:43,900
the optimization problem I wrote

191
00:06:44,070 --> 00:06:45,080
down in the previous slide, how

192
00:06:45,280 --> 00:06:47,270
does that lead to this large margin classifier.

193
00:06:48,350 --> 00:06:49,700
I know I haven't explained that yet.

194
00:06:50,520 --> 00:06:51,570
And in the next video

195
00:06:51,810 --> 00:06:53,340
I'm going to sketch a

196
00:06:53,500 --> 00:06:55,180
little bit of the intuition about why

197
00:06:55,430 --> 00:06:57,080
that optimization problem gives us

198
00:06:57,570 --> 00:06:59,630
this large margin classifier. But

199
00:06:59,790 --> 00:07:00,860
this is a useful feature to

200
00:07:00,970 --> 00:07:01,780
keep in mind if you are

201
00:07:01,920 --> 00:07:03,150
trying to understand what are the

202
00:07:03,290 --> 00:07:05,600
sorts of hypothesis that an SVM will choose.

203
00:07:06,140 --> 00:07:07,200
That is, trying to separate the

204
00:07:07,270 --> 00:07:10,310
positive and negative examples with as big a margin as possible.

205
00:07:12,890 --> 00:07:13,950
I want to say one last thing

206
00:07:14,180 --> 00:07:15,930
about large margin classifiers in

207
00:07:16,070 --> 00:07:17,900
this intuition, so we

208
00:07:18,030 --> 00:07:19,340
wrote out this large margin classification

209
00:07:20,010 --> 00:07:21,040
setting in the case

210
00:07:21,420 --> 00:07:23,640
of when C, that regularization concept,

211
00:07:24,160 --> 00:07:25,190
was very large, I think

212
00:07:25,390 --> 00:07:27,750
I set that to a hundred thousand or something.

213
00:07:28,310 --> 00:07:29,760
So given a dataset

214
00:07:30,110 --> 00:07:31,630
like this, maybe we'll choose

215
00:07:32,110 --> 00:07:34,000
that decision boundary that

216
00:07:34,140 --> 00:07:36,210
separate the positive and negative examples on large margin.

217
00:07:37,370 --> 00:07:39,020
Now, the SVM is actually sligthly

218
00:07:39,370 --> 00:07:41,120
more sophisticated than this large

219
00:07:41,440 --> 00:07:42,920
margin view might suggest.

220
00:07:43,630 --> 00:07:45,130
And in particular, if all you're

221
00:07:45,310 --> 00:07:46,490
doing is use a large

222
00:07:46,680 --> 00:07:48,850
margin classifier then your

223
00:07:49,020 --> 00:07:50,270
learning algorithms can be sensitive

224
00:07:50,920 --> 00:07:52,260
to outliers, so lets just

225
00:07:52,450 --> 00:07:53,990
add an extra positive example

226
00:07:54,520 --> 00:07:56,540
like that shown on the screen.

227
00:07:57,230 --> 00:07:58,830
If he had one example then

228
00:07:58,950 --> 00:08:00,060
it seems as if to separate

229
00:08:00,300 --> 00:08:01,410
data with a large margin,

230
00:08:02,680 --> 00:08:04,300
maybe I'll end up learning

231
00:08:05,270 --> 00:08:07,260
a decision boundary like that, right?

232
00:08:07,540 --> 00:08:09,130
that is the magenta line and

233
00:08:09,180 --> 00:08:10,210
it's really not clear that based

234
00:08:10,440 --> 00:08:11,950
on the single outlier based on

235
00:08:12,180 --> 00:08:13,560
a single example and it's

236
00:08:13,790 --> 00:08:14,720
really not clear that it's

237
00:08:14,890 --> 00:08:16,460
actually a good idea to change

238
00:08:17,060 --> 00:08:17,980
my decision boundary from the black

239
00:08:18,290 --> 00:08:19,960
one over to the magenta one.

240
00:08:20,980 --> 00:08:23,430
So, if C, if

241
00:08:23,640 --> 00:08:25,740
the regularization parameter C were very

242
00:08:25,970 --> 00:08:27,110
large, then this is

243
00:08:27,300 --> 00:08:28,130
actually what SVM will do, it will

244
00:08:28,360 --> 00:08:29,820
change the decision boundary

245
00:08:30,270 --> 00:08:31,530
from the black to the

246
00:08:31,650 --> 00:08:33,650
magenta one but if

247
00:08:33,810 --> 00:08:35,390
C were reasonably small if

248
00:08:35,550 --> 00:08:36,720
you were to use the C,

249
00:08:37,320 --> 00:08:39,090
not too large then you

250
00:08:39,260 --> 00:08:40,400
still end up with this

251
00:08:40,610 --> 00:08:44,500
black decision boundary.

252
00:08:44,830 --> 00:08:46,880
And of course if the data were not linearly separable so if you had some positive

253
00:08:47,250 --> 00:08:48,790
examples in here, or if

254
00:08:49,170 --> 00:08:50,440
you had some negative examples

255
00:08:50,980 --> 00:08:52,300
in here then the SVM

256
00:08:52,570 --> 00:08:53,830
will also do the right thing.

257
00:08:54,260 --> 00:08:55,710
And so this picture of

258
00:08:56,060 --> 00:08:57,770
a large margin classifier that's

259
00:08:58,090 --> 00:08:59,410
really, that's really the

260
00:08:59,530 --> 00:09:01,720
picture that gives better intuition

261
00:09:01,970 --> 00:09:03,440
only for the case of when the

262
00:09:03,560 --> 00:09:05,050
regulations parameter C is

263
00:09:05,190 --> 00:09:07,170
very large, and just

264
00:09:07,420 --> 00:09:08,810
to remind you this corresponds

265
00:09:09,650 --> 00:09:11,300
C plays a role similar to

266
00:09:11,850 --> 00:09:13,600
one over Lambda, where Lambda

267
00:09:13,930 --> 00:09:15,950
is the regularization parameter

268
00:09:16,110 --> 00:09:17,970
we had previously. And so it's

269
00:09:18,080 --> 00:09:18,880
only of one over Lambda

270
00:09:19,080 --> 00:09:21,060
is very large or equivalently

271
00:09:21,280 --> 00:09:23,110
if Lambda is very small that

272
00:09:23,560 --> 00:09:24,640
you end up with things like

273
00:09:24,850 --> 00:09:27,600
this Magenta decision boundary, but

274
00:09:28,870 --> 00:09:29,560
in practice when applying support vector machines,

275
00:09:30,190 --> 00:09:31,620
when C is not very very large

276
00:09:31,910 --> 00:09:33,180
like that,

277
00:09:34,840 --> 00:09:36,390
it can do a better job ignoring

278
00:09:36,980 --> 00:09:38,590
the few outliers like here. And

279
00:09:39,150 --> 00:09:40,320
also do fine and  do reasonable things

280
00:09:40,620 --> 00:09:44,400
even if your data is not linearly separable.

281
00:09:44,690 --> 00:09:46,810
But when we talk about bias and variance in the context of support vector machines

282
00:09:46,980 --> 00:09:47,990
which will do

283
00:09:48,170 --> 00:09:50,170
a little bit later, hopefully all

284
00:09:50,410 --> 00:09:51,990
of of this trade-offs involving the regularization

285
00:09:52,410 --> 00:09:53,710
parameter will become clearer at

286
00:09:53,830 --> 00:09:55,280
that time. So I hope

287
00:09:55,580 --> 00:09:57,290
that gives some intuition about

288
00:09:57,600 --> 00:09:59,680
how this support vector machine functions as

289
00:09:59,850 --> 00:10:01,810
a large margin classifier that

290
00:10:01,950 --> 00:10:03,040
tries to separate the data with

291
00:10:03,610 --> 00:10:05,210
a large margin, technically this

292
00:10:06,140 --> 00:10:07,160
picture of this view is true

293
00:10:07,460 --> 00:10:08,710
only when the parameter C is very large, which is

294
00:10:10,230 --> 00:10:11,720
a useful way to think about support vector machines.

295
00:10:13,120 --> 00:10:14,450
There was one missing step in

296
00:10:14,560 --> 00:10:15,990
this video which is, why is

297
00:10:16,110 --> 00:10:17,670
it that the optimization problem we

298
00:10:17,770 --> 00:10:18,770
wrote down on these

299
00:10:19,040 --> 00:10:19,930
slides, how does that actually

300
00:10:20,740 --> 00:10:22,490
lead to the large margin classifier, I

301
00:10:22,600 --> 00:10:23,520
didn't do that in this video,

302
00:10:23,930 --> 00:10:25,830
in the next video I

303
00:10:25,870 --> 00:10:26,940
will sketch a little bit

304
00:10:27,120 --> 00:10:28,370
more of the math behind that

305
00:10:28,750 --> 00:10:29,750
to explain

306
00:10:29,850 --> 00:10:31,660
that separate reasoning of how

307
00:10:31,930 --> 00:10:33,410
the optimization problem we wrote out

308
00:10:33,840 --> 00:10:34,990
results in a large margin classifier.