1
00:00:00,680 --> 00:00:01,740
In this video, I'd like to

2
00:00:01,900 --> 00:00:02,960
tell you a bit about the

3
00:00:03,210 --> 00:00:04,680
math behind large margin classification.

4
00:00:05,960 --> 00:00:08,390
This video is optional, so please feel free to skip it.

5
00:00:09,260 --> 00:00:10,380
It may also give you better

6
00:00:10,650 --> 00:00:11,980
intuition about how the

7
00:00:12,460 --> 00:00:13,830
optimization problem of the

8
00:00:13,940 --> 00:00:15,540
support vex machine, how that

9
00:00:15,860 --> 00:00:17,150
leads to large margin classifiers.

10
00:00:21,180 --> 00:00:22,530
In order to get started, let

11
00:00:22,600 --> 00:00:23,730
me first remind you of a

12
00:00:23,970 --> 00:00:26,490
couple of properties of what vector inner products look like.

13
00:00:28,310 --> 00:00:29,280
Let's say I have two vectors

14
00:00:29,900 --> 00:00:32,180
U and V, that look like this.

15
00:00:32,950 --> 00:00:34,180
So both two dimensional vectors.

16
00:00:35,460 --> 00:00:36,940
Then let's see what U

17
00:00:37,440 --> 00:00:39,550
transpose V looks like.

18
00:00:40,160 --> 00:00:42,180
And U transpose V is

19
00:00:42,300 --> 00:00:43,720
also called the inner products

20
00:00:44,490 --> 00:00:45,880
between the vectors U and V.

21
00:00:48,360 --> 00:00:49,960
Use a two dimensional vector, so

22
00:00:50,380 --> 00:00:51,940
I can on plot it on this figure.

23
00:00:52,760 --> 00:00:53,860
So let's say

24
00:00:54,040 --> 00:00:55,850
that's the vector U. And

25
00:00:55,960 --> 00:00:56,930
what I mean by that is

26
00:00:57,110 --> 00:00:59,160
if on the horizontal axis that

27
00:00:59,360 --> 00:01:00,820
value takes whatever value

28
00:01:01,560 --> 00:01:03,280
U1 is and on the

29
00:01:03,350 --> 00:01:04,820
vertical axis the height

30
00:01:05,100 --> 00:01:06,360
of that is whatever U2

31
00:01:07,340 --> 00:01:08,530
is the second component

32
00:01:08,990 --> 00:01:12,580
of the vector U. Now, one

33
00:01:12,860 --> 00:01:13,760
quantity that will be nice

34
00:01:14,040 --> 00:01:15,430
to have is the norm

35
00:01:16,500 --> 00:01:17,540
of the vector U. So, these

36
00:01:17,860 --> 00:01:19,390
are, you know, double bars on

37
00:01:19,540 --> 00:01:20,380
the left and right that denotes

38
00:01:20,800 --> 00:01:22,610
the norm or length of

39
00:01:22,730 --> 00:01:23,930
U. So this just means; really the

40
00:01:24,200 --> 00:01:27,330
euclidean length of the

41
00:01:27,410 --> 00:01:30,800
vector U. And this

42
00:01:31,350 --> 00:01:33,600
is Pythagoras theorem is just

43
00:01:33,940 --> 00:01:35,420
equal to U1

44
00:01:35,620 --> 00:01:37,300
squared plus U2

45
00:01:37,530 --> 00:01:40,190
squared square root, right?

46
00:01:40,300 --> 00:01:42,780
And this is the length of the vector U. That's a real number.

47
00:01:43,730 --> 00:01:44,750
Just say you know, what is the length

48
00:01:45,080 --> 00:01:46,120
of this, what is the

49
00:01:46,220 --> 00:01:48,900
length of this vector down here.

50
00:01:49,680 --> 00:01:50,490
What is the length of this

51
00:01:50,760 --> 00:01:52,990
arrow that I just drew, is the normal view?

52
00:01:56,020 --> 00:01:57,300
Now let's go back and

53
00:01:57,450 --> 00:01:59,660
look at the vector V because we want to compute the inner product.

54
00:02:00,430 --> 00:02:01,380
So V will be some other

55
00:02:01,520 --> 00:02:03,150
vector with, you know,

56
00:02:03,310 --> 00:02:06,900
some value V1, V2.

57
00:02:08,340 --> 00:02:10,490
And so, the vector

58
00:02:10,880 --> 00:02:15,050
V will look like that, towards V like so.

59
00:02:16,920 --> 00:02:18,260
Now let's go back

60
00:02:18,640 --> 00:02:19,880
and look at how to compute

61
00:02:20,400 --> 00:02:21,610
the inner product between U

62
00:02:21,860 --> 00:02:23,320
and V. Here's how you can do it.

63
00:02:24,010 --> 00:02:25,780
Let me take the vector V and

64
00:02:26,200 --> 00:02:28,440
project it down onto the

65
00:02:28,550 --> 00:02:29,700
vector U. So I'm going

66
00:02:29,930 --> 00:02:31,900
to take a orthogonal projection or

67
00:02:31,970 --> 00:02:33,700
a 90 degree projection, and project

68
00:02:33,920 --> 00:02:35,490
it down onto U like so.

69
00:02:36,650 --> 00:02:37,410
And what I'm going to do

70
00:02:38,130 --> 00:02:39,480
measure length of this

71
00:02:40,210 --> 00:02:41,520
red line that I just drew here.

72
00:02:41,720 --> 00:02:42,620
So, I'm going to call the length of

73
00:02:42,730 --> 00:02:44,670
that red line P. So, P

74
00:02:45,530 --> 00:02:46,830
is the length or is

75
00:02:46,890 --> 00:02:48,230
the magnitude of the projection

76
00:02:49,670 --> 00:02:51,670
of the vector V onto the

77
00:02:51,790 --> 00:02:54,380
vector U. Let me just write that down.

78
00:02:54,560 --> 00:02:55,600
So, P is the length

79
00:02:57,500 --> 00:03:02,150
of the projection of the

80
00:03:02,260 --> 00:03:05,800
vector V onto the

81
00:03:05,920 --> 00:03:08,210
vector U. And it is

82
00:03:08,430 --> 00:03:10,510
possible to show that unit

83
00:03:10,790 --> 00:03:12,710
product U transpose V, that

84
00:03:12,870 --> 00:03:13,540
this is going to be equal

85
00:03:13,840 --> 00:03:16,330
to P  times the

86
00:03:16,430 --> 00:03:18,020
norm or the length of

87
00:03:18,110 --> 00:03:21,130
the vector U. So, this

88
00:03:21,460 --> 00:03:23,400
is one way to compute the inner product.

89
00:03:24,070 --> 00:03:25,590
And if you actually do

90
00:03:25,780 --> 00:03:27,160
the geometry figure out what

91
00:03:27,330 --> 00:03:29,280
P is and figure out what the norm of U is.

92
00:03:29,900 --> 00:03:30,690
This should give you the same

93
00:03:31,050 --> 00:03:32,330
way, the same answer as

94
00:03:32,680 --> 00:03:33,840
the other way of computing unit product.

95
00:03:34,860 --> 00:03:34,860
Right.

96
00:03:35,070 --> 00:03:36,140
Which is if you take U

97
00:03:36,280 --> 00:03:38,150
transpose V then U transposes

98
00:03:39,000 --> 00:03:40,930
this U1 U2, its a

99
00:03:41,090 --> 00:03:42,650
one by two matrix, 1

100
00:03:43,220 --> 00:03:45,250
times V. And so

101
00:03:45,620 --> 00:03:46,930
this should actually give you

102
00:03:47,490 --> 00:03:50,630
U1, V1 plus U2, V2.

103
00:03:51,700 --> 00:03:53,140
And so the theorem of

104
00:03:53,310 --> 00:03:55,010
linear algebra that these two

105
00:03:55,180 --> 00:03:56,880
formulas give you the same answer.

106
00:03:57,890 --> 00:03:58,720
And by the way, U transpose

107
00:03:59,290 --> 00:04:01,010
V is also equal to

108
00:04:01,320 --> 00:04:03,490
V transpose U. So if

109
00:04:03,650 --> 00:04:04,510
you were to do the same process

110
00:04:05,050 --> 00:04:06,860
in reverse, instead of projecting

111
00:04:07,120 --> 00:04:08,130
V onto U, you could project

112
00:04:08,520 --> 00:04:09,940
U onto V. Then, you know, do

113
00:04:10,160 --> 00:04:12,410
the same process, but with the rows of U and V reversed.

114
00:04:13,170 --> 00:04:14,390
And you would actually, you should

115
00:04:14,710 --> 00:04:16,900
actually get the same number whatever that number is.

116
00:04:17,540 --> 00:04:18,790
And just to clarify what's

117
00:04:18,990 --> 00:04:20,850
going on in this equation the

118
00:04:21,030 --> 00:04:21,920
norm of U is a real

119
00:04:22,100 --> 00:04:25,260
number and P is also a real number.

120
00:04:25,760 --> 00:04:28,720
And so U transpose V is

121
00:04:29,410 --> 00:04:32,350
the regular multiplication as two real numbers of

122
00:04:33,040 --> 00:04:34,440
the length of P times the normal view.

123
00:04:35,580 --> 00:04:36,960
Just one last detail, which is

124
00:04:37,190 --> 00:04:38,240
if you look at the norm of

125
00:04:38,330 --> 00:04:40,250
P, P is actually signed so to the right.

126
00:04:41,350 --> 00:04:43,240
And it can either be positive or negative.

127
00:04:44,350 --> 00:04:45,530
So let me say what I mean

128
00:04:45,650 --> 00:04:46,740
by that, if U

129
00:04:47,170 --> 00:04:49,360
is a vector that looks like

130
00:04:49,640 --> 00:04:51,360
this and V is a vector that looks like this.

131
00:04:52,380 --> 00:04:53,890
So if the angle between U

132
00:04:54,130 --> 00:04:55,770
and V is greater than ninety degrees.

133
00:04:56,620 --> 00:04:57,960
Then if I project V onto

134
00:04:58,270 --> 00:05:00,220
U, what I get

135
00:05:00,420 --> 00:05:01,590
is a projection it looks like

136
00:05:01,720 --> 00:05:03,860
this and so that length

137
00:05:04,110 --> 00:05:05,490
P. And in this

138
00:05:05,670 --> 00:05:06,900
case, I will still have

139
00:05:07,670 --> 00:05:09,510
that U transpose V is

140
00:05:09,660 --> 00:05:11,720
equal to P times the

141
00:05:11,800 --> 00:05:14,070
norm of U. Except in

142
00:05:14,200 --> 00:05:16,600
this example P will be negative.

143
00:05:19,150 --> 00:05:20,990
So, you know, in inner products if the angle

144
00:05:21,320 --> 00:05:22,540
between U and V is less

145
00:05:22,790 --> 00:05:23,820
than ninety degrees, then P

146
00:05:24,100 --> 00:05:26,480
is the positive length for that red line

147
00:05:27,130 --> 00:05:28,420
whereas if the angle of this

148
00:05:28,720 --> 00:05:29,640
angle of here is greater

149
00:05:30,000 --> 00:05:31,890
than 90 degrees then P

150
00:05:32,130 --> 00:05:33,880
here will be negative of

151
00:05:34,130 --> 00:05:37,260
the length of the super line of that little line segment right over there.

152
00:05:37,650 --> 00:05:38,750
So the inner product between two

153
00:05:38,900 --> 00:05:40,130
vectors can also be negative

154
00:05:40,820 --> 00:05:42,900
if the angle between them is greater than 90 degrees.

155
00:05:43,770 --> 00:05:45,100
So that's how vector inner

156
00:05:45,310 --> 00:05:46,490
products work. We're going to

157
00:05:46,930 --> 00:05:47,960
use these properties of vector

158
00:05:48,280 --> 00:05:49,610
inner product to try

159
00:05:49,840 --> 00:05:51,880
to understand the support

160
00:05:52,400 --> 00:05:54,490
vector machine optimization objective over there. Here

161
00:05:54,630 --> 00:05:58,620
is the optimization objective for the

162
00:05:58,650 --> 00:06:00,900
support vector machine that we worked out earlier. Just for

163
00:06:01,100 --> 00:06:02,070
the purpose of this slide I

164
00:06:02,120 --> 00:06:04,520
am going to make one simplification or

165
00:06:04,910 --> 00:06:08,220
once just to make the objective easy

166
00:06:08,670 --> 00:06:10,110
to analyze and what I'm going to do is

167
00:06:10,270 --> 00:06:14,160
ignore the indeceptrums. So, we'll just ignore theta 0 and set that to be equal to 0. To

168
00:06:16,510 --> 00:06:22,950
make things easier to plot, I'm also going to set N the number of features to be equal to 2. So, we have only 2 features,

169
00:06:23,980 --> 00:06:24,710
X1 and X2.

170
00:06:26,510 --> 00:06:27,980
Now, let's look at the objective function.

171
00:06:28,470 --> 00:06:29,910
The optimization objective of the

172
00:06:30,160 --> 00:06:32,130
SVM. What we have only two features.

173
00:06:32,630 --> 00:06:33,710
When N is equal to 2.

174
00:06:34,170 --> 00:06:35,340
This can be written,

175
00:06:36,130 --> 00:06:37,900
one half of

176
00:06:38,040 --> 00:06:40,080
theta one squared plus theta two squared.

177
00:06:40,620 --> 00:06:42,870
Because we only have two parameters, theta one and thetaa two.

178
00:06:45,240 --> 00:06:46,730
What I'm going to do is rewrite this a bit.

179
00:06:46,940 --> 00:06:47,900
I'm going to write this as one

180
00:06:48,090 --> 00:06:49,980
half of theta one

181
00:06:50,190 --> 00:06:51,860
squared plus theta two squared and

182
00:06:52,050 --> 00:06:54,160
the square root squared.

183
00:06:54,820 --> 00:06:55,760
And the reason I can do that,

184
00:06:56,100 --> 00:06:58,990
is because for any number, you know, W, right, the

185
00:07:00,830 --> 00:07:02,480
square roots of W and

186
00:07:02,570 --> 00:07:03,930
then squared, that's just equal

187
00:07:04,080 --> 00:07:05,650
to W. So square roots

188
00:07:05,840 --> 00:07:07,250
and squared should give you the same thing.

189
00:07:08,600 --> 00:07:09,500
What you may notice is that

190
00:07:09,730 --> 00:07:11,870
this term inside is that's

191
00:07:12,290 --> 00:07:13,450
equal to the norm

192
00:07:14,530 --> 00:07:16,460
or the length of the

193
00:07:16,690 --> 00:07:18,250
vector theta and what

194
00:07:18,430 --> 00:07:20,020
I mean by that is that

195
00:07:20,200 --> 00:07:21,640
if we write out the

196
00:07:21,700 --> 00:07:22,590
vector theta like this, as

197
00:07:23,080 --> 00:07:24,320
you know theta one, theta two.

198
00:07:25,260 --> 00:07:26,260
Then this term that I've just

199
00:07:26,690 --> 00:07:28,230
underlined in red, that's exactly

200
00:07:28,640 --> 00:07:30,480
the length, or the norm, of the vector theta.

201
00:07:30,900 --> 00:07:32,180
We are calling the definition

202
00:07:32,950 --> 00:07:35,050
of the norm of the vector that we have on the previous line.

203
00:07:36,140 --> 00:07:37,040
And in fact this is actually

204
00:07:37,400 --> 00:07:38,320
equal to the length of the

205
00:07:38,370 --> 00:07:39,760
vector theta, whether you write

206
00:07:40,020 --> 00:07:41,620
it as theta zero, theta 1, theta 2.

207
00:07:42,280 --> 00:07:45,230
That is, if theta zero is equal to zero, as I assume here.

208
00:07:45,860 --> 00:07:46,770
Or just the length of theta

209
00:07:46,900 --> 00:07:48,680
1, theta 2; but for

210
00:07:48,830 --> 00:07:50,450
this line I am going to ignore theta 0.

211
00:07:50,940 --> 00:07:52,710
So let me just, you know, treat theta

212
00:07:53,150 --> 00:07:54,730
as this, let me just

213
00:07:54,960 --> 00:07:56,360
write theta, the normal

214
00:07:56,720 --> 00:07:58,480
theta as this theta 1,

215
00:07:58,620 --> 00:08:00,180
theta 2 only, but the

216
00:08:00,260 --> 00:08:01,220
math works out either way,

217
00:08:01,460 --> 00:08:03,790
whether we include theta zero here or not.

218
00:08:03,970 --> 00:08:05,870
So it's not going to matter for the rest of our derivation.

219
00:08:07,630 --> 00:08:09,120
And so finally this means

220
00:08:09,390 --> 00:08:11,440
that my optimization objective is equal

221
00:08:11,750 --> 00:08:13,100
to one half of the

222
00:08:13,190 --> 00:08:14,610
norm of theta squared.

223
00:08:16,190 --> 00:08:17,230
So all the support vector machine

224
00:08:17,530 --> 00:08:19,010
is doing in the optimization

225
00:08:19,910 --> 00:08:21,500
objective is it's minimizing the

226
00:08:21,590 --> 00:08:23,100
squared norm of the square

227
00:08:23,470 --> 00:08:24,840
length of the parameter vector theta.

228
00:08:28,330 --> 00:08:29,160
Now what I'd like to do

229
00:08:29,370 --> 00:08:30,790
is look at these terms, theta

230
00:08:31,090 --> 00:08:33,670
transpose X and understand better what they're doing.

231
00:08:34,230 --> 00:08:36,600
So given the parameter vector theta and given

232
00:08:36,930 --> 00:08:39,880
and example x, what is this is equal to?

233
00:08:40,820 --> 00:08:42,120
And on the previous slide, we

234
00:08:42,230 --> 00:08:44,070
figured out what U transpose

235
00:08:44,870 --> 00:08:45,850
V looks like, with different

236
00:08:46,110 --> 00:08:47,880
vectors U and V. And so we're

237
00:08:48,130 --> 00:08:50,340
going to take those definitions, you know, with theta

238
00:08:50,980 --> 00:08:52,300
and X(i) playing the

239
00:08:52,410 --> 00:08:53,310
roles of U and V.

240
00:08:54,400 --> 00:08:57,430
And let's see what that picture looks like.

241
00:08:57,860 --> 00:08:59,160
So, let's say I plot. Let's say I look at

242
00:08:59,430 --> 00:09:01,130
just a single training example. Let's say I

243
00:09:01,230 --> 00:09:03,360
have a positive example the drawing

244
00:09:03,720 --> 00:09:05,050
was across there and let's say that is

245
00:09:05,800 --> 00:09:09,310
my example X(i), what

246
00:09:09,500 --> 00:09:10,970
that really means is

247
00:09:12,100 --> 00:09:13,510
plotted on the horizontal axis

248
00:09:14,450 --> 00:09:16,210
some value X(i) 1

249
00:09:17,140 --> 00:09:19,620
and on the vertical axis

250
00:09:21,240 --> 00:09:22,290
X(i) 2.

251
00:09:22,650 --> 00:09:24,070
That's how I plot my training examples.

252
00:09:25,400 --> 00:09:27,160
And although we haven't been really

253
00:09:27,320 --> 00:09:28,310
thinking of this as a vector, what

254
00:09:28,570 --> 00:09:29,600
this really is, this is a

255
00:09:29,650 --> 00:09:30,910
vector from the origin

256
00:09:31,610 --> 00:09:33,520
from 0, 0 out to

257
00:09:34,560 --> 00:09:36,210
the location of this training example.

258
00:09:37,830 --> 00:09:39,460
And now let's say we have

259
00:09:39,980 --> 00:09:41,850
a parameter vector and

260
00:09:42,080 --> 00:09:43,620
I'm going to plot

261
00:09:43,800 --> 00:09:45,720
that as vector, as well.

262
00:09:46,390 --> 00:09:48,410
What I mean by that is if I plot theta 1

263
00:09:49,100 --> 00:09:53,530
here and theta 2 there

264
00:09:56,230 --> 00:09:57,050
so what is the inner

265
00:09:57,290 --> 00:09:58,940
product theta transpose X(i).

266
00:09:59,220 --> 00:10:01,240
While using our earlier method,

267
00:10:01,990 --> 00:10:03,360
the way we compute that is we

268
00:10:04,310 --> 00:10:06,170
take my example and

269
00:10:06,320 --> 00:10:08,710
project it onto my parameter vector theta.

270
00:10:09,830 --> 00:10:10,700
And then I'm going to look

271
00:10:10,950 --> 00:10:13,070
at the length of this segment

272
00:10:13,680 --> 00:10:14,660
that I'm coloring in, in red.

273
00:10:15,090 --> 00:10:16,500
And I'm going to

274
00:10:16,670 --> 00:10:19,480
call that P superscript I

275
00:10:20,330 --> 00:10:21,330
to denote that this is a

276
00:10:21,610 --> 00:10:22,920
projection of the i-th training example

277
00:10:24,860 --> 00:10:25,540
onto the parameter vector theta.

278
00:10:26,900 --> 00:10:28,140
And so what we have is

279
00:10:28,350 --> 00:10:30,790
that theta transpose X(i) is

280
00:10:30,920 --> 00:10:32,830
equal to following what

281
00:10:32,960 --> 00:10:34,210
we have on the previous slide, this

282
00:10:34,430 --> 00:10:35,350
is going to be equal to

283
00:10:36,560 --> 00:10:40,000
P times the

284
00:10:40,090 --> 00:10:42,090
length of the norm of the vector theta.

285
00:10:43,410 --> 00:10:44,690
And this is of course also equal to

286
00:10:44,750 --> 00:10:46,660
theta 1 x1

287
00:10:47,920 --> 00:10:50,610
plus theta 2 x2. So each

288
00:10:50,810 --> 00:10:52,360
of these is, you know, an equally

289
00:10:52,680 --> 00:10:54,080
valid way of computing the

290
00:10:54,180 --> 00:10:56,160
inner product between theta and X(i).

291
00:10:57,780 --> 00:10:57,780
Okay.

292
00:10:58,140 --> 00:10:59,040
So where does this leave us?

293
00:10:59,280 --> 00:11:00,770
What this means is that, this

294
00:11:01,020 --> 00:11:02,890
constrains that theta transpose X(i)

295
00:11:03,130 --> 00:11:05,330
be greater than or equal to one or less than minus one.

296
00:11:06,110 --> 00:11:06,860
What this means is that it

297
00:11:06,970 --> 00:11:07,830
can replace the use of constraints

298
00:11:08,610 --> 00:11:12,000
that P(i) times X

299
00:11:12,320 --> 00:11:13,500
be greater than or equal to one.

300
00:11:13,680 --> 00:11:16,280
Because theta transpose X(i) is

301
00:11:16,400 --> 00:11:19,470
equal to P(i) times the norm of theta.

302
00:11:21,250 --> 00:11:23,080
So writing that into our optimization objective.

303
00:11:23,910 --> 00:11:24,870
This is what we get

304
00:11:25,130 --> 00:11:26,290
where I have, instead of

305
00:11:27,090 --> 00:11:28,400
theta transpose X(i), I now

306
00:11:28,620 --> 00:11:30,920
have this P(i) times the norm of theta.

307
00:11:31,970 --> 00:11:32,970
And just to remind you we

308
00:11:33,090 --> 00:11:34,240
worked out earlier too that

309
00:11:34,460 --> 00:11:36,310
this optimization objective can be

310
00:11:36,510 --> 00:11:38,130
written as one half times

311
00:11:38,500 --> 00:11:39,910
the norm of theta squared.

312
00:11:41,730 --> 00:11:43,490
So, now let's consider

313
00:11:44,210 --> 00:11:45,550
the training example that we

314
00:11:45,700 --> 00:11:47,100
have at the bottom and

315
00:11:47,450 --> 00:11:49,620
for now, continuing to use the simplification that

316
00:11:50,180 --> 00:11:51,340
theta 0 is equal to 0.

317
00:11:52,030 --> 00:11:54,810
Let's see what decision boundary the support vector machine will choose.

318
00:11:55,860 --> 00:11:57,710
Here's one option, let's say

319
00:11:57,870 --> 00:11:59,190
the support vector machine were to

320
00:11:59,340 --> 00:12:01,750
choose this decision boundary.

321
00:12:02,690 --> 00:12:05,110
This is not a very good choice because it has very small margins.

322
00:12:05,530 --> 00:12:08,210
This decision boundary comes very close to the training examples.

323
00:12:09,810 --> 00:12:12,360
Let's see why the support vector machine will not do this.

324
00:12:14,130 --> 00:12:15,420
For this choice of parameters

325
00:12:16,410 --> 00:12:18,280
it's possible to show that the

326
00:12:19,030 --> 00:12:21,250
parameter vector theta is actually

327
00:12:21,760 --> 00:12:23,350
at 90 degrees to the decision boundary.

328
00:12:24,060 --> 00:12:25,440
And so, that green decision boundary

329
00:12:26,250 --> 00:12:27,550
corresponds to a parameter vector

330
00:12:27,920 --> 00:12:29,650
theta that points in that direction.

331
00:12:30,730 --> 00:12:32,280
And by the way, the simplification that

332
00:12:32,510 --> 00:12:34,120
theta 0 equals 0 that

333
00:12:34,300 --> 00:12:35,410
just means that the decision boundary

334
00:12:35,910 --> 00:12:37,960
must pass through the origin, (0,0) over there.

335
00:12:38,330 --> 00:12:40,350
So now, let's

336
00:12:40,690 --> 00:12:41,780
look at what this implies

337
00:12:41,840 --> 00:12:43,590
for the optimization objective.

338
00:12:45,260 --> 00:12:46,420
Let's say that this example here.

339
00:12:47,460 --> 00:12:48,560
Let's say that's my first example, you know,

340
00:12:50,480 --> 00:12:50,650
X1.

341
00:12:51,690 --> 00:12:52,630
If we look at the projection

342
00:12:53,320 --> 00:12:54,870
of this example onto my parameters theta.

343
00:12:56,180 --> 00:12:56,520
That's the projection.

344
00:12:57,660 --> 00:12:59,230
And so that little red line segment.

345
00:13:00,450 --> 00:13:01,720
That is equal to P1.

346
00:13:02,380 --> 00:13:04,650
And that is going to be pretty small, right.

347
00:13:05,860 --> 00:13:08,590
And similarly, if this

348
00:13:09,610 --> 00:13:10,710
example here, if this happens

349
00:13:11,170 --> 00:13:12,620
to be X2, that's my second example.

350
00:13:13,880 --> 00:13:16,620
Then, if I look at the projection of this this example onto theta.

351
00:13:18,080 --> 00:13:18,170
You know.

352
00:13:18,440 --> 00:13:20,460
Then, let me draw this one in magenta.

353
00:13:21,600 --> 00:13:23,690
This little magenta line segment, that's

354
00:13:24,000 --> 00:13:25,820
going to be P2. That's

355
00:13:26,070 --> 00:13:27,370
the projection of the second example

356
00:13:27,770 --> 00:13:28,870
onto my, onto the direction

357
00:13:30,100 --> 00:13:32,650
of my parameter vector theta which goes like this.

358
00:13:33,870 --> 00:13:34,250
And so, this little

359
00:13:35,270 --> 00:13:35,270
projection line segment is getting pretty small.

360
00:13:36,850 --> 00:13:38,420
P2 will actually be a negative number, right so P2 is

361
00:13:38,560 --> 00:13:42,490
in the opposite direction.

362
00:13:43,710 --> 00:13:45,250
This vector has greater

363
00:13:45,560 --> 00:13:47,130
than 90 degree angle with my

364
00:13:47,270 --> 00:13:48,980
parameter vector theta, it's going to be less than 0.

365
00:13:50,280 --> 00:13:51,580
And so what we're finding is that

366
00:13:51,850 --> 00:13:54,880
these terms P(i) are

367
00:13:55,200 --> 00:13:57,230
going to be pretty small numbers.

368
00:13:58,210 --> 00:13:59,080
So if we look at

369
00:13:59,110 --> 00:14:01,650
the optimization objective and see, well, for positive examples

370
00:14:02,490 --> 00:14:04,860
we need P(i) times

371
00:14:05,220 --> 00:14:07,590
the norm of theta to be bigger than either one.

372
00:14:08,670 --> 00:14:10,640
But if P(i) over

373
00:14:10,860 --> 00:14:12,140
here, if P1 over here

374
00:14:12,770 --> 00:14:14,160
is pretty small, that means

375
00:14:14,410 --> 00:14:15,580
that we need the norm of

376
00:14:15,650 --> 00:14:18,420
theta to be pretty large, right? If

377
00:14:19,830 --> 00:14:20,840
P1 of theta is small

378
00:14:21,790 --> 00:14:23,110
and we want P1 you

379
00:14:23,410 --> 00:14:24,600
know times in all of theta

380
00:14:24,920 --> 00:14:25,890
to be bigger than either one, well

381
00:14:26,400 --> 00:14:27,300
the only way for that

382
00:14:27,510 --> 00:14:28,440
to be true for the profit that

383
00:14:28,650 --> 00:14:29,750
these two numbers to be large

384
00:14:30,020 --> 00:14:31,120
if P1 is small, as we

385
00:14:31,240 --> 00:14:32,980
said we want the norm of theta to be large.

386
00:14:34,150 --> 00:14:36,450
And similarly for our

387
00:14:36,650 --> 00:14:38,560
negative example, we need P2

388
00:14:39,750 --> 00:14:41,070
times the norm of

389
00:14:41,350 --> 00:14:44,990
theta to be

390
00:14:45,160 --> 00:14:46,910
less than or equal to minus one.

391
00:14:47,760 --> 00:14:48,540
And we saw in this

392
00:14:48,710 --> 00:14:50,200
example already that P2

393
00:14:50,260 --> 00:14:51,520
is going pretty small negative number,

394
00:14:52,040 --> 00:14:53,290
and so the only way for

395
00:14:53,420 --> 00:14:54,490
that to happen as well is

396
00:14:54,530 --> 00:14:56,730
for the norm of theta to be

397
00:14:57,010 --> 00:14:59,630
large, but what

398
00:14:59,790 --> 00:15:00,900
we are doing in the optimization

399
00:15:01,290 --> 00:15:02,400
objective is we are

400
00:15:02,540 --> 00:15:03,840
trying to find a setting

401
00:15:04,170 --> 00:15:05,320
of parameters where the norm

402
00:15:05,550 --> 00:15:07,100
of theta is small, and so

403
00:15:07,830 --> 00:15:09,040
you know, so this doesn't

404
00:15:09,330 --> 00:15:10,070
seem like such a good direction

405
00:15:10,610 --> 00:15:14,160
for the parameter vector and theta.

406
00:15:14,450 --> 00:15:15,510
In contrast, just look at a different decision boundary.

407
00:15:17,040 --> 00:15:19,500
Here, let's say, this SVM chooses

408
00:15:20,510 --> 00:15:21,280
that decision boundary.

409
00:15:22,870 --> 00:15:23,980
Now the is going to be very different.

410
00:15:24,420 --> 00:15:25,890
If that is the decision boundary,

411
00:15:26,190 --> 00:15:27,380
here is the

412
00:15:27,450 --> 00:15:28,770
corresponding direction for theta.

413
00:15:29,210 --> 00:15:30,920
So, with the direction

414
00:15:31,000 --> 00:15:32,110
boundary you know, that

415
00:15:32,300 --> 00:15:33,560
vertical line that corresponds

416
00:15:34,470 --> 00:15:35,960
to it is possible to

417
00:15:36,190 --> 00:15:37,880
show using linear algebra that

418
00:15:38,070 --> 00:15:39,140
the way to get that green decision

419
00:15:39,460 --> 00:15:41,190
boundary is have the vector of theta be

420
00:15:41,390 --> 00:15:42,620
at 90 degrees to it,

421
00:15:43,610 --> 00:15:44,470
and now if you look

422
00:15:44,560 --> 00:15:45,570
at the projection of your

423
00:15:45,710 --> 00:15:47,540
data onto the vector

424
00:15:48,800 --> 00:15:50,010
x, lets say its before

425
00:15:50,010 --> 00:15:52,620
this example is my example of x1. So when

426
00:15:52,890 --> 00:15:54,600
I project this on to x,

427
00:15:55,410 --> 00:15:59,110
or onto theta, what I find is that this is P1.

428
00:16:00,650 --> 00:16:02,410
That length there is P1.

429
00:16:03,750 --> 00:16:05,820
The other example, that

430
00:16:06,260 --> 00:16:08,620
example is and I

431
00:16:08,840 --> 00:16:11,300
do the same projection and

432
00:16:11,410 --> 00:16:12,580
what I find is that this

433
00:16:12,780 --> 00:16:14,680
length here is a

434
00:16:15,610 --> 00:16:17,880
P2 really that is going to be less than 0.

435
00:16:18,830 --> 00:16:19,940
And you notice that now

436
00:16:20,480 --> 00:16:22,490
P1 and P2, these lengths

437
00:16:23,810 --> 00:16:24,740
of the projections are going to

438
00:16:24,780 --> 00:16:26,800
be much bigger, and so

439
00:16:27,440 --> 00:16:28,460
if we still need to enforce

440
00:16:28,890 --> 00:16:30,700
these constraints that P1 of

441
00:16:30,800 --> 00:16:33,040
the norm of theta is phase

442
00:16:33,230 --> 00:16:35,670
number one because P1 is so much bigger now.

443
00:16:36,580 --> 00:16:39,110
The normal can be smaller.

444
00:16:41,960 --> 00:16:43,090
And so, what this means is

445
00:16:43,210 --> 00:16:44,320
that by choosing the decision

446
00:16:44,730 --> 00:16:45,760
boundary shown on the right

447
00:16:46,010 --> 00:16:47,610
instead of on the left, the

448
00:16:47,850 --> 00:16:49,000
SVM can make the

449
00:16:49,080 --> 00:16:50,560
norm of the parameters theta

450
00:16:50,840 --> 00:16:52,420
much smaller. So, if we can

451
00:16:52,550 --> 00:16:54,080
make the norm of theta smaller and

452
00:16:54,260 --> 00:16:55,140
therefore make the squared norm of

453
00:16:55,590 --> 00:16:57,080
theta smaller, which is

454
00:16:57,210 --> 00:16:58,130
why the SVM

455
00:16:58,710 --> 00:17:00,920
would choose this hypothesis on the right instead.

456
00:17:02,800 --> 00:17:04,260
And this is how

457
00:17:05,580 --> 00:17:07,160
the SVM gives rise

458
00:17:07,500 --> 00:17:09,550
to this large margin certification effect.

459
00:17:10,700 --> 00:17:11,620
Mainly, if you look

460
00:17:11,820 --> 00:17:13,250
at this green line, if you look at this green

461
00:17:13,490 --> 00:17:14,990
hypothesis we want the

462
00:17:15,070 --> 00:17:16,250
projections of my positive

463
00:17:17,190 --> 00:17:18,780
and negative examples onto theta to be large, and

464
00:17:19,200 --> 00:17:20,360
the only way for that to

465
00:17:20,710 --> 00:17:23,490
hold true this is if surrounding the green line.

466
00:17:24,950 --> 00:17:27,710
There's this large margin, there's

467
00:17:27,880 --> 00:17:31,460
this large gap that separates

468
00:17:33,970 --> 00:17:37,240
positive and negative examples is

469
00:17:38,020 --> 00:17:40,740
really the magnitude of this gap.

470
00:17:41,080 --> 00:17:42,050
The magnitude of this margin

471
00:17:43,040 --> 00:17:44,900
is exactly the values of

472
00:17:45,060 --> 00:17:47,730
P1, P2, P3 and so on.

473
00:17:47,890 --> 00:17:48,970
And so by making the margin

474
00:17:49,480 --> 00:17:51,270
large, by these tyros P1,

475
00:17:51,470 --> 00:17:53,650
P2, P3 and so on that's

476
00:17:53,830 --> 00:17:55,520
the SVM can end up with

477
00:17:55,670 --> 00:17:56,860
a smaller value for the

478
00:17:56,960 --> 00:17:59,450
norm of theta which is what it is trying to do in the objective.

479
00:18:00,250 --> 00:18:01,290
And this is why this

480
00:18:01,960 --> 00:18:03,300
machine ends up with enlarge

481
00:18:03,790 --> 00:18:05,510
margin classifiers because itss

482
00:18:05,640 --> 00:18:07,570
trying to maximize the norm

483
00:18:07,720 --> 00:18:08,910
of these P1 which is the distance from

484
00:18:09,060 --> 00:18:10,450
the training examples to the decision boundary.

485
00:18:14,360 --> 00:18:16,450
Finally, we did this whole derivation

486
00:18:17,200 --> 00:18:18,590
using this simplification that the

487
00:18:18,750 --> 00:18:21,150
parameter theta 0 must be equal to 0.

488
00:18:21,860 --> 00:18:23,440
The effect of that as

489
00:18:23,560 --> 00:18:25,380
I mentioned briefly, is that if

490
00:18:25,540 --> 00:18:26,560
theta 0 is equal to

491
00:18:26,830 --> 00:18:28,280
0 what that means

492
00:18:28,460 --> 00:18:29,770
is that we are entertaining decision

493
00:18:30,200 --> 00:18:31,510
boundaries that pass through the

494
00:18:31,750 --> 00:18:33,640
origins of decision boundaries pass through

495
00:18:33,800 --> 00:18:35,510
the origin like that, if you

496
00:18:35,710 --> 00:18:37,980
allow theta zero to

497
00:18:38,080 --> 00:18:39,540
be non 0 then what

498
00:18:39,870 --> 00:18:41,190
that means is that you entertain the

499
00:18:41,380 --> 00:18:43,120
decision boundaries that did not

500
00:18:43,390 --> 00:18:45,730
cross through the origin, like that one I just drew.

501
00:18:46,380 --> 00:18:47,940
And I'm not going to do

502
00:18:48,010 --> 00:18:49,540
the full derivation that. It

503
00:18:49,650 --> 00:18:50,600
turns out that this same

504
00:18:51,060 --> 00:18:52,720
large margin proof works in

505
00:18:52,780 --> 00:18:54,240
pretty much in exactly the same way.

506
00:18:54,390 --> 00:18:56,100
And there's a generalization of this

507
00:18:56,850 --> 00:18:57,830
argument that we just went through

508
00:18:58,030 --> 00:18:59,400
them long ago through that shows

509
00:18:59,870 --> 00:19:01,540
that even when theta

510
00:19:01,840 --> 00:19:03,690
0 is non 0, what

511
00:19:03,960 --> 00:19:06,940
the SVM is trying to do when you have this optimization objective.

512
00:19:08,200 --> 00:19:09,620
Which again corresponds to the

513
00:19:09,720 --> 00:19:11,570
case of when C is very large.

514
00:19:14,010 --> 00:19:15,110
But it is possible to

515
00:19:15,290 --> 00:19:16,510
show that, you know, when theta

516
00:19:16,810 --> 00:19:18,420
is not equal to 0 this

517
00:19:18,620 --> 00:19:19,750
support vector machine is still

518
00:19:20,100 --> 00:19:21,360
finding is really trying

519
00:19:21,640 --> 00:19:22,650
to find the large margin

520
00:19:23,040 --> 00:19:24,060
separator that between the positive and negative

521
00:19:24,630 --> 00:19:28,200
examples. So that

522
00:19:28,420 --> 00:19:31,060
explains how this support vector machine is a large margin classifier.

523
00:19:32,920 --> 00:19:34,020
In the next video we

524
00:19:34,190 --> 00:19:35,080
will start to talk about how

525
00:19:35,400 --> 00:19:36,480
to take some of these

526
00:19:36,710 --> 00:19:37,980
SVM ideas and start to

527
00:19:38,190 --> 00:19:39,200
apply them to build a complex

528
00:19:39,900 --> 00:19:41,370
nonlinear classifiers.