1
00:00:00,080 --> 00:00:01,140
In this video, I'd like

2
00:00:01,370 --> 00:00:03,120
to start adapting support vector

3
00:00:03,390 --> 00:00:06,280
machines in order to develop complex nonlinear classifiers.

4
00:00:07,630 --> 00:00:10,410
The main technique for doing that is something called kernels.

5
00:00:11,730 --> 00:00:13,690
Let's see what this kernels are and how to use them.

6
00:00:15,860 --> 00:00:16,930
If you have a training set that

7
00:00:17,030 --> 00:00:18,270
looks like this, and you

8
00:00:18,400 --> 00:00:20,000
want to find a

9
00:00:20,150 --> 00:00:21,670
nonlinear decision boundary to distinguish

10
00:00:22,270 --> 00:00:23,950
the positive and negative examples, maybe

11
00:00:24,350 --> 00:00:25,900
a decision boundary that looks like that.

12
00:00:27,040 --> 00:00:27,950
One way to do so is

13
00:00:28,230 --> 00:00:29,760
to come up with a set

14
00:00:29,970 --> 00:00:32,180
of complex polynomial features, right? So, set of

15
00:00:32,340 --> 00:00:33,420
features that looks like this,

16
00:00:34,140 --> 00:00:34,990
so that you end up

17
00:00:35,140 --> 00:00:37,120
with a hypothesis X that

18
00:00:38,050 --> 00:00:40,380
predicts 1 if you know

19
00:00:40,570 --> 00:00:41,790
that theta 0 and plus theta 1 X1

20
00:00:41,860 --> 00:00:45,000
plus dot dot dot all those polynomial features is

21
00:00:45,180 --> 00:00:47,410
greater than 0, and

22
00:00:47,540 --> 00:00:49,170
predict 0, otherwise.

23
00:00:51,070 --> 00:00:52,760
And another way

24
00:00:52,980 --> 00:00:54,330
of writing this, to introduce

25
00:00:54,840 --> 00:00:56,240
a level of new notation that

26
00:00:56,500 --> 00:00:57,860
I'll use later, is that

27
00:00:58,200 --> 00:00:59,370
we can think of a hypothesis

28
00:00:59,730 --> 00:01:01,610
as computing a decision boundary

29
00:01:02,120 --> 00:01:03,380
using this. So, theta

30
00:01:03,820 --> 00:01:04,870
0 plus theta 1 f1 plus

31
00:01:05,070 --> 00:01:06,130
theta 2, f2 plus theta

32
00:01:06,610 --> 00:01:08,730
3, f3 plus and so on.

33
00:01:09,590 --> 00:01:12,790
Where I'm going to

34
00:01:13,050 --> 00:01:14,070
use this new denotation

35
00:01:14,730 --> 00:01:15,930
f1, f2, f3 and so

36
00:01:16,270 --> 00:01:17,610
on to denote these new sort of features

37
00:01:19,350 --> 00:01:20,630
that I'm computing, so f1 is

38
00:01:21,370 --> 00:01:24,250
just X1, f2 is equal

39
00:01:24,600 --> 00:01:27,060
to X2, f3 is

40
00:01:27,140 --> 00:01:28,560
equal to this one

41
00:01:28,770 --> 00:01:29,790
here. So, X1X2. So,

42
00:01:29,900 --> 00:01:32,200
f4 is equal to

43
00:01:33,840 --> 00:01:35,590
X1 squared where f5 is

44
00:01:35,680 --> 00:01:37,740
to be x2 squared and so

45
00:01:38,520 --> 00:01:39,780
on and we seen previously that

46
00:01:40,350 --> 00:01:41,190
coming up with these high

47
00:01:41,370 --> 00:01:42,870
order polynomials is one

48
00:01:43,110 --> 00:01:44,390
way to come up with lots more features,

49
00:01:45,470 --> 00:01:47,070
the question is, is

50
00:01:47,250 --> 00:01:48,600
there a different choice of

51
00:01:48,670 --> 00:01:51,350
features or is there better sort of features than this high order

52
00:01:51,690 --> 00:01:53,510
polynomials because you know

53
00:01:53,830 --> 00:01:54,820
it's not clear that this high

54
00:01:55,120 --> 00:01:56,350
order polynomial is what we want,

55
00:01:56,860 --> 00:01:57,920
and what we talked about

56
00:01:58,170 --> 00:01:59,560
computer vision talk about when

57
00:01:59,780 --> 00:02:01,940
the input is an image with lots of pixels.

58
00:02:02,540 --> 00:02:04,670
We also saw how using high order polynomials

59
00:02:05,140 --> 00:02:06,360
becomes very computationally

60
00:02:07,320 --> 00:02:08,270
expensive because there are

61
00:02:08,280 --> 00:02:09,830
a lot of these higher order polynomial terms.

62
00:02:11,240 --> 00:02:12,280
So, is there a different or

63
00:02:12,430 --> 00:02:13,160
a better choice of the features

64
00:02:14,110 --> 00:02:15,100
that we can use to plug

65
00:02:15,410 --> 00:02:16,770
into this sort of

66
00:02:17,500 --> 00:02:19,200
hypothesis form.

67
00:02:19,420 --> 00:02:20,470
So, here is one idea for how to

68
00:02:20,580 --> 00:02:23,580
define new features f1, f2, f3.

69
00:02:24,970 --> 00:02:25,930
On this line I am

70
00:02:26,100 --> 00:02:27,600
going to define only three new

71
00:02:27,890 --> 00:02:28,770
features, but for real problems

72
00:02:29,500 --> 00:02:30,650
we can get to define a much larger number.

73
00:02:31,060 --> 00:02:32,060
But here's what I'm going to do

74
00:02:32,260 --> 00:02:33,400
in this phase

75
00:02:33,640 --> 00:02:34,980
of features X1, X2, and

76
00:02:35,400 --> 00:02:36,520
I'm going to leave X0

77
00:02:36,720 --> 00:02:37,800
out of this, the

78
00:02:38,060 --> 00:02:39,230
interceptor X0, but

79
00:02:39,330 --> 00:02:40,320
in this phase X1 X2, I'm going to just,

80
00:02:42,550 --> 00:02:43,560
you know, manually pick a few points, and then

81
00:02:43,750 --> 00:02:45,210
call these points l1, we

82
00:02:45,450 --> 00:02:46,720
are going to pick

83
00:02:46,820 --> 00:02:49,560
a different point, let's call

84
00:02:50,080 --> 00:02:51,390
that l2 and let's pick

85
00:02:51,710 --> 00:02:52,880
the third one and call

86
00:02:53,170 --> 00:02:55,800
this one l3, and for

87
00:02:55,900 --> 00:02:56,830
now let's just say that I'm

88
00:02:56,930 --> 00:02:59,220
going to choose these three points manually.

89
00:02:59,870 --> 00:03:02,860
I'm going to call these three points line ups, so line up one, two, three.

90
00:03:03,720 --> 00:03:04,630
What I'm going to do is

91
00:03:04,790 --> 00:03:07,190
define my new features as follows, given

92
00:03:07,510 --> 00:03:10,070
an example X, let me

93
00:03:10,170 --> 00:03:13,130
define my first feature f1

94
00:03:13,330 --> 00:03:16,010
to be some

95
00:03:16,260 --> 00:03:18,960
measure of the similarity between

96
00:03:19,330 --> 00:03:21,460
my training example X and

97
00:03:21,680 --> 00:03:26,270
my first landmark and

98
00:03:26,520 --> 00:03:27,840
this specific formula that I'm

99
00:03:27,950 --> 00:03:29,600
going to use to measure similarity is

100
00:03:30,160 --> 00:03:31,830
going to be this is E to

101
00:03:31,940 --> 00:03:34,220
the minus the length of

102
00:03:34,470 --> 00:03:37,880
X minus l1, squared, divided

103
00:03:38,320 --> 00:03:39,610
by two sigma squared.

104
00:03:40,730 --> 00:03:41,640
So, depending on whether or not

105
00:03:41,780 --> 00:03:43,420
you watched the previous optional video,

106
00:03:44,390 --> 00:03:48,140
this notation, you know, this is

107
00:03:48,460 --> 00:03:49,340
the length of the vector

108
00:03:49,680 --> 00:03:51,260
W. And so, this thing

109
00:03:51,460 --> 00:03:53,760
here, this X

110
00:03:54,020 --> 00:03:55,990
minus l1, this is

111
00:03:56,100 --> 00:03:57,440
actually just the euclidean distance

112
00:03:58,610 --> 00:03:59,950
squared, is the euclidean

113
00:04:00,410 --> 00:04:03,240
distance between the point x and the landmark l1.

114
00:04:03,530 --> 00:04:04,610
We will see more about this later.

115
00:04:06,440 --> 00:04:07,990
But that's my first feature, and

116
00:04:08,120 --> 00:04:09,610
my second feature f2 is

117
00:04:09,750 --> 00:04:11,750
going to be, you know,

118
00:04:12,370 --> 00:04:14,040
similarity function that measures

119
00:04:14,400 --> 00:04:17,310
how similar X is to l2 and the game is going to be defined as

120
00:04:17,370 --> 00:04:19,360
the following function.

121
00:04:25,970 --> 00:04:27,320
This is E to the minus of the square of the euclidean distance

122
00:04:28,150 --> 00:04:29,050
between X and the second

123
00:04:29,820 --> 00:04:31,310
landmark, that is what the enumerator is and

124
00:04:31,510 --> 00:04:32,660
then divided by 2 sigma squared

125
00:04:33,520 --> 00:04:35,280
and similarly f3 is, you know,

126
00:04:35,850 --> 00:04:39,480
similarity between X

127
00:04:39,840 --> 00:04:41,860
and l3, which is

128
00:04:41,980 --> 00:04:44,510
equal to, again, similar formula.

129
00:04:46,550 --> 00:04:48,070
And what this similarity

130
00:04:48,830 --> 00:04:50,440
function is, the mathematical term

131
00:04:50,730 --> 00:04:52,030
for this, is that this is

132
00:04:52,160 --> 00:04:54,390
going to be a kernel function.

133
00:04:55,340 --> 00:04:56,810
And the specific kernel I'm using

134
00:04:57,140 --> 00:04:59,570
here, this is actually called a Gaussian kernel.

135
00:05:00,630 --> 00:05:01,920
And so this formula, this particular

136
00:05:02,500 --> 00:05:04,990
choice of similarity function is called a Gaussian kernel.

137
00:05:05,770 --> 00:05:07,220
But the way the terminology goes is that, you know, in

138
00:05:07,360 --> 00:05:09,110
the abstract these different

139
00:05:09,600 --> 00:05:11,270
similarity functions are called kernels and

140
00:05:11,600 --> 00:05:12,670
we can have different similarity functions

141
00:05:13,750 --> 00:05:16,410
and the specific example I'm giving here is called the Gaussian kernel.

142
00:05:17,110 --> 00:05:18,400
We'll see other examples of other kernels.

143
00:05:18,840 --> 00:05:21,100
But for now just think of these as similarity functions.

144
00:05:22,470 --> 00:05:24,100
And so, instead of writing similarity between

145
00:05:24,500 --> 00:05:26,270
X and l, sometimes we

146
00:05:26,480 --> 00:05:28,380
also write this a kernel denoted

147
00:05:29,070 --> 00:05:32,360
you know, lower case k between x and one of my landmarks all right.

148
00:05:34,120 --> 00:05:36,120
So let's see what a

149
00:05:36,650 --> 00:05:38,480
criminals actually do and

150
00:05:38,810 --> 00:05:40,640
why these sorts of similarity

151
00:05:41,280 --> 00:05:44,540
functions, why these expressions might make sense.

152
00:05:46,690 --> 00:05:48,020
So let's take my first landmark. My

153
00:05:48,330 --> 00:05:49,230
landmark l1, which is

154
00:05:49,350 --> 00:05:51,370
one of those points I chose on my figure just now.

155
00:05:53,000 --> 00:05:54,160
So the similarity of the kernel between x and l1 is given by this expression.

156
00:05:57,530 --> 00:05:58,600
Just to make sure, you know, we

157
00:05:58,690 --> 00:05:59,600
are on the same page about what

158
00:05:59,780 --> 00:06:01,860
the numerator term is, the

159
00:06:01,960 --> 00:06:03,140
numerator can also be

160
00:06:03,330 --> 00:06:04,620
written as a sum from

161
00:06:04,880 --> 00:06:06,470
J equals 1 through N on sort of the distance.

162
00:06:07,000 --> 00:06:08,700
So this is the component wise distance

163
00:06:09,270 --> 00:06:10,900
between the vector X and

164
00:06:11,070 --> 00:06:12,050
the vector l. And again

165
00:06:12,380 --> 00:06:14,460
for the purpose of these

166
00:06:14,720 --> 00:06:16,180
slides I'm ignoring X0.

167
00:06:16,680 --> 00:06:17,910
So just ignoring the intercept

168
00:06:18,220 --> 00:06:19,960
term X0, which is always equal to 1.

169
00:06:21,430 --> 00:06:22,470
So, you know, this is

170
00:06:22,630 --> 00:06:25,780
how you compute the kernel with similarity between X and a landmark.

171
00:06:27,270 --> 00:06:28,200
So let's see what this function does.

172
00:06:29,110 --> 00:06:31,870
Suppose X is close to one of the landmarks.

173
00:06:33,320 --> 00:06:34,910
Then this euclidean distance

174
00:06:35,360 --> 00:06:36,690
formula and the numerator will

175
00:06:36,990 --> 00:06:38,770
be close to 0, right.

176
00:06:38,890 --> 00:06:40,070
So, that is this term

177
00:06:40,580 --> 00:06:41,880
here, the distance was great,

178
00:06:42,170 --> 00:06:43,130
the distance using X and 0

179
00:06:43,240 --> 00:06:45,130
will be close to zero, and so

180
00:06:46,390 --> 00:06:47,440
f1, this is a simple

181
00:06:47,710 --> 00:06:50,100
feature, will be approximately E

182
00:06:50,290 --> 00:06:52,760
to the minus 0 and

183
00:06:52,800 --> 00:06:54,650
then the numerator squared over 2 is equal to squared

184
00:06:55,650 --> 00:06:56,670
so that E to the

185
00:06:56,770 --> 00:06:58,070
0, E to minus 0,

186
00:06:58,370 --> 00:06:59,810
E to 0 is going to be close to one.

187
00:07:01,640 --> 00:07:03,480
And I'll put the approximation symbol here

188
00:07:03,700 --> 00:07:05,430
because the distance may

189
00:07:05,530 --> 00:07:06,930
not be exactly 0, but

190
00:07:07,120 --> 00:07:08,040
if X is closer to landmark

191
00:07:08,340 --> 00:07:09,190
this term will be close

192
00:07:09,440 --> 00:07:12,070
to 0 and so f1 would be close 1.

193
00:07:13,400 --> 00:07:15,220
Conversely, if X is

194
00:07:15,520 --> 00:07:17,350
far from 01 then this

195
00:07:17,550 --> 00:07:18,940
first feature f1 will

196
00:07:19,820 --> 00:07:21,190
be E to the minus

197
00:07:21,540 --> 00:07:24,040
of some large number squared,

198
00:07:24,960 --> 00:07:25,980
divided divided by two sigma

199
00:07:26,260 --> 00:07:27,690
squared and E to

200
00:07:27,810 --> 00:07:28,800
the minus of a large number

201
00:07:29,630 --> 00:07:31,450
is going to be close to 0.

202
00:07:33,320 --> 00:07:34,610
So what these

203
00:07:34,750 --> 00:07:36,080
features do is they measure how

204
00:07:36,290 --> 00:07:37,500
similar X is from one

205
00:07:37,670 --> 00:07:39,160
of your landmarks and the feature

206
00:07:39,530 --> 00:07:40,290
f is going to be close

207
00:07:40,540 --> 00:07:42,360
to one when X is

208
00:07:42,540 --> 00:07:43,810
close to your landmark and is

209
00:07:44,020 --> 00:07:45,310
going to be 0 or close

210
00:07:45,380 --> 00:07:46,520
to zero when X is

211
00:07:46,790 --> 00:07:48,850
far from your landmark.

212
00:07:49,320 --> 00:07:49,980
Each of these landmarks.

213
00:07:50,590 --> 00:07:51,620
On the previous line, I drew

214
00:07:52,250 --> 00:07:54,260
three landmarks, l1, l2,l3.

215
00:07:56,190 --> 00:08:00,030
Each of these landmarks, defines a new feature

216
00:08:00,660 --> 00:08:02,270
f1, f2 and f3.

217
00:08:02,680 --> 00:08:03,660
That is, given the the

218
00:08:03,710 --> 00:08:05,160
training example X, we can

219
00:08:05,380 --> 00:08:06,750
now compute three new

220
00:08:06,930 --> 00:08:08,720
features: f1, f2, and

221
00:08:09,520 --> 00:08:11,010
f3, given, you know, the three

222
00:08:11,340 --> 00:08:13,530
landmarks that I wrote just now.

223
00:08:13,760 --> 00:08:15,030
But first, let's look

224
00:08:15,240 --> 00:08:16,450
at this exponentiation function, let's look

225
00:08:16,710 --> 00:08:18,190
at this similarity function and plot

226
00:08:18,570 --> 00:08:20,790
in some figures and just, you know, understand

227
00:08:21,230 --> 00:08:22,460
better what this really looks like.

228
00:08:23,510 --> 00:08:26,320
For this example, let's say I have two features X1 and X2.

229
00:08:26,570 --> 00:08:27,430
And let's say my first

230
00:08:27,820 --> 00:08:29,290
landmark, l1 is at

231
00:08:29,520 --> 00:08:32,550
a location, 3 5. So

232
00:08:33,650 --> 00:08:35,750
and let's say I set sigma squared equals one for now.

233
00:08:36,500 --> 00:08:37,550
If I plot what this feature

234
00:08:37,890 --> 00:08:40,420
looks like, what I get is this figure.

235
00:08:41,210 --> 00:08:42,510
So the vertical axis, the height

236
00:08:42,760 --> 00:08:44,030
of the surface is the value

237
00:08:45,240 --> 00:08:46,280
of f1 and down here

238
00:08:46,630 --> 00:08:48,490
on the horizontal axis are, if

239
00:08:48,710 --> 00:08:50,580
I have some training example, and there

240
00:08:51,660 --> 00:08:53,050
is x1 and there is x2.

241
00:08:53,320 --> 00:08:54,940
Given a certain training example, the

242
00:08:55,120 --> 00:08:56,890
training example here which shows

243
00:08:56,980 --> 00:08:58,140
the value of x1 and x2

244
00:08:58,140 --> 00:08:59,390
at a height above the surface,

245
00:08:59,950 --> 00:09:02,220
shows the corresponding value of

246
00:09:02,410 --> 00:09:03,830
f1 and down below this is

247
00:09:03,960 --> 00:09:04,890
the same figure I had showed,

248
00:09:05,040 --> 00:09:06,600
using a quantifiable plot, with

249
00:09:06,810 --> 00:09:08,320
x1 on horizontal

250
00:09:09,090 --> 00:09:10,340
axis, x2 on horizontal

251
00:09:10,820 --> 00:09:12,500
axis and so, this

252
00:09:12,820 --> 00:09:13,700
figure on the bottom is just

253
00:09:13,940 --> 00:09:15,440
a contour plot of the 3D surface.

254
00:09:16,540 --> 00:09:17,800
You notice that when

255
00:09:18,030 --> 00:09:19,540
X is equal to

256
00:09:19,820 --> 00:09:24,140
3 5 exactly, then we

257
00:09:24,380 --> 00:09:25,680
the f1 takes on the

258
00:09:25,760 --> 00:09:26,990
value 1, because that's at

259
00:09:27,170 --> 00:09:29,400
the maximum and X

260
00:09:29,860 --> 00:09:31,150
moves away as X goes

261
00:09:31,680 --> 00:09:33,650
further away then this

262
00:09:33,860 --> 00:09:35,270
feature takes on values

263
00:09:36,460 --> 00:09:37,160
that are close to 0.

264
00:09:38,750 --> 00:09:40,120
And so, this is really a feature,

265
00:09:40,400 --> 00:09:42,100
f1 measures, you know, how

266
00:09:42,400 --> 00:09:43,680
close X is to the first

267
00:09:44,040 --> 00:09:46,050
landmark and if

268
00:09:46,520 --> 00:09:47,610
varies between 0 and one

269
00:09:47,790 --> 00:09:48,940
depending on how close X

270
00:09:49,160 --> 00:09:50,650
is to the first landmark l1.

271
00:09:52,360 --> 00:09:53,710
Now the other was due on

272
00:09:53,920 --> 00:09:55,530
this slide is show the effects

273
00:09:56,090 --> 00:09:59,740
of varying this parameter sigma squared.

274
00:10:00,040 --> 00:10:01,770
So, sigma squared is the parameter of the

275
00:10:02,530 --> 00:10:04,120
Gaussian kernel and as you vary it, you get slightly different effects.

276
00:10:05,150 --> 00:10:06,380
Let's set sigma squared to be

277
00:10:06,650 --> 00:10:07,570
equal to 0.5 and see

278
00:10:07,710 --> 00:10:09,850
what we get. We set sigma square to 0.5,

279
00:10:10,090 --> 00:10:11,170
what you find is that the

280
00:10:11,430 --> 00:10:12,670
kernel looks similar, except for the

281
00:10:12,730 --> 00:10:14,200
width of the bump becomes narrower.

282
00:10:14,790 --> 00:10:16,400
The contours shrink a bit too.

283
00:10:17,120 --> 00:10:18,360
So if sigma squared equals to 0.5

284
00:10:18,740 --> 00:10:19,820
then as you start

285
00:10:20,250 --> 00:10:21,650
from X equals 3

286
00:10:21,910 --> 00:10:23,140
5 and as you move away,

287
00:10:24,750 --> 00:10:26,370
then the feature f1

288
00:10:27,050 --> 00:10:28,520
falls to zero much more

289
00:10:28,730 --> 00:10:30,830
rapidly and conversely,

290
00:10:32,090 --> 00:10:33,930
if you has increase since

291
00:10:34,670 --> 00:10:36,280
where three in that

292
00:10:36,510 --> 00:10:37,700
case and as I

293
00:10:37,800 --> 00:10:39,090
move away from, you know l. So

294
00:10:39,630 --> 00:10:40,770
this point here is really

295
00:10:41,110 --> 00:10:42,410
l, right, that's l1 is at

296
00:10:42,610 --> 00:10:45,210
location 3 5, right. So it's shown up here.

297
00:10:48,190 --> 00:10:49,480
And if sigma squared is

298
00:10:49,660 --> 00:10:50,460
large, then as you move

299
00:10:50,690 --> 00:10:54,040
away from l1, the

300
00:10:54,320 --> 00:10:56,170
value of the feature falls

301
00:10:56,740 --> 00:10:57,670
away much more slowly.

302
00:11:03,590 --> 00:11:05,200
So, given this definition of

303
00:11:05,290 --> 00:11:06,730
the features, let's see what

304
00:11:06,960 --> 00:11:08,420
source of hypothesis we can learn.

305
00:11:09,550 --> 00:11:11,360
Given the training example X, we

306
00:11:11,480 --> 00:11:12,930
are going to compute these features

307
00:11:14,670 --> 00:11:16,360
f1, f2, f3 and a

308
00:11:17,550 --> 00:11:18,980
hypothesis is going to

309
00:11:19,040 --> 00:11:20,510
predict one when theta 0

310
00:11:20,760 --> 00:11:22,050
plus theta 1 f1 plus theta 2 f2,

311
00:11:22,330 --> 00:11:26,210
and so on is greater than or equal to 0.

312
00:11:26,250 --> 00:11:27,100
For this particular example, let's say

313
00:11:27,290 --> 00:11:28,460
that I've already found a learning

314
00:11:28,620 --> 00:11:29,520
algorithm and let's say that, you know,

315
00:11:30,190 --> 00:11:31,220
somehow I ended up with

316
00:11:31,900 --> 00:11:32,880
these values of the parameter.

317
00:11:33,510 --> 00:11:34,600
So if theta 0 equals

318
00:11:34,830 --> 00:11:36,010
minus 0.5, theta 1 equals

319
00:11:36,390 --> 00:11:37,780
1, theta 2 equals

320
00:11:38,180 --> 00:11:39,570
1, and theta 3

321
00:11:40,370 --> 00:11:42,480
equals 0 And what

322
00:11:42,720 --> 00:11:44,530
I want to do is consider what

323
00:11:44,670 --> 00:11:46,100
happens if we have a

324
00:11:46,200 --> 00:11:48,060
training example that takes

325
00:11:49,260 --> 00:11:51,710
has location at this

326
00:11:52,510 --> 00:11:55,050
magenta dot, right where I just drew this dot over here.

327
00:11:55,380 --> 00:11:56,180
So let's say I have a training

328
00:11:56,290 --> 00:11:58,690
example X, what would my hypothesis predict?

329
00:11:59,000 --> 00:12:01,430
Well, If I look at this formula.

330
00:12:04,580 --> 00:12:05,890
Because my training example X

331
00:12:06,050 --> 00:12:07,820
is close to l1, we have

332
00:12:08,230 --> 00:12:10,190
that f1 is going

333
00:12:10,250 --> 00:12:11,830
to be close to 1 the because

334
00:12:12,250 --> 00:12:13,200
my training example X is

335
00:12:13,360 --> 00:12:15,050
far from l2 and l3 I

336
00:12:15,360 --> 00:12:16,880
have that, you know, f2 would be close to

337
00:12:17,590 --> 00:12:20,500
0 and f3 will be close to 0.

338
00:12:21,550 --> 00:12:22,700
So, if I look at

339
00:12:22,880 --> 00:12:23,970
that formula, I have theta

340
00:12:24,230 --> 00:12:25,670
0 plus theta 1

341
00:12:26,600 --> 00:12:29,970
times 1 plus theta 2 times some value.

342
00:12:30,510 --> 00:12:32,390
Not exactly 0, but let's say close to 0.

343
00:12:33,140 --> 00:12:36,400
Then plus theta 3 times something close to 0.

344
00:12:37,480 --> 00:12:39,810
And this is going to be equal to plugging in these values now.

345
00:12:41,050 --> 00:12:43,470
So, that gives minus 0.5

346
00:12:44,160 --> 00:12:46,820
plus 1 times 1 which is 1, and so on.

347
00:12:46,960 --> 00:12:47,740
Which is equal to 0.5 which is greater than or equal to 0.

348
00:12:48,000 --> 00:12:50,820
So, at this point,

349
00:12:51,160 --> 00:12:54,280
we're going to predict Y equals

350
00:12:54,740 --> 00:12:57,320
1, because that's greater than or equal to zero.

351
00:12:58,910 --> 00:12:59,950
Now let's take a different point.

352
00:13:00,800 --> 00:13:02,100
Now lets' say I take a

353
00:13:02,140 --> 00:13:03,060
different point, I'm going to

354
00:13:03,260 --> 00:13:04,370
draw this one in a different

355
00:13:04,770 --> 00:13:07,080
color, in cyan say, for

356
00:13:07,250 --> 00:13:08,470
a point out there, if that

357
00:13:08,710 --> 00:13:10,580
were my training example X, then

358
00:13:11,270 --> 00:13:12,190
if you make a similar computation,

359
00:13:12,950 --> 00:13:14,390
you find that f1, f2,

360
00:13:15,420 --> 00:13:16,850
Ff3 are all going to be close to 0.

361
00:13:18,160 --> 00:13:19,910
And so, we have theta

362
00:13:20,240 --> 00:13:23,940
0 plus theta 1, f1,

363
00:13:24,230 --> 00:13:26,010
plus so on and this

364
00:13:26,200 --> 00:13:27,830
will be about equal to

365
00:13:28,020 --> 00:13:30,810
minus 0.5, because theta

366
00:13:31,170 --> 00:13:32,110
0 is minus 0.5 and

367
00:13:32,190 --> 00:13:33,920
f1, f2, f3 are all zero.

368
00:13:34,910 --> 00:13:37,510
So this will be minus 0.5, this is less than zero.

369
00:13:37,860 --> 00:13:38,910
And so, at this

370
00:13:39,090 --> 00:13:40,220
point out there, we're going to

371
00:13:40,470 --> 00:13:42,010
predict Y equals zero.

372
00:13:44,190 --> 00:13:45,100
And if you do

373
00:13:45,270 --> 00:13:46,230
this yourself for a range

374
00:13:46,380 --> 00:13:47,460
of different points, be sure to

375
00:13:47,670 --> 00:13:48,660
convince yourself that if you

376
00:13:48,730 --> 00:13:50,340
have a training example that's

377
00:13:50,890 --> 00:13:52,390
close to L2, say,

378
00:13:52,970 --> 00:13:55,730
then at this point we'll also predict Y equals one.

379
00:13:56,800 --> 00:13:58,110
And in fact, what you end

380
00:13:58,240 --> 00:13:59,300
up doing is, you know,

381
00:13:59,350 --> 00:14:00,920
if you look around this boundary, this

382
00:14:01,140 --> 00:14:02,300
space, what we'll find is that

383
00:14:02,820 --> 00:14:03,900
for points near l1

384
00:14:04,090 --> 00:14:05,560
and l2 we end up predicting positive.

385
00:14:06,550 --> 00:14:07,780
And for points far away from

386
00:14:08,050 --> 00:14:09,260
l1 and l2, that's for

387
00:14:09,470 --> 00:14:12,220
points far away from these two

388
00:14:12,480 --> 00:14:13,780
landmarks, we end up predicting

389
00:14:14,390 --> 00:14:15,560
that the class is equal to 0.

390
00:14:16,510 --> 00:14:17,380
As so, what we end up doing,is

391
00:14:17,890 --> 00:14:20,270
that the decision boundary of

392
00:14:20,400 --> 00:14:22,110
this hypothesis would end

393
00:14:22,280 --> 00:14:24,210
up looking something like this where

394
00:14:24,370 --> 00:14:25,630
inside this red decision boundary

395
00:14:26,580 --> 00:14:28,240
would predict Y equals

396
00:14:28,630 --> 00:14:30,250
1 and outside we predict

397
00:14:32,570 --> 00:14:32,570
Y equals 0.

398
00:14:33,020 --> 00:14:34,770
And so this is

399
00:14:34,850 --> 00:14:36,010
how with this definition

400
00:14:36,870 --> 00:14:38,560
of the landmarks and of the kernel function.

401
00:14:39,370 --> 00:14:40,940
We can learn pretty complex non-linear

402
00:14:41,420 --> 00:14:42,800
decision boundary, like what I

403
00:14:42,930 --> 00:14:44,150
just drew where we predict

404
00:14:44,560 --> 00:14:46,990
positive when we're close to either one of the two landmarks.

405
00:14:47,570 --> 00:14:48,880
And we predict negative when we're

406
00:14:49,260 --> 00:14:50,680
very far away from any

407
00:14:50,950 --> 00:14:52,990
of the landmarks.

408
00:14:53,440 --> 00:14:55,000
And so this is part of

409
00:14:55,050 --> 00:14:57,300
the idea of kernels of and

410
00:14:57,600 --> 00:14:58,620
how we use them with the

411
00:14:58,770 --> 00:14:59,810
support vector machine, which is that

412
00:14:59,990 --> 00:15:01,720
we define these extra features using

413
00:15:02,040 --> 00:15:03,900
landmarks and similarity functions

414
00:15:04,770 --> 00:15:06,730
to learn more complex nonlinear classifiers.

415
00:15:08,210 --> 00:15:09,290
So hopefully that gives you

416
00:15:09,390 --> 00:15:10,410
a sense of the idea of

417
00:15:10,590 --> 00:15:11,680
kernels and how we could

418
00:15:11,890 --> 00:15:14,110
use it to define new features for the Support Vector Machine.

419
00:15:15,510 --> 00:15:17,670
But there are a couple of questions that we haven't answered yet.

420
00:15:18,010 --> 00:15:19,550
One is, how do we get these landmarks?

421
00:15:20,120 --> 00:15:20,930
How do we choose these landmarks?

422
00:15:21,050 --> 00:15:22,910
And another is, what

423
00:15:23,090 --> 00:15:24,500
other similarity functions, if any,

424
00:15:24,750 --> 00:15:25,680
can we use other than the

425
00:15:25,780 --> 00:15:29,000
one we talked about, which is called the Gaussian kernel.

426
00:15:29,190 --> 00:15:29,970
In the next video we give

427
00:15:29,990 --> 00:15:31,290
answers to these questions and put

428
00:15:31,490 --> 00:15:33,150
everything together to show how

429
00:15:33,740 --> 00:15:35,060
support vector machines with kernels

430
00:15:35,720 --> 00:15:36,960
can be a powerful way

431
00:15:37,200 --> 00:15:38,610
to learn complex nonlinear functions.