1
00:00:00,570 --> 00:00:01,860
By now, you see the range

2
00:00:02,090 --> 00:00:04,860
of different learning algorithms. Within supervised learning,

3
00:00:05,280 --> 00:00:06,810
the performance of many supervised learning algorithms

4
00:00:07,300 --> 00:00:08,830
will be pretty similar

5
00:00:09,650 --> 00:00:10,740
and when that is less more often be

6
00:00:11,040 --> 00:00:12,140
whether you use

7
00:00:12,440 --> 00:00:13,450
learning algorithm A or learning algorithm

8
00:00:13,660 --> 00:00:15,020
B but when that

9
00:00:15,190 --> 00:00:16,190
is small there will often be

10
00:00:16,360 --> 00:00:17,100
things like the amount of data

11
00:00:17,330 --> 00:00:18,530
you are creating these algorithms on.

12
00:00:19,280 --> 00:00:20,480
That's always your skill in

13
00:00:20,600 --> 00:00:21,990
applying this algorithms.
Seems like

14
00:00:23,150 --> 00:00:24,480
your choice of the features that you

15
00:00:24,660 --> 00:00:25,790
designed to give the learning

16
00:00:26,010 --> 00:00:27,030
algorithms and how you

17
00:00:27,200 --> 00:00:28,530
choose the regularization parameter

18
00:00:29,190 --> 00:00:31,690
and things like that. But there's

19
00:00:31,930 --> 00:00:34,110
one more algorithm that is very

20
00:00:34,380 --> 00:00:35,460
powerful and its very

21
00:00:35,580 --> 00:00:37,400
widely used both within industry

22
00:00:38,050 --> 00:00:39,590
and in Academia. And that's called the

23
00:00:39,850 --> 00:00:41,080
support vector machine, and compared to

24
00:00:41,200 --> 00:00:42,600
both the logistic regression and neural networks, the

25
00:00:46,770 --> 00:00:48,190
support vector machine or the SVM

26
00:00:48,440 --> 00:00:50,120
sometimes gives a cleaner

27
00:00:50,890 --> 00:00:52,040
and sometimes more powerful way

28
00:00:52,480 --> 00:00:53,250
of learning complex nonlinear functions.

29
00:00:54,970 --> 00:00:56,300
And so I'd like to take the next

30
00:00:56,480 --> 00:00:57,850
videos to

31
00:00:57,890 --> 00:01:00,100
talk about that.

32
00:01:00,400 --> 00:01:01,400
Later in this course, I will do

33
00:01:01,540 --> 00:01:02,710
a quick survey of the range

34
00:01:03,100 --> 00:01:04,340
of different supervised learning algorithms just

35
00:01:05,200 --> 00:01:06,790
to very briefly describe them

36
00:01:07,430 --> 00:01:08,870
but the support vector machine, given

37
00:01:09,370 --> 00:01:10,840
its popularity and how popular

38
00:01:10,980 --> 00:01:11,920
it is, this will be

39
00:01:12,060 --> 00:01:13,800
the last of the supervised learning algorithms

40
00:01:14,440 --> 00:01:16,710
that I'll spend a significant amount of time on in this course.

41
00:01:19,260 --> 00:01:20,440
As with our development of ever

42
00:01:20,670 --> 00:01:22,280
learning algorithms, we are going to start by talking

43
00:01:22,650 --> 00:01:23,940
about the optimization objective,

44
00:01:24,750 --> 00:01:26,420
so let's get started on

45
00:01:26,620 --> 00:01:27,920
this algorithm.

46
00:01:29,420 --> 00:01:30,960
In order to describe the support

47
00:01:31,270 --> 00:01:32,570
vector machine, I'm actually going

48
00:01:32,610 --> 00:01:34,020
to start with logistic regression

49
00:01:34,990 --> 00:01:35,990
and show how we can modify

50
00:01:36,820 --> 00:01:37,630
it a bit and get what

51
00:01:38,240 --> 00:01:39,260
is essentially the support vector machine.

52
00:01:40,290 --> 00:01:41,740
So, in logistic regression we have

53
00:01:41,950 --> 00:01:43,680
our familiar form of

54
00:01:43,740 --> 00:01:46,000
the hypotheses there and the

55
00:01:46,450 --> 00:01:48,590
sigmoid activation function shown on the right.

56
00:01:50,390 --> 00:01:51,330
And in order to explain

57
00:01:51,800 --> 00:01:52,650
some of the math, I'm going

58
00:01:52,850 --> 00:01:55,960
to use z to denote failure of transpose x here.

59
00:01:57,620 --> 00:01:58,650
Now let's think about what

60
00:01:58,900 --> 00:02:01,150
we will like the logistic regression to do.

61
00:02:01,270 --> 00:02:02,800
If we have an example with

62
00:02:03,070 --> 00:02:04,360
y equals 1, and by

63
00:02:04,540 --> 00:02:05,480
this I mean an example

64
00:02:06,100 --> 00:02:07,100
in either a training set

65
00:02:07,440 --> 00:02:11,780
or the test set, you know, order cross valuation set where y is equal to 1 then

66
00:02:12,030 --> 00:02:14,300
we are sort of hoping that h of x will be close to 1.

67
00:02:14,380 --> 00:02:15,760
So, right, we are hoping to

68
00:02:16,140 --> 00:02:17,330
correctly classify that example

69
00:02:18,520 --> 00:02:19,390
and what, having h of x

70
00:02:19,510 --> 00:02:20,710
close to 1, what that means

71
00:02:20,850 --> 00:02:22,080
is that theta transpose x

72
00:02:22,360 --> 00:02:23,380
must be much larger

73
00:02:23,770 --> 00:02:24,990
than 0, so there's

74
00:02:25,330 --> 00:02:26,680
greater than, greater than sign, that

75
00:02:26,900 --> 00:02:28,220
means much, much greater

76
00:02:28,530 --> 00:02:30,880
than 0 and that's

77
00:02:31,120 --> 00:02:32,840
because it is z, that is theta transpose

78
00:02:32,960 --> 00:02:34,750
x

79
00:02:34,940 --> 00:02:35,910
is when z is much bigger than

80
00:02:36,010 --> 00:02:37,240
0, is far to the

81
00:02:37,310 --> 00:02:39,060
right of this figure that, you know, the

82
00:02:39,360 --> 00:02:42,430
output of logistic regression becomes close to 1.

83
00:02:44,510 --> 00:02:45,580
Conversely, if we have

84
00:02:45,630 --> 00:02:46,870
an example where y is

85
00:02:47,000 --> 00:02:48,470
equal to 0 then what

86
00:02:48,750 --> 00:02:49,620
were hoping for is that the hypothesis

87
00:02:50,420 --> 00:02:51,890
will output the value to

88
00:02:52,010 --> 00:02:53,850
close to 0 and that corresponds to theta transpose x

89
00:02:54,650 --> 00:02:55,990
or z pretty much

90
00:02:56,250 --> 00:02:57,080
less than 0 because

91
00:02:57,440 --> 00:02:58,720
that corresponds to

92
00:02:59,160 --> 00:03:01,250
hypothesis of outputting a value close to 0. If

93
00:03:02,180 --> 00:03:03,590
you look at the

94
00:03:03,760 --> 00:03:06,300
cost function of logistic regression, what

95
00:03:06,440 --> 00:03:07,470
you find is that each

96
00:03:07,710 --> 00:03:09,400
example x, y,

97
00:03:10,190 --> 00:03:11,520
contributes a term like

98
00:03:11,700 --> 00:03:14,320
this to the overall cost function.

99
00:03:15,450 --> 00:03:16,900
All right. So, for the overall cost function, we usually, we will

100
00:03:17,390 --> 00:03:18,600
also have a sum over

101
00:03:18,890 --> 00:03:21,430
all the training examples and 1 over m term.

102
00:03:22,450 --> 00:03:22,740
But this

103
00:03:23,240 --> 00:03:24,150
expression here. That's

104
00:03:24,470 --> 00:03:25,450
the term that a single

105
00:03:26,220 --> 00:03:28,490
training example contributes to

106
00:03:28,780 --> 00:03:31,550
the overall objective function for logistic regression.

107
00:03:33,250 --> 00:03:34,350
Now, if I take the definition

108
00:03:35,190 --> 00:03:36,120
for the full of my hypothesis

109
00:03:37,030 --> 00:03:38,700
and plug it in, over here,

110
00:03:39,790 --> 00:03:40,710
the one I get is that

111
00:03:40,920 --> 00:03:43,130
each training example contributes this term, right?

112
00:03:44,270 --> 00:03:45,480
Ignoring the 1 over

113
00:03:45,720 --> 00:03:47,130
m but it contributes that term

114
00:03:47,470 --> 00:03:49,470
to be my overall cost function for

115
00:03:49,680 --> 00:03:52,260
logistic regression. Now let's

116
00:03:52,820 --> 00:03:54,310
consider the 2 cases

117
00:03:54,700 --> 00:03:55,970
of when y is equal to 1

118
00:03:56,040 --> 00:03:57,250
and when y is equal to 0.

119
00:03:57,820 --> 00:03:59,040
In the first case, let's

120
00:03:59,170 --> 00:04:00,260
suppose that y is equal

121
00:04:00,520 --> 00:04:01,960
to 1. In that case,

122
00:04:02,440 --> 00:04:04,850
only this first row in

123
00:04:04,980 --> 00:04:06,910
the objective matters because this

124
00:04:07,130 --> 00:04:08,830
1 minus y term will be equal

125
00:04:09,210 --> 00:04:10,510
to 0 if y is equal to 1.

126
00:04:13,640 --> 00:04:15,340
So, when y is equal to

127
00:04:15,400 --> 00:04:17,130
1 when in an example, x,

128
00:04:17,310 --> 00:04:18,240
y when y is equal to

129
00:04:18,420 --> 00:04:19,840
1, what we get is this

130
00:04:20,010 --> 00:04:21,340
term minus log 1

131
00:04:21,560 --> 00:04:22,370
over 1 plus e to the negative

132
00:04:22,860 --> 00:04:25,050
z. Where, similar to the last slide,

133
00:04:25,330 --> 00:04:26,480
I'm using z to denote

134
00:04:27,490 --> 00:04:29,430
data transpose x.  And

135
00:04:29,640 --> 00:04:30,930
of course, in the cost we

136
00:04:31,040 --> 00:04:32,130
actually had this minus y

137
00:04:32,380 --> 00:04:33,490
but we just said that you know, if y is

138
00:04:33,540 --> 00:04:34,790
equal to 1. So that's equal

139
00:04:35,020 --> 00:04:36,500
to 1. I just simplified it

140
00:04:36,580 --> 00:04:38,010
a way in the expression that

141
00:04:38,300 --> 00:04:39,820
I have written down here.

142
00:04:41,950 --> 00:04:43,030
And if we plot this function,

143
00:04:43,580 --> 00:04:45,080
as a function of z, what

144
00:04:45,230 --> 00:04:46,320
you find is that you get

145
00:04:47,160 --> 00:04:48,630
this curve shown on the

146
00:04:49,220 --> 00:04:50,290
lower left of this line

147
00:04:51,120 --> 00:04:52,290
and thus we also see

148
00:04:52,640 --> 00:04:53,590
that when z is equal

149
00:04:53,860 --> 00:04:54,930
to large that is to when

150
00:04:55,440 --> 00:04:56,930
theta transpose x is large

151
00:04:57,800 --> 00:04:58,790
that corresponds to a

152
00:04:58,890 --> 00:04:59,900
value of z that gives

153
00:05:00,100 --> 00:05:02,050
us a very small value, a very

154
00:05:03,000 --> 00:05:04,650
small contribution to the

155
00:05:04,740 --> 00:05:06,120
cost function and this

156
00:05:06,270 --> 00:05:07,790
kind of explains why when

157
00:05:08,260 --> 00:05:10,020
logistic regression sees a positive example

158
00:05:10,640 --> 00:05:12,200
with y equals 1 it tries

159
00:05:12,860 --> 00:05:14,220
to set theta transpose x

160
00:05:14,650 --> 00:05:15,810
to be very large because that

161
00:05:15,980 --> 00:05:17,440
corresponds to this term

162
00:05:18,300 --> 00:05:21,490
in a cost function being small. Now, to build

163
00:05:21,760 --> 00:05:23,640
the Support Vector Machine, here is what we are going to do.

164
00:05:23,740 --> 00:05:24,780
We are going to take this cost function, this

165
00:05:25,740 --> 00:05:29,420
minus log 1 over 1 plus e to the negative z and modify it a little bit.

166
00:05:31,270 --> 00:05:32,450
Let me take this point

167
00:05:33,590 --> 00:05:35,120
1 over here and let

168
00:05:36,150 --> 00:05:37,200
me draw the course function that I'm going to

169
00:05:37,280 --> 00:05:38,510
use, the new cost function is gonna

170
00:05:38,870 --> 00:05:40,320
be flat from here on out

171
00:05:42,000 --> 00:05:42,980
and then I'm going to draw something

172
00:05:43,170 --> 00:05:45,720
that grows as a straight

173
00:05:46,280 --> 00:05:49,230
line similar to logistic

174
00:05:49,530 --> 00:05:50,710
regression but this is going to be the

175
00:05:50,950 --> 00:05:52,740
straight line in this posh inning.

176
00:05:52,870 --> 00:05:55,040
So the curve that

177
00:05:55,190 --> 00:05:57,580
I just drew in magenta. The curve that I just drew purple and magenta.

178
00:05:58,090 --> 00:05:59,580
So, it's a pretty

179
00:05:59,730 --> 00:06:01,840
close approximation to the

180
00:06:02,310 --> 00:06:03,480
cost function used by logistic

181
00:06:03,900 --> 00:06:05,060
regression except that it is

182
00:06:05,130 --> 00:06:06,590
now made out of two line segments. This

183
00:06:07,490 --> 00:06:09,110
is flat potion on the right

184
00:06:09,430 --> 00:06:11,590
and then this is a straight

185
00:06:11,860 --> 00:06:14,340
line portion on the

186
00:06:14,630 --> 00:06:16,460
left. And don't worry too much about the slope of the straight line portion.

187
00:06:16,930 --> 00:06:18,930
It doesn't matter that

188
00:06:19,180 --> 00:06:21,630
much but that's the

189
00:06:21,730 --> 00:06:23,910
new cost function we're going to use where y is equal to 1 and

190
00:06:24,100 --> 00:06:25,240
you can imagine you

191
00:06:25,340 --> 00:06:28,310
should do something pretty similar to logistic regression

192
00:06:29,190 --> 00:06:30,470
but it turns out that this will give the

193
00:06:30,750 --> 00:06:32,630
support vector machine computational advantage

194
00:06:33,690 --> 00:06:34,470
that will give us later on

195
00:06:34,890 --> 00:06:37,190
an easier optimization problem, that

196
00:06:37,570 --> 00:06:39,670
will be easier for stock trades and so on.

197
00:06:41,050 --> 00:06:41,990
We just talked about the case

198
00:06:42,120 --> 00:06:43,300
of y equals to 1. The other

199
00:06:43,370 --> 00:06:44,420
case is if y is equal

200
00:06:44,660 --> 00:06:46,120
to 0. In that case,

201
00:06:47,090 --> 00:06:47,870
if you look at the cost

202
00:06:48,510 --> 00:06:49,880
then only this second term

203
00:06:50,220 --> 00:06:51,470
will apply because the first

204
00:06:51,610 --> 00:06:52,800
term goes a way

205
00:06:53,330 --> 00:06:54,490
where if y is equal to 0 then nearly

206
00:06:54,640 --> 00:06:55,670
it was 0 here. So

207
00:06:55,800 --> 00:06:56,640
your left only with the second

208
00:06:57,040 --> 00:06:58,100
term of the expression above

209
00:06:59,150 --> 00:07:00,600
and so the cost of an

210
00:07:00,710 --> 00:07:01,960
example or the contribution

211
00:07:01,980 --> 00:07:03,620
of the cost function is going

212
00:07:03,840 --> 00:07:04,850
to be given by this term

213
00:07:05,180 --> 00:07:06,620
over here and if you

214
00:07:06,710 --> 00:07:07,860
plug that as a function

215
00:07:08,560 --> 00:07:09,750
z. So, I have here z on the

216
00:07:09,990 --> 00:07:11,290
horizontal axis, you end up

217
00:07:11,400 --> 00:07:13,370
with this group and for

218
00:07:13,470 --> 00:07:14,570
the support vector machine, once

219
00:07:14,790 --> 00:07:15,540
again we're going to replace

220
00:07:16,250 --> 00:07:17,860
this blue line with something similar

221
00:07:18,380 --> 00:07:20,060
and see if we can

222
00:07:20,670 --> 00:07:22,220
replace it with a new cost, there

223
00:07:23,480 --> 00:07:24,910
is flat out here. There's 0 out here and then

224
00:07:25,020 --> 00:07:26,230
it grows as a straight

225
00:07:27,900 --> 00:07:27,900
line like so.

226
00:07:29,070 --> 00:07:29,710
So, let me give

227
00:07:29,860 --> 00:07:31,950
these two functions names.

228
00:07:32,830 --> 00:07:33,910
This function on the left, I'm

229
00:07:34,080 --> 00:07:35,850
going to call

230
00:07:37,140 --> 00:07:38,360
cost subscript 1 of z.

231
00:07:38,800 --> 00:07:39,650
And this function on the right, I'm going to call

232
00:07:39,870 --> 00:07:41,700
cost subscript 0

233
00:07:42,980 --> 00:07:44,260
of z. And the subscript just refers

234
00:07:44,860 --> 00:07:46,740
to the cost corresponding to

235
00:07:47,070 --> 00:07:48,570
y is equal to 1 versus y is equal to 0.

236
00:07:49,930 --> 00:07:51,470
Armed with these definitions, we are

237
00:07:51,580 --> 00:07:54,730
now ready to build the support vector machine.

238
00:07:55,000 --> 00:07:56,030
Here is the cost function

239
00:07:56,300 --> 00:07:57,230
j of theta that we have for

240
00:07:57,340 --> 00:07:58,440
logistic regression. In case

241
00:07:58,770 --> 00:07:59,760
this the equation looks a

242
00:07:59,860 --> 00:08:02,220
bit unfamiliar is because previously we

243
00:08:02,360 --> 00:08:04,270
had a minor sign outside, but

244
00:08:04,800 --> 00:08:05,820
here what I did was I

245
00:08:05,930 --> 00:08:07,010
instead moved the minor signs

246
00:08:07,610 --> 00:08:08,800
inside this expression. So it

247
00:08:08,950 --> 00:08:09,920
just, you know, makes it look a

248
00:08:10,080 --> 00:08:12,970
little more different. For the support

249
00:08:13,340 --> 00:08:14,670
vector machine what we are

250
00:08:14,730 --> 00:08:16,550
going to do is essentially take

251
00:08:16,820 --> 00:08:18,460
this, and replace this with

252
00:08:19,080 --> 00:08:21,260
cost 1 of z,

253
00:08:21,740 --> 00:08:23,060
that is cost 1 of theta transpose x.

254
00:08:23,320 --> 00:08:25,240
I'm going

255
00:08:25,300 --> 00:08:27,250
to take this and replace it with cost

256
00:08:28,640 --> 00:08:31,420
0 of z. This is cost 0 of

257
00:08:32,060 --> 00:08:34,090
theta transpose x

258
00:08:35,030 --> 00:08:36,680
where the cost 1 function is

259
00:08:37,000 --> 00:08:37,740
what we had on the previous

260
00:08:38,170 --> 00:08:39,930
line that looks like this and

261
00:08:40,890 --> 00:08:42,540
the cost 0 function, again what

262
00:08:42,680 --> 00:08:44,420
we have on the previous line that

263
00:08:44,910 --> 00:08:46,730
looks like this.

264
00:08:46,860 --> 00:08:48,080
So, what we have for the support

265
00:08:48,420 --> 00:08:49,360
vector machine is an minimizationminimalization

266
00:08:49,910 --> 00:08:52,220
problem of one of

267
00:08:52,340 --> 00:08:55,210
1 over m, sum over

268
00:08:55,400 --> 00:08:58,650
my training examples of y(i) times cost

269
00:08:59,090 --> 00:09:01,050
1 of theta transpose

270
00:09:01,300 --> 00:09:03,910
x(i) plus 1 minus

271
00:09:04,650 --> 00:09:06,640
y(i) times cost zero of theta transpose x(i).

272
00:09:07,220 --> 00:09:10,490
And then

273
00:09:10,990 --> 00:09:13,470
plus my usual regularization

274
00:09:17,120 --> 00:09:23,280
parameter like so. Now

275
00:09:24,130 --> 00:09:25,280
by convention for the Support

276
00:09:25,570 --> 00:09:27,610
Vector Machine, we actually write

277
00:09:27,790 --> 00:09:29,510
things slightly differently. We parametrize

278
00:09:30,570 --> 00:09:31,690
this just very slightly differently.

279
00:09:31,850 --> 00:09:33,720
First, we're going

280
00:09:34,130 --> 00:09:35,360
to get rid of the 1

281
00:09:35,670 --> 00:09:36,860
over m terms and this just,

282
00:09:37,130 --> 00:09:38,480
this just happens

283
00:09:38,770 --> 00:09:40,380
to be a slightly different convention that people

284
00:09:40,640 --> 00:09:41,930
use for support vector machines

285
00:09:42,140 --> 00:09:43,400
compared to for logistic regression. But here's what

286
00:09:44,160 --> 00:09:46,180
I mean, you know, what I'm going

287
00:09:46,670 --> 00:09:47,960
to do is I am just gonna get

288
00:09:48,210 --> 00:09:49,450
rid  of this 1 over m

289
00:09:50,070 --> 00:09:50,860
terms and this should give

290
00:09:51,070 --> 00:09:53,030
me the same optimal value for theta, right.

291
00:09:53,620 --> 00:09:55,020
Because 1 over m is just a constant.

292
00:09:56,420 --> 00:09:57,550
So, you know, whether I solve

293
00:09:57,930 --> 00:09:59,410
this minimization problem with 1

294
00:09:59,580 --> 00:10:00,430
over m in front or not,

295
00:10:01,100 --> 00:10:02,010
I should end up with the same

296
00:10:02,490 --> 00:10:03,510
optimal value of theta.

297
00:10:04,590 --> 00:10:05,450
Here is what I mean, to

298
00:10:05,590 --> 00:10:07,000
give you a concrete example,

299
00:10:08,010 --> 00:10:09,170
suppose I had a minimization

300
00:10:09,370 --> 00:10:11,040
problem that you know minimize over

301
00:10:11,460 --> 00:10:14,700
a real number u of u minus 5 squared,

302
00:10:17,080 --> 00:10:18,540
plus 1, right. Well, the

303
00:10:18,620 --> 00:10:20,040
minimum of this happens, happens

304
00:10:20,440 --> 00:10:21,900
to know the minimum of this is u equals 5.

305
00:10:23,090 --> 00:10:23,980
Now if I want to take

306
00:10:24,120 --> 00:10:25,800
this objective function and multiply

307
00:10:26,430 --> 00:10:28,240
it by 10, so

308
00:10:28,770 --> 00:10:29,850
here my minimization problem is

309
00:10:30,570 --> 00:10:33,510
minimum of u of 10, u minus

310
00:10:33,960 --> 00:10:35,270
5 squared plus 10.

311
00:10:35,920 --> 00:10:37,650
Well the value of u

312
00:10:37,670 --> 00:10:40,350
that minimizes this is still u equals 5, right.

313
00:10:40,940 --> 00:10:42,540
So, multiplying something that

314
00:10:42,640 --> 00:10:44,160
you are minimizing over by some

315
00:10:44,360 --> 00:10:45,540
constant, 10 in this case,

316
00:10:46,010 --> 00:10:47,710
it does not change the value

317
00:10:48,290 --> 00:10:51,450
of u that gives us, that minimizes this function.

318
00:10:52,650 --> 00:10:53,680
So the same way what I've

319
00:10:53,830 --> 00:10:55,120
done by crossing out this

320
00:10:55,430 --> 00:10:56,940
m is, all I

321
00:10:56,990 --> 00:10:58,770
am doing is multiplying my objective

322
00:10:59,240 --> 00:11:00,650
function by some constant m

323
00:11:00,940 --> 00:11:01,920
and it doesn't change the value

324
00:11:02,360 --> 00:11:04,310
of theta that achieves the minimum.

325
00:11:05,480 --> 00:11:07,190
The second bit of notational change,

326
00:11:07,470 --> 00:11:08,560
we're just designating the most

327
00:11:08,740 --> 00:11:10,630
standard convention, when using as

328
00:11:11,170 --> 00:11:13,250
the SVM, instead of logistic regression as a following.

329
00:11:14,210 --> 00:11:15,880
So, for logistic regression, we had

330
00:11:16,520 --> 00:11:18,270
two terms to our objective function.

331
00:11:19,340 --> 00:11:20,500
The first is this term

332
00:11:20,920 --> 00:11:22,020
which is the cost that comes

333
00:11:22,450 --> 00:11:23,910
from the training set and the

334
00:11:23,990 --> 00:11:25,730
second is this term, which

335
00:11:26,140 --> 00:11:28,330
is the regularization term

336
00:11:28,380 --> 00:11:29,460
and what we had, we had to

337
00:11:29,870 --> 00:11:30,900
control the trade off between

338
00:11:31,270 --> 00:11:32,600
these by saying, you know, that we

339
00:11:32,810 --> 00:11:34,760
wanted to minimize A plus

340
00:11:35,760 --> 00:11:38,240
and then my regularization parameter lambda,

341
00:11:39,370 --> 00:11:42,280
and then times some other

342
00:11:42,430 --> 00:11:43,430
term B, right? Where as I

343
00:11:43,510 --> 00:11:44,970
am using A to denote

344
00:11:45,080 --> 00:11:46,160
this first term, and I am

345
00:11:46,390 --> 00:11:48,280
using B to denote that

346
00:11:48,490 --> 00:11:49,560
second term, may be without the

347
00:11:49,650 --> 00:11:52,440
lambda, and instead of

348
00:11:53,140 --> 00:11:56,090
prioritizing this as A plus lambda B,

349
00:11:56,270 --> 00:11:57,950
we could, and so what we

350
00:11:58,200 --> 00:11:59,670
did was by setting different

351
00:12:00,010 --> 00:12:02,210
values for this regularization parameter lambda.

352
00:12:03,060 --> 00:12:04,180
We could trade off the relative

353
00:12:04,670 --> 00:12:05,720
way between how much we

354
00:12:05,900 --> 00:12:06,780
want to fit the training set well,

355
00:12:07,560 --> 00:12:09,390
as minimizing A, versus how

356
00:12:09,510 --> 00:12:12,930
much we care about keeping the values of the parameters small.

357
00:12:13,470 --> 00:12:14,530
So that would be

358
00:12:14,640 --> 00:12:16,170
for the parameters B. For the Support

359
00:12:16,380 --> 00:12:17,620
Vector Machine, just by convention

360
00:12:18,250 --> 00:12:19,150
we're going to use a different

361
00:12:19,570 --> 00:12:21,960
parameter. So instead of using lambda here

362
00:12:22,180 --> 00:12:23,220
to control the relative

363
00:12:23,640 --> 00:12:24,730
waiting between you know, the first and second terms,

364
00:12:24,810 --> 00:12:26,260
we are

365
00:12:26,300 --> 00:12:27,370
still going to use a different

366
00:12:27,710 --> 00:12:29,070
parameter which by convention

367
00:12:29,290 --> 00:12:31,530
is called C and

368
00:12:31,730 --> 00:12:33,550
we instead are going to minimize C times

369
00:12:34,430 --> 00:12:39,160
A plus B. So

370
00:12:39,380 --> 00:12:41,210
for logistic regression if we

371
00:12:41,340 --> 00:12:42,730
send a very large value of

372
00:12:42,990 --> 00:12:43,980
lambda, that means to give

373
00:12:44,260 --> 00:12:45,970
B a very high weight. Here

374
00:12:46,590 --> 00:12:47,640
is that if we set C

375
00:12:47,960 --> 00:12:49,750
to be a very small value. That

376
00:12:50,070 --> 00:12:51,510
corresponds to giving B

377
00:12:51,800 --> 00:12:53,530
much larger weight than C than A.

378
00:12:54,610 --> 00:12:55,730
So this is just a different

379
00:12:55,890 --> 00:12:57,330
way of controlling the trade off

380
00:12:57,630 --> 00:12:58,970
or just a different way of

381
00:12:59,060 --> 00:13:01,530
parametrizing how much we care about optimizing the first term versus how much we care about optimizing the second term.

382
00:13:05,290 --> 00:13:06,250
And if you want you can

383
00:13:06,380 --> 00:13:07,620
think of this as the parameter

384
00:13:08,180 --> 00:13:09,580
C playing a role

385
00:13:09,800 --> 00:13:11,570
similar to 1 over

386
00:13:11,890 --> 00:13:13,900
lambda and it's

387
00:13:14,080 --> 00:13:16,100
not that it's two equations

388
00:13:16,720 --> 00:13:17,900
or these two expressions will be

389
00:13:18,000 --> 00:13:19,500
equal, it's equals 1 over

390
00:13:19,650 --> 00:13:21,350
lambda and it's not that these two equations or these two expressions will be equal. It's equals t 1 over lambda. That's not the case where it bothers that if C is equal to 1 over lambda then

391
00:13:22,260 --> 00:13:24,510
these

392
00:13:24,710 --> 00:13:26,670
two optimization objectives should

393
00:13:26,940 --> 00:13:28,260
give you the same value, same

394
00:13:28,500 --> 00:13:29,460
optimal value of theta.
So

395
00:13:30,350 --> 00:13:31,180
just filling that

396
00:13:31,400 --> 00:13:33,030
in. I'm gonna cross out lambda here

397
00:13:33,730 --> 00:13:34,940
and write in the constant C there.

398
00:13:35,030 --> 00:13:37,930
So,that's gives

399
00:13:38,170 --> 00:13:40,830
us our overall optimization objective

400
00:13:41,280 --> 00:13:42,650
function for the Support Vector

401
00:13:42,900 --> 00:13:43,970
Machine and where you

402
00:13:44,080 --> 00:13:46,200
minimize that function then what

403
00:13:46,340 --> 00:13:47,410
you have is the parameters

404
00:13:48,230 --> 00:13:52,800
learned by SVM.
Finally on

405
00:13:52,940 --> 00:13:54,690
light of logistic regression, the Support

406
00:13:54,840 --> 00:13:56,110
Vector Machine doesn't output the

407
00:13:56,220 --> 00:13:57,850
probability. Instead what we

408
00:13:57,970 --> 00:13:58,910
have is, we have this cost

409
00:13:59,190 --> 00:14:00,600
function which we minimize to

410
00:14:00,730 --> 00:14:02,770
get the parameters theta and what

411
00:14:02,910 --> 00:14:03,900
the Support Vector Machine does,

412
00:14:05,130 --> 00:14:05,970
is it just makes the prediction

413
00:14:07,050 --> 00:14:08,650
of y being equal 1

414
00:14:08,690 --> 00:14:10,390
or 0 directly. So the hypothesis,

415
00:14:11,310 --> 00:14:12,920
where I predict, 1, if

416
00:14:14,150 --> 00:14:15,630
theta transpose x is

417
00:14:15,890 --> 00:14:17,680
greater than or equal to

418
00:14:18,230 --> 00:14:20,060
0 and I'll predict 0 otherwise.

419
00:14:20,320 --> 00:14:21,560
And so, having learned the

420
00:14:21,610 --> 00:14:23,010
parameters theta, this is

421
00:14:23,360 --> 00:14:25,980
the form of the hypothesis for the support vector machine.

422
00:14:26,850 --> 00:14:27,870
So, that was a

423
00:14:27,980 --> 00:14:29,670
mathematical definition of what

424
00:14:29,840 --> 00:14:31,520
a support vector machine does.

425
00:14:31,750 --> 00:14:32,870
In the next few videos, let's

426
00:14:33,100 --> 00:14:33,900
try to get back to

427
00:14:34,260 --> 00:14:36,030
intuition about what this

428
00:14:36,480 --> 00:14:37,660
optimization objective leads to and

429
00:14:37,820 --> 00:14:38,840
whether the source of the hypothesis

430
00:14:39,720 --> 00:14:41,300
a SVM will learn and also

431
00:14:41,700 --> 00:14:43,060
talk about how to modify

432
00:14:43,600 --> 00:14:44,640
this just a little bit to

433
00:14:44,920 --> 00:14:46,280
learn complex, nonlinear functions.