1
00:00:00,302 --> 00:00:01,883
In this video, we'll talk about

2
00:00:01,883 --> 00:00:03,948
the normal equation, which for

3
00:00:03,948 --> 00:00:05,660
some linear regression problems, will

4
00:00:05,660 --> 00:00:06,981
give us a much better way

5
00:00:06,981 --> 00:00:09,115
to solve for the optimal value

6
00:00:09,115 --> 00:00:10,879
of the parameters theta.

7
00:00:10,879 --> 00:00:13,096
Concretely, so far the

8
00:00:13,096 --> 00:00:14,399
algorithm that we've been using

9
00:00:14,399 --> 00:00:16,042
for linear regression is gradient

10
00:00:16,042 --> 00:00:17,823
descent where in order

11
00:00:17,823 --> 00:00:19,410
to minimize the cost function

12
00:00:19,410 --> 00:00:21,354
J of Theta, we would take

13
00:00:21,354 --> 00:00:23,792
this iterative algorithm that takes

14
00:00:23,792 --> 00:00:26,410
many steps, multiple iterations of

15
00:00:26,410 --> 00:00:28,259
gradient descent to converge

16
00:00:28,259 --> 00:00:30,396
to the global minimum.

17
00:00:30,396 --> 00:00:32,563
In contrast, the normal equation

18
00:00:32,563 --> 00:00:34,413
would give us a method to

19
00:00:34,413 --> 00:00:36,986
solve for theta analytically, so

20
00:00:36,986 --> 00:00:38,761
that rather than needing to run

21
00:00:38,761 --> 00:00:40,594
this iterative algorithm, we can

22
00:00:40,594 --> 00:00:41,365
instead just solve for the

23
00:00:41,365 --> 00:00:42,791
optimal value for theta

24
00:00:42,791 --> 00:00:44,403
all at one go, so that in

25
00:00:44,403 --> 00:00:46,096
basically one step you get

26
00:00:46,096 --> 00:00:48,136
to the optimal value right there.

27
00:00:49,136 --> 00:00:51,947
It turns out the normal equation

28
00:00:52,209 --> 00:00:54,442
that has some advantages and

29
00:00:54,442 --> 00:00:56,024
some disadvantages, but before

30
00:00:56,024 --> 00:00:57,817
we get to that and talk about

31
00:00:57,903 --> 00:00:59,426
when you should use it, let's

32
00:00:59,426 --> 00:01:02,539
get some intuition about what this method does.

33
00:01:02,539 --> 00:01:04,633
For this week's planetary example, let's

34
00:01:04,633 --> 00:01:06,120
imagine, let's take a

35
00:01:06,120 --> 00:01:07,505
very simplified cost function

36
00:01:07,505 --> 00:01:09,291
J of Theta, that's just the

37
00:01:09,291 --> 00:01:11,958
function of a real number Theta.

38
00:01:11,958 --> 00:01:13,642
So, for now, imagine that Theta

39
00:01:13,842 --> 00:01:16,615
is just a scalar value or that Theta is just a row value.

40
00:01:16,769 --> 00:01:18,918
It's just a number, rather than a vector.

41
00:01:19,171 --> 00:01:24,595
Imagine that we have a cost function J
that's a quadratic function of this real value

42
00:01:25,028 --> 00:01:27,420
parameter Theta, so J of Theta
looks like that.

43
00:01:27,851 --> 00:01:30,336
Well, how do you minimize
a quadratic function?

44
00:01:30,720 --> 00:01:32,745
For those of you that know
a little bit of calculus,

45
00:01:32,858 --> 00:01:34,965
you may know that the way to

46
00:01:34,965 --> 00:01:36,628
minimize a function is to

47
00:01:36,628 --> 00:01:38,991
take derivatives and to

48
00:01:38,991 --> 00:01:41,707
set derivatives equal to zero.

49
00:01:41,707 --> 00:01:44,721
So, you take the derivative of J
with respect to the parameter of Theta.

50
00:01:44,797 --> 00:01:46,847
You get some formula
which I am not going to derive,

51
00:01:46,847 --> 00:01:49,161
you set that derivative

52
00:01:49,161 --> 00:01:50,782
equal to zero, and this

53
00:01:50,782 --> 00:01:53,503
allows you to solve for

54
00:01:53,503 --> 00:01:57,866
the value of Theda that
minimizes J of Theta.

55
00:01:57,866 --> 00:01:59,096
That was a simpler case

56
00:01:59,096 --> 00:02:01,716
of when data was just real number.

57
00:02:01,716 --> 00:02:04,272
In the problem that we are
interested in, Theta is

58
00:02:04,929 --> 00:02:06,559
no longer just a real number,

59
00:02:06,559 --> 00:02:07,847
but, instead, is this

60
00:02:07,847 --> 00:02:11,986
n+1-dimensional parameter vector, and,

61
00:02:11,986 --> 00:02:13,809
a cost function J is

62
00:02:13,809 --> 00:02:15,742
a function of this vector

63
00:02:15,742 --> 00:02:17,501
value or Theta 0 through

64
00:02:17,501 --> 00:02:18,924
Theta m. And, a cost

65
00:02:18,924 --> 00:02:21,957
function looks like this,
some square cost function on the right.

66
00:02:22,373 --> 00:02:25,712
How do we minimize this cost function J?

67
00:02:25,712 --> 00:02:27,163
Calculus actually tells us

68
00:02:27,163 --> 00:02:29,377
that, if you, that

69
00:02:29,377 --> 00:02:30,709
one way to do so, is

70
00:02:30,709 --> 00:02:38,604
to take the partial derivative of J, with respect to every parameter of Theta J in turn, and then, to set

71
00:02:38,604 --> 00:02:40,271
all of these to 0.

72
00:02:40,271 --> 00:02:41,394
If you do that, and you

73
00:02:41,394 --> 00:02:42,718
solve for the values of

74
00:02:42,718 --> 00:02:44,000
Theta 0, Theta 1,

75
00:02:44,000 --> 00:02:45,973
up to Theta N, then,

76
00:02:45,973 --> 00:02:47,217
this would give you that values

77
00:02:47,217 --> 00:02:48,765
of Theta to minimize the cost

78
00:02:48,765 --> 00:02:50,878
function J.  Where, if

79
00:02:50,878 --> 00:02:52,176
you actually work through the

80
00:02:52,176 --> 00:02:53,597
calculus and work through

81
00:02:53,597 --> 00:02:55,194
the solution to the parameters

82
00:02:55,194 --> 00:02:57,316
Theta 0 through Theta N, the

83
00:02:57,316 --> 00:03:00,520
derivation ends up being somewhat involved.

84
00:03:00,520 --> 00:03:01,625
And, what I am going

85
00:03:01,625 --> 00:03:03,113
to do in this video,

86
00:03:03,113 --> 00:03:04,852
is actually to not go

87
00:03:04,852 --> 00:03:06,297
through the derivation, which is kind

88
00:03:06,297 --> 00:03:07,657
of long and kind of involved, but

89
00:03:07,657 --> 00:03:08,962
what I want to do is just

90
00:03:08,962 --> 00:03:10,545
tell you what you need to know

91
00:03:10,545 --> 00:03:12,619
in order to implement this process

92
00:03:12,619 --> 00:03:14,138
so you can solve for the

93
00:03:14,138 --> 00:03:15,511
values of the thetas that

94
00:03:15,511 --> 00:03:16,892
corresponds to where the

95
00:03:16,892 --> 00:03:19,273
partial derivatives is equal to zero.

96
00:03:19,273 --> 00:03:21,733
Or alternatively, or equivalently,

97
00:03:21,733 --> 00:03:23,357
the values of Theta is that

98
00:03:23,357 --> 00:03:25,901
minimize the cost function J of Theta.

99
00:03:25,901 --> 00:03:27,283
I realize that some of

100
00:03:27,283 --> 00:03:28,846
the comments I made that made

101
00:03:28,846 --> 00:03:29,914
more sense only to those

102
00:03:29,914 --> 00:03:31,896
of you that are normally familiar with calculus.

103
00:03:31,896 --> 00:03:33,065
So, but if you don't

104
00:03:33,065 --> 00:03:34,487
know, if you're less familiar

105
00:03:34,487 --> 00:03:36,354
with calculus, don't worry about it.

106
00:03:36,354 --> 00:03:37,404
I'm just going to tell you what

107
00:03:37,404 --> 00:03:38,374
you need to know in order to

108
00:03:38,374 --> 00:03:41,358
implement this algorithm and get it to work.

109
00:03:41,358 --> 00:03:42,585
For the example that I

110
00:03:42,585 --> 00:03:43,737
want to use as a running

111
00:03:43,737 --> 00:03:46,339
example let's say that

112
00:03:46,339 --> 00:03:49,056
I have m = 4 training examples.

113
00:03:50,409 --> 00:03:52,881
In order to implement this normal

114
00:03:52,881 --> 00:03:56,515
equation at big, what I'm going to do
is the following.

115
00:03:56,515 --> 00:03:57,640
I'm going to take my

116
00:03:57,640 --> 00:04:00,375
data set, so here are my four training examples.

117
00:04:00,375 --> 00:04:01,844
In this case let's assume that,

118
00:04:01,844 --> 00:04:06,073
you know, these four examples is all the data I have.

119
00:04:06,073 --> 00:04:07,890
What I am going to do is take

120
00:04:07,890 --> 00:04:09,007
my data set and add

121
00:04:09,007 --> 00:04:11,289
an extra column that corresponds

122
00:04:11,289 --> 00:04:14,579
to my extra feature, x0,

123
00:04:14,579 --> 00:04:15,967
that is always takes

124
00:04:15,967 --> 00:04:17,527
on this value of 1.

125
00:04:17,527 --> 00:04:18,681
What I'm going to do is

126
00:04:18,681 --> 00:04:19,943
I'm then going to construct

127
00:04:19,943 --> 00:04:22,638
a matrix called X that's

128
00:04:22,638 --> 00:04:24,632
a matrix are basically contains all

129
00:04:24,632 --> 00:04:26,100
of the features from my

130
00:04:26,100 --> 00:04:28,140
training data, so completely

131
00:04:28,140 --> 00:04:31,528
here is my here are

132
00:04:31,528 --> 00:04:33,743
all my features and we're

133
00:04:33,743 --> 00:04:34,797
going to take all those numbers and

134
00:04:34,797 --> 00:04:37,777
put them into this matrix "X", okay?

135
00:04:37,777 --> 00:04:39,179
So just, you know, copy

136
00:04:39,179 --> 00:04:41,233
the data over one column

137
00:04:41,233 --> 00:04:45,962
at a time and then I am going to do
something similar for y's.

138
00:04:45,962 --> 00:04:47,087
I am going to take the

139
00:04:47,087 --> 00:04:47,952
values that I'm trying to

140
00:04:47,952 --> 00:04:49,360
predict and construct now

141
00:04:49,360 --> 00:04:52,894
a vector, like so

142
00:04:52,894 --> 00:04:55,440
and call that a vector y.

143
00:04:55,440 --> 00:04:58,038
So X is going to be a

144
00:04:59,653 --> 00:05:05,688
m by (n+1) - dimensional matrix, and

145
00:05:05,688 --> 00:05:07,490
Y is going to be

146
00:05:07,490 --> 00:05:14,421
a m-dimensional vector

147
00:05:14,421 --> 00:05:16,624
where m is the number of
training examples

148
00:05:16,984 --> 00:05:18,688
and n is, n is

149
00:05:18,688 --> 00:05:20,713
a number of features, n+1, because of

150
00:05:20,713 --> 00:05:24,825
this extra feature X0 that I had.

151
00:05:24,825 --> 00:05:26,350
Finally if you take

152
00:05:26,350 --> 00:05:27,489
your matrix X and you take

153
00:05:27,489 --> 00:05:28,595
your vector Y, and if you

154
00:05:28,595 --> 00:05:31,065
just compute this, and set

155
00:05:31,065 --> 00:05:32,419
theta to be equal to

156
00:05:32,419 --> 00:05:34,440
X transpose X inverse times

157
00:05:34,440 --> 00:05:36,516
X transpose Y, this would

158
00:05:36,516 --> 00:05:38,583
give you the value of theta

159
00:05:38,583 --> 00:05:42,559
that minimizes your cost function.

160
00:05:42,559 --> 00:05:43,436
There was a lot

161
00:05:43,436 --> 00:05:44,416
that happened on the slides and

162
00:05:44,416 --> 00:05:47,514
I work through it using
one specific example of one dataset.

163
00:05:47,514 --> 00:05:49,241
Let me just write this

164
00:05:49,333 --> 00:05:50,770
out in a slightly more general form

165
00:05:50,955 --> 00:05:53,418
and then let me just,
and later on in

166
00:05:53,621 --> 00:05:56,531
this video let me explain
this equation a little bit more.

167
00:05:57,581 --> 00:06:00,687
It is not yet entirely clear how to do this.

168
00:06:00,687 --> 00:06:02,129
In a general case, let us

169
00:06:02,129 --> 00:06:04,124
say we have M training examples

170
00:06:04,124 --> 00:06:05,697
so X1, Y1 up to

171
00:06:05,697 --> 00:06:09,319
Xn, Yn and n features.

172
00:06:09,319 --> 00:06:10,811
So, each of the training example

173
00:06:10,811 --> 00:06:12,926
x(i) may looks like a vector

174
00:06:12,926 --> 00:06:16,297
like this, that is a n+1 dimensional feature vector.

175
00:06:16,943 --> 00:06:18,350
The way I'm going to construct the

176
00:06:18,350 --> 00:06:20,674
matrix "X", this is

177
00:06:20,674 --> 00:06:24,827
also called the design matrix

178
00:06:24,827 --> 00:06:26,712
is as follows.

179
00:06:26,712 --> 00:06:28,640
Each training example gives

180
00:06:28,640 --> 00:06:30,549
me a feature vector like this.

181
00:06:30,549 --> 00:06:34,491
say, sort of n+1 dimensional vector.

182
00:06:34,491 --> 00:06:36,190
The way I am going to construct my

183
00:06:36,359 --> 00:06:39,734
design matrix x is only construct the matrix like this.

184
00:06:39,734 --> 00:06:40,834
and what I'm going to

185
00:06:40,834 --> 00:06:42,109
do is take the first

186
00:06:42,109 --> 00:06:43,711
training example, so that's

187
00:06:43,711 --> 00:06:46,350
a vector, take its transpose

188
00:06:46,350 --> 00:06:48,692
so it ends up being this,

189
00:06:48,692 --> 00:06:50,250
you know, long flat thing and

190
00:06:50,250 --> 00:06:55,153
make x1 transpose the first row of my design matrix.

191
00:06:55,153 --> 00:06:56,225
Then I am going to take my

192
00:06:56,225 --> 00:06:58,682
second training example, x2, take

193
00:06:58,682 --> 00:07:00,437
the transpose of that and

194
00:07:00,437 --> 00:07:01,838
put that as the second row

195
00:07:01,838 --> 00:07:04,068
of x and so on,

196
00:07:04,068 --> 00:07:07,206
down until my last training example.

197
00:07:07,206 --> 00:07:09,279
Take the transpose of that,

198
00:07:09,279 --> 00:07:10,850
and that's my last row of

199
00:07:10,850 --> 00:07:12,665
my matrix X. And, so,

200
00:07:12,665 --> 00:07:14,418
that makes my matrix X, an

201
00:07:14,418 --> 00:07:17,129
M by N +1

202
00:07:17,129 --> 00:07:19,836
dimensional matrix.

203
00:07:19,836 --> 00:07:21,953
As a concrete example, let's

204
00:07:21,953 --> 00:07:23,505
say I have only one

205
00:07:23,505 --> 00:07:24,670
feature, really, only one

206
00:07:24,670 --> 00:07:26,631
feature other than X zero,

207
00:07:26,631 --> 00:07:28,165
which is always equal to 1.

208
00:07:28,165 --> 00:07:30,376
So if my feature vectors

209
00:07:30,376 --> 00:07:32,186
X-i are equal to this

210
00:07:32,186 --> 00:07:33,878
1, which is X-0, then

211
00:07:33,878 --> 00:07:35,912
some real feature, like maybe the

212
00:07:35,912 --> 00:07:37,662
size of the house, then my

213
00:07:37,662 --> 00:07:40,947
design matrix, X, would be equal to this.

214
00:07:40,947 --> 00:07:42,589
For the first row, I'm going

215
00:07:42,589 --> 00:07:46,071
to basically take this and take its transpose.

216
00:07:46,071 --> 00:07:51,644
So, I'm going to end up with 1, and then X-1-1.

217
00:07:51,644 --> 00:07:53,309
For the second row, we're going to end

218
00:07:53,309 --> 00:07:56,077
up with 1 and then

219
00:07:56,077 --> 00:07:58,046
X-1-2 and so

220
00:07:58,046 --> 00:07:59,046
on down to 1, and

221
00:07:59,046 --> 00:08:01,420
then X-1-M.

222
00:08:01,420 --> 00:08:03,084
And thus, this will be

223
00:08:03,084 --> 00:08:07,776
a m by 2-dimensional matrix.

224
00:08:07,776 --> 00:08:08,821
So, that's how to construct

225
00:08:08,821 --> 00:08:11,251
the matrix X. And, the

226
00:08:11,251 --> 00:08:13,886
vector Y--sometimes I might

227
00:08:13,886 --> 00:08:15,487
write an arrow on top to

228
00:08:15,487 --> 00:08:16,541
denote that it is a vector,

229
00:08:16,541 --> 00:08:19,871
but very often I'll just write this as Y, either way.

230
00:08:19,871 --> 00:08:21,182
The vector Y is obtained by

231
00:08:21,182 --> 00:08:23,275
taking all all the labels,

232
00:08:23,275 --> 00:08:25,098
all the correct prices of

233
00:08:25,098 --> 00:08:27,076
houses in my training set, and

234
00:08:27,076 --> 00:08:28,963
just stacking them up into

235
00:08:28,963 --> 00:08:32,011
an M-dimensional vector, and

236
00:08:32,011 --> 00:08:34,511
that's Y.  Finally, having

237
00:08:34,511 --> 00:08:36,724
constructed the matrix X

238
00:08:36,724 --> 00:08:38,184
and the vector Y, we then

239
00:08:38,184 --> 00:08:40,887
just compute theta as X'(1/X)

240
00:08:40,887 --> 00:08:47,243
x X'Y. I just

241
00:08:47,243 --> 00:08:49,356
want to make

242
00:08:49,356 --> 00:08:51,348
I just want to make sure that
this equation makes sense to you

243
00:08:51,348 --> 00:08:52,242
and that you know how to implement it.

244
00:08:52,242 --> 00:08:55,221
So, you know, concretely, what is this X'(1/X)?

245
00:08:55,221 --> 00:08:57,903
Well, X'(1/X) is the

246
00:08:57,903 --> 00:09:02,101
inverse of the matrix X'X.

247
00:09:02,101 --> 00:09:04,498
Concretely, if you were

248
00:09:04,498 --> 00:09:08,055
to say set A to

249
00:09:08,055 --> 00:09:11,120
be equal to X' x

250
00:09:11,120 --> 00:09:12,542
X, so X' is a

251
00:09:12,542 --> 00:09:14,063
matrix, X' x X

252
00:09:14,063 --> 00:09:15,305
gives you another matrix, and we

253
00:09:15,305 --> 00:09:17,560
call that matrix A. Then, you

254
00:09:17,560 --> 00:09:19,968
know, X'(1/X) is just

255
00:09:19,968 --> 00:09:22,352
you take this matrix A and you invert it, right!

256
00:09:23,245 --> 00:09:24,417
This gives, let's say 1/A.

257
00:09:26,025 --> 00:09:28,919
And so that's how you compute this thing.

258
00:09:28,919 --> 00:09:31,451
You compute X'X and then you compute its inverse.

259
00:09:31,451 --> 00:09:34,296
We haven't yet talked about Octave.

260
00:09:34,296 --> 00:09:35,941
We'll do so in the later

261
00:09:35,941 --> 00:09:37,211
set of videos, but in the

262
00:09:37,211 --> 00:09:39,073
Octave programming language or a

263
00:09:39,073 --> 00:09:40,652
similar view, and also the

264
00:09:40,652 --> 00:09:42,957
matlab programming language is very similar.

265
00:09:42,957 --> 00:09:46,937
The command to compute this quantity,

266
00:09:47,384 --> 00:09:50,326
X transpose X inverse times

267
00:09:50,326 --> 00:09:52,537
X transpose Y, is as follows.

268
00:09:52,537 --> 00:09:54,903
In Octave X prime is

269
00:09:54,903 --> 00:09:58,354
the notation that you use to denote X transpose.

270
00:09:58,354 --> 00:10:00,737
And so, this expression that's

271
00:10:00,737 --> 00:10:03,588
boxed in red, that's computing

272
00:10:03,588 --> 00:10:06,633
X transpose times X.

273
00:10:06,633 --> 00:10:08,551
pinv is a function for

274
00:10:08,551 --> 00:10:09,701
computing the inverse of

275
00:10:09,701 --> 00:10:11,818
a matrix, so this computes

276
00:10:11,818 --> 00:10:14,656
X transpose X inverse,

277
00:10:14,656 --> 00:10:16,453
and then you multiply that by

278
00:10:16,453 --> 00:10:18,267
X transpose, and you multiply

279
00:10:18,267 --> 00:10:19,712
that by Y. So you

280
00:10:19,712 --> 00:10:22,325
end computing that formula

281
00:10:22,325 --> 00:10:24,369
which I didn't prove,

282
00:10:24,369 --> 00:10:25,994
but it is possible to

283
00:10:25,994 --> 00:10:27,382
show mathematically even though I'm

284
00:10:27,382 --> 00:10:28,537
not going to do so

285
00:10:28,537 --> 00:10:31,071
here, that this formula gives you

286
00:10:31,071 --> 00:10:32,316
the optimal value of theta

287
00:10:32,316 --> 00:10:34,865
in the sense that if you set theta equal

288
00:10:34,865 --> 00:10:36,512
to this, that's the value

289
00:10:36,512 --> 00:10:38,000
of theta that minimizes the

290
00:10:38,000 --> 00:10:40,169
cost function J of theta

291
00:10:40,169 --> 00:10:41,993
for the new regression.

292
00:10:41,993 --> 00:10:44,530
One last detail in the earlier video.

293
00:10:44,530 --> 00:10:46,131
I talked about the feature

294
00:10:46,131 --> 00:10:47,061
skill and the idea of

295
00:10:47,061 --> 00:10:48,878
getting features to be

296
00:10:48,878 --> 00:10:50,726
on similar ranges of

297
00:10:50,726 --> 00:10:54,900
Scales of similar ranges of values of each other.

298
00:10:54,900 --> 00:10:56,872
If you are using this normal

299
00:10:56,872 --> 00:10:59,843
equation method then feature

300
00:10:59,843 --> 00:11:02,315
scaling isn't actually necessary

301
00:11:02,315 --> 00:11:04,361
and is actually okay if,

302
00:11:04,361 --> 00:11:06,094
say, some feature X one

303
00:11:06,094 --> 00:11:07,552
is between zero and one,

304
00:11:07,552 --> 00:11:08,846
and some feature X two is

305
00:11:08,846 --> 00:11:10,550
between ranges from zero to

306
00:11:10,550 --> 00:11:12,019
one thousand and some feature

307
00:11:12,019 --> 00:11:14,159
x three ranges from zero

308
00:11:14,159 --> 00:11:15,822
to ten to the

309
00:11:15,822 --> 00:11:17,263
minus five and if

310
00:11:17,263 --> 00:11:18,321
you are using the normal equation method

311
00:11:18,321 --> 00:11:20,296
this is okay and there is

312
00:11:20,296 --> 00:11:21,550
no need to do features

313
00:11:21,550 --> 00:11:22,740
scaling, although of course

314
00:11:22,740 --> 00:11:25,667
if you are using gradient descent,

315
00:11:25,667 --> 00:11:27,814
then, features scaling is still important.

316
00:11:28,030 --> 00:11:31,020
Finally, where should you use the gradient descent

317
00:11:31,020 --> 00:11:33,273
and when should you use the normal equation method.

318
00:11:33,273 --> 00:11:35,800
Here are some of the their advantages and disadvantages.

319
00:11:35,800 --> 00:11:38,305
Let's say you have m training

320
00:11:38,305 --> 00:11:40,918
examples and n features.

321
00:11:40,918 --> 00:11:42,854
One disadvantage of gradient descent

322
00:11:42,854 --> 00:11:46,015
is that, you need to choose the learning rate Alpha.

323
00:11:46,015 --> 00:11:47,374
And, often, this means running

324
00:11:47,374 --> 00:11:49,128
it few times with different learning

325
00:11:49,128 --> 00:11:51,154
rate alphas and then seeing what works best.

326
00:11:51,154 --> 00:11:54,274
And so that is sort of extra work and extra hassle.

327
00:11:54,274 --> 00:11:55,976
Another disadvantage with gradient descent

328
00:11:55,976 --> 00:11:57,841
is it needs many more iterations.

329
00:11:57,841 --> 00:11:59,346
So, depending on the details,

330
00:11:59,346 --> 00:12:00,839
that could make it slower, although

331
00:12:00,839 --> 00:12:04,391
there's more to the story as we'll see in a second.

332
00:12:04,391 --> 00:12:07,544
As for the normal equation, you don't need to
choose any learning rate alpha.

333
00:12:07,821 --> 00:12:11,208
So that, you know, makes it really convenient,
makes it simple to implement.

334
00:12:11,208 --> 00:12:13,888
You just run it and it usually just works.

335
00:12:13,888 --> 00:12:15,061
And you don't need to

336
00:12:15,061 --> 00:12:16,129
iterate, so, you don't need

337
00:12:16,129 --> 00:12:17,456
to plot J of Theta or

338
00:12:17,456 --> 00:12:20,497
check the convergence or take all those extra steps.

339
00:12:20,497 --> 00:12:21,931
So far, the balance seems to

340
00:12:21,931 --> 00:12:23,846
favor normal the normal equation.

341
00:12:24,826 --> 00:12:27,085
Here are some disadvantages of

342
00:12:27,612 --> 00:12:29,435
the normal equation, and some advantages of gradient descent.

343
00:12:29,681 --> 00:12:31,447
Gradient descent works pretty well,

344
00:12:31,928 --> 00:12:34,698
even when you have a very large number of features.

345
00:12:34,698 --> 00:12:36,168
So, even if you

346
00:12:36,168 --> 00:12:37,812
have millions of features you

347
00:12:37,812 --> 00:12:40,865
can run gradient descent and it will be reasonably efficient.

348
00:12:40,865 --> 00:12:43,381
It will do something reasonable.

349
00:12:43,381 --> 00:12:46,566
In contrast to normal equation, In, in

350
00:12:46,566 --> 00:12:48,014
order to solve for the parameters

351
00:12:48,014 --> 00:12:50,394
data, we need to solve for this term.

352
00:12:50,394 --> 00:12:53,058
We need to compute this term, X transpose, X inverse.

353
00:12:53,058 --> 00:12:56,328
This matrix X transpose X.

354
00:12:56,328 --> 00:13:00,206
That's an n by n matrix,
if you have n features.

355
00:13:00,770 --> 00:13:02,947
Because, if you look

356
00:13:02,947 --> 00:13:03,917
at the dimensions of

357
00:13:03,917 --> 00:13:05,529
X transpose the dimension of

358
00:13:05,529 --> 00:13:07,024
X, you multiply, figure out what

359
00:13:07,024 --> 00:13:08,749
the dimension of the product

360
00:13:08,749 --> 00:13:10,983
is, the matrix X transpose

361
00:13:10,983 --> 00:13:13,727
X is an n by n matrix where

362
00:13:13,727 --> 00:13:15,853
n is the number of features, and

363
00:13:15,853 --> 00:13:18,641
for almost computed implementations

364
00:13:18,641 --> 00:13:20,990
the cost of inverting

365
00:13:20,990 --> 00:13:23,087
the matrix, rose roughly as

366
00:13:23,087 --> 00:13:25,707
the cube of the dimension of the matrix.

367
00:13:25,707 --> 00:13:28,180
So, computing this inverse costs,

368
00:13:28,180 --> 00:13:29,964
roughly order, and cube time.

369
00:13:29,964 --> 00:13:31,213
Sometimes, it's slightly faster than

370
00:13:31,213 --> 00:13:35,050
N cube but, it's, you know, close enough
for our purposes.

371
00:13:35,489 --> 00:13:36,605
So if n the number of features
is very large,

372
00:13:37,643 --> 00:13:39,025
then computing this

373
00:13:39,025 --> 00:13:40,570
quantity can be slow and

374
00:13:40,570 --> 00:13:44,289
the normal equation method can actually be much slower.

375
00:13:44,289 --> 00:13:45,491
So if n is

376
00:13:45,491 --> 00:13:47,622
large then I might

377
00:13:47,622 --> 00:13:49,490
usually use gradient descent because

378
00:13:49,490 --> 00:13:51,872
we don't want to pay this all in q time.

379
00:13:51,872 --> 00:13:53,525
But, if n is relatively small,

380
00:13:53,525 --> 00:13:57,395
then the normal equation might give you a better way to solve the parameters.

381
00:13:57,395 --> 00:13:59,080
What does small and large mean?

382
00:13:59,080 --> 00:14:00,741
Well, if n is on

383
00:14:00,741 --> 00:14:02,130
the order of a hundred, then

384
00:14:02,130 --> 00:14:03,822
inverting a hundred-by-hundred matrix is

385
00:14:03,822 --> 00:14:06,539
no problem by modern computing standards.

386
00:14:06,539 --> 00:14:10,966
If n is a thousand, I would still use
the normal equation method.

387
00:14:10,966 --> 00:14:12,583
Inverting a thousand-by-thousand matrix is

388
00:14:12,583 --> 00:14:15,408
actually really fast on a modern computer.

389
00:14:15,408 --> 00:14:18,406
If n is ten thousand, then I might start to wonder.

390
00:14:18,406 --> 00:14:20,618
Inverting a ten-thousand-  by-ten-thousand matrix

391
00:14:20,618 --> 00:14:22,208
starts to get kind of slow,

392
00:14:22,208 --> 00:14:23,471
and I might then start to

393
00:14:23,471 --> 00:14:25,007
maybe lean in the

394
00:14:25,007 --> 00:14:27,007
direction of gradient descent, but maybe not quite.

395
00:14:27,114 --> 00:14:28,672
n equals ten thousand, you can

396
00:14:28,672 --> 00:14:31,148
sort of convert a ten-thousand-by-ten-thousand matrix.

397
00:14:31,148 --> 00:14:34,345
But if it gets much bigger than that, then,
I would probably use gradient descent.

398
00:14:34,345 --> 00:14:35,834
So, if n equals ten

399
00:14:35,834 --> 00:14:36,920
to the sixth with a million

400
00:14:36,920 --> 00:14:38,963
features, then inverting a

401
00:14:38,963 --> 00:14:41,565
million-by-million matrix is going

402
00:14:41,565 --> 00:14:42,631
to be very expensive, and

403
00:14:42,631 --> 00:14:46,163
I would definitely favor gradient descent if you have that many features.

404
00:14:46,163 --> 00:14:47,859
So exactly how large

405
00:14:47,859 --> 00:14:49,282
set of features has to be

406
00:14:49,282 --> 00:14:52,655
before you convert a gradient descent,
it's hard to give a strict number.

407
00:14:52,655 --> 00:14:53,855
But, for me, it is usually

408
00:14:53,855 --> 00:14:55,501
around ten thousand that I might

409
00:14:55,501 --> 00:14:58,258
start to consider switching over

410
00:14:58,335 --> 00:15:00,663
to gradient descents or maybe,

411
00:15:00,663 --> 00:15:04,324
some other algorithms that we'll talk about later in this class.

412
00:15:04,324 --> 00:15:05,765
To summarize, so long

413
00:15:05,765 --> 00:15:06,999
as the number of features is

414
00:15:06,999 --> 00:15:08,475
not too large, the normal equation

415
00:15:08,475 --> 00:15:12,229
gives us a great alternative method
to solve for the parameter theta.

416
00:15:12,583 --> 00:15:13,983
Concretely, so long as

417
00:15:13,983 --> 00:15:15,749
the number of features is less

418
00:15:15,749 --> 00:15:17,472
than 1000, you know, I would

419
00:15:17,472 --> 00:15:18,881
use, I would usually is used

420
00:15:18,881 --> 00:15:21,955
in normal equation method rather than, gradient descent.

421
00:15:21,955 --> 00:15:23,549
To preview some ideas that

422
00:15:23,549 --> 00:15:24,493
we'll talk about later in this

423
00:15:24,493 --> 00:15:26,235
course, as we get

424
00:15:26,235 --> 00:15:27,912
to the more complex learning algorithm, for

425
00:15:27,912 --> 00:15:29,617
example, when we talk about

426
00:15:29,617 --> 00:15:32,188
classification algorithm, like a logistic regression algorithm,

427
00:15:32,834 --> 00:15:34,319
We'll see that those algorithm

428
00:15:34,319 --> 00:15:35,467
actually...

429
00:15:35,467 --> 00:15:37,592
The normal  equation method
actually do not work

430
00:15:37,592 --> 00:15:39,388
for those more sophisticated

431
00:15:39,388 --> 00:15:41,190
learning algorithms, and, we

432
00:15:41,190 --> 00:15:43,916
will have to resort to gradient descent
for those algorithms.

433
00:15:43,916 --> 00:15:46,682
So, gradient descent is a very useful algorithm to know.

434
00:15:46,682 --> 00:15:48,859
The linear regression will have

435
00:15:48,982 --> 00:15:50,017
a large number of features and

436
00:15:50,017 --> 00:15:52,373
for some of the other algorithms

437
00:15:52,373 --> 00:15:53,893
that we'll see in

438
00:15:53,893 --> 00:15:55,438
this course, because, for them, the normal

439
00:15:55,438 --> 00:15:58,747
equation method just doesn't apply and doesn't work.

440
00:15:58,747 --> 00:16:00,537
But for this specific model of

441
00:16:00,537 --> 00:16:02,904
linear regression, the normal equation

442
00:16:02,904 --> 00:16:05,827
can give you a alternative

443
00:16:07,219 --> 00:16:08,612
that can be much faster, than gradient descent.

444
00:16:09,604 --> 00:16:11,920
So, depending on the detail of your algortithm,

445
00:16:12,007 --> 00:16:14,164
depending of the detail of the problems and

446
00:16:14,164 --> 00:16:15,550
how many features that you have,

447
00:16:15,550 --> 00:16:19,550
both of these algorithms are
well worth knowing about.