1
00:00:00,000 --> 00:00:04,480
[MUSIC]

2
00:00:04,480 --> 00:00:08,224
Okay, so now we're onto the final
important step of the derivation,

3
00:00:08,224 --> 00:00:09,950
which is taking the gradient.

4
00:00:09,950 --> 00:00:15,000
Because as we saw in the simple regression
case, the gradient was important both for

5
00:00:15,000 --> 00:00:17,480
our closed form solution as well as,
of course, for

6
00:00:17,480 --> 00:00:19,370
the gradient descent algorithm.

7
00:00:19,370 --> 00:00:22,610
So what's the gradient of our residual
sum of squares in this multiple

8
00:00:22,610 --> 00:00:24,000
regression case?

9
00:00:24,000 --> 00:00:28,880
Well, it's the gradient of this
matrix notation that we use for

10
00:00:28,880 --> 00:00:31,270
representing the residual sum of squares.

11
00:00:31,270 --> 00:00:35,110
And if you know gradients of vectors and
matrices,

12
00:00:35,110 --> 00:00:40,710
which we're not assuming you do, so please
don't think that you need to know this,

13
00:00:40,710 --> 00:00:44,180
but the result is -2H transpose, so

14
00:00:44,180 --> 00:00:49,890
taking that big grain matrix and
turning it on its side, times y-Hw,

15
00:00:49,890 --> 00:00:56,430
which again is that vector, of residuals.

16
00:00:56,430 --> 00:00:58,374
And why is this the result?

17
00:00:58,374 --> 00:01:02,337
Well, I'm not gonna give
a complete proof of this.

18
00:01:02,337 --> 00:01:05,754
I'm just gonna give some motivation.

19
00:01:05,754 --> 00:01:09,979
I'm going to walk through an analogy to
1D case, and we'll see some patterns, and

20
00:01:09,979 --> 00:01:14,170
maybe you'll believe that that's
the result of the matrix case.

21
00:01:14,170 --> 00:01:20,650
So, in particular, if we think about
taking the derivative with respect to w

22
00:01:20,650 --> 00:01:27,806
of a function that is y-hw times

23
00:01:27,806 --> 00:01:32,579
y-hw where these things are all scalars.

24
00:01:34,090 --> 00:01:39,140
So this is the 1D analog
to this equation here,

25
00:01:39,140 --> 00:01:43,460
where the gradient is just this
derivative of this one parameter w.

26
00:01:43,460 --> 00:01:47,060
That arrow is not quite pointing to w.

27
00:01:48,210 --> 00:01:49,600
Well what's the derivative of this?

28
00:01:49,600 --> 00:01:57,890
It's equivalent to the derivative with
respect to w of Y minus hw squared.

29
00:01:57,890 --> 00:02:01,180
And, like we've done multiple
times in this course now,

30
00:02:01,180 --> 00:02:05,880
when I take the derivative with respect to
w of some function raised to the power,

31
00:02:05,880 --> 00:02:08,080
by the chain rule,
I bring that power down.

32
00:02:09,100 --> 00:02:17,800
Then I'm gonna multiply by the function
Hw raised to the power minus 1.

33
00:02:17,800 --> 00:02:21,670
And then I'm gonna take
the derivative of the inside.

34
00:02:21,670 --> 00:02:24,629
And what's the derivative of
this function with respect to w?

35
00:02:24,629 --> 00:02:26,850
It's minus h.

36
00:02:26,850 --> 00:02:34,874
And so the result here is -2h(y-Hw).

37
00:02:34,874 --> 00:02:38,136
So we have the -2 in both cases,

38
00:02:38,136 --> 00:02:43,144
this little scalar H is this
big matrix in our case,

39
00:02:43,144 --> 00:02:49,921
and y- Hw in the scalar case,
this big vector matrix notation here.

40
00:02:49,921 --> 00:02:55,290
Okay, so
just believe that this is the gradient.

41
00:02:55,290 --> 00:02:58,620
We didn't wanna bog you down
in too much linear algebra, or

42
00:02:58,620 --> 00:03:01,000
too much in terms of derivatives.

43
00:03:01,000 --> 00:03:05,884
But if we have this notation, then we
can derive everything we need to for

44
00:03:05,884 --> 00:03:09,422
our two different solutions
to fitting this model.

45
00:03:09,422 --> 00:03:14,509
[MUSIC]