1
00:00:00,029 --> 00:00:04,620
[MUSIC]

2
00:00:04,620 --> 00:00:09,493
For now, let's assume that somebody gives
you the lambda value that we wanna use

3
00:00:09,493 --> 00:00:11,786
in our ridge regression objective and

4
00:00:11,786 --> 00:00:16,100
let's discuss algorithmically
how we fit this model.

5
00:00:16,100 --> 00:00:18,550
So in particular, in this part,

6
00:00:18,550 --> 00:00:21,960
we're describing this gray box
our machine learning algorithm.

7
00:00:23,310 --> 00:00:26,510
And to start with,
just like we've done many times before,

8
00:00:26,510 --> 00:00:31,150
we're first gonna rewrite our
objective in matrix notation,

9
00:00:31,150 --> 00:00:33,820
where we assume we have
some N observations and

10
00:00:33,820 --> 00:00:38,040
we wanna write jointly what the objective
is for all N of these observations.

11
00:00:39,920 --> 00:00:44,540
So let's just recall what we did for
our residual sum of squares term.

12
00:00:44,540 --> 00:00:50,550
Where, for our model, we thought about
stacking up all N observations and

13
00:00:50,550 --> 00:00:55,640
all the features associated with those
N observations in this green matrix.

14
00:00:55,640 --> 00:01:01,460
And that matrix got multiplied by
a vector of our regression coefficient,

15
00:01:01,460 --> 00:01:05,450
and then we had this additive
noise per observation.

16
00:01:05,450 --> 00:01:14,320
So we wrote this matrix
notation as y = H w + epsilon.

17
00:01:14,320 --> 00:01:18,866
Then, when we went to form
our residual sum of squares,

18
00:01:18,866 --> 00:01:25,949
we showed that it was equivalent to
the following where we have (y-Hw)T(y-Hw).

19
00:01:25,949 --> 00:01:30,422
Okay, so this is our matrix notation for
our residual sum of squares term,

20
00:01:30,422 --> 00:01:34,611
but now we wanna do a similar term for
our model complexity, penalty,

21
00:01:34,611 --> 00:01:39,520
that we added to our original objective
to get a resulting regression objective.

22
00:01:40,710 --> 00:01:48,120
So in particular, we want to write this
two norm of w squared in vector notation.

23
00:01:48,120 --> 00:01:53,323
So this two norm of our w vector squared,
we said was equal to

24
00:01:53,323 --> 00:02:00,730
w0 squared + w1 squared + w2 squared + all
the way up to our Dth feature squared.

25
00:02:03,220 --> 00:02:08,230
And this,
is equivalent to taking our w vector,

26
00:02:10,240 --> 00:02:16,570
transpose, meaning putting it as a row,
and

27
00:02:16,570 --> 00:02:19,419
multiplying by the w vector itself.

28
00:02:23,460 --> 00:02:27,673
Because if we think about
doing this multiplication,

29
00:02:27,673 --> 00:02:34,960
we're gonna get w0 * w0 + w1 * w1 + w2
* w2 etc and that's exactly equivalent.

30
00:02:34,960 --> 00:02:42,460
And so we can write this as w, so
I'm trying to write a thick w vector.

31
00:02:42,460 --> 00:02:44,360
W transpose w.

32
00:02:46,930 --> 00:02:52,800
Okay, so this is our vector notation for
this model complexity term.

33
00:02:52,800 --> 00:02:57,470
So putting it all together,
our ridge regression total cost, for

34
00:02:57,470 --> 00:03:02,610
some N observations,
can be written as follows, where we

35
00:03:02,610 --> 00:03:09,140
have (y-Hw) transpose * (y-Hw)
+ lambda * w transpose w.

36
00:03:10,320 --> 00:03:14,760
Okay, now that we have this, we can start
doing what we've done in the past which is

37
00:03:14,760 --> 00:03:19,230
take the gradient and we can think about
either setting the gradient to zero to

38
00:03:19,230 --> 00:03:23,200
get a closed form solution, or
doing our gradient descent algorithm.

39
00:03:23,200 --> 00:03:25,740
And we're gonna walk through
both of these approaches now.

40
00:03:27,010 --> 00:03:30,210
So the first step is computing
the gradient of this objective.

41
00:03:31,330 --> 00:03:35,830
So here I'm just writing exactly
what we had on the previous slide.

42
00:03:36,860 --> 00:03:41,130
But, now,
with these gradient signs in front, and

43
00:03:41,130 --> 00:03:45,310
as we've seen before,
the gradient distributes across a sum.

44
00:03:45,310 --> 00:03:50,960
So we have the gradient of this first term
plus the gradient of our model complexity

45
00:03:50,960 --> 00:03:56,160
term, or the first term is our measure of
fit, that residual sum of squares term.

46
00:03:56,160 --> 00:03:57,440
And we know that the gradient, or

47
00:03:57,440 --> 00:04:00,840
the residual sum of squares
has the following form.

48
00:04:00,840 --> 00:04:04,759
-2H transpose (y-Hw).

49
00:04:04,759 --> 00:04:09,300
The question is, what's the gradient
of this model complexity term?

50
00:04:11,310 --> 00:04:16,081
What we see is that
the gradient of this is 2 * w.

51
00:04:17,100 --> 00:04:19,820
Why is it 2*w?

52
00:04:19,820 --> 00:04:23,140
Well, instead of deriving it, I'll leave
that for a little mini challenge for

53
00:04:23,140 --> 00:04:24,240
you guys.

54
00:04:24,240 --> 00:04:26,370
It's fairly straightforward to derive,

55
00:04:26,370 --> 00:04:29,690
just taking partials with
respect to each w component.

56
00:04:29,690 --> 00:04:35,020
Just write w transpose w as w0 squared
+ w1 squared + blah, blah, blah.

57
00:04:35,020 --> 00:04:37,116
All the way up to WD squared, and

58
00:04:37,116 --> 00:04:40,850
then take a derivative just
with respect to one of the Ws.

59
00:04:40,850 --> 00:04:45,951
But for now what I'm gonna do is
I'm just gonna draw an analogy to

60
00:04:45,951 --> 00:04:50,970
the 1d case where w transpose w
is analogous to just w squared.

61
00:04:50,970 --> 00:04:54,500
If w weren't a vector and
were instead just a scalar.

62
00:04:54,500 --> 00:04:57,210
And what's the derivative of w squared?

63
00:04:57,210 --> 00:04:58,370
It's 2w.

64
00:04:58,370 --> 00:05:01,650
Okay, so proof by analogy here.

65
00:05:01,650 --> 00:05:04,039
[MUSIC]