1
00:00:00,672 --> 00:00:04,284
[MUSIC]

2
00:00:04,284 --> 00:00:07,494
Okay, so
let's consider the resulting objective,

3
00:00:07,494 --> 00:00:11,170
where I'm gonna try and
search over all possible w vectors.

4
00:00:11,170 --> 00:00:15,790
To find the ones that minimize the sum of
residual sum of squares plus the square

5
00:00:15,790 --> 00:00:16,990
of the two norm of w.

6
00:00:18,570 --> 00:00:24,270
So that's gonna be my w hat,
my estimated model parameters.

7
00:00:24,270 --> 00:00:29,256
But really what I'd like to do, is I'd
like to be able to control how much I'm

8
00:00:29,256 --> 00:00:33,863
weighing the complexity of the model
as measured by this magnitude of my

9
00:00:33,863 --> 00:00:37,045
coefficient, relative to
the fit of the model.

10
00:00:37,045 --> 00:00:39,561
I'd like to balance between
these two terms, and so

11
00:00:39,561 --> 00:00:41,630
I'm gonna introduce another parameter.

12
00:00:41,630 --> 00:00:43,660
And this is called a tuning parameter.

13
00:00:43,660 --> 00:00:49,780
With the model, it's a lambda, and this is
balancing between this fit and magnitude.

14
00:00:51,110 --> 00:00:53,935
So let's see what happens
if I choose lambda to be 0.

15
00:00:55,460 --> 00:00:57,231
Well, if I choose lambda to be 0,

16
00:00:57,231 --> 00:01:00,900
this magnitude term that we've
introduced completely disappears and

17
00:01:00,900 --> 00:01:04,767
my objective reduces down just to
minimizing the residual sum of squares.

18
00:01:04,767 --> 00:01:07,889
Which was exactly the same
as my objective before.

19
00:01:07,889 --> 00:01:12,029
So, this reduces to

20
00:01:12,029 --> 00:01:17,143
minimizing residual sum

21
00:01:17,143 --> 00:01:22,506
of squares of w as before.

22
00:01:25,988 --> 00:01:28,647
So this is our old solution,

23
00:01:32,907 --> 00:01:38,293
Which leads to some w
hat which I'm gonna call

24
00:01:38,293 --> 00:01:43,130
w hat superscript LS for least squares.

25
00:01:44,210 --> 00:01:46,960
Because what we were doing
before is commonly referred to

26
00:01:46,960 --> 00:01:48,510
as the least squares solution.

27
00:01:48,510 --> 00:01:52,540
So I'm gonna specifically represent
the parameters associated

28
00:01:52,540 --> 00:01:56,279
with that old procedure we're doing
as the least squares parameters.

29
00:01:57,840 --> 00:01:58,970
On the other hand,

30
00:01:58,970 --> 00:02:02,820
what if I completely crank up that
tuning parameter to be infinity?

31
00:02:04,950 --> 00:02:10,760
So I have a really, really massively
large weight on this magnitude term.

32
00:02:10,760 --> 00:02:12,850
Massively large being infinitely large.

33
00:02:14,140 --> 00:02:17,680
So as large as you can possibly imagine.

34
00:02:17,680 --> 00:02:25,012
So what happens to any solution
where w hat is not equal to 0?

35
00:02:25,012 --> 00:02:29,467
So, For

36
00:02:29,467 --> 00:02:36,413
solutions where w hat does not equal 0.

37
00:02:39,791 --> 00:02:41,811
Then the total cost is what?

38
00:02:45,212 --> 00:02:48,886
Well I get something that's non-0
times infinity plus something,

39
00:02:48,886 --> 00:02:51,946
my residual sum of squares,
whatever that happens to be.

40
00:02:51,946 --> 00:02:55,428
But the sum of that is infinity.

41
00:02:55,428 --> 00:02:57,994
Okay, so my total cost is infinite.

42
00:02:57,994 --> 00:03:02,609
On the other hand,
what if w hat is exactly equal to 0?

43
00:03:02,609 --> 00:03:06,742
Then if w hat equals 0,

44
00:03:06,742 --> 00:03:14,185
then total cost is equal
to the residual sum

45
00:03:14,185 --> 00:03:19,370
of squares of this 0 vector.

46
00:03:20,380 --> 00:03:24,310
And that's some number, but
it's probably not infinity.

47
00:03:24,310 --> 00:03:26,760
Actually it's not infinity, so

48
00:03:27,900 --> 00:03:32,675
the minimizing solution here is
always gonna be w hat equals 0.

49
00:03:36,565 --> 00:03:40,445
Cuz that's the thing that's gonna minimize
the total cost over all possible w's.

50
00:03:46,551 --> 00:03:51,524
Okay, so just to recap, we said that
if we put that tuning parameter

51
00:03:51,524 --> 00:03:56,200
all the way to 0, make it very,
very small, all the way to 0.

52
00:03:56,200 --> 00:03:59,560
Then we return to our
previously square solution and

53
00:03:59,560 --> 00:04:03,060
if we crank that parameter all
the way up to be infinite.

54
00:04:03,060 --> 00:04:09,980
In that limit, we get all of our
coefficients being exactly 0, okay?

55
00:04:12,300 --> 00:04:18,029
But we're gonna be operating in a regime
where lambda is somewhere in between 0 and

56
00:04:18,029 --> 00:04:18,950
infinity.

57
00:04:20,400 --> 00:04:25,164
And in this case, Then we know

58
00:04:25,164 --> 00:04:30,326
that the magnitude of our
estimated coefficients,

59
00:04:30,326 --> 00:04:33,258
they're gonna be less than or

60
00:04:33,258 --> 00:04:39,141
equal to the magnitude of our
least squares coefficients.

61
00:04:39,141 --> 00:04:44,504
In particular,
the two norm will be less than.

62
00:04:44,504 --> 00:04:48,626
But we also know it's gonna be
greater than or equal to 0.

63
00:04:48,626 --> 00:04:55,185
So we're gonna be somewhere
in between these two regions.

64
00:04:55,185 --> 00:04:58,041
And a key question is,
what lambda do we actually want?

65
00:04:58,041 --> 00:05:01,977
How much do we want to bias away
from our least square solution,

66
00:05:01,977 --> 00:05:06,956
which was subject to potentially
over-fitting, down to this really simple,

67
00:05:06,956 --> 00:05:11,290
the most trivial model you can
consider which is nothing, no model?

68
00:05:12,650 --> 00:05:15,270
So, well not no model,
no coefficients in the model.

69
00:05:15,270 --> 00:05:18,360
What's the model if all
the coefficients are 0?

70
00:05:18,360 --> 00:05:21,900
Just noise, we just have y equals epsilon,
that noise term.

71
00:05:23,230 --> 00:05:29,750
Okay, so we're gonna think about somehow
trading off between these two extremes.

72
00:05:30,780 --> 00:05:34,900
Okay, I wanted to mention that this
is referred to as Ridge regression.

73
00:05:36,240 --> 00:05:39,610
And that's also known as
doing L2 regularization.

74
00:05:39,610 --> 00:05:45,614
Because, for reasons that we'll describe
a little bit more later in this module,

75
00:05:45,614 --> 00:05:50,503
we're regularizing the solution
to the old objective that we had,

76
00:05:50,503 --> 00:05:52,410
using this L2 norm term.

77
00:05:52,410 --> 00:05:57,009
[MUSIC]