1
00:00:00,000 --> 00:00:04,666
[MUSIC]

2
00:00:04,666 --> 00:00:07,946
Now we've seen multiple ways
that overfitting can be bad for

3
00:00:07,946 --> 00:00:11,100
classification especially
from logistic regression and

4
00:00:11,100 --> 00:00:14,690
how very massive parameters
can be a really bad thing.

5
00:00:14,690 --> 00:00:18,450
So, what we're going to do next is
introduce a notion of regularization

6
00:00:18,450 --> 00:00:22,020
just like we did regression to
penalize this really large parameters

7
00:00:22,020 --> 00:00:24,780
in order to get a more reasonable outcome.

8
00:00:26,170 --> 00:00:29,820
So we're still talking about
the same logistical regression model

9
00:00:29,820 --> 00:00:33,150
where we take data we do some
feature extraction to it we fit

10
00:00:33,150 --> 00:00:38,330
this model one over one plus
e to the minus w transpose x.

11
00:00:38,330 --> 00:00:41,960
But the quality metric for this machine
in the algorithm is going to change

12
00:00:41,960 --> 00:00:46,230
to push us away from
really large coefficients.

13
00:00:46,230 --> 00:00:49,330
So in particular we're going to
balance how well we fit the data

14
00:00:49,330 --> 00:00:53,560
with the magnitude of coefficients as
to avoid this massive coefficients.

15
00:00:53,560 --> 00:00:57,050
In the context of logistic regression,

16
00:00:57,050 --> 00:01:00,510
we're balancing two things
to measure total quality.

17
00:01:00,510 --> 00:01:01,830
The measure of fit.

18
00:01:01,830 --> 00:01:05,330
Which was the data likelihood,
the thing that's bigger is better,

19
00:01:05,330 --> 00:01:09,110
how well I fit the data, and
then the magnitude of the coefficients,

20
00:01:09,110 --> 00:01:12,480
where the coefficients
are too big are problematic.

21
00:01:12,480 --> 00:01:15,610
So we have one thing that we want
to be big, the likelihood, and

22
00:01:15,610 --> 00:01:19,210
the other thing we want to be small,
which is the minds of coefficients,

23
00:01:19,210 --> 00:01:25,760
we're going to optimize the quality
minus this complex metric here.

24
00:01:25,760 --> 00:01:28,320
And so we want to balance between the two.

25
00:01:29,500 --> 00:01:31,050
So what do those mean?

26
00:01:31,050 --> 00:01:34,530
Let's substantiate that more clearly
In the context of logistic regression.

27
00:01:35,760 --> 00:01:39,260
The quality metric in logistic
regression is the data likelihood, and

28
00:01:39,260 --> 00:01:42,100
we talked about it quite
a bit in the previous module.

29
00:01:42,100 --> 00:01:45,940
Now one little side note here that
we're going to use in this module

30
00:01:45,940 --> 00:01:50,330
is that we don't typically optimize
the data likelihood directly.

31
00:01:50,330 --> 00:01:53,340
We optimize the log of
the data likelihood,

32
00:01:53,340 --> 00:01:55,430
because that makes math a lot simpler.

33
00:01:56,680 --> 00:01:59,810
And it makes the gradients
behave a lot better.

34
00:01:59,810 --> 00:02:02,510
In the options section
in the previous model,

35
00:02:02,510 --> 00:02:06,070
we talked about this quite a bit,
and we explored it in detail.

36
00:02:07,128 --> 00:02:08,130
If you skipped that section,

37
00:02:08,130 --> 00:02:12,730
just think about the log as a way
to make those numbers less extreme.

38
00:02:12,730 --> 00:02:13,492
So we take the log.

39
00:02:13,492 --> 00:02:18,600
So the method for quality is going to
be the log of the data likelihood,

40
00:02:18,600 --> 00:02:21,540
and we're going to make that
log as big as possible.

41
00:02:23,030 --> 00:02:27,540
So we see that the likelihood is going
to be what we're going to optimize when

42
00:02:27,540 --> 00:02:28,630
we try to make it big.

43
00:02:28,630 --> 00:02:30,370
But at the same time,
we're trying to make something small,

44
00:02:30,370 --> 00:02:31,820
which is the magnitude of the coefficient.

45
00:02:31,820 --> 00:02:33,190
So there are different metrics for

46
00:02:33,190 --> 00:02:36,320
magnitude of coefficients,
just like we explored in regression.

47
00:02:36,320 --> 00:02:38,890
There's two that we're
going to use in this module.

48
00:02:38,890 --> 00:02:43,730
One is the sum of the squares, also called
the L2 norm, the square of the L2 norm.

49
00:02:43,730 --> 00:02:51,140
And because it's noted by W2 squared and
it's just very simple.

50
00:02:51,140 --> 00:02:54,460
It's the square of the first
coefficient plus the square

51
00:02:54,460 --> 00:02:59,160
of the second coefficient plus the square
of the third coefficient and so

52
00:02:59,160 --> 00:03:04,100
on plus the square of the last
coefficient, w d squared.

53
00:03:05,720 --> 00:03:07,250
That's if you used the L2 norm.

54
00:03:07,250 --> 00:03:10,830
We can also use the sum of the absolute
values, also called the L1 norm,

55
00:03:10,830 --> 00:03:15,560
and it's denoted by this here.

56
00:03:15,560 --> 00:03:19,710
And instead of being the squares,
it's w0, absolute value,

57
00:03:19,710 --> 00:03:25,400
plus W one absolute value plus w two

58
00:03:25,400 --> 00:03:29,546
absolute value all the way to the absolute
value of the last coefficient.

59
00:03:29,546 --> 00:03:36,100
Now in the regression course we
explored these notions quite a bit but

60
00:03:36,100 --> 00:03:39,270
the main reason we take the square of
the absolute value is that we want to

61
00:03:39,270 --> 00:03:44,100
make sure to penalize highly positive and
highly negative numbers in the same way,

62
00:03:44,100 --> 00:03:48,740
so by doing the search squaring of some
value, i'll make the output here positive.

63
00:03:48,740 --> 00:03:53,220
When I make this norms as low as possible.

64
00:03:53,220 --> 00:03:56,950
So both of these approaches
are penalize larger weight.

65
00:03:59,870 --> 00:04:04,540
Actually, i should say
penalize large coefficients.

66
00:04:07,140 --> 00:04:12,190
However, as we saw in the regression
class by using the L one norm,

67
00:04:12,190 --> 00:04:16,850
I'm also going to get what's
called a sparse solution.

68
00:04:16,850 --> 00:04:19,990
So the sparse doesn't point
play in regression but

69
00:04:19,990 --> 00:04:22,530
it also plays a role in classification for
example.

70
00:04:22,530 --> 00:04:27,120
And in this module we're going to explore
a little bit of both of these concepts.

71
00:04:27,120 --> 00:04:31,090
And we're going to start with the L2 norm,
or the sum of the squares.

72
00:04:32,150 --> 00:04:37,010
So now that we've reviewed these concepts,
we can now formalize the problem,

73
00:04:37,010 --> 00:04:39,640
the quality that we're trying to maximize.

74
00:04:39,640 --> 00:04:43,110
And so I want to maximize

75
00:04:43,110 --> 00:04:47,390
over my choice parameter W's of
the trade off between two things.

76
00:04:47,390 --> 00:04:51,900
The likelihood of my data and
actually the log of it.

77
00:04:51,900 --> 00:04:56,020
So, log of the data likelihood.

78
00:04:57,270 --> 00:05:02,816
And some notion of penalty for
the magnitude of the coefficients,

79
00:05:02,816 --> 00:05:06,985
which it will start with
this L2 penalty notion.

80
00:05:06,985 --> 00:05:12,099
[MUSIC]