1
00:00:02,820 --> 00:00:06,520
In this video, we'll discuss how to regularize our models,

2
00:00:06,520 --> 00:00:08,290
how to reduce their complexity,

3
00:00:08,290 --> 00:00:10,060
so they don't overfit.

4
00:00:10,060 --> 00:00:15,015
As you remember there was an example with eight data points and eight parameters in

5
00:00:15,015 --> 00:00:17,580
our linear model and this model overfitted to

6
00:00:17,580 --> 00:00:21,930
our data and the parameters of this model were very large.

7
00:00:21,930 --> 00:00:24,690
But if we use the appropriate model for the same problem,

8
00:00:24,690 --> 00:00:27,250
in this case it's a model with three features

9
00:00:27,250 --> 00:00:30,960
x. X with a second degree and x with a third degree.

10
00:00:30,960 --> 00:00:32,710
Then the model will be good,

11
00:00:32,710 --> 00:00:35,100
it will fit the target function,

12
00:00:35,100 --> 00:00:38,355
the green line, and the parameters will be not very high.

13
00:00:38,355 --> 00:00:42,835
Actually, we can look at this property that overfitted model have large weights and

14
00:00:42,835 --> 00:00:47,200
good models have not very large weights to solve the problem of overfitting.

15
00:00:47,200 --> 00:00:50,095
To do it, we modify our loss function.

16
00:00:50,095 --> 00:00:52,200
So we take our initial loss function,

17
00:00:52,200 --> 00:00:54,990
L of W and we add a regularizer,

18
00:00:54,990 --> 00:00:58,925
R of W that penalizes our model for large weights.

19
00:00:58,925 --> 00:01:00,740
We end it with coefficient lambda,

20
00:01:00,740 --> 00:01:04,410
regularization strengths that controls the tradeoff between

21
00:01:04,410 --> 00:01:08,670
model quality on a training set and model complexity.

22
00:01:08,670 --> 00:01:11,395
And then we minimize this new loss function,

23
00:01:11,395 --> 00:01:16,860
L of W plus lambda multiplied by R of W. For example,

24
00:01:16,860 --> 00:01:19,525
we can use L2 penalty as a regularizer,

25
00:01:19,525 --> 00:01:22,500
it just sums the squares of our parameters,

26
00:01:22,500 --> 00:01:25,405
not including the bias, that's important.

27
00:01:25,405 --> 00:01:27,600
So this is a very simple penalty,

28
00:01:27,600 --> 00:01:31,760
it's differentiable, so we can use any gradient descent method to optimize it.

29
00:01:31,760 --> 00:01:36,925
And this regularizer, just drives all the coefficient closer to zero.

30
00:01:36,925 --> 00:01:41,470
So it penalizes our model for very large weights.

31
00:01:41,470 --> 00:01:45,860
Actually, it can be shown this unconstrained optimization problem

32
00:01:45,860 --> 00:01:48,550
is equivalent to constraint optimization problem.

33
00:01:48,550 --> 00:01:51,555
We just take our initial loss function L of W,

34
00:01:51,555 --> 00:01:56,895
minimize it with respect to W and have it constrained that the L to norm,

35
00:01:56,895 --> 00:01:59,780
squared L to norm of our weight vector,

36
00:01:59,780 --> 00:02:02,761
of our parameter vector is no larger than C,

37
00:02:02,761 --> 00:02:08,315
where there is a one to one correspondence between C and lambda regularization strength.

38
00:02:08,315 --> 00:02:14,330
So we take our loss function and we select the closest point to the minimum of

39
00:02:14,330 --> 00:02:22,495
this function that lies inside the ball of the radius R with the center at zero.

40
00:02:22,495 --> 00:02:26,585
And then if we return to our example with eight data points and model of

41
00:02:26,585 --> 00:02:31,790
eighth degree and apply an L2 penalty with regularization coefficient one,

42
00:02:31,790 --> 00:02:33,660
then we get this model.

43
00:02:33,660 --> 00:02:36,030
It's much more simpler than the previous model.

44
00:02:36,030 --> 00:02:40,520
It fits our true target function well and the co-efficient are not very large.

45
00:02:40,520 --> 00:02:43,060
So, L2 penalty does its job.

46
00:02:43,060 --> 00:02:46,050
There is another penalty function called L1 penalty.

47
00:02:46,050 --> 00:02:49,585
You need to take absolute values of all weights and sum them.

48
00:02:49,585 --> 00:02:52,790
And once again, we don't include the bias into this sum.

49
00:02:52,790 --> 00:02:56,810
This regularizer is not differentiable because there is

50
00:02:56,810 --> 00:03:01,960
no derivative of the absolute value as zero,

51
00:03:01,960 --> 00:03:06,555
so we need to use some advanced optimization techniques to optimize this function.

52
00:03:06,555 --> 00:03:09,190
But this penalty has a nice property.

53
00:03:09,190 --> 00:03:10,975
There's at least two sparse solution.

54
00:03:10,975 --> 00:03:12,250
It drives some coefficient,

55
00:03:12,250 --> 00:03:14,505
some parameters exactly to zero.

56
00:03:14,505 --> 00:03:18,795
So our model depends only on some subset of features.

57
00:03:18,795 --> 00:03:20,030
Once again we can show that

58
00:03:20,030 --> 00:03:24,035
these unconstrained optimization problem is equivalent to constraint,

59
00:03:24,035 --> 00:03:29,390
where we minimize our initial loss L of W and have a constraint

60
00:03:29,390 --> 00:03:33,190
that L1 norm for our weight vector is no larger than

61
00:03:33,190 --> 00:03:36,970
C. And in our example with eight data points,

62
00:03:36,970 --> 00:03:41,105
if we use L1 penalty with 0.01 coefficient,

63
00:03:41,105 --> 00:03:42,785
then we get this solution.

64
00:03:42,785 --> 00:03:48,311
It's too good, it fits data well and also four of eight coefficients are zero,

65
00:03:48,311 --> 00:03:51,175
so the solution is indeed sparse.

66
00:03:51,175 --> 00:03:54,875
Of course there're other regularization techniques.

67
00:03:54,875 --> 00:03:57,870
For example, we could reduce a dimensionality of our data.

68
00:03:57,870 --> 00:04:00,900
For example, remove some redundant features or maybe apply

69
00:04:00,900 --> 00:04:06,805
principal component analysis to get some new good features or we can augment our data.

70
00:04:06,805 --> 00:04:09,060
For example, if we work with images,

71
00:04:09,060 --> 00:04:10,110
we can distort them,

72
00:04:10,110 --> 00:04:11,580
flip, rotate, or something else.

73
00:04:11,580 --> 00:04:16,250
So we have more data and it's harder for our model to overfit on it.

74
00:04:16,250 --> 00:04:19,815
We can use dropout that we'll discuss in full in the weeks of our course.

75
00:04:19,815 --> 00:04:22,655
We can somehow use early stopping.

76
00:04:22,655 --> 00:04:24,960
So if we use gradient descent,

77
00:04:24,960 --> 00:04:28,020
we can stop, for example, at hundredth iteration,

78
00:04:28,020 --> 00:04:30,820
so our model doesn't have a way to overfit,

79
00:04:30,820 --> 00:04:34,170
it stops early and it underfits to our data.

80
00:04:34,170 --> 00:04:36,880
And of course, we can just collect more data.

81
00:04:36,880 --> 00:04:40,605
The more data we have, the harder it's for our model to overfit.

82
00:04:40,605 --> 00:04:43,375
So on large samples it just should generalize,

83
00:04:43,375 --> 00:04:46,350
it should learn some dependences from our data.

84
00:04:46,350 --> 00:04:50,935
In this video, we discussed regularization techniques.

85
00:04:50,935 --> 00:04:55,510
For example, L2 and L1 penalties that were good for linear models.

86
00:04:55,510 --> 00:05:00,280
And we mentioned some other regularization techniques that are good for larger models,

87
00:05:00,280 --> 00:05:01,945
for example, for neural networks.

88
00:05:01,945 --> 00:05:07,880
And we'll discuss these regularization techniques in details in our following weeks.