1 00:00:02,820 --> 00:00:06,520 In this video, we'll discuss how to regularize our models, 2 00:00:06,520 --> 00:00:08,290 how to reduce their complexity, 3 00:00:08,290 --> 00:00:10,060 so they don't overfit. 4 00:00:10,060 --> 00:00:15,015 As you remember there was an example with eight data points and eight parameters in 5 00:00:15,015 --> 00:00:17,580 our linear model and this model overfitted to 6 00:00:17,580 --> 00:00:21,930 our data and the parameters of this model were very large. 7 00:00:21,930 --> 00:00:24,690 But if we use the appropriate model for the same problem, 8 00:00:24,690 --> 00:00:27,250 in this case it's a model with three features 9 00:00:27,250 --> 00:00:30,960 x. X with a second degree and x with a third degree. 10 00:00:30,960 --> 00:00:32,710 Then the model will be good, 11 00:00:32,710 --> 00:00:35,100 it will fit the target function, 12 00:00:35,100 --> 00:00:38,355 the green line, and the parameters will be not very high. 13 00:00:38,355 --> 00:00:42,835 Actually, we can look at this property that overfitted model have large weights and 14 00:00:42,835 --> 00:00:47,200 good models have not very large weights to solve the problem of overfitting. 15 00:00:47,200 --> 00:00:50,095 To do it, we modify our loss function. 16 00:00:50,095 --> 00:00:52,200 So we take our initial loss function, 17 00:00:52,200 --> 00:00:54,990 L of W and we add a regularizer, 18 00:00:54,990 --> 00:00:58,925 R of W that penalizes our model for large weights. 19 00:00:58,925 --> 00:01:00,740 We end it with coefficient lambda, 20 00:01:00,740 --> 00:01:04,410 regularization strengths that controls the tradeoff between 21 00:01:04,410 --> 00:01:08,670 model quality on a training set and model complexity. 22 00:01:08,670 --> 00:01:11,395 And then we minimize this new loss function, 23 00:01:11,395 --> 00:01:16,860 L of W plus lambda multiplied by R of W. For example, 24 00:01:16,860 --> 00:01:19,525 we can use L2 penalty as a regularizer, 25 00:01:19,525 --> 00:01:22,500 it just sums the squares of our parameters, 26 00:01:22,500 --> 00:01:25,405 not including the bias, that's important. 27 00:01:25,405 --> 00:01:27,600 So this is a very simple penalty, 28 00:01:27,600 --> 00:01:31,760 it's differentiable, so we can use any gradient descent method to optimize it. 29 00:01:31,760 --> 00:01:36,925 And this regularizer, just drives all the coefficient closer to zero. 30 00:01:36,925 --> 00:01:41,470 So it penalizes our model for very large weights. 31 00:01:41,470 --> 00:01:45,860 Actually, it can be shown this unconstrained optimization problem 32 00:01:45,860 --> 00:01:48,550 is equivalent to constraint optimization problem. 33 00:01:48,550 --> 00:01:51,555 We just take our initial loss function L of W, 34 00:01:51,555 --> 00:01:56,895 minimize it with respect to W and have it constrained that the L to norm, 35 00:01:56,895 --> 00:01:59,780 squared L to norm of our weight vector, 36 00:01:59,780 --> 00:02:02,761 of our parameter vector is no larger than C, 37 00:02:02,761 --> 00:02:08,315 where there is a one to one correspondence between C and lambda regularization strength. 38 00:02:08,315 --> 00:02:14,330 So we take our loss function and we select the closest point to the minimum of 39 00:02:14,330 --> 00:02:22,495 this function that lies inside the ball of the radius R with the center at zero. 40 00:02:22,495 --> 00:02:26,585 And then if we return to our example with eight data points and model of 41 00:02:26,585 --> 00:02:31,790 eighth degree and apply an L2 penalty with regularization coefficient one, 42 00:02:31,790 --> 00:02:33,660 then we get this model. 43 00:02:33,660 --> 00:02:36,030 It's much more simpler than the previous model. 44 00:02:36,030 --> 00:02:40,520 It fits our true target function well and the co-efficient are not very large. 45 00:02:40,520 --> 00:02:43,060 So, L2 penalty does its job. 46 00:02:43,060 --> 00:02:46,050 There is another penalty function called L1 penalty. 47 00:02:46,050 --> 00:02:49,585 You need to take absolute values of all weights and sum them. 48 00:02:49,585 --> 00:02:52,790 And once again, we don't include the bias into this sum. 49 00:02:52,790 --> 00:02:56,810 This regularizer is not differentiable because there is 50 00:02:56,810 --> 00:03:01,960 no derivative of the absolute value as zero, 51 00:03:01,960 --> 00:03:06,555 so we need to use some advanced optimization techniques to optimize this function. 52 00:03:06,555 --> 00:03:09,190 But this penalty has a nice property. 53 00:03:09,190 --> 00:03:10,975 There's at least two sparse solution. 54 00:03:10,975 --> 00:03:12,250 It drives some coefficient, 55 00:03:12,250 --> 00:03:14,505 some parameters exactly to zero. 56 00:03:14,505 --> 00:03:18,795 So our model depends only on some subset of features. 57 00:03:18,795 --> 00:03:20,030 Once again we can show that 58 00:03:20,030 --> 00:03:24,035 these unconstrained optimization problem is equivalent to constraint, 59 00:03:24,035 --> 00:03:29,390 where we minimize our initial loss L of W and have a constraint 60 00:03:29,390 --> 00:03:33,190 that L1 norm for our weight vector is no larger than 61 00:03:33,190 --> 00:03:36,970 C. And in our example with eight data points, 62 00:03:36,970 --> 00:03:41,105 if we use L1 penalty with 0.01 coefficient, 63 00:03:41,105 --> 00:03:42,785 then we get this solution. 64 00:03:42,785 --> 00:03:48,311 It's too good, it fits data well and also four of eight coefficients are zero, 65 00:03:48,311 --> 00:03:51,175 so the solution is indeed sparse. 66 00:03:51,175 --> 00:03:54,875 Of course there're other regularization techniques. 67 00:03:54,875 --> 00:03:57,870 For example, we could reduce a dimensionality of our data. 68 00:03:57,870 --> 00:04:00,900 For example, remove some redundant features or maybe apply 69 00:04:00,900 --> 00:04:06,805 principal component analysis to get some new good features or we can augment our data. 70 00:04:06,805 --> 00:04:09,060 For example, if we work with images, 71 00:04:09,060 --> 00:04:10,110 we can distort them, 72 00:04:10,110 --> 00:04:11,580 flip, rotate, or something else. 73 00:04:11,580 --> 00:04:16,250 So we have more data and it's harder for our model to overfit on it. 74 00:04:16,250 --> 00:04:19,815 We can use dropout that we'll discuss in full in the weeks of our course. 75 00:04:19,815 --> 00:04:22,655 We can somehow use early stopping. 76 00:04:22,655 --> 00:04:24,960 So if we use gradient descent, 77 00:04:24,960 --> 00:04:28,020 we can stop, for example, at hundredth iteration, 78 00:04:28,020 --> 00:04:30,820 so our model doesn't have a way to overfit, 79 00:04:30,820 --> 00:04:34,170 it stops early and it underfits to our data. 80 00:04:34,170 --> 00:04:36,880 And of course, we can just collect more data. 81 00:04:36,880 --> 00:04:40,605 The more data we have, the harder it's for our model to overfit. 82 00:04:40,605 --> 00:04:43,375 So on large samples it just should generalize, 83 00:04:43,375 --> 00:04:46,350 it should learn some dependences from our data. 84 00:04:46,350 --> 00:04:50,935 In this video, we discussed regularization techniques. 85 00:04:50,935 --> 00:04:55,510 For example, L2 and L1 penalties that were good for linear models. 86 00:04:55,510 --> 00:05:00,280 And we mentioned some other regularization techniques that are good for larger models, 87 00:05:00,280 --> 00:05:01,945 for example, for neural networks. 88 00:05:01,945 --> 00:05:07,880 And we'll discuss these regularization techniques in details in our following weeks.