1 00:00:00,000 --> 00:00:03,025 Why does regularization help with overfitting? 2 00:00:03,025 --> 00:00:05,835 Why does it help with reducing variance problems? 3 00:00:05,835 --> 00:00:10,920 Let's go through a couple examples to gain some intuition about how it works. 4 00:00:10,920 --> 00:00:16,635 So, recall that high bias, high variance. 5 00:00:16,635 --> 00:00:25,235 And I just write pictures from our earlier video that looks something like this. 6 00:00:25,235 --> 00:00:27,780 Now, let's see a fitting large and deep neural network. 7 00:00:27,780 --> 00:00:30,550 I know I haven't drawn this one too large or too deep, 8 00:00:30,550 --> 00:00:34,630 unless you think some neural network and this currently overfitting. 9 00:00:34,630 --> 00:00:39,520 So you have some cost function like J of W, 10 00:00:39,520 --> 00:00:44,390 B equals sum of the losses. 11 00:00:44,390 --> 00:00:51,872 So what we did for regularization was add 12 00:00:51,872 --> 00:00:56,395 this extra term that 13 00:00:56,395 --> 00:01:02,690 penalizes the weight matrices from being too large. 14 00:01:02,690 --> 00:01:04,540 So that was the Frobenius norm. 15 00:01:04,540 --> 00:01:08,290 So why is it that shrinking the L two norm or 16 00:01:08,290 --> 00:01:12,445 the Frobenius norm or the parameters might cause less overfitting? 17 00:01:12,445 --> 00:01:14,515 One piece of intuition is that if you 18 00:01:14,515 --> 00:01:17,354 crank regularisation lambda to be really, really big, 19 00:01:17,354 --> 00:01:20,005 they'll be really incentivized to set 20 00:01:20,005 --> 00:01:24,535 the weight matrices W to be reasonably close to zero. 21 00:01:24,535 --> 00:01:30,460 So one piece of intuition is maybe it set the weight to be so close to zero for 22 00:01:30,460 --> 00:01:33,340 a lot of hidden units that's basically zeroing 23 00:01:33,340 --> 00:01:36,675 out a lot of the impact of these hidden units. 24 00:01:36,675 --> 00:01:37,990 And if that's the case, 25 00:01:37,990 --> 00:01:44,765 then this much simplified neural network becomes a much smaller neural network. 26 00:01:44,765 --> 00:01:48,185 In fact, it is almost like a logistic regression unit, 27 00:01:48,185 --> 00:01:50,005 but stacked most probably as deep. 28 00:01:50,005 --> 00:01:51,805 And so that will take you from 29 00:01:51,805 --> 00:01:57,635 this overfitting case much closer to the left to other high bias case. 30 00:01:57,635 --> 00:02:00,760 But hopefully there'll be an intermediate value of lambda that 31 00:02:00,760 --> 00:02:04,820 results in a result closer to this just right case in the middle. 32 00:02:04,820 --> 00:02:07,420 But the intuition is that by cranking up lambda to be 33 00:02:07,420 --> 00:02:10,510 really big they'll set W close to zero, 34 00:02:10,510 --> 00:02:13,280 which in practice this isn't actually what happens. 35 00:02:13,280 --> 00:02:17,110 We can think of it as zeroing out or at least reducing 36 00:02:17,110 --> 00:02:19,270 the impact of a lot of the hidden units so you end up 37 00:02:19,270 --> 00:02:21,935 with what might feel like a simpler network. 38 00:02:21,935 --> 00:02:25,920 They get closer and closer as if you're just using logistic regression. 39 00:02:25,920 --> 00:02:31,360 The intuition of completely zeroing out of a bunch of hidden units isn't quite right. 40 00:02:31,360 --> 00:02:35,225 It turns out that what actually happens is they'll still use all the hidden units, 41 00:02:35,225 --> 00:02:37,610 but each of them would just have a much smaller effect. 42 00:02:37,610 --> 00:02:41,255 But you do end up with a simpler network and as 43 00:02:41,255 --> 00:02:45,040 if you have a smaller network that is therefore less prone to overfitting. 44 00:02:45,040 --> 00:02:47,715 So a lot of this intuition helps better 45 00:02:47,715 --> 00:02:50,765 when you implement regularization in the program exercise, 46 00:02:50,765 --> 00:02:55,360 you actually see some of these variance reduction results yourself. 47 00:02:55,360 --> 00:02:57,955 Here's another attempt at additional intuition 48 00:02:57,955 --> 00:03:01,535 for why regularization helps prevent overfitting. 49 00:03:01,535 --> 00:03:04,030 And for this, I'm going to assume that we're using 50 00:03:04,030 --> 00:03:08,465 the tanh activation function which looks like this. 51 00:03:08,465 --> 00:03:13,515 This is a g of z equals tanh of z. 52 00:03:13,515 --> 00:03:15,200 So if that's the case, 53 00:03:15,200 --> 00:03:19,427 notice that so long as Z is quite small, 54 00:03:19,427 --> 00:03:23,410 so if Z takes on only a smallish range of parameters, 55 00:03:23,410 --> 00:03:28,165 maybe around here, then you're just using the linear regime of the tanh function. 56 00:03:28,165 --> 00:03:34,080 Is only if Z is allowed to wander up to larger values or smaller values like so, 57 00:03:34,080 --> 00:03:37,490 that the activation function starts to become less linear. 58 00:03:37,490 --> 00:03:40,605 So the intuition you might take away from this is that if lambda, 59 00:03:40,605 --> 00:03:42,750 the regularization parameter, is large, 60 00:03:42,750 --> 00:03:46,530 then you have that your parameters will be relatively small, 61 00:03:46,530 --> 00:03:51,290 because they are penalized being large into a cos function. 62 00:03:51,290 --> 00:03:56,740 And so if the blades W are small then because Z is 63 00:03:56,740 --> 00:04:02,550 equal to W and then technically is plus b, 64 00:04:02,550 --> 00:04:04,440 but if W tends to be very small, 65 00:04:04,440 --> 00:04:07,140 then Z will also be relatively small. 66 00:04:07,140 --> 00:04:10,830 And in particular, if Z ends up taking relatively small values, 67 00:04:10,830 --> 00:04:12,787 just in this whole range, 68 00:04:12,787 --> 00:04:16,045 then G of Z will be roughly linear. 69 00:04:16,045 --> 00:04:22,880 So it's as if every layer will be roughly linear. 70 00:04:22,880 --> 00:04:24,800 As if it is just linear regression. 71 00:04:24,800 --> 00:04:27,860 And we saw in course one that if every layer 72 00:04:27,860 --> 00:04:31,275 is linear then your whole network is just a linear network. 73 00:04:31,275 --> 00:04:33,200 And so even a very deep network, 74 00:04:33,200 --> 00:04:35,930 with a deep network with a linear activation function 75 00:04:35,930 --> 00:04:39,245 is at the end they are only able to compute a linear function. 76 00:04:39,245 --> 00:04:43,700 So it's not able to fit those very very complicated decision. 77 00:04:43,700 --> 00:04:49,085 Very non-linear decision boundaries that allow it to really 78 00:04:49,085 --> 00:04:52,940 overfit right to data sets like we saw on 79 00:04:52,940 --> 00:04:57,485 the overfitting high variance case on the previous slide. 80 00:04:57,485 --> 00:04:59,060 So just to summarize, 81 00:04:59,060 --> 00:05:01,665 if the regularization becomes very large, 82 00:05:01,665 --> 00:05:03,873 the parameters W very small, 83 00:05:03,873 --> 00:05:06,350 so Z will be relatively small, 84 00:05:06,350 --> 00:05:08,480 kind of ignoring the effects of b for now, 85 00:05:08,480 --> 00:05:12,935 so Z will be relatively small or, 86 00:05:12,935 --> 00:05:16,250 really, I should say it takes on a small range of values. 87 00:05:16,250 --> 00:05:19,890 And so the activation function if is tanh, 88 00:05:19,890 --> 00:05:21,790 say, will be relatively linear. 89 00:05:21,790 --> 00:05:25,790 And so your whole neural network will be computing something not too far from 90 00:05:25,790 --> 00:05:28,550 a big linear function which is therefore pretty 91 00:05:28,550 --> 00:05:32,250 simple function rather than a very complex highly non-linear function. 92 00:05:32,250 --> 00:05:34,650 And so is also much less able to overfit. 93 00:05:34,650 --> 00:05:38,870 And again, when you enter in regularization for yourself in the program exercise, 94 00:05:38,870 --> 00:05:41,350 you'll be able to see some of these effects yourself. 95 00:05:41,350 --> 00:05:45,680 Before wrapping up our def discussion on regularization, 96 00:05:45,680 --> 00:05:48,310 I just want to give you one implementational tip. 97 00:05:48,310 --> 00:05:52,145 Which is that, when implanting regularization, 98 00:05:52,145 --> 00:05:58,730 we took our definition of the cost function J and we actually modified 99 00:05:58,730 --> 00:06:05,810 it by adding this extra term that penalizes the weight being too large. 100 00:06:05,810 --> 00:06:09,230 And so if you implement gradient descent, 101 00:06:09,230 --> 00:06:18,605 one of the steps to debug gradient descent is to plot the cost function J as a function 102 00:06:18,605 --> 00:06:22,520 of the number of elevations of gradient descent and you want to see that 103 00:06:22,520 --> 00:06:27,730 the cost function J decreases monotonically after every elevation of gradient descent. 104 00:06:27,730 --> 00:06:30,820 And if you're implementing regularization 105 00:06:30,820 --> 00:06:35,350 then please remember that J now has this new definition. 106 00:06:35,350 --> 00:06:37,735 If you plot the old definition of J, 107 00:06:37,735 --> 00:06:39,370 just this first term, 108 00:06:39,370 --> 00:06:42,290 then you might not see a decrease monotonically. 109 00:06:42,290 --> 00:06:45,030 So to debug gradient descent make sure that you're plotting 110 00:06:45,030 --> 00:06:49,910 this new definition of J that includes this second term as well. 111 00:06:49,910 --> 00:06:54,015 Otherwise you might not see J decrease monotonically on every single elevation. 112 00:06:54,015 --> 00:06:57,140 So that's it for L two regularization which is actually 113 00:06:57,140 --> 00:07:01,435 a regularization technique that I use the most in training deep learning modules. 114 00:07:01,435 --> 00:07:05,480 In deep learning there is another sometimes used regularization technique 115 00:07:05,480 --> 00:07:07,390 called dropout regularization. 116 00:07:07,390 --> 00:07:09,280 Let's take a look at that in the next video.