1 00:00:00,000 --> 00:00:06,015 Drop out does this seemingly crazy thing of randomly knocking out units on your network. 2 00:00:06,015 --> 00:00:08,240 Why does it work so well with a regularizer? 3 00:00:08,240 --> 00:00:10,665 Let's gain some better intuition. 4 00:00:10,665 --> 00:00:11,970 In the previous video, 5 00:00:11,970 --> 00:00:16,705 I gave this intuition that drop-out randomly knocks out units in your network. 6 00:00:16,705 --> 00:00:20,860 So it's as if on every iteration you're working with a smaller neural network, 7 00:00:20,860 --> 00:00:26,360 and so using a smaller neural network seems like it should have a regularizing effect. 8 00:00:26,360 --> 00:00:28,255 Here's a second intuition which is, 9 00:00:28,255 --> 00:00:34,795 let's look at it from the perspective of a single unit. Let's say this one. 10 00:00:34,795 --> 00:00:37,530 Now, for this unit to do his job as for inputs and 11 00:00:37,530 --> 00:00:41,370 it needs to generate some meaningful output. 12 00:00:41,370 --> 00:00:42,595 Now with drop out, 13 00:00:42,595 --> 00:00:45,555 the inputs can get randomly eliminated. 14 00:00:45,555 --> 00:00:47,965 Sometimes those two units will get eliminated, 15 00:00:47,965 --> 00:00:50,530 sometimes a different unit will get eliminated. 16 00:00:50,530 --> 00:00:52,635 So, what this means is that this unit, 17 00:00:52,635 --> 00:00:54,005 which I'm circling in purple, 18 00:00:54,005 --> 00:00:58,560 it can't rely on any one feature because any one feature could go away at 19 00:00:58,560 --> 00:01:03,715 random or any one of its own inputs could go away at random. 20 00:01:03,715 --> 00:01:08,070 Some particular would be reluctant to put all of its bets on, 21 00:01:08,070 --> 00:01:10,475 say, just this input, right? 22 00:01:10,475 --> 00:01:12,990 The weights, we're reluctant to put too much weight 23 00:01:12,990 --> 00:01:16,035 on any one input because it can go away. 24 00:01:16,035 --> 00:01:20,820 So this unit will be more motivated to spread out this way and give you a little bit of 25 00:01:20,820 --> 00:01:26,250 weight to each of the four inputs to this unit. 26 00:01:26,250 --> 00:01:27,765 And by spreading all the weights, 27 00:01:27,765 --> 00:01:34,815 this will tend to have an effect of shrinking the squared norm of the weights. 28 00:01:34,815 --> 00:01:38,730 And so, similar to what we saw with L2 regularization, 29 00:01:38,730 --> 00:01:41,650 the effect of implementing drop out is that it shrinks 30 00:01:41,650 --> 00:01:46,195 the weights and does some of those outer regularization that helps prevent over-fitting. 31 00:01:46,195 --> 00:01:48,750 But it turns out that drop out can formally be 32 00:01:48,750 --> 00:01:52,035 shown to be an adaptive form without a regularization. 33 00:01:52,035 --> 00:01:55,305 But L2 penalty on different weights are different, 34 00:01:55,305 --> 00:01:58,830 depending on the size of the activations being multiplied that way. 35 00:01:58,830 --> 00:02:02,580 But to summarize, it is possible to show that drop out has 36 00:02:02,580 --> 00:02:06,705 a similar effect to L2 regularization. 37 00:02:06,705 --> 00:02:09,990 Only to L2 regularization applied to different ways can be 38 00:02:09,990 --> 00:02:13,540 a little bit different and even more adaptive to the scale of different inputs. 39 00:02:13,540 --> 00:02:15,930 One more detail for when you're implementing drop out. 40 00:02:15,930 --> 00:02:19,510 Here's a network where you have three input features. 41 00:02:19,510 --> 00:02:21,795 This is seven hidden units here, 42 00:02:21,795 --> 00:02:24,625 seven, three, two, one. 43 00:02:24,625 --> 00:02:26,915 So, one of the parameters we had to choose was 44 00:02:26,915 --> 00:02:31,395 the cheap prop which has a chance of keeping a unit in each layer. 45 00:02:31,395 --> 00:02:36,550 So, it is also feasible to vary key prop by layer. 46 00:02:36,550 --> 00:02:38,490 So for the first layer, 47 00:02:38,490 --> 00:02:42,460 your matrix W1 will be three by seven. 48 00:02:42,460 --> 00:02:46,120 Your second weight matrix will be seven by seven. 49 00:02:46,120 --> 00:02:49,680 W3 will be seven by three and so on. 50 00:02:49,680 --> 00:02:53,205 And so W2 is actually the biggest weight matrix, 51 00:02:53,205 --> 00:02:55,500 because they're actually the largest set of parameters 52 00:02:55,500 --> 00:02:58,195 would be in W2 which is seven by seven. 53 00:02:58,195 --> 00:03:01,605 So to prevent, to reduce over-fitting of that matrix, 54 00:03:01,605 --> 00:03:03,600 maybe for this layer, 55 00:03:03,600 --> 00:03:05,205 I guess this is layer two, 56 00:03:05,205 --> 00:03:08,490 you might have a key prop that's relatively low, 57 00:03:08,490 --> 00:03:10,435 say zero point five, 58 00:03:10,435 --> 00:03:13,825 whereas for different layers where you might worry less about over-fitting, 59 00:03:13,825 --> 00:03:15,080 you could have a higher key prop, 60 00:03:15,080 --> 00:03:18,255 maybe just zero point seven. 61 00:03:18,255 --> 00:03:22,715 And if a layers we don't worry about over-fitting at all, 62 00:03:22,715 --> 00:03:25,240 you can have a key prop of one point zero. 63 00:03:25,240 --> 00:03:30,725 For clarity, these are numbers I'm drawing on the purple boxes. 64 00:03:30,725 --> 00:03:34,635 These could be different key props for different layers. 65 00:03:34,635 --> 00:03:39,100 Notice that the key prop of one point zero means that you're keeping every unit and so, 66 00:03:39,100 --> 00:03:41,855 you're really not using drop out for that layer. 67 00:03:41,855 --> 00:03:44,730 But for layers where you're more worried about over-fitting, 68 00:03:44,730 --> 00:03:46,660 really the layers with a lot of parameters, 69 00:03:46,660 --> 00:03:51,600 you can set the key prop to be smaller to apply a more powerful form of drop out. 70 00:03:51,600 --> 00:03:53,070 It's kind of like cranking up 71 00:03:53,070 --> 00:03:54,910 the regularization parameter lambda of 72 00:03:54,910 --> 00:03:57,960 L2 regularization where you try to regularize some layers more than others. 73 00:03:57,960 --> 00:04:02,715 And technically, you can also apply drop out to the input layer, 74 00:04:02,715 --> 00:04:07,295 where you can have some chance of just maxing out one or more of the input features. 75 00:04:07,295 --> 00:04:11,580 Although in practice, usually don't do that that often. 76 00:04:11,580 --> 00:04:15,270 And so, a key prop of one point zero was quite common for the input there. 77 00:04:15,270 --> 00:04:17,985 You can also use a very high value, maybe zero point nine, 78 00:04:17,985 --> 00:04:22,740 but it's much less likely that you want to eliminate half of the input features. 79 00:04:22,740 --> 00:04:25,665 So usually key prop, if you apply the law, 80 00:04:25,665 --> 00:04:32,030 will be a number close to one if you even apply drop out at all to the input there. 81 00:04:32,030 --> 00:04:33,450 So just to summarize, 82 00:04:33,450 --> 00:04:36,330 if you're more worried about some layers overfitting than others, 83 00:04:36,330 --> 00:04:40,320 you can set a lower key prop for some layers than others. 84 00:04:40,320 --> 00:04:41,430 The downside is, this gives you 85 00:04:41,430 --> 00:04:44,955 even more hyper parameters to search for using cross-validation. 86 00:04:44,955 --> 00:04:48,525 One other alternative might be to have some layers where you apply 87 00:04:48,525 --> 00:04:50,460 drop out and some layers where you don't apply drop 88 00:04:50,460 --> 00:04:52,630 out and then just have one hyper parameter, 89 00:04:52,630 --> 00:04:55,910 which is a key prop for the layers for which you do apply drop outs. 90 00:04:55,910 --> 00:04:59,070 And before we wrap up, just a couple implementational tips. 91 00:04:59,070 --> 00:05:03,850 Many of the first successful implementations of drop outs were to computer vision. 92 00:05:03,850 --> 00:05:05,075 So in computer vision, 93 00:05:05,075 --> 00:05:06,890 the input size is so big, 94 00:05:06,890 --> 00:05:11,275 inputting all these pixels that you almost never have enough data. 95 00:05:11,275 --> 00:05:14,710 And so drop out is very frequently used by computer vision. 96 00:05:14,710 --> 00:05:18,035 And there's some computer vision researchers that pretty much always use it, 97 00:05:18,035 --> 00:05:19,750 almost as a default. 98 00:05:19,750 --> 00:05:24,866 But really the thing to remember is that drop out is a regularization technique, 99 00:05:24,866 --> 00:05:27,010 it helps prevent over-fitting. 100 00:05:27,010 --> 00:05:30,880 And so, unless my algorithm is over-fitting, 101 00:05:30,880 --> 00:05:33,250 I wouldn't actually bother to use drop out. 102 00:05:33,250 --> 00:05:36,557 So it's used somewhat less often than other application areas. 103 00:05:36,557 --> 00:05:38,320 There's just with computer vision, 104 00:05:38,320 --> 00:05:40,600 you usually just don't have enough data, 105 00:05:40,600 --> 00:05:42,090 so you're almost always overfitting, 106 00:05:42,090 --> 00:05:46,425 which is why there tends to be some computer vision researchers who swear by drop out. 107 00:05:46,425 --> 00:05:52,498 But their intuition doesn't always generalize I think to other disciplines. 108 00:05:52,498 --> 00:06:00,490 One big downside of drop out is that the cost function J is no longer well-defined. 109 00:06:00,490 --> 00:06:06,635 On every iteration, you are randomly killing off a bunch of nodes. 110 00:06:06,635 --> 00:06:10,855 And so, if you are double checking the performance of grade and dissent, 111 00:06:10,855 --> 00:06:14,590 it's actually harder to double check that you have 112 00:06:14,590 --> 00:06:20,365 a well defined cost function J that is going downhill on every iteration. 113 00:06:20,365 --> 00:06:24,625 Because the cost function J that you're optimizing is actually less. 114 00:06:24,625 --> 00:06:27,395 Less well defined, or is certainly hard to calculate. 115 00:06:27,395 --> 00:06:30,160 So you lose this debugging tool to will a plot, 116 00:06:30,160 --> 00:06:32,010 a graph like this. 117 00:06:32,010 --> 00:06:34,805 So what I usually do is turn off drop out, 118 00:06:34,805 --> 00:06:37,060 you will set key prop equals one, 119 00:06:37,060 --> 00:06:40,885 and I run my code and make sure that it is monotonically decreasing J, 120 00:06:40,885 --> 00:06:43,960 and then turn on drop out and hope that I 121 00:06:43,960 --> 00:06:47,035 didn't introduce bugs into my code during drop out. 122 00:06:47,035 --> 00:06:49,195 Because you need other ways, I guess, 123 00:06:49,195 --> 00:06:52,060 but not plotting these figures to make sure that your code is 124 00:06:52,060 --> 00:06:55,900 working to greatness and it's working even with drop outs. 125 00:06:55,900 --> 00:06:58,130 So with that, there's still 126 00:06:58,130 --> 00:07:01,830 a few more regularization techniques that are worth your knowing. 127 00:07:01,830 --> 00:07:04,480 Let's talk about a few more such techniques in the next video.