1 00:00:00,540 --> 00:00:01,820 In the previous videos, we put 2 00:00:01,950 --> 00:00:03,220 together almost all 3 00:00:03,270 --> 00:00:04,620 the pieces you need in order 4 00:00:04,820 --> 00:00:07,170 to implement and train in your network. 5 00:00:07,940 --> 00:00:09,060 There's just one last idea I 6 00:00:09,120 --> 00:00:09,980 need to share with you, which 7 00:00:10,200 --> 00:00:11,570 is the idea of random initialization. 8 00:00:13,220 --> 00:00:14,360 When you're running an algorithm like 9 00:00:14,510 --> 00:00:15,990 gradient descent or also the 10 00:00:16,280 --> 00:00:17,810 advanced optimization algorithms, we 11 00:00:17,940 --> 00:00:20,770 need to pick some initial value for the parameters theta. 12 00:00:21,610 --> 00:00:22,990 So for the advanced optimization algorithm, you know, 13 00:00:23,570 --> 00:00:24,620 it assumes that you will 14 00:00:24,780 --> 00:00:26,090 pass it some initial value 15 00:00:26,700 --> 00:00:27,640 for the parameters theta. 16 00:00:29,010 --> 00:00:30,680 Now let's consider gradient descent. 17 00:00:31,320 --> 00:00:34,090 For that, you know, we also need to initialize theta to something. 18 00:00:34,580 --> 00:00:36,030 And then we can slowly take steps 19 00:00:36,680 --> 00:00:38,830 go downhill, using graded descent, 20 00:00:38,910 --> 00:00:40,920 to go downhill to minimize the function J of theta. 21 00:00:41,990 --> 00:00:43,960 So what do we set the initial value of theta to? 22 00:00:44,240 --> 00:00:47,000 Is it possible to set 23 00:00:47,520 --> 00:00:48,930 the initial value of theta 24 00:00:49,250 --> 00:00:50,450 to the vector of all zeroes. 25 00:00:51,870 --> 00:00:54,800 Whereas this worked okay when we were using logistic regression. 26 00:00:55,630 --> 00:00:56,690 Initializing all of your 27 00:00:56,760 --> 00:00:57,970 parameters to zero actually 28 00:00:58,310 --> 00:01:00,290 does not work when you're trading a neural network. 29 00:01:01,410 --> 00:01:03,150 Consider training the following neural network. 30 00:01:03,650 --> 00:01:06,430 And let's say we initialized all of the parameters in the network to zero. 31 00:01:07,970 --> 00:01:09,210 And if you do that then 32 00:01:09,780 --> 00:01:10,920 what that means is that 33 00:01:11,160 --> 00:01:13,870 at the initialization this blue weight, that I'm covering blue 34 00:01:15,390 --> 00:01:16,540 is going to equal to that weight. 35 00:01:17,510 --> 00:01:17,510 So, they're both zero. 36 00:01:18,580 --> 00:01:19,880 And this weight that I'm covering 37 00:01:20,330 --> 00:01:21,940 in in red, is equal to that weight. 38 00:01:22,550 --> 00:01:23,040 Which I'm covering it in red. 39 00:01:23,790 --> 00:01:25,280 And also this weight, well 40 00:01:25,620 --> 00:01:26,500 which I'm covering it in green 41 00:01:26,680 --> 00:01:28,940 is going to be equal to the value of that weight. 42 00:01:30,030 --> 00:01:32,820 And what that means is that both of your hidden units: a1 and a2 43 00:01:32,950 --> 00:01:35,940 are going to be computing the same function 44 00:01:36,660 --> 00:01:36,810 of your inputs. 45 00:01:37,810 --> 00:01:38,900 And thus, you end up with 46 00:01:39,500 --> 00:01:40,870 for everyone of your training your examples. 47 00:01:41,430 --> 00:01:43,640 You end up with a(2)1 equals a(2)2. 48 00:01:46,950 --> 00:01:48,700 and moreover because, I'm not 49 00:01:48,960 --> 00:01:50,050 going to show this too much 50 00:01:50,310 --> 00:01:51,420 detail, but because these out 51 00:01:51,580 --> 00:01:52,990 going weights are the same you 52 00:01:53,080 --> 00:01:54,630 can also show that the 53 00:01:54,710 --> 00:01:56,560 delta values are also going to be the same. 54 00:01:56,790 --> 00:01:57,790 So concretely, you end up 55 00:01:57,970 --> 00:02:00,070 with delta 1 1, 56 00:02:00,760 --> 00:02:02,900 delta 2 1, equals delta 2 2. 57 00:02:06,120 --> 00:02:07,150 And if you work through the 58 00:02:07,230 --> 00:02:08,480 map further, what you can 59 00:02:08,760 --> 00:02:09,990 show is that the partial derivatives 60 00:02:11,560 --> 00:02:14,080 with respect to your parameters will satisfy the following. 61 00:02:15,120 --> 00:02:16,710 That the partial derivative 62 00:02:17,550 --> 00:02:19,260 of the cost 63 00:02:19,580 --> 00:02:21,020 function with respect to 64 00:02:21,800 --> 00:02:23,680 writing out the derivatives respect to 65 00:02:23,900 --> 00:02:25,320 these two blue weights neural network. 66 00:02:26,190 --> 00:02:27,290 You'll find that these two partial 67 00:02:27,680 --> 00:02:30,340 derivatives are going to be equal to each other. 68 00:02:31,970 --> 00:02:33,180 And so, what this means, is 69 00:02:33,320 --> 00:02:35,820 that even after say, one gradient descent update. 70 00:02:36,690 --> 00:02:38,200 You're going to update, say this 71 00:02:38,470 --> 00:02:40,800 first blue weight with, you know, learning rate times this. 72 00:02:41,580 --> 00:02:42,500 And you're going to update the second 73 00:02:42,920 --> 00:02:44,620 blue weight to a sum learning rate times this. 74 00:02:44,820 --> 00:02:45,870 But what this means is 75 00:02:45,980 --> 00:02:47,090 that even after one gradient 76 00:02:47,420 --> 00:02:49,330 descent update, those two 77 00:02:49,680 --> 00:02:50,710 blue weights, those two blue 78 00:02:51,430 --> 00:02:53,050 color parameters will end 79 00:02:53,240 --> 00:02:54,960 up the same as each other. 80 00:02:55,190 --> 00:02:56,210 So they'll be some non-zero 81 00:02:56,750 --> 00:02:57,720 value now, but this value 82 00:02:58,550 --> 00:02:59,520 will be equal to that value. 83 00:03:00,360 --> 00:03:02,790 And similarly, even after one gradient descent update. 84 00:03:03,690 --> 00:03:05,740 This value will equal to that value. 85 00:03:06,170 --> 00:03:07,200 There will be some non-zero values. 86 00:03:07,640 --> 00:03:09,450 Just that the two red values will be equal to each other. 87 00:03:10,240 --> 00:03:11,760 And similarly the two green 88 00:03:12,060 --> 00:03:13,720 weights, they'll both change values 89 00:03:13,860 --> 00:03:16,350 but they'll both end up the same value as each other. 90 00:03:17,590 --> 00:03:19,020 So after each update, the parameters corresponding 91 00:03:19,740 --> 00:03:20,890 to the inputs going to each 92 00:03:21,060 --> 00:03:22,870 of the two hidden units identical. 93 00:03:23,700 --> 00:03:24,490 That's just saying that the two 94 00:03:24,710 --> 00:03:25,590 green weights must be sustained, 95 00:03:25,640 --> 00:03:26,310 the two red weights must be 96 00:03:26,550 --> 00:03:27,750 sustained, the two blue weights 97 00:03:28,010 --> 00:03:30,000 are still the same and what 98 00:03:30,160 --> 00:03:31,590 that means is that even after 99 00:03:31,770 --> 00:03:33,070 one iteration of say, gradient 100 00:03:33,460 --> 00:03:34,860 descent, you find that 101 00:03:35,600 --> 00:03:37,250 your two hidden units are still 102 00:03:37,800 --> 00:03:40,380 computing exactly the same function that the input. 103 00:03:40,830 --> 00:03:43,040 So you still have this a(1)2 equals a(2)2. 104 00:03:43,510 --> 00:03:45,200 And so you're back to this case. 105 00:03:45,930 --> 00:03:47,380 And as keep running gradient descent. 106 00:03:48,390 --> 00:03:50,940 The blue weights, the two blue weights will stay the same as each other. 107 00:03:51,190 --> 00:03:52,920 The two red weights will stay the same as each other. 108 00:03:53,060 --> 00:03:54,990 The two green weights will stay the same as each other. 109 00:03:55,160 --> 00:03:56,860 And what this means 110 00:03:57,130 --> 00:03:58,260 is that your neural network really 111 00:03:58,470 --> 00:03:59,980 can't compute very interesting functions. 112 00:04:00,700 --> 00:04:01,910 Imagine that you had 113 00:04:02,240 --> 00:04:03,670 not only two hidden 114 00:04:04,010 --> 00:04:05,470 units but imagine 115 00:04:05,640 --> 00:04:07,100 that you had many many hidden units. 116 00:04:08,080 --> 00:04:09,160 Then what this is saying is that 117 00:04:09,430 --> 00:04:10,680 all of your hidden units are 118 00:04:10,740 --> 00:04:12,320 computing the exact same 119 00:04:12,540 --> 00:04:16,300 feature, all of your hidden units are computing all of the exact same function of the input. 120 00:04:17,030 --> 00:04:18,980 And this is a highly redundant representation. 121 00:04:20,140 --> 00:04:21,010 Because that means that your 122 00:04:21,110 --> 00:04:24,160 final logistic regression unit, you know, really only gets to see one feature. 123 00:04:24,730 --> 00:04:25,460 Because all of these are the same 124 00:04:26,330 --> 00:04:28,690 and this prevents your neural network from learning something interesting. 125 00:04:31,600 --> 00:04:32,830 In order to get around this 126 00:04:32,960 --> 00:04:34,050 problem, the way we initialize 127 00:04:34,590 --> 00:04:35,680 the parameters of a neural network 128 00:04:36,050 --> 00:04:37,660 therefore, is with random initialization. 129 00:04:41,820 --> 00:04:43,130 Concretely, the problem we 130 00:04:43,250 --> 00:04:44,470 saw on the previous slide 131 00:04:44,760 --> 00:04:46,240 is sometimes called the problem 132 00:04:46,640 --> 00:04:49,040 of symmetric weights, that is if the weights all being the same. 133 00:04:49,810 --> 00:04:51,470 And so this random initialization 134 00:04:52,590 --> 00:04:54,240 is how we perform symmetry breaking. 135 00:04:55,520 --> 00:04:56,480 So what we do is we 136 00:04:56,680 --> 00:04:58,200 initialize each value of 137 00:04:58,310 --> 00:04:59,460 theta to a random 138 00:04:59,830 --> 00:05:01,300 number between minus epsilon and epsilon. 139 00:05:02,080 --> 00:05:03,200 So this is a notation to 140 00:05:03,310 --> 00:05:05,350 mean numbers between minus epsilon and plus epsilon. 141 00:05:06,330 --> 00:05:07,430 So my weights on my 142 00:05:07,540 --> 00:05:08,660 parameters are all going 143 00:05:08,710 --> 00:05:11,470 to be randomly initialized between minus epsilon and plus epsilon. 144 00:05:12,300 --> 00:05:13,330 The way I write code to do 145 00:05:13,420 --> 00:05:16,770 this in octave, this I've said you know theta 1 to be equal to this. 146 00:05:17,550 --> 00:05:19,620 So this rand 10 by 11. 147 00:05:19,910 --> 00:05:21,060 That's how you compute 148 00:05:21,640 --> 00:05:23,620 a random 10 by 11 149 00:05:24,670 --> 00:05:26,640 dimensional matrix, and all 150 00:05:27,070 --> 00:05:30,380 of the values are between 0 and 1. 151 00:05:30,580 --> 00:05:31,350 So these are going to 152 00:05:31,520 --> 00:05:32,700 be real numbers that take on 153 00:05:32,870 --> 00:05:34,860 any continuous values between 0 and 1. 154 00:05:35,450 --> 00:05:36,290 And so, if you take a 155 00:05:36,320 --> 00:05:37,440 number between 0 and 156 00:05:37,550 --> 00:05:38,310 1, multiply it by 2 157 00:05:38,590 --> 00:05:39,550 times an epsilon, and 158 00:05:39,600 --> 00:05:41,050 minus an epsilon, then you 159 00:05:41,160 --> 00:05:42,270 end up with a number that's 160 00:05:42,690 --> 00:05:44,160 between minus epsilon and plus epsilon. 161 00:05:45,640 --> 00:05:46,970 And incidentally, this epsilon here 162 00:05:47,230 --> 00:05:48,410 has nothing to do 163 00:05:48,730 --> 00:05:49,860 with the epsilon that we were 164 00:05:50,070 --> 00:05:51,710 using when we were doing gradient checking. 165 00:05:52,590 --> 00:05:54,070 So when we were doing numerical gradient checking, 166 00:05:54,850 --> 00:05:57,060 there we were adding some values of epsilon to theta. 167 00:05:57,430 --> 00:05:59,560 This is, you know, an unrelated value of epsilon. 168 00:05:59,780 --> 00:06:00,590 Which is why I am denoting 169 00:06:00,990 --> 00:06:02,200 in it epsilon, just to distinguish 170 00:06:02,480 --> 00:06:04,970 it from the value of epsilon we were using in gradient checking. 171 00:06:06,490 --> 00:06:07,590 Absolutely, if you want to 172 00:06:07,690 --> 00:06:09,620 initialize theta 2 173 00:06:09,640 --> 00:06:10,820 to a random 1 by 174 00:06:10,920 --> 00:06:13,430 11 matrix, you can do so using this piece of code here. 175 00:06:15,910 --> 00:06:17,460 So, to summarize, to 176 00:06:17,660 --> 00:06:18,910 train a neural network, what you 177 00:06:19,060 --> 00:06:20,850 should do is randomly initialize the 178 00:06:20,930 --> 00:06:21,810 weights to, you know, small 179 00:06:22,120 --> 00:06:23,370 values close to 0, between 180 00:06:23,740 --> 00:06:24,740 minus epsilon and plus epsilon, 181 00:06:25,160 --> 00:06:27,150 say, and then implement 182 00:06:27,620 --> 00:06:29,330 back-propagation; do gradient checking; 183 00:06:30,220 --> 00:06:31,300 and use either gradient 184 00:06:31,660 --> 00:06:32,620 descent or one of the 185 00:06:32,880 --> 00:06:34,860 advanced optimization algorithms to try 186 00:06:35,100 --> 00:06:36,250 to minimize J of theta 187 00:06:36,790 --> 00:06:37,860 as a function of the 188 00:06:38,050 --> 00:06:39,610 parameters theta starting from just 189 00:06:39,890 --> 00:06:41,900 randomly chosen initial value for the parameters. 190 00:06:42,970 --> 00:06:45,440 And by doing symmetry breaking, which is this process. 191 00:06:46,000 --> 00:06:47,110 Hopefully, gradient descent or the 192 00:06:47,580 --> 00:06:48,820 advanced optimization algorithms will be 193 00:06:48,980 --> 00:06:50,710 able to find a good value of theta.