1 00:00:00,240 --> 00:00:01,560 So, it's taken us a 2 00:00:01,700 --> 00:00:02,690 lot of videos to get through 3 00:00:03,120 --> 00:00:04,480 the neural network learning algorithm. 4 00:00:05,620 --> 00:00:06,640 In this video, what I'd like 5 00:00:06,800 --> 00:00:08,090 to do is try to 6 00:00:08,350 --> 00:00:10,040 put all the pieces together, to 7 00:00:10,370 --> 00:00:12,120 give a overall summary or 8 00:00:12,360 --> 00:00:13,410 a bigger picture view, of how 9 00:00:13,650 --> 00:00:15,290 all the pieces fit together and 10 00:00:15,530 --> 00:00:16,990 of the overall process of how 11 00:00:17,260 --> 00:00:18,830 to implement a neural network learning algorithm. 12 00:00:21,870 --> 00:00:23,210 When training a neural network, the 13 00:00:23,280 --> 00:00:24,290 first thing you need to do 14 00:00:24,400 --> 00:00:25,920 is pick some network architecture 15 00:00:26,680 --> 00:00:27,950 and by architecture I just 16 00:00:28,200 --> 00:00:30,510 mean connectivity pattern between the neurons. 17 00:00:31,080 --> 00:00:31,840 So, you know, we might choose 18 00:00:32,700 --> 00:00:33,770 between say, a neural network 19 00:00:34,230 --> 00:00:35,440 with three input units 20 00:00:35,960 --> 00:00:37,400 and five hidden units and 21 00:00:37,500 --> 00:00:39,560 four output units versus one 22 00:00:39,800 --> 00:00:41,460 of 3, 5 hidden, 5 23 00:00:41,700 --> 00:00:43,430 hidden, 4 output and 24 00:00:43,910 --> 00:00:45,220 here are 3, 5, 25 00:00:45,550 --> 00:00:47,060 5, 5 units in each 26 00:00:47,320 --> 00:00:48,870 of three hidden layers and four 27 00:00:49,120 --> 00:00:50,250 open units, and so these 28 00:00:50,430 --> 00:00:52,000 choices of how many hidden 29 00:00:52,270 --> 00:00:53,410 units in each layer 30 00:00:53,810 --> 00:00:55,560 and how many hidden layers, those 31 00:00:55,780 --> 00:00:57,580 are architecture choices. 32 00:00:57,910 --> 00:00:58,680 So, how do you make these choices? 33 00:00:59,710 --> 00:01:01,270 Well first, the number 34 00:01:01,530 --> 00:01:03,840 of input units well that's pretty well defined. 35 00:01:04,680 --> 00:01:05,960 And once you decides on the fix 36 00:01:06,580 --> 00:01:07,870 set of features x the 37 00:01:08,080 --> 00:01:09,420 number of input units will just be, you know, the 38 00:01:10,140 --> 00:01:12,180 dimension of your features x(i) 39 00:01:12,330 --> 00:01:14,470 would be determined by that. 40 00:01:14,760 --> 00:01:15,970 And if you are doing multiclass 41 00:01:16,210 --> 00:01:17,370 classifications the number of 42 00:01:17,520 --> 00:01:18,320 output of this will be 43 00:01:18,420 --> 00:01:19,720 determined by the number 44 00:01:20,060 --> 00:01:22,860 of classes in your classification problem. 45 00:01:23,260 --> 00:01:24,890 And just a reminder if you have 46 00:01:25,160 --> 00:01:27,290 a multiclass classification where y 47 00:01:27,570 --> 00:01:28,970 takes on say values between 48 00:01:30,040 --> 00:01:31,350 1 and 10, so that 49 00:01:31,470 --> 00:01:33,560 you have ten possible classes. 50 00:01:34,690 --> 00:01:37,200 Then remember to right, your 51 00:01:37,820 --> 00:01:39,340 output y as these were the vectors. 52 00:01:40,130 --> 00:01:41,560 So instead of clause one, you 53 00:01:41,730 --> 00:01:42,840 recode it as a vector 54 00:01:43,150 --> 00:01:44,600 like that, or for 55 00:01:44,670 --> 00:01:47,280 the second class you recode it as a vector like that. 56 00:01:48,130 --> 00:01:49,080 So if one of these 57 00:01:49,210 --> 00:01:51,000 apples takes on 58 00:01:51,140 --> 00:01:53,910 the fifth class, you know, y equals 5, then 59 00:01:54,120 --> 00:01:55,130 what you're showing to your neural 60 00:01:55,380 --> 00:01:56,840 network is not actually a value 61 00:01:57,250 --> 00:01:59,520 of y equals 5, instead here 62 00:02:00,030 --> 00:02:00,950 at the upper layer which would 63 00:02:01,280 --> 00:02:02,650 have ten output units, you 64 00:02:02,740 --> 00:02:03,920 will instead feed to the 65 00:02:04,070 --> 00:02:05,710 vector which you know 66 00:02:07,470 --> 00:02:08,430 with one in the fifth 67 00:02:08,770 --> 00:02:11,050 position and a bunch of zeros down here. 68 00:02:11,420 --> 00:02:12,470 So the choice of number 69 00:02:12,890 --> 00:02:14,330 of input units and number of output units 70 00:02:14,970 --> 00:02:16,600 is maybe somewhat reasonably straightforward. 71 00:02:18,000 --> 00:02:18,950 And as for the number 72 00:02:19,410 --> 00:02:21,040 of hidden units and the 73 00:02:21,140 --> 00:02:23,110 number of hidden layers, a 74 00:02:23,210 --> 00:02:24,350 reasonable default is to 75 00:02:24,540 --> 00:02:26,010 use a single hidden layer 76 00:02:26,660 --> 00:02:28,040 and so this type of 77 00:02:28,880 --> 00:02:30,400 neural network shown on the left with 78 00:02:30,580 --> 00:02:33,270 just one hidden layer is probably the most common. 79 00:02:34,490 --> 00:02:35,870 Or if you use more 80 00:02:36,140 --> 00:02:38,410 than one hidden layer, again the 81 00:02:38,670 --> 00:02:39,600 reasonable default will be to 82 00:02:39,760 --> 00:02:40,950 have the same number of 83 00:02:41,130 --> 00:02:42,560 hidden units in every single layer. 84 00:02:42,810 --> 00:02:44,600 So here we have two 85 00:02:45,020 --> 00:02:46,370 hidden layers and each 86 00:02:46,610 --> 00:02:47,650 of these hidden layers have the 87 00:02:47,860 --> 00:02:49,500 same number five of hidden 88 00:02:49,790 --> 00:02:50,740 units and here we have, you know, 89 00:02:51,600 --> 00:02:53,020 three hidden layers and 90 00:02:53,170 --> 00:02:54,790 each of them has the same 91 00:02:54,980 --> 00:02:56,400 number, that is five hidden units. 92 00:02:57,440 --> 00:02:59,440 Rather than doing this sort 93 00:02:59,740 --> 00:03:02,850 of network architecture on the left would be a perfect ably reasonable default. 94 00:03:04,020 --> 00:03:04,780 And as for the number 95 00:03:05,120 --> 00:03:07,040 of hidden units - usually, the 96 00:03:07,120 --> 00:03:08,100 more hidden units the better; 97 00:03:08,560 --> 00:03:09,640 it's just that if you have 98 00:03:09,900 --> 00:03:11,110 a lot of hidden units, it 99 00:03:11,330 --> 00:03:13,150 can become more computationally expensive, but 100 00:03:13,300 --> 00:03:15,850 very often, having more hidden units is a good thing. 101 00:03:17,250 --> 00:03:18,560 And usually the number of hidden 102 00:03:18,720 --> 00:03:20,820 units in each layer will be maybe 103 00:03:21,080 --> 00:03:22,130 comparable to the dimension 104 00:03:22,490 --> 00:03:23,670 of x, comparable to the 105 00:03:23,810 --> 00:03:24,950 number of features, or it could 106 00:03:25,140 --> 00:03:26,880 be any where from same number 107 00:03:27,180 --> 00:03:29,590 of hidden units of input features to 108 00:03:29,770 --> 00:03:32,430 maybe so that three or four times of that. 109 00:03:32,680 --> 00:03:34,770 So having the number of hidden units is comparable. 110 00:03:35,140 --> 00:03:36,350 You know, several times, or 111 00:03:36,410 --> 00:03:37,380 some what bigger than the number 112 00:03:37,430 --> 00:03:38,750 of input features is often 113 00:03:39,280 --> 00:03:41,320 a useful thing to do So, 114 00:03:42,150 --> 00:03:43,490 hopefully this gives you one 115 00:03:43,810 --> 00:03:45,140 reasonable set of default choices 116 00:03:45,650 --> 00:03:47,770 for neural architecture and and 117 00:03:48,200 --> 00:03:49,460 if you follow these guidelines, you 118 00:03:49,540 --> 00:03:50,580 will probably get something that works 119 00:03:50,930 --> 00:03:52,180 well, but in a 120 00:03:52,360 --> 00:03:53,770 later set of videos where 121 00:03:54,050 --> 00:03:55,270 I will talk specifically about 122 00:03:55,580 --> 00:03:56,900 advice for how to apply 123 00:03:57,410 --> 00:03:58,770 algorithms, I will actually 124 00:03:58,840 --> 00:04:01,880 say a lot more about how to choose a neural network architecture. 125 00:04:02,540 --> 00:04:03,920 Or actually have quite 126 00:04:03,970 --> 00:04:04,960 a lot I want to 127 00:04:04,960 --> 00:04:06,180 say later to make good choices 128 00:04:06,710 --> 00:04:08,780 for the number of hidden units, the number of hidden layers, and so on. 129 00:04:10,620 --> 00:04:12,310 Next, here's what we 130 00:04:12,420 --> 00:04:13,740 need to implement in order to 131 00:04:13,860 --> 00:04:15,360 trade in neural network, there are 132 00:04:15,510 --> 00:04:16,820 actually six steps that I 133 00:04:17,080 --> 00:04:18,030 have; I have four on this 134 00:04:18,160 --> 00:04:19,100 slide and two more steps 135 00:04:19,380 --> 00:04:21,480 on the next slide. 136 00:04:21,620 --> 00:04:22,220 First step is to set up the neural 137 00:04:22,430 --> 00:04:23,510 network and to randomly 138 00:04:24,080 --> 00:04:25,570 initialize the values of the weights. 139 00:04:25,790 --> 00:04:27,000 And we usually initialize the 140 00:04:27,080 --> 00:04:29,710 weights to small values near zero. 141 00:04:31,100 --> 00:04:33,120 Then we implement forward propagation 142 00:04:34,080 --> 00:04:35,060 so that we can input 143 00:04:35,480 --> 00:04:37,150 any excellent neural network and 144 00:04:37,490 --> 00:04:38,860 compute h of x which is this 145 00:04:39,070 --> 00:04:40,820 output vector of the y values. 146 00:04:44,260 --> 00:04:45,910 We then also implement code to 147 00:04:46,010 --> 00:04:47,500 compute this cost function j of theta. 148 00:04:49,770 --> 00:04:51,160 And next we implement 149 00:04:52,120 --> 00:04:53,330 back-prop, or the back-propagation 150 00:04:54,400 --> 00:04:55,680 algorithm, to compute these 151 00:04:55,910 --> 00:04:58,000 partial derivatives terms, partial 152 00:04:58,440 --> 00:04:59,830 derivatives of j of theta 153 00:05:00,340 --> 00:05:04,240 with respect to the parameters. Concretely, to implement back prop. 154 00:05:04,960 --> 00:05:05,880 Usually we will do that 155 00:05:06,250 --> 00:05:08,460 with a fore loop over the training examples. 156 00:05:09,700 --> 00:05:10,650 Some of you may have heard of 157 00:05:10,830 --> 00:05:12,640 advanced, and frankly very 158 00:05:12,940 --> 00:05:14,500 advanced factorization methods where you 159 00:05:14,670 --> 00:05:15,720 don't have a four-loop over 160 00:05:16,570 --> 00:05:18,580 the m-training examples, that the 161 00:05:18,660 --> 00:05:19,900 first time you're implementing back prop 162 00:05:20,250 --> 00:05:21,420 there should almost certainly the four 163 00:05:21,420 --> 00:05:22,980 loop in your code, 164 00:05:23,800 --> 00:05:25,010 where you're iterating over the examples, 165 00:05:25,810 --> 00:05:27,760 you know, x1, y1, then so 166 00:05:28,030 --> 00:05:29,510 you do forward prop and 167 00:05:29,640 --> 00:05:30,400 back prop on the first 168 00:05:30,850 --> 00:05:32,510 example, and then in 169 00:05:32,710 --> 00:05:33,730 the second iteration of the 170 00:05:33,780 --> 00:05:35,360 four-loop, you do forward propagation 171 00:05:35,980 --> 00:05:38,050 and back propagation on the second example, and so on. 172 00:05:38,170 --> 00:05:40,900 Until you get through the final example. 173 00:05:41,680 --> 00:05:43,110 So there should be 174 00:05:43,230 --> 00:05:44,250 a four-loop in your implementation 175 00:05:45,050 --> 00:05:47,180 of back prop, at least the first time implementing it. 176 00:05:48,120 --> 00:05:49,160 And then there are frankly 177 00:05:49,390 --> 00:05:50,520 somewhat complicated ways to do 178 00:05:50,890 --> 00:05:52,660 this without a four-loop, but 179 00:05:52,810 --> 00:05:53,950 I definitely do not recommend 180 00:05:54,360 --> 00:05:55,340 trying to do that much more 181 00:05:55,660 --> 00:05:58,420 complicated version the first time you try to implement back prop. 182 00:05:59,850 --> 00:06:00,920 So concretely, we have a 183 00:06:01,010 --> 00:06:02,200 four-loop over my m-training examples 184 00:06:03,240 --> 00:06:04,630 and inside the four-loop we're 185 00:06:04,770 --> 00:06:06,300 going to perform fore prop 186 00:06:06,580 --> 00:06:08,090 and back prop using just this one example. 187 00:06:09,310 --> 00:06:10,320 And what that means is that 188 00:06:10,560 --> 00:06:12,470 we're going to take x(i), and 189 00:06:12,690 --> 00:06:14,010 feed that to my input layer, 190 00:06:14,770 --> 00:06:16,370 perform forward-prop, perform back-prop 191 00:06:17,370 --> 00:06:18,360 and that will if all of 192 00:06:18,430 --> 00:06:19,840 these activations and all of 193 00:06:19,930 --> 00:06:22,090 these delta terms for all 194 00:06:22,300 --> 00:06:23,440 of the layers of all my 195 00:06:23,770 --> 00:06:24,720 units in the neural 196 00:06:24,950 --> 00:06:27,170 network then still 197 00:06:27,610 --> 00:06:28,760 inside this four-loop, let 198 00:06:29,180 --> 00:06:30,450 me draw some curly braces 199 00:06:30,940 --> 00:06:31,950 just to show the scope with 200 00:06:32,030 --> 00:06:32,930 the four-loop, this is in 201 00:06:34,160 --> 00:06:35,480 octave code of course, but it's more a sequence Java 202 00:06:36,190 --> 00:06:38,350 code, and a four-loop encompasses all this. 203 00:06:39,060 --> 00:06:40,060 We're going to compute those delta 204 00:06:40,480 --> 00:06:43,690 terms, which are is the formula that we gave earlier. 205 00:06:45,540 --> 00:06:47,370 Plus, you know, delta l plus one times 206 00:06:48,630 --> 00:06:51,150 a, l transpose of the code. 207 00:06:51,490 --> 00:06:53,540 And then finally, outside the 208 00:06:54,180 --> 00:06:55,630 having computed these delta 209 00:06:55,970 --> 00:06:57,550 terms, these accumulation terms, we 210 00:06:57,870 --> 00:06:59,050 would then have some other 211 00:06:59,170 --> 00:07:00,430 code and then that will 212 00:07:00,720 --> 00:07:03,240 allow us to compute these partial derivative terms. 213 00:07:03,860 --> 00:07:05,450 Right and these partial derivative 214 00:07:05,970 --> 00:07:07,020 terms have to take 215 00:07:07,210 --> 00:07:10,270 into account the regularization term lambda as well. 216 00:07:11,050 --> 00:07:13,240 And so, those formulas were given in the earlier video. 217 00:07:14,830 --> 00:07:15,720 So, how do you done that 218 00:07:16,680 --> 00:07:18,080 you now hopefully have code to 219 00:07:18,180 --> 00:07:20,050 compute these partial derivative terms. 220 00:07:21,190 --> 00:07:23,030 Next is step five, what I 221 00:07:23,240 --> 00:07:24,420 do is then use gradient 222 00:07:24,730 --> 00:07:26,700 checking to compare these partial 223 00:07:27,120 --> 00:07:28,530 derivative terms that were computed. So, I've 224 00:07:29,420 --> 00:07:30,980 compared the versions computed using 225 00:07:31,270 --> 00:07:33,990 back propagation versus the 226 00:07:34,430 --> 00:07:36,470 partial derivatives computed using the numerical 227 00:07:37,710 --> 00:07:39,850 estimates as using numerical estimates of the derivatives. 228 00:07:40,350 --> 00:07:41,810 So, I do gradient checking to make 229 00:07:41,970 --> 00:07:44,340 sure that both of these give you very similar values. 230 00:07:45,830 --> 00:07:47,410 Having done gradient checking just now reassures 231 00:07:47,910 --> 00:07:49,280 us that our implementation of back 232 00:07:49,590 --> 00:07:51,470 propagation is correct, and is 233 00:07:51,610 --> 00:07:52,850 then very important that we disable 234 00:07:53,530 --> 00:07:54,710 gradient checking, because the gradient 235 00:07:55,080 --> 00:07:57,150 checking code is computationally very slow. 236 00:07:59,020 --> 00:08:00,880 And finally, we then 237 00:08:01,120 --> 00:08:03,280 use an optimization algorithm such 238 00:08:03,510 --> 00:08:04,940 as gradient descent, or one of 239 00:08:04,960 --> 00:08:07,520 the advanced optimization methods such 240 00:08:07,740 --> 00:08:10,020 as LB of GS, contract gradient has 241 00:08:10,250 --> 00:08:13,120 embodied into fminunc or other optimization methods. 242 00:08:13,940 --> 00:08:15,500 We use these together with 243 00:08:15,730 --> 00:08:17,380 back propagation, so back 244 00:08:17,620 --> 00:08:18,670 propagation is the thing 245 00:08:18,770 --> 00:08:20,640 that computes these partial derivatives for us. 246 00:08:21,730 --> 00:08:22,680 And so, we know how to 247 00:08:22,860 --> 00:08:24,020 compute the cost function, we know 248 00:08:24,100 --> 00:08:25,550 how to compute the partial derivatives using 249 00:08:25,830 --> 00:08:27,410 back propagation, so we 250 00:08:27,480 --> 00:08:28,830 can use one of these optimization methods 251 00:08:29,580 --> 00:08:30,850 to try to minimize j of 252 00:08:31,130 --> 00:08:33,500 theta as a function of the parameters theta. 253 00:08:34,330 --> 00:08:35,410 And by the way, for 254 00:08:35,660 --> 00:08:37,330 neural networks, this cost function 255 00:08:38,300 --> 00:08:39,630 j of theta is non-convex, 256 00:08:40,530 --> 00:08:42,490 or is not convex and so 257 00:08:43,260 --> 00:08:45,600 it can theoretically be susceptible 258 00:08:46,250 --> 00:08:47,480 to local minima, and in 259 00:08:47,650 --> 00:08:49,580 fact algorithms like gradient descent and 260 00:08:49,840 --> 00:08:51,950 the advance optimization methods can, 261 00:08:52,400 --> 00:08:53,660 in theory, get stuck in local 262 00:08:55,190 --> 00:08:56,300 optima, but it turns out 263 00:08:56,480 --> 00:08:57,680 that in practice this is 264 00:08:57,870 --> 00:08:59,230 not usually a huge problem 265 00:08:59,560 --> 00:09:00,800 and even though we can't guarantee 266 00:09:01,210 --> 00:09:02,320 that these algorithms will find a 267 00:09:02,510 --> 00:09:04,260 global optimum, usually algorithms like 268 00:09:04,390 --> 00:09:05,870 gradient descent will do a 269 00:09:05,930 --> 00:09:07,700 very good job minimizing this 270 00:09:07,850 --> 00:09:09,230 cost function j of 271 00:09:09,280 --> 00:09:10,350 theta and get a 272 00:09:10,420 --> 00:09:11,820 very good local minimum, even 273 00:09:12,060 --> 00:09:13,690 if it doesn't get to the global optimum. 274 00:09:14,500 --> 00:09:16,950 Finally, gradient descents for 275 00:09:17,230 --> 00:09:19,500 a neural network might still seem a little bit magical. 276 00:09:20,170 --> 00:09:21,680 So, let me just show one 277 00:09:21,890 --> 00:09:22,990 more figure to try to get 278 00:09:23,170 --> 00:09:25,660 that intuition about what gradient descent for a neural network is doing. 279 00:09:27,020 --> 00:09:28,460 This was actually similar to the 280 00:09:28,590 --> 00:09:31,190 figure that I was using earlier to explain gradient descent. 281 00:09:31,730 --> 00:09:32,750 So, we have some cost 282 00:09:33,090 --> 00:09:34,480 function, and we have 283 00:09:34,710 --> 00:09:36,590 a number of parameters in our neural network. Right 284 00:09:36,810 --> 00:09:39,190 here I've just written down two of the parameter values. 285 00:09:40,080 --> 00:09:41,250 In reality, of course, in 286 00:09:41,520 --> 00:09:43,570 the neural network, we can have lots of parameters with these. 287 00:09:44,190 --> 00:09:46,980 Theta one, theta two--all of these are matrices, right? 288 00:09:47,030 --> 00:09:48,130 So we can have very high dimensional 289 00:09:48,580 --> 00:09:49,870 parameters but because of 290 00:09:49,960 --> 00:09:51,620 the limitations the source of 291 00:09:51,790 --> 00:09:52,970 parts we can draw. I'm pretending 292 00:09:53,410 --> 00:09:55,840 that we have only two parameters in this neural network. 293 00:09:56,270 --> 00:09:56,890 Although obviously we have a lot more in practice. 294 00:09:59,280 --> 00:10:00,700 Now, this cost function j of 295 00:10:00,800 --> 00:10:02,470 theta measures how well 296 00:10:02,880 --> 00:10:04,730 the neural network fits the training data. 297 00:10:06,000 --> 00:10:06,920 So, if you take a point 298 00:10:07,120 --> 00:10:08,590 like this one, down here, 299 00:10:10,270 --> 00:10:11,180 that's a point where j 300 00:10:11,460 --> 00:10:12,580 of theta is pretty low, 301 00:10:12,870 --> 00:10:16,170 and so this corresponds to a setting of the parameters. 302 00:10:17,020 --> 00:10:17,840 There's a setting of the parameters 303 00:10:18,350 --> 00:10:19,920 theta, where, you know, for most 304 00:10:20,140 --> 00:10:22,450 of the training examples, the output 305 00:10:24,120 --> 00:10:26,270 of my hypothesis, that may 306 00:10:26,410 --> 00:10:27,420 be pretty close to y(i) 307 00:10:27,650 --> 00:10:28,720 and if this is 308 00:10:28,840 --> 00:10:31,560 true than that's what causes my cost function to be pretty low. 309 00:10:32,690 --> 00:10:33,770 Whereas in contrast, if you were 310 00:10:33,820 --> 00:10:35,140 to take a value like that, a 311 00:10:35,510 --> 00:10:37,260 point like that corresponds to, 312 00:10:38,080 --> 00:10:39,260 where for many training examples, 313 00:10:39,890 --> 00:10:40,780 the output of my neural 314 00:10:41,040 --> 00:10:42,860 network is far from 315 00:10:43,110 --> 00:10:44,340 the actual value y(i) 316 00:10:44,540 --> 00:10:45,850 that was observed in the training set. 317 00:10:46,610 --> 00:10:47,480 So points like this on the 318 00:10:47,590 --> 00:10:50,100 line correspond to where the 319 00:10:50,450 --> 00:10:51,450 hypothesis, where the neural 320 00:10:51,740 --> 00:10:53,330 network is outputting values 321 00:10:53,770 --> 00:10:54,810 on the training set that are 322 00:10:55,020 --> 00:10:56,260 far from y(i). So, it's not 323 00:10:56,470 --> 00:10:57,970 fitting the training set well, whereas 324 00:10:58,170 --> 00:10:59,640 points like this with low 325 00:10:59,970 --> 00:11:01,300 values of the cost function corresponds 326 00:11:02,130 --> 00:11:03,380 to where j of theta 327 00:11:04,130 --> 00:11:05,270 is low, and therefore corresponds 328 00:11:05,950 --> 00:11:07,590 to where the neural network happens 329 00:11:07,850 --> 00:11:09,290 to be fitting my training set 330 00:11:09,510 --> 00:11:11,340 well, because I mean this is what's 331 00:11:11,550 --> 00:11:14,070 needed to be true in order for j of theta to be small. 332 00:11:15,480 --> 00:11:16,810 So what gradient descent does is 333 00:11:16,870 --> 00:11:18,330 we'll start from some random 334 00:11:18,730 --> 00:11:20,300 initial point like that 335 00:11:20,430 --> 00:11:22,990 one over there, and it will repeatedly go downhill. 336 00:11:24,040 --> 00:11:25,400 And so what back propagation is 337 00:11:25,570 --> 00:11:27,220 doing is computing the direction 338 00:11:27,940 --> 00:11:29,370 of the gradient, and what 339 00:11:29,520 --> 00:11:30,740 gradient descent is doing is 340 00:11:31,040 --> 00:11:32,060 it's taking little steps downhill 341 00:11:32,880 --> 00:11:34,220 until hopefully it gets to, 342 00:11:34,610 --> 00:11:36,410 in this case, a pretty good local optimum. 343 00:11:37,880 --> 00:11:39,250 So, when you implement back 344 00:11:39,410 --> 00:11:40,840 propagation and use gradient 345 00:11:41,200 --> 00:11:42,420 descent or one of the 346 00:11:42,840 --> 00:11:44,750 advanced optimization methods, this picture 347 00:11:45,330 --> 00:11:47,290 sort of explains what the algorithm is doing. 348 00:11:47,450 --> 00:11:48,820 It's trying to find a value 349 00:11:49,260 --> 00:11:50,920 of the parameters where the 350 00:11:51,260 --> 00:11:52,180 output values in the neural 351 00:11:52,450 --> 00:11:54,300 network closely matches the 352 00:11:54,410 --> 00:11:55,520 values of the y(i)'s 353 00:11:55,660 --> 00:11:58,800 observed in your training set. 354 00:11:58,910 --> 00:12:00,250 So, hopefully this gives you 355 00:12:00,400 --> 00:12:01,610 a better sense of how 356 00:12:01,920 --> 00:12:03,930 the many different pieces of 357 00:12:04,120 --> 00:12:05,760 neural network learning fit together. 358 00:12:07,120 --> 00:12:09,010 In case even after this video, in 359 00:12:09,120 --> 00:12:10,130 case you still feel like there 360 00:12:10,360 --> 00:12:11,420 are, like, a lot of different pieces 361 00:12:12,070 --> 00:12:13,450 and it's not entirely clear what 362 00:12:13,690 --> 00:12:14,670 some of them do or how all 363 00:12:14,860 --> 00:12:17,760 of these pieces come together, that's actually okay. 364 00:12:18,790 --> 00:12:21,780 Neural network learning and back propagation is a complicated algorithm. 365 00:12:23,000 --> 00:12:23,960 And even though I've seen 366 00:12:24,290 --> 00:12:25,340 the math behind back propagation 367 00:12:25,860 --> 00:12:26,710 for many years and I've used 368 00:12:27,030 --> 00:12:28,470 back propagation, I think very 369 00:12:28,680 --> 00:12:30,210 successfully, for many years, even 370 00:12:30,380 --> 00:12:31,510 today I still feel like I 371 00:12:31,570 --> 00:12:32,670 don't always have a great 372 00:12:33,400 --> 00:12:35,610 grasp of exactly what back propagation is doing sometimes. 373 00:12:36,200 --> 00:12:37,850 And what the optimization process 374 00:12:38,520 --> 00:12:41,480 looks like of minimizing j if theta. 375 00:12:41,920 --> 00:12:42,830 Much this is a much harder algorithm 376 00:12:43,450 --> 00:12:44,680 to feel like I have a 377 00:12:44,830 --> 00:12:46,590 much less good handle on 378 00:12:46,690 --> 00:12:47,690 exactly what this is doing 379 00:12:48,240 --> 00:12:49,360 compared to say, linear regression or logistic regression. 380 00:12:51,390 --> 00:12:53,180 Which were mathematically and conceptually 381 00:12:53,510 --> 00:12:55,090 much simpler and much cleaner algorithms. 382 00:12:56,200 --> 00:12:57,030 But so in case if you feel the 383 00:12:57,070 --> 00:12:58,560 same way, you know, that's actually perfectly 384 00:12:58,970 --> 00:13:01,010 okay, but if you 385 00:13:01,170 --> 00:13:02,790 do implement back propagation, hopefully 386 00:13:03,160 --> 00:13:04,260 what you find is that this 387 00:13:04,460 --> 00:13:05,410 is one of the most powerful 388 00:13:05,790 --> 00:13:08,030 learning algorithms and if you 389 00:13:08,130 --> 00:13:09,510 implement this algorithm, implement back propagation, 390 00:13:10,250 --> 00:13:11,230 implement one of these optimization 391 00:13:11,340 --> 00:13:13,260 methods, you find that 392 00:13:13,610 --> 00:13:14,940 back propagation will be able 393 00:13:15,390 --> 00:13:17,330 to fit very complex, powerful, non-linear 394 00:13:17,830 --> 00:13:19,370 functions to your data, 395 00:13:20,080 --> 00:13:21,060 and this is one of the 396 00:13:21,190 --> 00:13:22,790 most effective learning algorithms we have today.