1 00:00:00,260 --> 00:00:03,120 In the previous video, we talked about the back propagation algorithm. 2 00:00:04,230 --> 00:00:05,090 To a lot of people 3 00:00:05,220 --> 00:00:06,140 seeing it for the first time, 4 00:00:06,460 --> 00:00:07,610 the first impression is often 5 00:00:08,070 --> 00:00:09,250 that wow, this is 6 00:00:09,380 --> 00:00:11,650 a very complicated algorithm and there are all 7 00:00:11,970 --> 00:00:12,990 these different steps. And I'm 8 00:00:13,130 --> 00:00:13,980 not quite sure how they fit 9 00:00:14,180 --> 00:00:15,130 together and its like kind 10 00:00:15,400 --> 00:00:17,830 of a black box with all these complicated steps. 11 00:00:18,130 --> 00:00:18,830 In case that 's how you are 12 00:00:18,870 --> 00:00:20,460 feeling about back propagation, that's 13 00:00:20,860 --> 00:00:22,100 actually okay. 14 00:00:22,740 --> 00:00:24,100 Back propagation may be unfortunately 15 00:00:24,970 --> 00:00:26,920 is a less mathematically clean or 16 00:00:27,060 --> 00:00:28,520 less mathematically simple algorithm 17 00:00:28,860 --> 00:00:30,680 compared to linear regression or 18 00:00:31,130 --> 00:00:32,850 logistic regression, and I've 19 00:00:33,020 --> 00:00:35,560 actually used back propagation, you know, pretty 20 00:00:36,080 --> 00:00:37,310 successfully for many years and 21 00:00:37,530 --> 00:00:39,130 even today, I still don't sometimes 22 00:00:39,510 --> 00:00:40,320 feel like I have a very 23 00:00:40,430 --> 00:00:41,790 good sense of just what 24 00:00:42,130 --> 00:00:43,580 it's doing most of intuition about 25 00:00:43,830 --> 00:00:45,980 what background propagation is doing. 26 00:00:46,740 --> 00:00:47,850 For those of you that are doing 27 00:00:48,250 --> 00:00:49,920 the programming exercises that will 28 00:00:50,480 --> 00:00:51,970 at least mechanically step you 29 00:00:52,280 --> 00:00:53,710 through the different steps of 30 00:00:53,810 --> 00:00:54,910 how to implement back prop 31 00:00:55,200 --> 00:00:56,860 so you will be able to get it to work for yourself. 32 00:00:57,910 --> 00:00:58,850 And what I want to do 33 00:00:58,970 --> 00:01:00,170 in this video is look a 34 00:01:00,460 --> 00:01:01,750 little bit more at the 35 00:01:02,190 --> 00:01:03,640 mechanical steps of back propagation 36 00:01:04,160 --> 00:01:05,620 and try to give you a little 37 00:01:05,840 --> 00:01:07,450 more intuition about what the 38 00:01:07,930 --> 00:01:09,080 mechanical steps of back prop 39 00:01:09,250 --> 00:01:10,590 is doing to hopefully convince you 40 00:01:10,790 --> 00:01:12,530 that, you know, it is at least a reasonable algorithm. 41 00:01:14,680 --> 00:01:16,240 In case even after this video, in 42 00:01:16,380 --> 00:01:18,000 case back propagation still seems 43 00:01:18,760 --> 00:01:19,920 very black box and kind 44 00:01:20,160 --> 00:01:21,600 of like, you know, too many complicated 45 00:01:22,150 --> 00:01:23,230 steps, a little bit magical to to 46 00:01:23,330 --> 00:01:24,740 you, that's actually okay. 47 00:01:24,930 --> 00:01:26,760 And even though, 48 00:01:27,050 --> 00:01:27,840 you know, I have used back prop 49 00:01:28,070 --> 00:01:31,590 for many years, sometimes it's a difficult algorithm to understand. 50 00:01:32,310 --> 00:01:34,140 But hopefully this video will help a little bit. 51 00:01:36,410 --> 00:01:37,970 In order to better understand back 52 00:01:38,190 --> 00:01:39,660 propagation, let's take another 53 00:01:40,100 --> 00:01:42,290 closer look at what forward propagation is doing. 54 00:01:43,170 --> 00:01:44,420 Here's the neural network with two 55 00:01:44,770 --> 00:01:46,070 input units that is not 56 00:01:46,390 --> 00:01:48,480 counting the bias unit, and 57 00:01:48,700 --> 00:01:50,300 two hidden units in this 58 00:01:50,500 --> 00:01:51,590 layer and two hidden units 59 00:01:52,030 --> 00:01:53,490 in the next layer, and then 60 00:01:53,640 --> 00:01:55,090 finally one output unit. 61 00:01:55,520 --> 00:01:57,800 And again, these counts 2, 62 00:01:57,920 --> 00:02:00,240 2, 2 are not counting these bias units on top. 63 00:02:01,520 --> 00:02:03,170 In order to illustrate forward 64 00:02:03,430 --> 00:02:04,570 propagation, I'm going to 65 00:02:04,690 --> 00:02:06,080 draw this network a little bit differently. 66 00:02:08,040 --> 00:02:09,180 And in particular, I'm going to 67 00:02:09,370 --> 00:02:10,840 draw this neural network with the 68 00:02:10,930 --> 00:02:12,620 nodes drawn as these very 69 00:02:12,920 --> 00:02:15,010 fat ellipses, so that I can write text in them. 70 00:02:15,840 --> 00:02:16,800 When performing forward propagation, 71 00:02:17,600 --> 00:02:18,900 we might have some particular example, 72 00:02:19,760 --> 00:02:21,190 say some example x(i) comma 73 00:02:21,610 --> 00:02:22,990 y(i) and it will 74 00:02:23,080 --> 00:02:24,550 be this x(i) that we 75 00:02:24,740 --> 00:02:26,460 feed into the input layer, so 76 00:02:27,080 --> 00:02:28,850 that this may be, 77 00:02:29,110 --> 00:02:30,290 x(i)1 and x(i)2 are the 78 00:02:30,440 --> 00:02:31,360 values we set the input 79 00:02:31,510 --> 00:02:32,870 layer to and when we 80 00:02:33,010 --> 00:02:34,350 forward propagate it to the 81 00:02:34,650 --> 00:02:36,210 first hidden layer here, what 82 00:02:36,360 --> 00:02:38,070 we do is compute z(2)1 and 83 00:02:39,370 --> 00:02:42,900 z(2)2, so these are the 84 00:02:43,770 --> 00:02:45,010 weighted sum of inputs of the 85 00:02:45,260 --> 00:02:47,000 input units and then 86 00:02:47,230 --> 00:02:48,680 we apply the sigmoid of 87 00:02:48,940 --> 00:02:50,670 the logistic function and the 88 00:02:51,940 --> 00:02:53,630 sigmoid activation function applied 89 00:02:54,050 --> 00:02:55,670 to the z value, gives us 90 00:02:55,960 --> 00:02:57,520 these activation values. 91 00:02:57,880 --> 00:02:59,670 So that gives us a(2)1 92 00:02:59,870 --> 00:03:01,160 and a(2)2, and then we 93 00:03:01,260 --> 00:03:02,500 forward propagate again to get, 94 00:03:03,940 --> 00:03:05,570 you know, here z(3)1, 95 00:03:06,010 --> 00:03:07,500 apply the sigmoid of the 96 00:03:07,690 --> 00:03:09,500 logistic function, the activation function 97 00:03:10,080 --> 00:03:11,200 to that, to get the 98 00:03:11,240 --> 00:03:14,310 31 and similarly Like so, 99 00:03:15,580 --> 00:03:17,850 until we get z(4)1, apply the 100 00:03:18,080 --> 00:03:19,450 activation function this gives 101 00:03:19,630 --> 00:03:20,940 us a(4)1 which is the 102 00:03:21,630 --> 00:03:23,030 finer output value of the network. 103 00:03:24,860 --> 00:03:25,920 Let's erase this arrow to 104 00:03:26,040 --> 00:03:28,490 give myself some space, and if 105 00:03:28,620 --> 00:03:30,170 you look at what this 106 00:03:30,610 --> 00:03:32,280 computation really is doing, focusing 107 00:03:32,780 --> 00:03:33,970 on this hidden unit 108 00:03:34,400 --> 00:03:35,860 lets say we have that 109 00:03:36,090 --> 00:03:37,770 this weight, shown in 110 00:03:37,870 --> 00:03:39,500 magenta there, is my 111 00:03:39,700 --> 00:03:42,820 weight theta 2(1)0 the 112 00:03:43,090 --> 00:03:45,930 indexing is not important, and this 113 00:03:46,140 --> 00:03:47,440 way here which I guess 114 00:03:47,570 --> 00:03:49,270 I am highlighting in red, that 115 00:03:49,630 --> 00:03:51,290 is theta 2(1)1 and 116 00:03:52,870 --> 00:03:53,970 this weight here, which I'm 117 00:03:54,050 --> 00:03:55,370 drawing in green, in a cyan, 118 00:03:55,720 --> 00:03:59,530 is theta 2(1)2 so 119 00:04:00,410 --> 00:04:01,970 the way the computers value z(3)1 120 00:04:02,540 --> 00:04:05,230 is z(3)1 is as 121 00:04:05,410 --> 00:04:09,120 equal to this Weight, 122 00:04:10,430 --> 00:04:11,840 times this value so that's 123 00:04:13,070 --> 00:04:14,970 theta 2(1)0 times 1, 124 00:04:16,240 --> 00:04:19,190 plus this red 125 00:04:19,410 --> 00:04:21,480 weight times this value, so 126 00:04:21,670 --> 00:04:23,690 that's theta 2(1)1 times 127 00:04:25,270 --> 00:04:28,520 a(2)1, and finally this 128 00:04:28,860 --> 00:04:30,140 cyan red times this value, 129 00:04:30,660 --> 00:04:33,950 which is therefore, plus theta 130 00:04:35,120 --> 00:04:37,300 2(1)2 times a(2)1. 131 00:04:38,870 --> 00:04:40,170 And so that's forward propagation. 132 00:04:42,410 --> 00:04:43,680 And it turns out that, as 133 00:04:43,870 --> 00:04:44,730 we see later on in this 134 00:04:44,790 --> 00:04:46,140 video, what back propagation 135 00:04:46,530 --> 00:04:47,730 is doing, is doing a 136 00:04:47,780 --> 00:04:49,120 process very similar to 137 00:04:49,300 --> 00:04:50,860 this, except that instead of 138 00:04:50,950 --> 00:04:53,120 the computations flowing from the 139 00:04:53,360 --> 00:04:54,270 left to the right of this network, 140 00:04:55,250 --> 00:04:56,510 the computations is there flow 141 00:04:56,940 --> 00:04:58,070 from the right to the 142 00:04:58,220 --> 00:04:59,720 left of the network, and using 143 00:05:00,050 --> 00:05:02,170 a very similar computation as this, 144 00:05:02,430 --> 00:05:03,710 and I'll say in two 145 00:05:03,920 --> 00:05:05,260 slides exactly what I mean by that. 146 00:05:06,400 --> 00:05:07,880 To better understand what back 147 00:05:08,070 --> 00:05:09,710 propagation is doing, let's look 148 00:05:09,780 --> 00:05:10,920 at the cost function, it's just the 149 00:05:11,070 --> 00:05:12,270 cost function that we had for 150 00:05:12,670 --> 00:05:14,950 when we have only one output unit. 151 00:05:15,350 --> 00:05:16,300 If we have more than 152 00:05:16,400 --> 00:05:17,410 one output unit, we just 153 00:05:17,820 --> 00:05:19,850 have a summation, you know, over the 154 00:05:19,930 --> 00:05:22,170 output units index, if only 155 00:05:22,370 --> 00:05:25,990 one output unit then this 156 00:05:26,190 --> 00:05:27,490 is a cost function operation and 157 00:05:27,610 --> 00:05:30,340 we do forward propagation and back propagation on one example at a time. 158 00:05:30,560 --> 00:05:31,440 So, let's just focus on the 159 00:05:31,770 --> 00:05:34,770 single example x(i)y(i), and focus 160 00:05:35,360 --> 00:05:36,480 on the case of having one output 161 00:05:36,810 --> 00:05:38,390 unit so y(i) here 162 00:05:38,660 --> 00:05:40,390 is just a real number, and 163 00:05:40,680 --> 00:05:42,790 let's ignore regularization, so lambda 164 00:05:43,010 --> 00:05:44,300 equals zero, and this final 165 00:05:44,640 --> 00:05:46,480 term, that regularization term goes away. 166 00:05:47,320 --> 00:05:48,220 Now, if you look inside 167 00:05:48,730 --> 00:05:50,480 this summation, you find that 168 00:05:50,780 --> 00:05:53,290 the cost term associated with 169 00:05:53,450 --> 00:05:54,980 the I'f training example, that 170 00:05:55,190 --> 00:05:57,230 is, the cost associated with training 171 00:05:58,040 --> 00:06:00,420 example x(i)y(i), that's 172 00:06:00,540 --> 00:06:01,820 going to be given by this expression, that the 173 00:06:02,030 --> 00:06:03,270 cost, sort of, of training example 174 00:06:03,810 --> 00:06:04,910 i is written as follows. 175 00:06:06,080 --> 00:06:07,320 And what this cost 176 00:06:07,650 --> 00:06:08,650 function does, is it plays 177 00:06:09,080 --> 00:06:10,580 a role similar to the square error. 178 00:06:10,750 --> 00:06:11,530 So, rather than looking at this 179 00:06:12,190 --> 00:06:14,050 complicated expression, if you 180 00:06:14,170 --> 00:06:15,380 want you can think of cos 181 00:06:15,620 --> 00:06:17,600 of i being approximately, you know, the square of 182 00:06:18,020 --> 00:06:19,310 difference between or the 183 00:06:19,430 --> 00:06:20,870 neural network outputs versus what 184 00:06:21,170 --> 00:06:22,980 is the actual value. Just as 185 00:06:23,150 --> 00:06:24,340 in logistic regression, we actually 186 00:06:24,620 --> 00:06:25,510 prefer to use this slightly 187 00:06:25,830 --> 00:06:27,060 more complicated cost function using 188 00:06:27,370 --> 00:06:28,580 the log, but for the 189 00:06:28,640 --> 00:06:30,230 purpose of intuition, feel free 190 00:06:30,570 --> 00:06:31,440 to think of the cost function 191 00:06:32,000 --> 00:06:32,750 as being sort of the squared 192 00:06:33,250 --> 00:06:35,000 error cost function, and so 193 00:06:35,220 --> 00:06:36,870 this cos of i measures how 194 00:06:37,110 --> 00:06:38,780 well is the network doing on 195 00:06:38,880 --> 00:06:40,600 correctly predicting example i. 196 00:06:40,840 --> 00:06:42,000 How close is the output 197 00:06:42,810 --> 00:06:44,640 to the actually observed label y(i). 198 00:06:45,590 --> 00:06:47,610 Now let's look at what back propagation is doing. 199 00:06:48,420 --> 00:06:50,170 One useful intuition is that 200 00:06:51,190 --> 00:06:52,940 back propagation is computing these 201 00:06:53,610 --> 00:06:54,840 delta superscript l 202 00:06:55,050 --> 00:06:57,440 subscript j terms, and we 203 00:06:57,730 --> 00:06:58,520 can think of these as 204 00:06:58,650 --> 00:07:00,070 the quote error of the 205 00:07:00,300 --> 00:07:02,460 activation value that we 206 00:07:02,620 --> 00:07:03,980 got for unit j in 207 00:07:04,440 --> 00:07:05,750 the layer, in the 208 00:07:07,130 --> 00:07:07,400 lth layer. 209 00:07:07,660 --> 00:07:09,070 More formally, and this is 210 00:07:09,340 --> 00:07:10,280 maybe only for those of 211 00:07:10,360 --> 00:07:11,480 you that are familiar with calculus, 212 00:07:12,690 --> 00:07:14,080 more formally, what the delta 213 00:07:14,260 --> 00:07:15,820 terms actually are is this: 214 00:07:15,950 --> 00:07:17,810 they're the partial derivative with respect 215 00:07:18,240 --> 00:07:20,000 to z(l)j, that is 216 00:07:20,150 --> 00:07:21,460 the weighted sum of inputs that 217 00:07:21,650 --> 00:07:22,700 we're computing the z terms, 218 00:07:23,410 --> 00:07:25,760 partial derivative respect of these things of the cost function. 219 00:07:27,000 --> 00:07:28,650 So concretely the cost function 220 00:07:28,900 --> 00:07:30,000 is a function of the label 221 00:07:30,250 --> 00:07:31,350 y and of the 222 00:07:31,470 --> 00:07:32,680 value, this h of 223 00:07:32,780 --> 00:07:35,060 x output value neural network. And 224 00:07:35,180 --> 00:07:36,430 if we could go inside the neural network 225 00:07:37,340 --> 00:07:39,200 and just change those z(l)j 226 00:07:39,860 --> 00:07:41,450 values a little bit, then 227 00:07:41,640 --> 00:07:44,250 that would affect these values that the neural net. 228 00:07:44,990 --> 00:07:47,290 And so that will end up changing the cost function. 229 00:07:48,340 --> 00:07:50,120 And again really this 230 00:07:50,210 --> 00:07:51,690 is only for those of you expert in calculus. 231 00:07:52,960 --> 00:07:55,580 If you are familiar with comfortable with partial derivatives. 232 00:07:56,540 --> 00:07:57,860 What these delta terms are, 233 00:07:57,950 --> 00:07:59,270 is they're, they turn out to 234 00:07:59,370 --> 00:08:00,800 be the partial derivative of the 235 00:08:00,870 --> 00:08:04,010 cos function with respect to these intermediate terms that we're computing. 236 00:08:05,500 --> 00:08:07,250 And so their measure of 237 00:08:07,910 --> 00:08:08,940 how much would we like to 238 00:08:09,140 --> 00:08:11,090 change the neural network's weights in 239 00:08:11,250 --> 00:08:13,620 order to affect these intermediate values 240 00:08:14,150 --> 00:08:16,110 of the computation, so as 241 00:08:16,240 --> 00:08:17,430 to affect the final output the 242 00:08:17,470 --> 00:08:18,980 neural network h of x and 243 00:08:19,160 --> 00:08:20,770 therefore affect the overall cost. 244 00:08:21,510 --> 00:08:22,820 In case this last part of 245 00:08:23,030 --> 00:08:25,290 this partial derivative intuition, in case 246 00:08:25,530 --> 00:08:26,920 that didn't make sense, don't worry 247 00:08:27,070 --> 00:08:28,230 about it, the rest of this 248 00:08:28,390 --> 00:08:29,770 we can do without really 249 00:08:30,280 --> 00:08:32,400 talking partial derivatives but let's 250 00:08:32,660 --> 00:08:33,780 look in more detail at what 251 00:08:34,100 --> 00:08:36,020 back propagation is doing. 252 00:08:36,250 --> 00:08:37,440 For the output layer, if first 253 00:08:37,890 --> 00:08:39,630 sets this delta term, we say 254 00:08:39,830 --> 00:08:41,400 delta 4(1), as y(i) 255 00:08:41,700 --> 00:08:44,430 if we're doing forward propagation 256 00:08:44,890 --> 00:08:48,010 and back propagation on this 257 00:08:48,210 --> 00:08:50,180 training example i. It says it's y(i) 258 00:08:51,030 --> 00:08:52,970 minus a(4)1, 259 00:08:53,250 --> 00:08:54,370 so it's really the error, it's 260 00:08:54,560 --> 00:08:55,680 the difference between the actual value 261 00:08:56,000 --> 00:08:57,210 of y minus what was 262 00:08:57,630 --> 00:08:58,020 the value predicted. 263 00:08:58,530 --> 00:09:00,160 And so we're going to compute delta 264 00:09:00,670 --> 00:09:01,880 4(1) like so. 265 00:09:03,510 --> 00:09:06,200 Next we're going to do propagate these values backwards. 266 00:09:06,910 --> 00:09:07,820 I explain this in a second 267 00:09:08,510 --> 00:09:10,810 and end up computing the delta terms of the previous layer. 268 00:09:11,350 --> 00:09:12,450 We're going to end up 269 00:09:12,560 --> 00:09:13,720 with delta 3(1); delta 3(2); 270 00:09:13,990 --> 00:09:15,210 and then we're going to 271 00:09:15,600 --> 00:09:17,940 propagate this further 272 00:09:18,380 --> 00:09:19,340 backward and end up 273 00:09:19,470 --> 00:09:21,960 computing delta 2(1) and 274 00:09:22,690 --> 00:09:23,800 delta 2(2). 275 00:09:25,190 --> 00:09:27,290 Now the back propagation calculation 276 00:09:28,730 --> 00:09:30,050 is a lot like running the 277 00:09:30,140 --> 00:09:32,870 forward propagation algorithm, but doing it backwards. 278 00:09:33,260 --> 00:09:33,890 So here's what I mean. 279 00:09:34,160 --> 00:09:35,300 Let's look at how we end 280 00:09:35,460 --> 00:09:37,370 up with this value of Delta 2(2). 281 00:09:38,060 --> 00:09:39,280 So we have Delta 282 00:09:39,480 --> 00:09:42,330 2(2) and similar to 283 00:09:42,600 --> 00:09:44,760 forward propagation, let me label a couple of the weights. 284 00:09:45,000 --> 00:09:47,620 So this weight should be one cyan--let's say 285 00:09:47,890 --> 00:09:50,680 that weight is theta 2 286 00:09:51,190 --> 00:09:54,190 of 1, 2 and this 287 00:09:54,450 --> 00:09:55,970 weight down here, let me highlight 288 00:09:56,280 --> 00:09:57,740 this in red. That's going to be, let's say, 289 00:09:58,030 --> 00:09:59,760 theta 2 of 2, 2. 290 00:10:01,510 --> 00:10:03,410 So if we 291 00:10:03,500 --> 00:10:05,450 look at how Delta 2(2) 292 00:10:05,800 --> 00:10:07,540 is computed. How it's computed for this note. It turns 293 00:10:08,390 --> 00:10:09,690 out that what we're 294 00:10:09,800 --> 00:10:10,830 going to do is 295 00:10:10,970 --> 00:10:12,030 we're going to take this value and 296 00:10:12,350 --> 00:10:14,340 multiply it by this weight and 297 00:10:14,630 --> 00:10:16,770 add it to this value 298 00:10:17,580 --> 00:10:18,660 multiplied by that weight. 299 00:10:18,930 --> 00:10:19,850 So it's really a weighted sum 300 00:10:20,800 --> 00:10:22,880 of the new, these delta values. 301 00:10:23,280 --> 00:10:25,570 weighted by the corresponding edge strength. 302 00:10:25,960 --> 00:10:27,270 So concretely, let me fill this in. 303 00:10:28,430 --> 00:10:29,550 This delta 2,2 is going to 304 00:10:30,270 --> 00:10:32,610 be equal to theta 2(1)2, 305 00:10:33,110 --> 00:10:34,660 which is that magenta weight, 306 00:10:34,980 --> 00:10:38,850 times delta 3(1) plus, and 307 00:10:38,990 --> 00:10:40,080 then the thing I have in red, that's 308 00:10:41,230 --> 00:10:43,530 theta 2(2)2 309 00:10:43,860 --> 00:10:46,230 times Delta 3(2). 310 00:10:46,700 --> 00:10:48,550 So it is really, literally this red 311 00:10:48,800 --> 00:10:51,340 weight times this value, plus this 312 00:10:51,570 --> 00:10:52,690 magenta weight times it's value 313 00:10:53,540 --> 00:10:55,820 and that's how we wind up with that value of delta. 314 00:10:56,880 --> 00:10:59,490 And just as another example, let's look at this value. 315 00:10:59,870 --> 00:11:00,750 How did we get that value? 316 00:11:01,320 --> 00:11:02,660 Well, it's a similar 317 00:11:02,890 --> 00:11:04,490 process, if this weight, 318 00:11:05,530 --> 00:11:07,000 which I'm going to highlight in 319 00:11:07,100 --> 00:11:08,310 green, if this weight is 320 00:11:08,440 --> 00:11:09,860 equal to, say, delta 321 00:11:10,450 --> 00:11:12,990 3(1)2, then we have 322 00:11:13,920 --> 00:11:15,360 that, delta 3(2) is 323 00:11:15,630 --> 00:11:17,010 going to be equal to 324 00:11:17,910 --> 00:11:19,860 that green weight, theta 3(1)2 325 00:11:20,800 --> 00:11:22,260 times delta 4(1). 326 00:11:22,930 --> 00:11:25,520 And by the 327 00:11:25,610 --> 00:11:26,560 way, so far I've been 328 00:11:26,670 --> 00:11:28,310 writing the delta values only 329 00:11:28,660 --> 00:11:30,390 for the hidden units and 330 00:11:30,560 --> 00:11:32,750 not, but not, excluding the bias units. 331 00:11:33,620 --> 00:11:34,610 Depending on how you define 332 00:11:35,030 --> 00:11:37,170 the back propagation algorithm or depending on 333 00:11:37,330 --> 00:11:38,610 how you implement it, you know, 334 00:11:38,710 --> 00:11:40,510 you may end up implementing something 335 00:11:40,850 --> 00:11:42,390 to compute delta values for 336 00:11:42,900 --> 00:11:43,950 these bias units as well. 337 00:11:44,960 --> 00:11:46,230 The bias unit is always output 338 00:11:46,620 --> 00:11:47,880 the values plus one and they 339 00:11:47,990 --> 00:11:48,980 are just what they are and there's 340 00:11:49,220 --> 00:11:50,060 no way for us to change 341 00:11:50,210 --> 00:11:51,960 the value and so, depending 342 00:11:52,340 --> 00:11:53,440 on your implementation of back prop, 343 00:11:53,770 --> 00:11:54,960 the way I usually implement it, 344 00:11:55,090 --> 00:11:56,180 I do end up computing these 345 00:11:56,340 --> 00:11:57,670 delta values, but we 346 00:11:57,760 --> 00:11:58,900 just discard them and we 347 00:11:58,990 --> 00:12:00,560 don't use them, because they don't 348 00:12:00,800 --> 00:12:02,130 end up being part of 349 00:12:02,220 --> 00:12:04,130 the calculation needed to compute the derivatives. 350 00:12:04,990 --> 00:12:06,720 So, hopefully, that gives 351 00:12:06,990 --> 00:12:08,360 you a little bit of intuition 352 00:12:08,750 --> 00:12:10,380 about what back propagation is doing. 353 00:12:12,480 --> 00:12:13,290 In case of all this, they 354 00:12:13,440 --> 00:12:14,670 still seem so magical and 355 00:12:14,760 --> 00:12:16,090 so black box, in a 356 00:12:16,240 --> 00:12:17,560 later video, in the 357 00:12:17,770 --> 00:12:19,880 putting it together video, I'll try 358 00:12:20,150 --> 00:12:22,650 to give a little more intuition about what that back propagation is doing. 359 00:12:23,250 --> 00:12:24,360 But, unfortunately, this is, you 360 00:12:24,450 --> 00:12:26,370 know, a difficult algorithm to 361 00:12:26,510 --> 00:12:28,770 try to visualize and understand what it is really doing. 362 00:12:29,500 --> 00:12:30,790 But fortunately, you know, 363 00:12:30,990 --> 00:12:32,280 often I guess, many people 364 00:12:32,940 --> 00:12:33,930 have been using it very successfully 365 00:12:34,420 --> 00:12:35,640 for many years and if 366 00:12:35,730 --> 00:12:37,810 you infer the algorithm, you have 367 00:12:37,990 --> 00:12:40,090 a very effective learning algorithm, even 368 00:12:40,340 --> 00:12:41,400 though the inner workings of exactly 369 00:12:41,900 --> 00:12:43,190 how it works can be harder to visualize.