1 00:00:00,420 --> 00:00:01,540 In this video, I'd like you 2 00:00:01,700 --> 00:00:02,680 to work in through our example 3 00:00:03,390 --> 00:00:04,480 to show how a neural 4 00:00:04,730 --> 00:00:07,140 network can compute complex nonlinear hypotheses. 5 00:00:10,110 --> 00:00:11,240 In the last video, we saw 6 00:00:11,490 --> 00:00:12,790 how a neural network can 7 00:00:13,020 --> 00:00:13,900 be used to compute the functions 8 00:00:14,420 --> 00:00:16,120 x1 and x2 and the 9 00:00:16,230 --> 00:00:18,410 function x1 or x2 when 10 00:00:18,750 --> 00:00:20,250 x1 and x2 are binary. 11 00:00:20,870 --> 00:00:23,080 That is, when they take on values of 0, 1. 12 00:00:23,230 --> 00:00:24,580 We can also have a 13 00:00:24,620 --> 00:00:27,130 network to compute negation, that 14 00:00:27,330 --> 00:00:30,040 is to compute the function "not x1". 15 00:00:30,280 --> 00:00:31,670 Let me just write 16 00:00:31,890 --> 00:00:33,670 down the ways associated with this network. 17 00:00:33,970 --> 00:00:35,350 We have only one input feature, x1 18 00:00:35,450 --> 00:00:36,550 in this case and the 19 00:00:36,620 --> 00:00:38,210 bias unit plus 1 and 20 00:00:38,680 --> 00:00:40,130 if I associate this with 21 00:00:41,070 --> 00:00:42,610 the weights +10 and 22 00:00:43,120 --> 00:00:45,700 -20 then my hypotheses is computing this. 23 00:00:46,080 --> 00:00:47,740 H of x equals sigmoid of 24 00:00:47,880 --> 00:00:49,600 10 minus 20 times x1 25 00:00:50,390 --> 00:00:51,710 so when x1 is 26 00:00:51,940 --> 00:00:52,880 equal to 0, my 27 00:00:52,960 --> 00:00:54,060 hypothesis will be computing 28 00:00:55,160 --> 00:00:57,340 g of 10 minus 29 00:00:57,970 --> 00:00:59,910 20 times 0 which is just 10. 30 00:01:00,090 --> 00:01:01,600 And so that's approximately 31 00:01:02,440 --> 00:01:03,390 1, and when is x is 32 00:01:03,500 --> 00:01:04,300 equal to 1 this will 33 00:01:04,380 --> 00:01:05,740 be g of -10, which 34 00:01:06,210 --> 00:01:09,380 is therefore approximately equal to 0. 35 00:01:09,550 --> 00:01:10,320 And if you look at what 36 00:01:10,450 --> 00:01:11,720 these values are, that's essentially 37 00:01:12,230 --> 00:01:13,470 the "not x1" function. 38 00:01:14,560 --> 00:01:16,410 So to include negations, the 39 00:01:16,700 --> 00:01:18,640 general idea is to put 40 00:01:19,080 --> 00:01:20,460 a large negative weight in front 41 00:01:20,650 --> 00:01:22,870 of the variable you want to negate. 42 00:01:23,100 --> 00:01:24,710 So if it's -20, multiplied by 43 00:01:25,590 --> 00:01:26,780 x1 and, you know, that's the general 44 00:01:27,230 --> 00:01:28,110 idea of how you end 45 00:01:28,320 --> 00:01:30,500 up negating x1. 46 00:01:30,700 --> 00:01:32,210 And so, in an example that 47 00:01:32,580 --> 00:01:33,410 I hope you will figure out 48 00:01:33,480 --> 00:01:35,090 yourself, if you want 49 00:01:35,280 --> 00:01:36,410 to compute a function like this: 50 00:01:36,580 --> 00:01:38,870 "not x1 and not x2" 51 00:01:39,090 --> 00:01:40,100 you know, while part of that would 52 00:01:40,390 --> 00:01:41,860 probably be putting large negative 53 00:01:42,290 --> 00:01:44,150 weights in front of x1 54 00:01:44,500 --> 00:01:45,330 and x2, but it should 55 00:01:45,580 --> 00:01:47,320 be feasible to get a 56 00:01:47,490 --> 00:01:49,910 neural network with just 57 00:01:50,420 --> 00:01:52,810 one output unit to compute this as well. 58 00:01:52,990 --> 00:01:53,460 All right? 59 00:01:53,680 --> 00:01:55,130 So, this large fill 60 00:01:55,300 --> 00:01:56,290 function "not x1 and not 61 00:01:56,590 --> 00:01:57,990 x2" is going to 62 00:01:58,210 --> 00:02:00,450 be equal to 1 63 00:02:00,780 --> 00:02:06,960 if, and only if, x1 64 00:02:07,350 --> 00:02:09,860 equals x2 equals zero, right? 65 00:02:10,420 --> 00:02:11,480 So this is a logical function, this 66 00:02:11,680 --> 00:02:14,290 is not x1, that means x1 must be zero and not x2. 67 00:02:14,530 --> 00:02:17,130 That means x2 must be equal to zero as well. 68 00:02:17,800 --> 00:02:19,210 So this logical function is 69 00:02:19,450 --> 00:02:20,210 equal to 1 if, and only 70 00:02:20,540 --> 00:02:22,900 if, both x1 and x2 are equal to zero. 71 00:02:23,910 --> 00:02:25,600 And hopefully, you should be 72 00:02:25,710 --> 00:02:26,630 able to figure out how to make a 73 00:02:26,950 --> 00:02:28,240 small neural network to compute 74 00:02:28,640 --> 00:02:29,830 this logical function as well. 75 00:02:33,430 --> 00:02:34,350 Now, taking the three pieces 76 00:02:34,820 --> 00:02:36,720 that we have put together, the 77 00:02:37,400 --> 00:02:38,710 network for computing x1 and 78 00:02:38,910 --> 00:02:40,620 x2 and the network for 79 00:02:40,960 --> 00:02:42,070 computing not x1 and 80 00:02:42,340 --> 00:02:44,170 not x2 and one last 81 00:02:44,620 --> 00:02:45,910 network for computing x1 or 82 00:02:46,570 --> 00:02:47,700 x2, we should be 83 00:02:47,760 --> 00:02:49,420 able to put these three pieces together 84 00:02:49,840 --> 00:02:51,270 to compute this x1, XNOR 85 00:02:51,470 --> 00:02:52,810 x2 function. 86 00:02:53,860 --> 00:02:54,930 And just to remind you, if 87 00:02:55,100 --> 00:02:57,130 this was x1, x2, this 88 00:02:58,080 --> 00:02:58,830 function that we want to 89 00:02:59,090 --> 00:03:00,900 compute would have negative examples 90 00:03:01,520 --> 00:03:02,690 here and here and we'd 91 00:03:02,830 --> 00:03:04,370 have positive examples there and there. 92 00:03:04,730 --> 00:03:06,270 And so clearly this, you know, we'll 93 00:03:06,570 --> 00:03:08,400 need a nonlinear decision boundary 94 00:03:08,940 --> 00:03:10,540 in order to separate the positive and negative examples. 95 00:03:12,950 --> 00:03:13,460 Let's draw the network. 96 00:03:14,260 --> 00:03:15,820 I'm going to take my input 97 00:03:16,570 --> 00:03:18,610 plus 1, x1, x2, and create 98 00:03:19,150 --> 00:03:20,390 my first hidden unit here. 99 00:03:20,660 --> 00:03:22,010 I'm going to call this a(2)1 100 00:03:22,770 --> 00:03:24,060 because that's my first hidden unit. 101 00:03:24,510 --> 00:03:25,660 And I'm going to copy 102 00:03:25,920 --> 00:03:27,410 the weights over from the Red 103 00:03:27,740 --> 00:03:30,020 Network, x1 and x2 networks. 104 00:03:30,820 --> 00:03:32,410 So now minus 30, 20, 20. 105 00:03:32,650 --> 00:03:36,060 Next, let me create 106 00:03:36,420 --> 00:03:37,700 a second hidden unit, which 107 00:03:37,930 --> 00:03:39,960 I'm going to call a(2)2 that 108 00:03:40,350 --> 00:03:42,610 is the second hidden unit of layer two. 109 00:03:43,550 --> 00:03:44,590 And I'm going to copy over the 110 00:03:44,740 --> 00:03:45,940 Cyan Network in the 111 00:03:46,170 --> 00:03:47,080 middle, so I'm going 112 00:03:47,130 --> 00:03:49,230 to have the weights 10, minus 20, 113 00:03:50,150 --> 00:03:51,060 minus 20. 114 00:03:52,150 --> 00:03:55,570 And so, let's pull some of the truth table values. 115 00:03:56,170 --> 00:03:57,350 For the Red Network, we know 116 00:03:57,590 --> 00:03:59,340 that was computing the x1 and x2. 117 00:03:59,690 --> 00:04:00,940 And so this is 118 00:04:01,040 --> 00:04:02,460 going to be approximately 0, 0, 119 00:04:02,540 --> 00:04:05,030 0, 1, depending on the values of x1 and x2. 120 00:04:07,040 --> 00:04:09,560 And for a (2)2, that's the Cyan Network. 121 00:04:10,590 --> 00:04:11,750 Well that we know the function not x1 122 00:04:12,240 --> 00:04:13,640 and not x2 then outputs 1, 123 00:04:13,640 --> 00:04:15,610 0, 0, 0 for the 124 00:04:15,700 --> 00:04:17,830 4 values of x1 and x2. 125 00:04:18,480 --> 00:04:19,560 Finally, I'm going to 126 00:04:19,810 --> 00:04:21,300 create my output note, my 127 00:04:21,490 --> 00:04:23,950 output unit that is a(3)1. 128 00:04:24,860 --> 00:04:26,230 This is one more output h 129 00:04:26,590 --> 00:04:28,270 of x and I'm 130 00:04:28,390 --> 00:04:30,030 going to copy over the OR 131 00:04:30,320 --> 00:04:32,470 Network for that and I'm going to 132 00:04:32,860 --> 00:04:34,330 need a plus one bias unit here. 133 00:04:34,810 --> 00:04:36,010 So, draw that in and I'm 134 00:04:36,320 --> 00:04:38,360 going to copy over the weights from the Green Networks. 135 00:04:38,950 --> 00:04:39,750 So, it's minus 10, 20, 20 136 00:04:42,370 --> 00:04:44,460 and we know earlier that this computes the OR function. 137 00:04:46,660 --> 00:04:48,200 So, let's go on the truth table entries. 138 00:04:50,300 --> 00:04:51,660 For the first entry is 0 139 00:04:51,720 --> 00:04:53,930 or 1, which is gonna be 140 00:04:54,140 --> 00:04:55,710 1 then next 0 or 141 00:04:55,800 --> 00:04:57,280 0, which is 0, 142 00:04:57,350 --> 00:04:58,920 0, or 0, which is 0, 143 00:04:58,960 --> 00:05:00,420 1 or 0 and that all 144 00:05:00,600 --> 00:05:02,450 is to 1 and thus, h 145 00:05:02,640 --> 00:05:04,820 of x is equal to 1 146 00:05:04,980 --> 00:05:06,270 when either both x1 and 147 00:05:06,780 --> 00:05:08,360 x2 are 0 or when x1 and 148 00:05:08,590 --> 00:05:10,160 x2 are both 1. And 149 00:05:10,900 --> 00:05:12,170 concretely, h of x 150 00:05:12,680 --> 00:05:15,340 outputs 1 exactly at these 151 00:05:15,560 --> 00:05:16,850 two locations and it outputs 152 00:05:17,230 --> 00:05:19,270 0 otherwise and thus, 153 00:05:19,570 --> 00:05:20,970 with this neural network, which 154 00:05:21,210 --> 00:05:23,030 has an input layer, one 155 00:05:23,200 --> 00:05:24,560 hidden layer and one output 156 00:05:24,880 --> 00:05:25,920 layer, we end up 157 00:05:26,100 --> 00:05:28,450 with a nonlinear decision boundary that 158 00:05:29,120 --> 00:05:30,520 computes this XNOR function. 159 00:05:31,640 --> 00:05:33,390 And the more general intuition is 160 00:05:33,710 --> 00:05:34,870 that in the input 161 00:05:34,990 --> 00:05:35,780 layer, we just had our raw 162 00:05:36,060 --> 00:05:37,400 inputs then we had 163 00:05:37,610 --> 00:05:39,510 a hidden layer, which computed some 164 00:05:39,680 --> 00:05:41,140 slightly more complex functions of 165 00:05:41,250 --> 00:05:42,080 the inputs that is shown 166 00:05:42,430 --> 00:05:43,410 here, these are slightly more 167 00:05:43,550 --> 00:05:44,960 complex functions, and then by 168 00:05:45,250 --> 00:05:46,510 adding yet another layer, we end 169 00:05:46,640 --> 00:05:49,030 up with an even more complex nonlinear function. 170 00:05:50,550 --> 00:05:51,340 And this is the sort of 171 00:05:51,450 --> 00:05:53,810 intuition about why Neural 172 00:05:54,100 --> 00:05:55,270 Networks can compute pretty complicated 173 00:05:55,840 --> 00:05:57,270 functions that when you 174 00:05:57,340 --> 00:05:58,550 have multiple layers, you have, you know, 175 00:05:58,910 --> 00:06:00,300 relatively simple function of 176 00:06:00,390 --> 00:06:01,500 the inputs, and the second layer, 177 00:06:02,160 --> 00:06:03,110 but the third layer can build 178 00:06:03,340 --> 00:06:04,590 on that to compute even more 179 00:06:04,820 --> 00:06:06,330 complex functions and then 180 00:06:06,790 --> 00:06:08,730 the layer after that can compute even more complex functions. 181 00:06:10,340 --> 00:06:11,740 To wrap up this video, I 182 00:06:11,800 --> 00:06:13,330 want to show you a fun example of 183 00:06:13,480 --> 00:06:14,650 an application of a neural 184 00:06:14,880 --> 00:06:16,400 network that captures this intuition 185 00:06:17,260 --> 00:06:19,440 of the deeper layers computing more complex features. 186 00:06:19,900 --> 00:06:21,040 I want to show you 187 00:06:21,200 --> 00:06:22,480 a video that I got from 188 00:06:22,930 --> 00:06:24,170 a good friend of mine, Yon Khun. 189 00:06:24,850 --> 00:06:26,240 Yon is a professor at 190 00:06:26,610 --> 00:06:27,680 New York University, at NYU, 191 00:06:28,230 --> 00:06:29,400 and he was one of 192 00:06:29,470 --> 00:06:30,910 the early pioneers of neural 193 00:06:31,130 --> 00:06:32,590 network research and sort 194 00:06:32,930 --> 00:06:34,610 of a legend in the field 195 00:06:34,930 --> 00:06:36,520 now and his ideas are 196 00:06:36,560 --> 00:06:38,340 used in all sorts of products 197 00:06:38,980 --> 00:06:40,490 and applications throughout the world now. 198 00:06:41,470 --> 00:06:42,230 So, I want to show you 199 00:06:42,380 --> 00:06:43,410 a video from some of his 200 00:06:43,740 --> 00:06:44,890 early work in which he 201 00:06:44,980 --> 00:06:46,110 was using a neural network 202 00:06:47,000 --> 00:06:50,300 to recognize handwriting - to do handwritten digit recognition. 203 00:06:51,370 --> 00:06:52,510 You might remember early in this 204 00:06:52,720 --> 00:06:53,630 class, at the start of this 205 00:06:53,730 --> 00:06:55,180 class, I said that one of 206 00:06:55,460 --> 00:06:56,720 early successes of neural networks 207 00:06:57,140 --> 00:06:58,170 was trying to use it 208 00:06:58,320 --> 00:07:00,580 to read zip codes, to help 209 00:07:00,850 --> 00:07:02,940 us, you know, send mail along. So, to read postal codes. 210 00:07:03,880 --> 00:07:04,910 So, this is one of the attempts. 211 00:07:05,250 --> 00:07:06,220 So, this is one of the 212 00:07:06,650 --> 00:07:08,370 algorithms used to try to address that problem. 213 00:07:09,320 --> 00:07:10,420 In the video I'll show you 214 00:07:11,060 --> 00:07:12,640 this area here is the 215 00:07:12,910 --> 00:07:14,420 input area that shows a 216 00:07:14,980 --> 00:07:16,460 handwritten character shown to the 217 00:07:16,560 --> 00:07:18,610 network. This column here 218 00:07:19,490 --> 00:07:21,350 shows a visualization of 219 00:07:21,460 --> 00:07:23,550 the features computed by so that the 220 00:07:23,900 --> 00:07:24,760 first hidden layer of the 221 00:07:24,830 --> 00:07:26,090 network and so the first 222 00:07:26,400 --> 00:07:28,420 hidden layer, you know, this visualization shows 223 00:07:28,720 --> 00:07:31,190 different features, different edges and lines and so on detected. 224 00:07:32,360 --> 00:07:35,260 This is a visualization of the next hidden layer. 225 00:07:35,530 --> 00:07:36,390 It's kind of harder to see 226 00:07:36,770 --> 00:07:38,170 how to understand deeper hidden 227 00:07:38,730 --> 00:07:39,680 layers and that's the visualization 228 00:07:40,460 --> 00:07:41,830 of what the next hidden layer is computing. 229 00:07:42,140 --> 00:07:43,530 You probably have a hard time 230 00:07:44,180 --> 00:07:45,550 seeing what's going on, you know, 231 00:07:45,700 --> 00:07:46,800 much beyond the first hidden layer, 232 00:07:47,640 --> 00:07:49,160 but then finally, all of 233 00:07:49,260 --> 00:07:51,110 these learned features get 234 00:07:51,430 --> 00:07:52,590 fed to the output layer 235 00:07:53,260 --> 00:07:54,830 and shown over here is 236 00:07:55,030 --> 00:07:56,370 the final answers, the final 237 00:07:56,800 --> 00:07:58,850 predictive value for what 238 00:07:59,390 --> 00:08:02,150 handwritten digit the neural network things that is being shown. 239 00:08:03,130 --> 00:08:04,270 So, let's take a look at the video. 240 00:09:42,060 --> 00:09:44,370 So, I hope 241 00:09:50,610 --> 00:09:52,010 you enjoyed the video and that 242 00:09:52,260 --> 00:09:53,480 this hopefully gave you some 243 00:09:53,670 --> 00:09:55,240 intuition about the sorts 244 00:09:55,450 --> 00:09:57,120 of pretty complicated functions neural 245 00:09:57,320 --> 00:09:58,420 networks can learn in which 246 00:09:58,740 --> 00:10:00,250 it takes this input this image, 247 00:10:00,670 --> 00:10:01,510 just takes this input the 248 00:10:01,620 --> 00:10:03,140 raw pixels and the first 249 00:10:03,310 --> 00:10:04,640 end of the layer computes some set 250 00:10:04,770 --> 00:10:05,680 of features, the next end 251 00:10:05,740 --> 00:10:06,900 of the layer computes even more complex 252 00:10:07,330 --> 00:10:08,620 features and even more complex features 253 00:10:09,560 --> 00:10:10,640 and these features can then be 254 00:10:10,780 --> 00:10:12,030 used by essentially the final 255 00:10:12,940 --> 00:10:14,700 layer of logistic regression classifiers 256 00:10:15,810 --> 00:10:17,550 to make accurate predictions about what 257 00:10:17,880 --> 00:10:19,190 are the numbers that the network sees.