1 00:00:00,440 --> 00:00:01,400 In this and in the 2 00:00:01,480 --> 00:00:02,640 next set of videos, I'd like 3 00:00:02,780 --> 00:00:04,270 to tell you about a learning 4 00:00:04,550 --> 00:00:06,110 algorithm called a Neural Network. 5 00:00:07,190 --> 00:00:07,900 We're going to first talk about 6 00:00:08,079 --> 00:00:09,330 the representation and then 7 00:00:09,600 --> 00:00:10,390 in the next set of videos 8 00:00:10,410 --> 00:00:12,160 talk about learning algorithms for it. 9 00:00:12,660 --> 00:00:14,070 Neutral networks is actually 10 00:00:14,510 --> 00:00:15,870 a pretty old idea, but had 11 00:00:16,290 --> 00:00:17,680 fallen out of favor for a while. 12 00:00:18,200 --> 00:00:19,270 But today, it is the 13 00:00:19,580 --> 00:00:20,820 state of the art technique for 14 00:00:21,090 --> 00:00:22,390 many different machine learning problems. 15 00:00:23,740 --> 00:00:25,740 So why do we need yet another learning algorithm? 16 00:00:26,300 --> 00:00:28,030 We already have linear regression and 17 00:00:28,180 --> 00:00:31,260 we have logistic regression, so why do we need, you know, neural networks? 18 00:00:32,280 --> 00:00:34,260 In order to motivate the discussion 19 00:00:34,790 --> 00:00:35,970 of neural networks, let me 20 00:00:36,120 --> 00:00:37,130 start by showing you a few 21 00:00:37,310 --> 00:00:38,720 examples of machine learning 22 00:00:38,930 --> 00:00:40,100 problems where we need 23 00:00:40,300 --> 00:00:41,850 to learn complex non-linear hypotheses. 24 00:00:43,850 --> 00:00:45,650 Consider a supervised learning classification 25 00:00:46,530 --> 00:00:48,440 problem where you have a training set like this. 26 00:00:49,280 --> 00:00:50,530 If you want to apply logistic 27 00:00:50,960 --> 00:00:52,710 regression to this problem, one 28 00:00:52,900 --> 00:00:54,250 thing you could do is apply 29 00:00:54,660 --> 00:00:56,140 logistic regression with a 30 00:00:56,190 --> 00:00:57,720 lot of nonlinear features like that. 31 00:00:58,170 --> 00:00:59,580 So here, g as usual 32 00:01:00,070 --> 00:01:01,710 is the sigmoid function, and we 33 00:01:01,780 --> 00:01:04,680 can include lots of polynomial terms like these. 34 00:01:05,450 --> 00:01:06,790 And, if you include enough polynomial 35 00:01:07,370 --> 00:01:08,280 terms then, you know, maybe 36 00:01:08,950 --> 00:01:10,280 you can get a hypotheses 37 00:01:11,600 --> 00:01:13,780 that separates the positive and negative examples. 38 00:01:14,630 --> 00:01:16,080 This particular method works well 39 00:01:16,470 --> 00:01:18,400 when you have only, say, two 40 00:01:18,620 --> 00:01:20,180 features - x1 and x2 41 00:01:20,190 --> 00:01:20,980 - because you can then include 42 00:01:21,500 --> 00:01:22,880 all those polynomial terms of 43 00:01:23,400 --> 00:01:24,620 x1 and x2. 44 00:01:24,810 --> 00:01:26,280 But for many interesting machine learning 45 00:01:26,520 --> 00:01:27,730 problems would have a 46 00:01:27,910 --> 00:01:29,230 lot more features than just two. 47 00:01:30,780 --> 00:01:31,760 We've been talking for a while 48 00:01:32,320 --> 00:01:34,560 about housing prediction, and suppose 49 00:01:35,130 --> 00:01:36,990 you have a housing classification 50 00:01:38,020 --> 00:01:39,280 problem rather than a 51 00:01:39,390 --> 00:01:41,170 regression problem, like maybe 52 00:01:41,580 --> 00:01:43,350 if you have different features of 53 00:01:43,440 --> 00:01:44,760 a house, and you want 54 00:01:45,010 --> 00:01:46,000 to predict what are the 55 00:01:46,050 --> 00:01:47,590 odds that your house will 56 00:01:47,700 --> 00:01:48,710 be sold within the next 57 00:01:48,910 --> 00:01:51,040 six months, so that will be a classification problem. 58 00:01:52,100 --> 00:01:53,060 And as we saw we can 59 00:01:53,260 --> 00:01:55,130 come up with quite a 60 00:01:55,260 --> 00:01:56,480 lot of features, maybe a hundred 61 00:01:56,840 --> 00:01:58,270 different features of different houses. 62 00:02:00,130 --> 00:02:01,610 For a problem like this, if 63 00:02:01,880 --> 00:02:03,260 you were to include all the 64 00:02:03,370 --> 00:02:04,980 quadratic terms, all of 65 00:02:05,100 --> 00:02:06,260 these, even all of the 66 00:02:06,540 --> 00:02:07,540 quadratic that is the second 67 00:02:07,930 --> 00:02:10,450 or the polynomial terms, there would be a lot of them. 68 00:02:10,560 --> 00:02:11,580 There would be terms like x1 squared, 69 00:02:12,960 --> 00:02:17,610 x1x2, x1x3, you know, x1x4 70 00:02:18,750 --> 00:02:21,880 up to x1x100 and then 71 00:02:21,980 --> 00:02:23,620 you have x2 squared, x2x3 72 00:02:25,620 --> 00:02:25,980 and so on. 73 00:02:26,510 --> 00:02:27,770 And if you include just 74 00:02:28,060 --> 00:02:29,200 the second order terms, that 75 00:02:29,330 --> 00:02:30,750 is, the terms that are 76 00:02:30,840 --> 00:02:32,090 a product of, you know, 77 00:02:32,220 --> 00:02:33,390 two of these terms, x1 78 00:02:33,510 --> 00:02:35,010 times x1 and so on, then, 79 00:02:35,780 --> 00:02:36,920 for the case of n equals 80 00:02:38,180 --> 00:02:40,280 100, you end up with about five thousand features. 81 00:02:41,890 --> 00:02:44,880 And, asymptotically, the 82 00:02:45,000 --> 00:02:46,330 number of quadratic features grows 83 00:02:46,770 --> 00:02:48,670 roughly as order n 84 00:02:48,820 --> 00:02:50,330 squared, where n is the 85 00:02:50,460 --> 00:02:52,790 number of the original features, 86 00:02:53,370 --> 00:02:54,780 like x1 through x100 that we had. 87 00:02:55,700 --> 00:02:58,750 And its actually closer to n squared over two. 88 00:02:59,920 --> 00:03:01,440 So including all the 89 00:03:01,560 --> 00:03:02,920 quadratic features doesn't seem 90 00:03:03,220 --> 00:03:04,220 like it's maybe a good 91 00:03:04,300 --> 00:03:05,380 idea, because that is a 92 00:03:05,580 --> 00:03:07,050 lot of features and you 93 00:03:07,220 --> 00:03:08,920 might up overfitting the training 94 00:03:09,330 --> 00:03:10,500 set, and it can 95 00:03:10,740 --> 00:03:12,800 also be computationally expensive, you know, to 96 00:03:14,080 --> 00:03:15,120 be working with that many features. 97 00:03:16,450 --> 00:03:17,540 One thing you could do is 98 00:03:17,770 --> 00:03:19,090 include only a subset of 99 00:03:19,290 --> 00:03:20,950 these, so if you include only the 100 00:03:21,050 --> 00:03:22,630 features x1 squared, x2 squared, 101 00:03:23,590 --> 00:03:25,180 x3 squared, up to 102 00:03:25,580 --> 00:03:27,750 maybe x100 squared, then 103 00:03:28,100 --> 00:03:29,500 the number of features is much smaller. 104 00:03:29,980 --> 00:03:31,720 Here you have only 100 such 105 00:03:32,070 --> 00:03:33,850 quadratic features, but this 106 00:03:34,120 --> 00:03:35,950 is not enough features and 107 00:03:36,100 --> 00:03:37,170 certainly won't let you fit 108 00:03:37,290 --> 00:03:39,330 the data set like that on the upper left. 109 00:03:39,570 --> 00:03:40,550 In fact, if you include 110 00:03:41,040 --> 00:03:42,720 only these quadratic features together 111 00:03:43,170 --> 00:03:44,870 with the original x1, and 112 00:03:45,350 --> 00:03:46,500 so on, up to x100 features, 113 00:03:47,460 --> 00:03:48,530 then you can actually fit very 114 00:03:48,910 --> 00:03:50,210 interesting hypotheses. So, you 115 00:03:50,330 --> 00:03:52,350 can fit things like, you know, access a 116 00:03:52,490 --> 00:03:53,860 line of the ellipses like these, but 117 00:03:55,080 --> 00:03:56,240 you certainly cannot fit a more 118 00:03:56,340 --> 00:03:57,930 complex data set like that shown here. 119 00:03:59,360 --> 00:04:00,530 So 5000 features seems like 120 00:04:00,620 --> 00:04:03,090 a lot, if you were 121 00:04:03,230 --> 00:04:04,860 to include the cubic, or 122 00:04:05,140 --> 00:04:06,100 third order known of each others, 123 00:04:06,440 --> 00:04:08,050 the x1, x2, x3. 124 00:04:08,400 --> 00:04:09,800 You know, x1 squared, 125 00:04:10,310 --> 00:04:12,240 x2, x10 and 126 00:04:12,900 --> 00:04:15,280 x11, x17 and so on. 127 00:04:15,700 --> 00:04:18,110 You can imagine there are gonna be a lot of these features. 128 00:04:19,040 --> 00:04:19,770 In fact, they are going to be 129 00:04:20,050 --> 00:04:21,260 order and cube such features 130 00:04:22,210 --> 00:04:23,830 and if any is 100 131 00:04:24,150 --> 00:04:25,660 you can compute that, you 132 00:04:25,740 --> 00:04:26,870 end up with on the order 133 00:04:27,730 --> 00:04:29,650 of about 170,000 such cubic 134 00:04:30,040 --> 00:04:31,670 features and so including 135 00:04:32,260 --> 00:04:34,470 these higher auto-polynomial features when 136 00:04:34,920 --> 00:04:36,050 your original feature set end 137 00:04:36,230 --> 00:04:37,730 is large this really dramatically 138 00:04:38,530 --> 00:04:40,440 blows up your feature space and 139 00:04:41,070 --> 00:04:42,180 this doesn't seem like a 140 00:04:42,320 --> 00:04:43,320 good way to come up with 141 00:04:43,560 --> 00:04:45,050 additional features with which 142 00:04:45,240 --> 00:04:48,100 to build none many classifiers when n is large. 143 00:04:49,590 --> 00:04:52,560 For many machine learning problems, n will be pretty large. 144 00:04:53,270 --> 00:04:53,560 Here's an example. 145 00:04:55,000 --> 00:04:58,140 Let's consider the problem of computer vision. 146 00:04:59,670 --> 00:05:00,770 And suppose you want to 147 00:05:01,260 --> 00:05:02,620 use machine learning to train 148 00:05:02,710 --> 00:05:04,610 a classifier to examine an 149 00:05:04,710 --> 00:05:05,880 image and tell us whether 150 00:05:06,160 --> 00:05:08,030 or not the image is a car. 151 00:05:09,480 --> 00:05:11,900 Many people wonder why computer vision could be difficult. 152 00:05:12,390 --> 00:05:13,140 I mean when you and I 153 00:05:13,270 --> 00:05:15,670 look at this picture it is so obvious what this is. 154 00:05:15,900 --> 00:05:17,000 You wonder how is it 155 00:05:17,190 --> 00:05:18,320 that a learning algorithm could possibly 156 00:05:18,910 --> 00:05:20,880 fail to know what this picture is. 157 00:05:22,110 --> 00:05:23,330 To understand why computer vision 158 00:05:23,720 --> 00:05:25,380 is hard let's zoom 159 00:05:25,650 --> 00:05:26,690 into a small part of the 160 00:05:26,940 --> 00:05:28,180 image like that area where the 161 00:05:28,510 --> 00:05:30,240 little red rectangle is. 162 00:05:30,400 --> 00:05:31,330 It turns out that where you 163 00:05:31,450 --> 00:05:34,270 and I see a car, the computer sees that. 164 00:05:34,780 --> 00:05:35,930 What it sees is this matrix, 165 00:05:36,600 --> 00:05:38,110 or this grid, of pixel 166 00:05:38,580 --> 00:05:40,350 intensity values that tells 167 00:05:40,610 --> 00:05:42,930 us the brightness of each pixel in the image. 168 00:05:43,640 --> 00:05:45,170 So the computer vision problem is 169 00:05:45,550 --> 00:05:47,230 to look at this matrix of 170 00:05:47,310 --> 00:05:49,140 pixel intensity values, and tell 171 00:05:49,410 --> 00:05:52,440 us that these numbers represent the door handle of a car. 172 00:05:54,230 --> 00:05:55,740 Concretely, when we use 173 00:05:56,030 --> 00:05:57,220 machine learning to build a 174 00:05:57,430 --> 00:05:59,060 car detector, what we do 175 00:05:59,360 --> 00:06:00,510 is we come up with a 176 00:06:00,800 --> 00:06:02,690 label training set, with, let's 177 00:06:02,890 --> 00:06:04,250 say, a few label examples 178 00:06:04,730 --> 00:06:05,850 of cars and a few 179 00:06:06,000 --> 00:06:07,150 label examples of things that 180 00:06:07,380 --> 00:06:08,780 are not cars, then we 181 00:06:09,090 --> 00:06:10,590 give our training set to 182 00:06:10,720 --> 00:06:12,230 the learning algorithm trained a 183 00:06:12,310 --> 00:06:13,500 classifier and then, you 184 00:06:13,680 --> 00:06:14,700 know, we may test it and show 185 00:06:14,890 --> 00:06:16,710 the new image and ask, "What is this new thing?". 186 00:06:17,980 --> 00:06:20,030 And hopefully it will recognize that that is a car. 187 00:06:21,410 --> 00:06:24,000 To understand why we 188 00:06:24,120 --> 00:06:26,810 need nonlinear hypotheses, let's take 189 00:06:27,050 --> 00:06:27,940 a look at some of the 190 00:06:28,190 --> 00:06:29,360 images of cars and maybe 191 00:06:29,480 --> 00:06:31,780 non-cars that we might feed to our learning algorithm. 192 00:06:32,960 --> 00:06:33,920 Let's pick a couple of pixel 193 00:06:34,090 --> 00:06:35,630 locations in our images, so 194 00:06:35,750 --> 00:06:37,040 that's pixel one location and 195 00:06:37,180 --> 00:06:39,500 pixel two location, and let's 196 00:06:39,730 --> 00:06:42,390 plot this car, you know, at the 197 00:06:42,510 --> 00:06:44,010 location, at a certain 198 00:06:44,360 --> 00:06:45,890 point, depending on the intensities 199 00:06:46,430 --> 00:06:47,870 of pixel one and pixel two. 200 00:06:49,260 --> 00:06:50,630 And let's do this with a few other images. 201 00:06:51,060 --> 00:06:52,450 So let's take a different example 202 00:06:52,980 --> 00:06:53,980 of the car and you know, 203 00:06:54,080 --> 00:06:55,010 look at the same two pixel locations 204 00:06:56,160 --> 00:06:57,570 and that image has a 205 00:06:57,770 --> 00:06:58,970 different intensity for pixel one 206 00:06:59,230 --> 00:07:00,660 and a different intensity for pixel two. 207 00:07:00,960 --> 00:07:02,940 So, it ends up at a different location on the figure. 208 00:07:03,360 --> 00:07:05,740 And then let's plot some negative examples as well. 209 00:07:05,990 --> 00:07:07,590 That's a non-car, that's a 210 00:07:07,720 --> 00:07:09,470 non-car . 211 00:07:09,730 --> 00:07:10,910 And if we do this for 212 00:07:11,070 --> 00:07:12,720 more and more examples using 213 00:07:13,280 --> 00:07:14,680 the pluses to denote cars 214 00:07:15,080 --> 00:07:16,310 and minuses to denote non-cars, 215 00:07:16,890 --> 00:07:18,500 what we'll find is that 216 00:07:18,830 --> 00:07:20,680 the cars and non-cars end up 217 00:07:20,890 --> 00:07:22,430 lying in different regions of 218 00:07:22,570 --> 00:07:24,910 the space, and what we 219 00:07:25,180 --> 00:07:26,570 need therefore is some sort 220 00:07:26,750 --> 00:07:28,790 of non-linear hypotheses to try 221 00:07:29,000 --> 00:07:30,900 to separate out the two classes. 222 00:07:32,480 --> 00:07:34,300 What is the dimension of the feature space? 223 00:07:35,290 --> 00:07:38,210 Suppose we were to use just 50 by 50 pixel images. 224 00:07:38,770 --> 00:07:40,050 Now that suppose our images were 225 00:07:40,520 --> 00:07:42,760 pretty small ones, just 50 pixels on the side. 226 00:07:43,470 --> 00:07:44,990 Then we would have 2500 pixels, 227 00:07:46,330 --> 00:07:47,650 and so the dimension of 228 00:07:47,740 --> 00:07:49,310 our feature size will be N 229 00:07:49,520 --> 00:07:51,450 equals 2500 where our feature 230 00:07:51,700 --> 00:07:52,910 vector x is a list 231 00:07:53,180 --> 00:07:54,570 of all the pixel testings, you 232 00:07:54,710 --> 00:07:56,690 know, the pixel brightness of pixel 233 00:07:57,080 --> 00:07:58,030 one, the brightness of pixel 234 00:07:58,330 --> 00:07:59,580 two, and so on down 235 00:07:59,870 --> 00:08:01,310 to the pixel brightness of the 236 00:08:01,400 --> 00:08:03,420 last pixel where, you know, in a 237 00:08:03,590 --> 00:08:05,450 typical computer representation, each of 238 00:08:05,540 --> 00:08:07,190 these may be values between say 239 00:08:07,480 --> 00:08:09,020 0 to 255 if it gives 240 00:08:09,230 --> 00:08:12,110 us the grayscale value. 241 00:08:12,520 --> 00:08:13,290 So we have n equals 2500, 242 00:08:13,950 --> 00:08:15,580 and that's if we 243 00:08:15,740 --> 00:08:17,140 were using grayscale images. 244 00:08:17,790 --> 00:08:18,800 If we were using RGB 245 00:08:19,440 --> 00:08:21,140 images with separate red, green 246 00:08:21,420 --> 00:08:23,870 and blue values, we would have n equals 7500. 247 00:08:27,650 --> 00:08:28,630 So, if we were to 248 00:08:29,000 --> 00:08:29,920 try to learn a nonlinear 249 00:08:30,370 --> 00:08:32,020 hypothesis by including all 250 00:08:32,300 --> 00:08:33,710 the quadratic features, that is 251 00:08:33,810 --> 00:08:34,680 all the terms of the form, you know, 252 00:08:35,430 --> 00:08:38,900 Xi times Xj, while with the 253 00:08:39,130 --> 00:08:40,370 2500 pixels we would end 254 00:08:40,580 --> 00:08:42,500 up with a total of three million features. 255 00:08:43,050 --> 00:08:44,620 And that's just too large to 256 00:08:44,720 --> 00:08:46,430 be reasonable; the computation would 257 00:08:46,600 --> 00:08:48,680 be very expensive to find and 258 00:08:48,840 --> 00:08:50,070 to represent all of these 259 00:08:50,310 --> 00:08:52,250 three million features per training example. 260 00:08:55,470 --> 00:08:57,580 So, simple logistic regression together 261 00:08:58,100 --> 00:08:59,230 with adding in maybe the 262 00:08:59,300 --> 00:09:00,510 quadratic or the cubic features 263 00:09:00,930 --> 00:09:01,910 - that's just not a 264 00:09:01,980 --> 00:09:03,950 good way to learn complex 265 00:09:04,550 --> 00:09:06,090 nonlinear hypotheses when n 266 00:09:06,310 --> 00:09:08,410 is large because you just end up with too many features. 267 00:09:09,370 --> 00:09:10,620 In the next few videos, I 268 00:09:10,840 --> 00:09:11,890 would like to tell you about Neural 269 00:09:12,080 --> 00:09:13,670 Networks, which turns out 270 00:09:13,980 --> 00:09:15,370 to be a much better way to 271 00:09:15,650 --> 00:09:17,720 learn complex hypotheses, complex nonlinear 272 00:09:17,960 --> 00:09:19,780 hypotheses even when your 273 00:09:20,070 --> 00:09:22,080 input feature space, even when n is large. 274 00:09:22,860 --> 00:09:24,080 And along the way I'll 275 00:09:24,420 --> 00:09:25,580 also get to show 276 00:09:25,780 --> 00:09:26,730 you a couple of fun videos 277 00:09:27,240 --> 00:09:29,030 of historically important applications 278 00:09:30,300 --> 00:09:31,290 of Neural networks as well that I 279 00:09:32,100 --> 00:09:33,480 hope those videos that 280 00:09:33,570 --> 00:09:35,460 we'll see later will be fun for you to watch as well.