1 00:00:00,090 --> 00:00:01,330 In this video, I'd like to 2 00:00:01,500 --> 00:00:02,560 start talking about a second 3 00:00:03,030 --> 00:00:04,620 type of unsupervised learning problem 4 00:00:04,950 --> 00:00:06,320 called dimensionality reduction. 5 00:00:07,600 --> 00:00:08,460 There are a couple of different 6 00:00:08,660 --> 00:00:09,710 reasons why one might 7 00:00:09,890 --> 00:00:11,270 want to do dimensionality reduction. 8 00:00:12,220 --> 00:00:14,420 One is data compression, and as 9 00:00:14,600 --> 00:00:15,860 we'll see later, a few videos 10 00:00:16,570 --> 00:00:18,200 later, data compression not only 11 00:00:18,490 --> 00:00:19,660 allows us to compress the 12 00:00:19,970 --> 00:00:20,940 data and have it therefore 13 00:00:21,330 --> 00:00:22,670 use up less computer memory 14 00:00:23,050 --> 00:00:24,410 or disk space, but it will 15 00:00:24,730 --> 00:00:26,960 also allow us to speed up our learning algorithms. 16 00:00:27,980 --> 00:00:29,490 But first, let's start by 17 00:00:29,620 --> 00:00:31,840 talking about what is dimensionality reduction. 18 00:00:33,490 --> 00:00:35,800 As a motivating example, let's say 19 00:00:35,990 --> 00:00:37,440 that we've collected a data set 20 00:00:37,680 --> 00:00:38,700 with many, many, many features, 21 00:00:39,290 --> 00:00:40,600 and I've plotted just two of them here. 22 00:00:41,600 --> 00:00:42,770 And let's say that unknown to 23 00:00:42,890 --> 00:00:44,000 us two of the 24 00:00:44,070 --> 00:00:45,730 features were actually the length 25 00:00:45,860 --> 00:00:47,150 of something in centimeters, and 26 00:00:47,550 --> 00:00:48,920 a different feature, x2, is 27 00:00:49,460 --> 00:00:51,150 the length of the same thing in inches. 28 00:00:52,250 --> 00:00:53,030 So, this gives us a highly 29 00:00:53,500 --> 00:00:55,910 redundant representation and maybe 30 00:00:56,170 --> 00:00:57,920 instead of having two separate features x1 31 00:00:58,430 --> 00:00:58,820 then x2, 32 00:00:59,090 --> 00:01:00,240 both of which basically measure the 33 00:01:00,370 --> 00:01:01,490 length, maybe what we 34 00:01:01,640 --> 00:01:03,340 want to do is reduce the 35 00:01:03,430 --> 00:01:06,800 data to one-dimensional and 36 00:01:06,920 --> 00:01:08,840 just have one number measuring this length. 37 00:01:09,620 --> 00:01:11,080 In case this example seems a 38 00:01:11,150 --> 00:01:13,920 bit contrived, this centimeter and 39 00:01:14,030 --> 00:01:15,850 inches example is actually not that 40 00:01:16,220 --> 00:01:17,140 unrealistic, and not that different 41 00:01:17,510 --> 00:01:18,870 from things that I see happening in industry. 42 00:01:19,970 --> 00:01:21,320 If you have hundreds 43 00:01:21,790 --> 00:01:23,160 or thousands of features, it is 44 00:01:23,240 --> 00:01:24,450 often this easy to 45 00:01:24,680 --> 00:01:26,580 lose track of exactly what features you have. 46 00:01:26,930 --> 00:01:28,190 And sometimes may have 47 00:01:28,420 --> 00:01:29,650 a few different engineering teams, maybe 48 00:01:30,110 --> 00:01:31,090 one engineering team gives you 49 00:01:31,200 --> 00:01:32,500 two hundred features, a second 50 00:01:32,770 --> 00:01:34,000 engineering team gives you another 51 00:01:34,340 --> 00:01:35,420 three hundred features, and a 52 00:01:35,550 --> 00:01:36,640 third engineering team gives you five 53 00:01:36,940 --> 00:01:38,150 hundred features so you have 54 00:01:38,290 --> 00:01:39,220 a thousand features all together, 55 00:01:39,940 --> 00:01:40,910 and it actually becomes hard to 56 00:01:41,040 --> 00:01:42,820 keep track of you know, exactly which features 57 00:01:43,200 --> 00:01:44,540 you got from which team, and 58 00:01:44,860 --> 00:01:47,310 it's actually not that want to have highly redundant features like these. 59 00:01:47,530 --> 00:01:49,440 And so if the 60 00:01:50,090 --> 00:01:51,520 length in centimeters were rounded 61 00:01:51,940 --> 00:01:53,920 off to the nearest centimeter and 62 00:01:54,060 --> 00:01:56,480 lengthened inches was rounded off to the nearest inch. 63 00:01:57,070 --> 00:01:58,050 Then, that's why these examples 64 00:01:58,720 --> 00:01:59,900 don't lie perfectly on a 65 00:02:00,100 --> 00:02:01,270 straight line, because of, you know, round-off 66 00:02:01,740 --> 00:02:03,420 error to the nearest centimeter or the nearest inch. 67 00:02:04,260 --> 00:02:05,160 And if we can reduce 68 00:02:05,610 --> 00:02:06,680 the data to one dimension 69 00:02:07,130 --> 00:02:10,320 instead of two dimensions, that reduces the redundancy. 70 00:02:11,590 --> 00:02:14,030 For a different example, again maybe when there seems fairly less contrives. 71 00:02:14,590 --> 00:02:16,560 For may years I've 72 00:02:16,920 --> 00:02:19,920 been working with autonomous helicopter pilots. 73 00:02:20,990 --> 00:02:22,610 Or I've been working with pilots that fly helicopters. 74 00:02:23,950 --> 00:02:24,040 And so. 75 00:02:25,080 --> 00:02:28,090 If you were to measure--if you 76 00:02:28,250 --> 00:02:29,100 were to, you know, do a survey 77 00:02:29,590 --> 00:02:30,500 or do a test of these different 78 00:02:30,770 --> 00:02:32,200 pilots--you might have one 79 00:02:32,440 --> 00:02:33,780 feature, x1, which is maybe 80 00:02:34,050 --> 00:02:35,600 the skill of these 81 00:02:35,820 --> 00:02:38,190 helicopter pilots, and maybe 82 00:02:38,460 --> 00:02:41,810 "x2" could be the pilot enjoyment. 83 00:02:42,700 --> 00:02:43,770 That is, you know, how 84 00:02:43,870 --> 00:02:45,050 much they enjoy flying, and maybe 85 00:02:45,280 --> 00:02:46,810 these two features will be highly correlated. And 86 00:02:48,310 --> 00:02:49,730 what you really care about might 87 00:02:49,940 --> 00:02:52,530 be this sort of 88 00:02:53,610 --> 00:02:55,120 this sort of, this direction, a different feature that really 89 00:02:55,370 --> 00:02:57,190 measures pilot aptitude. 90 00:03:00,450 --> 00:03:01,240 And I'm making up the name 91 00:03:01,590 --> 00:03:03,220 aptitude of course, but again, if 92 00:03:03,320 --> 00:03:04,780 you highly correlated features, maybe 93 00:03:04,990 --> 00:03:06,500 you really want to reduce the dimension. 94 00:03:07,570 --> 00:03:08,760 So, let me say a 95 00:03:09,040 --> 00:03:09,950 little bit more about what it 96 00:03:10,060 --> 00:03:11,390 really means to reduce the 97 00:03:11,520 --> 00:03:12,950 dimension of the data from 98 00:03:13,150 --> 00:03:14,400 2 dimensions down from 2D 99 00:03:14,600 --> 00:03:16,300 to 1 dimensional or to 1D. 100 00:03:16,840 --> 00:03:18,660 Let me color in 101 00:03:18,830 --> 00:03:19,940 these examples by using different 102 00:03:21,600 --> 00:03:21,600 colors. 103 00:03:21,730 --> 00:03:22,890 And in this case 104 00:03:23,370 --> 00:03:24,740 by reducing the dimension what 105 00:03:25,010 --> 00:03:26,320 I mean is that I would 106 00:03:26,540 --> 00:03:28,400 like to find maybe this 107 00:03:28,660 --> 00:03:30,560 line, this, you know, direction on 108 00:03:30,710 --> 00:03:31,700 which most of the data seems 109 00:03:31,910 --> 00:03:33,150 to lie and project all 110 00:03:33,380 --> 00:03:34,740 the data onto that line which 111 00:03:34,910 --> 00:03:36,230 is true, and by doing 112 00:03:36,510 --> 00:03:37,430 so, what I can do 113 00:03:37,970 --> 00:03:39,420 is just measure the 114 00:03:39,580 --> 00:03:41,480 position of each of the examples on that line. 115 00:03:42,010 --> 00:03:42,820 And what I can do is come 116 00:03:43,100 --> 00:03:45,080 up with a new feature, z1, 117 00:03:46,830 --> 00:03:48,200 and to specify the position 118 00:03:48,730 --> 00:03:49,530 on the line I need only 119 00:03:49,890 --> 00:03:50,940 one number, so it says 120 00:03:51,200 --> 00:03:51,980 z1 is a new feature 121 00:03:52,750 --> 00:03:54,630 that specifies the location of 122 00:03:54,830 --> 00:03:57,610 each of those points on this green line. 123 00:03:58,060 --> 00:03:59,300 And what this means, is 124 00:03:59,400 --> 00:04:00,680 that where as previously if i 125 00:04:00,930 --> 00:04:02,540 had an example x1, maybe 126 00:04:03,430 --> 00:04:04,740 this was my first example, x1. 127 00:04:05,040 --> 00:04:06,480 So in order to 128 00:04:06,820 --> 00:04:08,550 represent x1 originally x1. 129 00:04:09,620 --> 00:04:10,760 I needed a two dimensional number, 130 00:04:11,570 --> 00:04:12,800 or a two dimensional feature vector. 131 00:04:13,700 --> 00:04:14,920 Instead now I can represent 132 00:04:18,120 --> 00:04:20,330 z1. I could 133 00:04:20,520 --> 00:04:22,170 use just z1 to represent my first 134 00:04:23,270 --> 00:04:25,380 example, and that's going to be a real number. 135 00:04:25,940 --> 00:04:29,260 And similarly x2 you know, if x2 136 00:04:29,590 --> 00:04:31,400 is my second example there, 137 00:04:32,690 --> 00:04:35,110 then previously, whereas this required 138 00:04:35,830 --> 00:04:37,520 two numbers to represent if I 139 00:04:37,720 --> 00:04:39,930 instead compute the projection 140 00:04:40,930 --> 00:04:42,730 of that black cross 141 00:04:43,130 --> 00:04:44,250 onto the line. 142 00:04:44,700 --> 00:04:45,980 And now I only need one 143 00:04:46,210 --> 00:04:47,350 real number which is 144 00:04:47,550 --> 00:04:49,580 z2 to represent the 145 00:04:49,620 --> 00:04:51,230 location of this point 146 00:04:51,790 --> 00:04:53,070 z2 on the line. 147 00:04:54,300 --> 00:04:56,730 And so on through my M examples. 148 00:04:57,790 --> 00:04:59,560 So, just to summarize, if 149 00:04:59,810 --> 00:05:01,310 we allow ourselves to approximate 150 00:05:02,340 --> 00:05:03,800 the original data set by 151 00:05:04,000 --> 00:05:05,270 projecting all of my 152 00:05:05,590 --> 00:05:07,690 original examples onto this green 153 00:05:07,880 --> 00:05:10,260 line over here, then I 154 00:05:10,360 --> 00:05:12,090 need only one number, I 155 00:05:12,170 --> 00:05:13,700 need only real number to 156 00:05:13,820 --> 00:05:15,270 specify the position of 157 00:05:15,370 --> 00:05:16,710 a point on the line, 158 00:05:17,080 --> 00:05:18,220 and so what I can 159 00:05:18,300 --> 00:05:19,730 do is therefore use just 160 00:05:20,070 --> 00:05:21,850 one number to represent the 161 00:05:21,930 --> 00:05:23,170 location of each of 162 00:05:23,280 --> 00:05:26,520 my training examples after they've been projected onto that green line. 163 00:05:27,570 --> 00:05:29,060 So this is an approximation to 164 00:05:29,210 --> 00:05:30,300 the original training self because 165 00:05:30,570 --> 00:05:32,770 I have projected all of my training examples onto a line. 166 00:05:33,630 --> 00:05:34,790 But 167 00:05:35,130 --> 00:05:36,140 now, I need to keep around 168 00:05:36,530 --> 00:05:39,800 only one number for each of my examples. 169 00:05:41,220 --> 00:05:42,960 And so this halves the memory 170 00:05:43,340 --> 00:05:44,640 requirement, or a space requirement, 171 00:05:45,090 --> 00:05:47,760 or what have you, for how to store my data. 172 00:05:49,100 --> 00:05:50,530 And perhaps more interestingly, more 173 00:05:50,700 --> 00:05:51,940 importantly, what we'll see 174 00:05:52,200 --> 00:05:53,520 later, in the later 175 00:05:53,780 --> 00:05:55,730 video as well is that this 176 00:05:55,930 --> 00:05:56,940 will allow us to make 177 00:05:57,200 --> 00:05:59,170 our learning algorithms run more quickly as well. 178 00:05:59,480 --> 00:06:00,600 And that is actually, 179 00:06:00,920 --> 00:06:02,060 perhaps, even the more interesting 180 00:06:02,140 --> 00:06:03,800 application of this data compression 181 00:06:04,580 --> 00:06:06,220 rather than reducing the memory 182 00:06:06,680 --> 00:06:08,620 or disk space requirement for storing the data. 183 00:06:10,250 --> 00:06:11,490 On the previous slide we 184 00:06:11,580 --> 00:06:13,140 showed an example of reducing 185 00:06:13,620 --> 00:06:15,060 data from 2D to 1D. 186 00:06:15,210 --> 00:06:16,290 On this slide, I'm going 187 00:06:16,660 --> 00:06:18,010 to show another example of reducing 188 00:06:18,450 --> 00:06:21,080 data from three dimensional 3D to two dimensional 2D. 189 00:06:22,590 --> 00:06:23,360 By the way, in the more typical 190 00:06:23,750 --> 00:06:25,570 example of dimensionality reduction 191 00:06:26,390 --> 00:06:27,790 we might have a thousand dimensional 192 00:06:28,230 --> 00:06:30,330 data or 1000D data that 193 00:06:30,720 --> 00:06:31,880 we might want to reduce to 194 00:06:32,150 --> 00:06:34,080 let's say a hundred dimensional or 195 00:06:34,110 --> 00:06:35,590 100D, but because of 196 00:06:35,700 --> 00:06:37,760 the limitations of what I can plot on the slide. 197 00:06:38,460 --> 00:06:41,520 I'm going to use examples of 3D to 2D, or 2D to 1D. 198 00:06:43,160 --> 00:06:45,830 So, let's have a data set like that shown here. 199 00:06:46,050 --> 00:06:47,420 And so, I would have a set of examples 200 00:06:48,110 --> 00:06:49,430 x(i) which are points 201 00:06:49,800 --> 00:06:51,790 in r3. So, I have three dimension examples. 202 00:06:52,740 --> 00:06:53,300 I know it might be a little 203 00:06:53,690 --> 00:06:54,610 bit hard to see this on the slide, 204 00:06:54,920 --> 00:06:55,980 but I'll show a 3D point 205 00:06:56,310 --> 00:06:58,190 cloud in a little bit. 206 00:06:59,050 --> 00:07:00,280 And it might be hard to see 207 00:07:00,380 --> 00:07:01,970 here, but all of this 208 00:07:02,230 --> 00:07:04,020 data maybe lies roughly on 209 00:07:04,130 --> 00:07:05,700 the plane, like so. 210 00:07:07,110 --> 00:07:08,130 And so what we can do 211 00:07:08,380 --> 00:07:09,970 with dimensionality reduction, is take 212 00:07:10,210 --> 00:07:11,960 all of this data and 213 00:07:12,110 --> 00:07:13,800 project the data down onto 214 00:07:14,630 --> 00:07:15,350 a two dimensional plane. 215 00:07:15,700 --> 00:07:16,670 So, here what I've done is, 216 00:07:16,730 --> 00:07:18,060 I've taken all the data and I've 217 00:07:18,300 --> 00:07:19,250 projected all of the data, 218 00:07:19,770 --> 00:07:21,390 so that it all lies on the plane. 219 00:07:22,590 --> 00:07:23,910 Now, finally, in order to 220 00:07:24,040 --> 00:07:25,580 specify the location of a 221 00:07:25,750 --> 00:07:27,810 point within a plane, we need two numbers, right? 222 00:07:28,000 --> 00:07:29,150 We need to, maybe, specify the 223 00:07:29,290 --> 00:07:30,660 location of a point along 224 00:07:30,970 --> 00:07:32,370 this axis, and then also 225 00:07:32,650 --> 00:07:35,090 specify it's location along that axis. 226 00:07:35,730 --> 00:07:37,470 So, we need two numbers, maybe called 227 00:07:37,850 --> 00:07:39,900 z1 and z2 to specify 228 00:07:40,600 --> 00:07:42,450 the location of a point within a plane. 229 00:07:43,290 --> 00:07:44,730 And so, what that means, 230 00:07:44,890 --> 00:07:45,910 is that we can now represent 231 00:07:46,690 --> 00:07:48,310 each example, each training example, 232 00:07:48,740 --> 00:07:50,310 using two numbers that 233 00:07:50,630 --> 00:07:52,950 I've drawn here, z1, and z2. 234 00:07:53,990 --> 00:07:55,890 So, our data can be represented 235 00:07:56,610 --> 00:07:59,130 using vector z which are in r2. 236 00:08:00,580 --> 00:08:02,110 And these subscript, z subscript 237 00:08:02,350 --> 00:08:03,990 1, z subscript 2, what 238 00:08:04,560 --> 00:08:05,440 I just mean by that is that my 239 00:08:05,500 --> 00:08:07,520 vectors here, z, you know, are two 240 00:08:07,750 --> 00:08:09,680 dimensional vectors, z1, z2. 241 00:08:10,600 --> 00:08:11,580 And so if I have some 242 00:08:11,790 --> 00:08:13,690 particular examples, z(i), or 243 00:08:13,760 --> 00:08:15,700 that's the two dimensional vector, z(i)1, 244 00:08:16,350 --> 00:08:19,110 z(i)2. 245 00:08:20,580 --> 00:08:21,990 And on the previous slide when 246 00:08:22,230 --> 00:08:23,750 I was reducing data to one 247 00:08:23,950 --> 00:08:25,270 dimensional data then I 248 00:08:25,360 --> 00:08:27,500 had only z1, right? 249 00:08:27,760 --> 00:08:28,610 And that is what a z1 subscript 1 250 00:08:28,700 --> 00:08:29,830 on the previous slide was, 251 00:08:30,550 --> 00:08:31,720 but here I have two dimensional data, 252 00:08:32,100 --> 00:08:32,730 so I have z1 and z2 as 253 00:08:33,040 --> 00:08:34,940 the two components of the data. 254 00:08:36,690 --> 00:08:37,830 Now, let me just make sure 255 00:08:38,020 --> 00:08:39,200 that these figures make sense. So 256 00:08:39,290 --> 00:08:40,790 let me just reshow these exact 257 00:08:41,600 --> 00:08:45,080 three figures again but with 3D plots. 258 00:08:45,540 --> 00:08:46,570 So the process we went through was that 259 00:08:47,040 --> 00:08:48,110 shown in the lab is the optimal 260 00:08:48,480 --> 00:08:49,520 data set, in the middle the 261 00:08:49,590 --> 00:08:50,540 data set projects on the 2D, 262 00:08:51,040 --> 00:08:52,140 and on the right the 2D 263 00:08:52,820 --> 00:08:54,900 data sets with z1 and z2 as the axis. 264 00:08:55,780 --> 00:08:56,610 Let's look at them a little 265 00:08:56,820 --> 00:08:57,960 bit further. Here's my original 266 00:08:58,270 --> 00:08:59,210 data set, shown on the 267 00:08:59,410 --> 00:09:00,680 left, and so I had started 268 00:09:01,380 --> 00:09:02,420 off with a 3D point 269 00:09:02,660 --> 00:09:04,000 cloud like so, where the 270 00:09:04,360 --> 00:09:05,390 axis are labeled x1, 271 00:09:05,570 --> 00:09:07,410 x2, x3, and so there's a 3D 272 00:09:07,960 --> 00:09:08,970 point but most of the data, 273 00:09:09,500 --> 00:09:10,750 maybe roughly lies on some, 274 00:09:10,850 --> 00:09:12,800 you know, not too far from some 2D plain. 275 00:09:13,930 --> 00:09:14,950 So, what we can 276 00:09:15,040 --> 00:09:17,460 do is take this data and here's my middle figure. 277 00:09:17,800 --> 00:09:19,110 I'm going to project it onto 2D. 278 00:09:19,370 --> 00:09:20,790 So, I've projected this data so 279 00:09:20,900 --> 00:09:23,220 that all of it now lies on this 2D surface. 280 00:09:23,750 --> 00:09:25,330 As you can see all the data 281 00:09:26,190 --> 00:09:27,470 lies on a plane, 'cause we've 282 00:09:27,700 --> 00:09:30,520 projected everything onto a 283 00:09:30,570 --> 00:09:31,490 plane, and so what this means is that 284 00:09:31,800 --> 00:09:33,190 now I need only two numbers, 285 00:09:33,820 --> 00:09:35,090 z1 and z2, to represent 286 00:09:35,620 --> 00:09:37,470 the location of point on the plane. 287 00:09:40,530 --> 00:09:41,480 And so that's the process that 288 00:09:41,810 --> 00:09:42,990 we can go through to reduce our 289 00:09:43,500 --> 00:09:45,180 data from three dimensional to 290 00:09:45,340 --> 00:09:48,520 two dimensional. So that's 291 00:09:49,230 --> 00:09:50,850 dimensionality reduction and how 292 00:09:51,070 --> 00:09:52,740 we can use it to compress our data. 293 00:09:54,010 --> 00:09:55,400 And as we'll see 294 00:09:55,580 --> 00:09:56,970 later this will allow us to 295 00:09:57,110 --> 00:09:58,020 make some of our learning algorithms 296 00:09:58,580 --> 00:09:59,670 run much later as well, but 297 00:09:59,740 --> 00:10:01,210 we'll get to that only in a later video.