1 00:00:00,130 --> 00:00:01,140 In the last video, we talked 2 00:00:01,470 --> 00:00:03,380 about dimensionality reduction for 3 00:00:03,530 --> 00:00:05,090 the purpose of compressing the data. 4 00:00:05,830 --> 00:00:06,770 In this video, I'd like 5 00:00:06,910 --> 00:00:08,140 to tell you about a second application 6 00:00:08,870 --> 00:00:12,490 of dimensionality reduction and that is to visualize the data. 7 00:00:13,440 --> 00:00:14,210 For a lot of machine learning 8 00:00:14,560 --> 00:00:15,890 applications, it really helps 9 00:00:16,220 --> 00:00:17,650 us to develop effective learning 10 00:00:17,990 --> 00:00:20,260 algorithms, if we can understand our data better. 11 00:00:20,610 --> 00:00:21,890 If there is some way of visualizing 12 00:00:22,100 --> 00:00:23,790 the data better, and so, 13 00:00:24,080 --> 00:00:25,810 dimensionality reduction offers us, 14 00:00:25,990 --> 00:00:27,870 often, another useful tool to do so. 15 00:00:28,700 --> 00:00:29,290 Let's start with an example. 16 00:00:30,840 --> 00:00:31,370 Let's say we've collected a large 17 00:00:31,720 --> 00:00:33,190 data set of many statistics 18 00:00:33,840 --> 00:00:35,730 and facts about different countries around the world. 19 00:00:36,030 --> 00:00:37,190 So, maybe the first feature, X1 20 00:00:38,090 --> 00:00:39,530 is the country's GDP, or the 21 00:00:39,720 --> 00:00:41,710 Gross Domestic Product, and 22 00:00:41,850 --> 00:00:43,210 X2 is a per capita, meaning 23 00:00:43,600 --> 00:00:45,770 the per person GDP, X3 24 00:00:46,080 --> 00:00:48,340 human development index, life 25 00:00:48,530 --> 00:00:51,290 expectancy, X5, X6 and so on. 26 00:00:51,560 --> 00:00:52,670 And we may have a huge data set 27 00:00:52,880 --> 00:00:54,080 like this, where, you know, 28 00:00:54,290 --> 00:00:56,890 maybe 50 features for 29 00:00:57,650 --> 00:00:59,660 every country, and we have a huge set of countries. 30 00:01:01,310 --> 00:01:02,300 So is there something 31 00:01:02,810 --> 00:01:05,210 we can do to try to understand our data better? 32 00:01:05,490 --> 00:01:07,200 I've given this huge table of numbers. 33 00:01:07,850 --> 00:01:11,010 How do you visualize this data? 34 00:01:11,510 --> 00:01:12,420 If you have 50 features, it's 35 00:01:12,600 --> 00:01:13,970 very difficult to plot 50-dimensional 36 00:01:15,620 --> 00:01:15,620 data. 37 00:01:16,470 --> 00:01:19,060 What is a good way to examine this data? 38 00:01:20,750 --> 00:01:22,820 Using dimensionality reduction, what 39 00:01:22,960 --> 00:01:24,920 we can do is, instead of 40 00:01:25,200 --> 00:01:27,240 having each country represented by 41 00:01:27,430 --> 00:01:30,220 this featured vector, xi, which 42 00:01:30,460 --> 00:01:33,140 is 50-dimensional, so instead 43 00:01:33,410 --> 00:01:34,800 of, say, having a country 44 00:01:35,330 --> 00:01:37,260 like Canada, instead of 45 00:01:37,380 --> 00:01:38,880 having 50 numbers to represent the features 46 00:01:39,320 --> 00:01:41,030 of Canada, let's say we 47 00:01:41,240 --> 00:01:42,350 can come up with a different feature 48 00:01:42,900 --> 00:01:44,930 representation that is these 49 00:01:45,320 --> 00:01:47,650 z vectors, that is in R2. 50 00:01:49,590 --> 00:01:50,780 If that's the case, if we 51 00:01:50,910 --> 00:01:51,930 can have just a pair of 52 00:01:52,230 --> 00:01:53,640 numbers, z1 and z2 that 53 00:01:53,790 --> 00:01:55,500 somehow, summarizes my 50 54 00:01:55,640 --> 00:01:56,730 numbers, maybe what we 55 00:01:56,810 --> 00:01:57,880 can do [xx] is to plot 56 00:01:58,190 --> 00:01:59,750 these countries in R2 and 57 00:01:59,970 --> 00:02:01,500 use that to try to 58 00:02:01,590 --> 00:02:03,810 understand the space in 59 00:02:03,950 --> 00:02:05,630 [xx] of features of different 60 00:02:05,900 --> 00:02:08,250 countries [xx] the better and 61 00:02:08,520 --> 00:02:10,690 so, here, what you 62 00:02:10,780 --> 00:02:11,980 can do is reduce the 63 00:02:12,070 --> 00:02:14,630 data from 50 64 00:02:14,850 --> 00:02:16,580 D, from 50 dimensions 65 00:02:17,470 --> 00:02:18,380 to 2D, so you can 66 00:02:18,740 --> 00:02:19,960 plot this as a 2 67 00:02:20,170 --> 00:02:21,470 dimensional plot, and, when 68 00:02:21,610 --> 00:02:23,060 you do that, it turns out 69 00:02:23,270 --> 00:02:24,110 that, if you look at the 70 00:02:24,280 --> 00:02:25,770 output of the Dimensionality Reduction algorithms, 71 00:02:26,720 --> 00:02:28,650 It usually doesn't astride a 72 00:02:28,920 --> 00:02:32,340 physical meaning to these new features you want [xx] to. 73 00:02:32,710 --> 00:02:35,210 It's often up to us to figure out you know, roughly what these features means. 74 00:02:36,810 --> 00:02:39,440 But, And if you plot those features, here is what you might find. 75 00:02:39,750 --> 00:02:41,090 So, here, every country 76 00:02:41,760 --> 00:02:43,060 is represented by a point 77 00:02:43,820 --> 00:02:44,640 ZI, which is an R2 78 00:02:44,990 --> 00:02:46,440 and so each of those. 79 00:02:46,790 --> 00:02:47,780 Dots, and this figure 80 00:02:48,050 --> 00:02:48,980 represents a country, and so, 81 00:02:49,200 --> 00:02:50,830 here's Z1 and here's 82 00:02:51,200 --> 00:02:53,380 Z2, and [xx] [xx] of these. 83 00:02:54,090 --> 00:02:55,310 So, you might find, 84 00:02:55,680 --> 00:02:57,270 for example, That the horizontial 85 00:02:57,690 --> 00:02:59,240 axis the Z1 axis 86 00:03:00,270 --> 00:03:01,980 corresponds roughly to the 87 00:03:02,260 --> 00:03:05,150 overall country size, or 88 00:03:05,230 --> 00:03:07,410 the overall economic activity of a country. 89 00:03:07,800 --> 00:03:09,950 So the overall GDP, overall 90 00:03:10,750 --> 00:03:13,490 economic size of a country. 91 00:03:14,350 --> 00:03:15,860 Whereas the vertical axis in our 92 00:03:15,920 --> 00:03:18,250 data might correspond to the 93 00:03:18,390 --> 00:03:21,430 per person GDP. 94 00:03:22,290 --> 00:03:23,900 Or the per person well being, 95 00:03:24,160 --> 00:03:30,730 or the per person economic activity, and, 96 00:03:31,030 --> 00:03:32,370 you might find that, given these 97 00:03:32,570 --> 00:03:33,540 50 features, you know, these 98 00:03:34,040 --> 00:03:35,160 are really the 2 main dimensions 99 00:03:35,800 --> 00:03:37,760 of the deviation, and so, out 100 00:03:38,170 --> 00:03:39,140 here you may have a country 101 00:03:39,820 --> 00:03:41,220 like the U.S.A., which 102 00:03:41,500 --> 00:03:43,370 is a relatively large GDP, 103 00:03:43,690 --> 00:03:44,990 you know, is a very 104 00:03:45,270 --> 00:03:46,490 large GDP and a relatively 105 00:03:46,710 --> 00:03:48,760 high per-person GDP as well. 106 00:03:49,470 --> 00:03:50,710 Whereas here you might have 107 00:03:51,410 --> 00:03:53,720 a country like Singapore, which 108 00:03:53,970 --> 00:03:55,040 actually has a very 109 00:03:55,390 --> 00:03:56,760 high per person GDP as well, 110 00:03:57,030 --> 00:03:58,010 but because Singapore is a much 111 00:03:58,100 --> 00:03:59,820 smaller country the overall 112 00:04:01,030 --> 00:04:02,230 economy size of Singapore 113 00:04:03,460 --> 00:04:05,060 is much smaller than the US. 114 00:04:06,270 --> 00:04:08,140 And, over here, you would 115 00:04:08,290 --> 00:04:10,880 have countries where individuals 116 00:04:12,020 --> 00:04:13,320 are unfortunately some are less 117 00:04:13,430 --> 00:04:14,660 well off, maybe shorter life expectancy, 118 00:04:15,820 --> 00:04:17,000 less health care, less economic 119 00:04:18,290 --> 00:04:19,370 maturity that's why smaller 120 00:04:19,700 --> 00:04:21,950 countries, whereas a point 121 00:04:22,280 --> 00:04:23,780 like this will correspond to 122 00:04:24,450 --> 00:04:26,000 a country that has a 123 00:04:26,160 --> 00:04:27,870 fair, has a substantial amount of 124 00:04:28,090 --> 00:04:29,540 economic activity, but where individuals 125 00:04:30,520 --> 00:04:32,520 tend to be somewhat less well off. 126 00:04:32,600 --> 00:04:33,700 So you might find that 127 00:04:33,840 --> 00:04:35,610 the axes Z1 and Z2 128 00:04:35,680 --> 00:04:37,140 can help you to most succinctly 129 00:04:37,670 --> 00:04:39,010 capture really what are the 130 00:04:39,120 --> 00:04:40,120 two main dimensions of the variations 131 00:04:41,360 --> 00:04:42,120 amongst different countries. 132 00:04:43,430 --> 00:04:44,910 Such as the overall economic 133 00:04:45,400 --> 00:04:46,850 activity of the country projected 134 00:04:47,390 --> 00:04:48,800 by the size of the 135 00:04:49,090 --> 00:04:50,770 country's overall economy as well 136 00:04:51,320 --> 00:04:53,440 as the per-person individual 137 00:04:54,040 --> 00:04:55,290 well-being, measured by per-person 138 00:04:56,960 --> 00:04:58,470 GDP, per-person healthcare, and things like that. 139 00:05:00,930 --> 00:05:02,130 So that's how you can 140 00:05:02,290 --> 00:05:04,410 use dimensionality reduction, in 141 00:05:04,540 --> 00:05:06,230 order to reduce data from 142 00:05:06,470 --> 00:05:07,860 50 dimensions or whatever, down 143 00:05:08,150 --> 00:05:09,520 to two dimensions, or maybe 144 00:05:09,680 --> 00:05:11,270 down to three dimensions, so that 145 00:05:11,380 --> 00:05:13,740 you can plot it and understand your data better. 146 00:05:14,840 --> 00:05:16,010 In the next video, we'll start 147 00:05:16,440 --> 00:05:17,580 to develop a specific algorithm, 148 00:05:18,200 --> 00:05:19,500 called PCA, or Principal Component 149 00:05:20,010 --> 00:05:21,360 Analysis, which will allow 150 00:05:21,550 --> 00:05:22,630 us to do this and also 151 00:05:23,820 --> 00:05:26,690 do the earlier application I talked about of compressing the data.