1 00:00:00,000 --> 00:00:07,505 So far, whatever we've done in machine learning has assumed that data is labeled 2 00:00:07,505 --> 00:00:15,291 in the training set with some amount, some classes, positive negative, buy or browser 3 00:00:15,291 --> 00:00:21,793 or type of animals, for example. But, in the real world, there's nobody 4 00:00:21,793 --> 00:00:28,324 always telling you what classes there are. I mean, in some cases, you can measure it 5 00:00:28,324 --> 00:00:33,488 by, figuring out which users actually bought something and which users didn't. 6 00:00:33,488 --> 00:00:38,718 But for example, in cases of sentiment, it's difficult to imagine how one would 7 00:00:38,718 --> 00:00:43,882 get that input, whether a sentence is positive or negative, without some human 8 00:00:43,882 --> 00:00:49,705 actually doing the labeling. So the question does arise, how the 9 00:00:49,705 --> 00:00:57,021 classes emerge from the data without any external human input or in the real world, 10 00:00:57,021 --> 00:01:04,338 how do we figure out that an animal is an animal, a table is an object, a chair is a 11 00:01:04,338 --> 00:01:11,390 chair etc., without some one explicitly telling us what are objects what are not 12 00:01:11,390 --> 00:01:14,740 objects. Well the answer is clustering. 13 00:01:15,340 --> 00:01:24,459 In clustering what happens is groups of similar users or user queries are clumped 14 00:01:24,459 --> 00:01:28,875 together or clustered together based on the terms that they contain. 15 00:01:28,875 --> 00:01:33,747 So if a large number of queries contained red flowers, yellow flowers, cheap 16 00:01:33,747 --> 00:01:38,683 flowers, etcetera then all of these queries would get clumped into a cluster, 17 00:01:38,683 --> 00:01:42,255 which talked essentially about flowers and their color. 18 00:01:42,255 --> 00:01:47,115 Okay. Similarly, comments, which often use the 19 00:01:47,115 --> 00:01:54,609 words like, great, love, excellent would go together and naturally would form a 20 00:01:54,609 --> 00:02:00,290 cluster which talks about. Positive sentiment, whereas others which 21 00:02:00,290 --> 00:02:06,512 had words like hate, uncomfortable, very difficult and stuff like that would get 22 00:02:06,512 --> 00:02:12,105 into another cluster which would hopefully be those which are negative. 23 00:02:12,105 --> 00:02:18,878 Now those are ideal situations in practice the clusters that emerge directly from the 24 00:02:18,878 --> 00:02:25,180 data need not be so nicely categorized into the classes that we actually expect. 25 00:02:27,080 --> 00:02:32,498 In the real world, for example, we might be looking at observations of animals 26 00:02:32,498 --> 00:02:38,058 having features like legs of the side of the head and things like that and you 27 00:02:38,058 --> 00:02:43,758 might group them based on which animals appeared to be similar in terms of their 28 00:02:43,758 --> 00:02:49,599 head size or the number of legs they have or the noises that they make and perhaps 29 00:02:49,599 --> 00:02:55,300 we'll get clusters which sort of look like the groups which represent the animal 30 00:02:55,300 --> 00:03:00,930 classes that we would naturally give and that's probably how we actually assign. 31 00:03:00,930 --> 00:03:04,640 Classes to objects in the real world in the first place. 32 00:03:05,460 --> 00:03:13,660 An important part of all of these is figuring out what classes there are from 33 00:03:13,660 --> 00:03:19,865 the data is the fact that we assume that the features on the w-, basis of which 34 00:03:19,865 --> 00:03:24,892 classes emerge are given. This is very important to note, because 35 00:03:24,892 --> 00:03:30,154 often clustering is highly dependent on which features one chooses. 36 00:03:30,154 --> 00:03:35,888 Later on this week, we will see how we might be able to find the features 37 00:03:35,888 --> 00:03:41,307 themselves from the data itself. But for the moment, let's assume that 38 00:03:41,307 --> 00:03:48,224 features are given. So the goal of clustering is to find 39 00:03:48,224 --> 00:03:53,110 regions of our space X. Remember our space X with features being 40 00:03:53,110 --> 00:03:58,225 the elements of X1 to XN. You want to find regions that are more 41 00:03:58,225 --> 00:04:03,416 populated than random data. Now let's, let's look at this statement a 42 00:04:03,416 --> 00:04:08,812 little bit carefully. Regions means that things which are in the 43 00:04:08,812 --> 00:04:14,748 same region are close together or similar to each other, just because their values 44 00:04:14,748 --> 00:04:19,092 of features, that is the X values are close in some respects. 45 00:04:19,092 --> 00:04:25,100 Some agree, some don't agree and even if they don't agree in some respects they are 46 00:04:25,100 --> 00:04:28,286 close. For example large and extra large are 47 00:04:28,286 --> 00:04:32,340 closer than perhaps, to each other than perhaps to small. 48 00:04:33,620 --> 00:04:39,420 Secondly. These regions are not just, regions which 49 00:04:39,420 --> 00:04:40,360 have. A few. 50 00:04:40,900 --> 00:04:47,086 Instances in them but are more populated than one would otherwise expect if the 51 00:04:47,086 --> 00:04:53,581 data were totally random in the sense that if the X values for each of the features 52 00:04:53,581 --> 00:04:58,840 were chosen at random from all possible values that they could have. 53 00:05:01,120 --> 00:05:08,097 So what we're trying to see is. Other regions, where the ratio, of the 54 00:05:08,097 --> 00:05:15,611 probability of X falling, in that particular region in, given the data that 55 00:05:15,611 --> 00:05:20,080 we have, is larger than the probability that. 56 00:05:21,140 --> 00:05:26,343 The data would take that value if, data was generated uniformly. 57 00:05:26,343 --> 00:05:29,977 That means if, it was completely random data. 58 00:05:29,977 --> 00:05:33,446 Every element of x, being chosen at random. 59 00:05:33,446 --> 00:05:39,063 Another way of looking at this is, lets suppose we have all our data. 60 00:05:39,063 --> 00:05:45,924 Which is the black dots in the space here. And we set y equal to one, so you know, in 61 00:05:45,924 --> 00:05:50,890 clustering we didn't have the output variable, since we didn't have the class. 62 00:05:50,890 --> 00:05:56,122 But we want to find these classes, so let's set y equal to one for all the dots 63 00:05:56,122 --> 00:06:01,420 which are actually there in our data. That means the actual observations that we 64 00:06:01,420 --> 00:06:06,302 experience in, in the real world. Are all set to one and then let's so 65 00:06:06,302 --> 00:06:12,275 imagine that we added to this data, not some other data which we chose at random 66 00:06:12,275 --> 00:06:18,397 and we just chose X values at random and just threw them in to data set and those 67 00:06:18,397 --> 00:06:24,893 were the, the light blue or colored or the lighter colored dots and we, we threw alot 68 00:06:24,893 --> 00:06:29,000 of them in and we assigned them values y equal to zero. 69 00:06:29,640 --> 00:06:34,206 Now we want to find out with this definition of our data. 70 00:06:34,206 --> 00:06:40,695 Now it's no longer just the old data, but it's the old data plus this random data. 71 00:06:40,695 --> 00:06:44,380 How we could figure out what the clusters are. 72 00:06:45,200 --> 00:06:52,290 If we now define our F of X just like before, the expected value of Y given X. 73 00:06:52,290 --> 00:06:57,742 Well, Y is one for the data that we actually observed, and zero for the data 74 00:06:57,742 --> 00:07:03,412 that we added at random, then the expected value of Y given X, is just, you know, 75 00:07:03,412 --> 00:07:09,155 the, essentially it'll be probability of X divided by probability of X plus the 76 00:07:09,155 --> 00:07:13,298 probability of the random value, variable being chosen. 77 00:07:13,298 --> 00:07:18,750 And if you just work this out, it'll be the ratio R over one plus R, because 78 00:07:18,750 --> 00:07:24,711 essentially you can divide the numerator and the denominator by P0, and you'll get 79 00:07:24,711 --> 00:07:32,300 R over one plus R. This particular function, will have. 80 00:07:32,640 --> 00:07:37,660 Extreme values. We'd have some areas where it's very 81 00:07:37,660 --> 00:07:41,561 large. Because lot of the dots in that space have 82 00:07:41,561 --> 00:07:45,702 a one associated with them. In other areas it might not be large, 83 00:07:45,702 --> 00:07:50,360 there are no black dots in that area but there are a lot of random dots. 84 00:07:51,180 --> 00:07:57,992 So the fine regions where this data, this function is large gives us clusters. 85 00:07:57,992 --> 00:08:04,805 Now, this is quite important when we have big data because in big data, we can 86 00:08:04,805 --> 00:08:11,352 actually afford to do this kind of clustering which is sometimes possibly 87 00:08:11,352 --> 00:08:18,964 even efficient. But that's not normally done, because big 88 00:08:18,964 --> 00:08:23,698 data is quite new. So traditional means of clustering are 89 00:08:23,698 --> 00:08:29,760 k-means clustering, agglomerative or hierarchical clustering, and even our 90 00:08:29,760 --> 00:08:35,573 locality sensitive hashing that we discussed in week one is a form of 91 00:08:35,573 --> 00:08:41,028 clustering, because it. In an unsupervised manner, group similar 92 00:08:41,028 --> 00:08:45,258 items together. Of course, it doesn't care if the clusters 93 00:08:45,258 --> 00:08:48,991 are big or small. So, often it doesn't give us great 94 00:08:48,991 --> 00:08:52,625 clusters. But it can definitely be used as a first 95 00:08:52,625 --> 00:08:58,875 step towards more careful clustering where we are trying to find areas which actually 96 00:08:58,875 --> 00:09:04,544 have large values of F that is contained large number of points from our data. 97 00:09:04,544 --> 00:09:10,187 All of which are close together. Once you have a cluster, one can assign a 98 00:09:10,187 --> 00:09:16,580 class label to each cluster and say, okay, cluster one is class A, cluster two is 99 00:09:16,580 --> 00:09:20,707 class B. We can't really give them names because we 100 00:09:20,707 --> 00:09:26,986 don't really know how to name them. Well, as human beings we figure out how to 101 00:09:26,986 --> 00:09:32,606 name them through language and collective agreement but, you know, the computer 102 00:09:32,606 --> 00:09:39,020 doesn't know how to do that. So clustering is one way of. 103 00:09:39,820 --> 00:09:44,924 Classes emerging from data in an unsupervised manner and we've seen one 104 00:09:44,924 --> 00:09:49,249 means of clustering. Look how these sensitive hashing even if 105 00:09:49,249 --> 00:09:55,134 it's not giving a great set of clusters. So we want to really study the other means 106 00:09:55,134 --> 00:09:59,176 in this course. There are other courses where you can get 107 00:09:59,176 --> 00:10:04,635 into all kinds of clustering algorithms as well as supervised classification 108 00:10:04,635 --> 00:10:07,330 algorithms. But we're going to move on. 109 00:10:07,330 --> 00:10:14,133 To look at other kinds of learning, so the message here is clustering allows us to 110 00:10:14,133 --> 00:10:20,936 get classes from data without having to do any explicit labeling, but an important 111 00:10:20,936 --> 00:10:27,490 aspect is that one needs to have the features, because otherwise you don't have 112 00:10:27,490 --> 00:10:31,624 any basis. On whcih to cluster And if you choose your 113 00:10:31,624 --> 00:10:35,159 features wrong then you'll get wrong clusters. 114 00:10:35,159 --> 00:10:41,383 Another nice point that I'd like you to note is that we use the same formulation 115 00:10:41,383 --> 00:10:47,915 of F of X equal to the expected value of Y given X with an appropriate definition of 116 00:10:47,915 --> 00:10:51,758 our space. As including not only the original data 117 00:10:51,758 --> 00:10:56,830 points but also some random data, so we can use the same mechanism. 118 00:10:56,830 --> 00:11:01,295 Of defining the problem in terms of the function F of X. 119 00:11:01,295 --> 00:11:07,595 While this doesn't normally yield any practical benefit, it certainly allows us 120 00:11:07,595 --> 00:11:13,256 to understand both classification and clustering, and some of the other 121 00:11:13,256 --> 00:11:18,360 techniques that we will study very soon with the same formalism. 122 00:11:18,860 --> 00:11:29,193 It also allows us to imagine a situation where if one were able to find decision 123 00:11:29,193 --> 00:11:38,033 boundaries, of functions, like F of X, or find regions where F of X is large or 124 00:11:38,033 --> 00:11:43,053 small, efficiently. One could solve classification, 125 00:11:43,053 --> 00:11:48,641 clustering, and a whole bunch of other machine learning techniques with one set 126 00:11:48,641 --> 00:11:53,314 of methods. However, this remains a research area even 127 00:11:53,314 --> 00:11:59,422 though much work has been done on this direction in the past, and assumes 128 00:11:59,422 --> 00:12:03,522 especially more importance now with big datawear. 129 00:12:03,522 --> 00:12:10,538 Wearing the data directly often yields much better results than if one actually 130 00:12:10,538 --> 00:12:13,160 had very small amounts of data.