So far, whatever we've done in machine learning has assumed that data is labeled in the training set with some amount, some classes, positive negative, buy or browser or type of animals, for example. But, in the real world, there's nobody always telling you what classes there are. I mean, in some cases, you can measure it by, figuring out which users actually bought something and which users didn't. But for example, in cases of sentiment, it's difficult to imagine how one would get that input, whether a sentence is positive or negative, without some human actually doing the labeling. So the question does arise, how the classes emerge from the data without any external human input or in the real world, how do we figure out that an animal is an animal, a table is an object, a chair is a chair etc., without some one explicitly telling us what are objects what are not objects. Well the answer is clustering. In clustering what happens is groups of similar users or user queries are clumped together or clustered together based on the terms that they contain. So if a large number of queries contained red flowers, yellow flowers, cheap flowers, etcetera then all of these queries would get clumped into a cluster, which talked essentially about flowers and their color. Okay. Similarly, comments, which often use the words like, great, love, excellent would go together and naturally would form a cluster which talks about. Positive sentiment, whereas others which had words like hate, uncomfortable, very difficult and stuff like that would get into another cluster which would hopefully be those which are negative. Now those are ideal situations in practice the clusters that emerge directly from the data need not be so nicely categorized into the classes that we actually expect. In the real world, for example, we might be looking at observations of animals having features like legs of the side of the head and things like that and you might group them based on which animals appeared to be similar in terms of their head size or the number of legs they have or the noises that they make and perhaps we'll get clusters which sort of look like the groups which represent the animal classes that we would naturally give and that's probably how we actually assign. Classes to objects in the real world in the first place. An important part of all of these is figuring out what classes there are from the data is the fact that we assume that the features on the w-, basis of which classes emerge are given. This is very important to note, because often clustering is highly dependent on which features one chooses. Later on this week, we will see how we might be able to find the features themselves from the data itself. But for the moment, let's assume that features are given. So the goal of clustering is to find regions of our space X. Remember our space X with features being the elements of X1 to XN. You want to find regions that are more populated than random data. Now let's, let's look at this statement a little bit carefully. Regions means that things which are in the same region are close together or similar to each other, just because their values of features, that is the X values are close in some respects. Some agree, some don't agree and even if they don't agree in some respects they are close. For example large and extra large are closer than perhaps, to each other than perhaps to small. Secondly. These regions are not just, regions which have. A few. Instances in them but are more populated than one would otherwise expect if the data were totally random in the sense that if the X values for each of the features were chosen at random from all possible values that they could have. So what we're trying to see is. Other regions, where the ratio, of the probability of X falling, in that particular region in, given the data that we have, is larger than the probability that. The data would take that value if, data was generated uniformly. That means if, it was completely random data. Every element of x, being chosen at random. Another way of looking at this is, lets suppose we have all our data. Which is the black dots in the space here. And we set y equal to one, so you know, in clustering we didn't have the output variable, since we didn't have the class. But we want to find these classes, so let's set y equal to one for all the dots which are actually there in our data. That means the actual observations that we experience in, in the real world. Are all set to one and then let's so imagine that we added to this data, not some other data which we chose at random and we just chose X values at random and just threw them in to data set and those were the, the light blue or colored or the lighter colored dots and we, we threw alot of them in and we assigned them values y equal to zero. Now we want to find out with this definition of our data. Now it's no longer just the old data, but it's the old data plus this random data. How we could figure out what the clusters are. If we now define our F of X just like before, the expected value of Y given X. Well, Y is one for the data that we actually observed, and zero for the data that we added at random, then the expected value of Y given X, is just, you know, the, essentially it'll be probability of X divided by probability of X plus the probability of the random value, variable being chosen. And if you just work this out, it'll be the ratio R over one plus R, because essentially you can divide the numerator and the denominator by P0, and you'll get R over one plus R. This particular function, will have. Extreme values. We'd have some areas where it's very large. Because lot of the dots in that space have a one associated with them. In other areas it might not be large, there are no black dots in that area but there are a lot of random dots. So the fine regions where this data, this function is large gives us clusters. Now, this is quite important when we have big data because in big data, we can actually afford to do this kind of clustering which is sometimes possibly even efficient. But that's not normally done, because big data is quite new. So traditional means of clustering are k-means clustering, agglomerative or hierarchical clustering, and even our locality sensitive hashing that we discussed in week one is a form of clustering, because it. In an unsupervised manner, group similar items together. Of course, it doesn't care if the clusters are big or small. So, often it doesn't give us great clusters. But it can definitely be used as a first step towards more careful clustering where we are trying to find areas which actually have large values of F that is contained large number of points from our data. All of which are close together. Once you have a cluster, one can assign a class label to each cluster and say, okay, cluster one is class A, cluster two is class B. We can't really give them names because we don't really know how to name them. Well, as human beings we figure out how to name them through language and collective agreement but, you know, the computer doesn't know how to do that. So clustering is one way of. Classes emerging from data in an unsupervised manner and we've seen one means of clustering. Look how these sensitive hashing even if it's not giving a great set of clusters. So we want to really study the other means in this course. There are other courses where you can get into all kinds of clustering algorithms as well as supervised classification algorithms. But we're going to move on. To look at other kinds of learning, so the message here is clustering allows us to get classes from data without having to do any explicit labeling, but an important aspect is that one needs to have the features, because otherwise you don't have any basis. On whcih to cluster And if you choose your features wrong then you'll get wrong clusters. Another nice point that I'd like you to note is that we use the same formulation of F of X equal to the expected value of Y given X with an appropriate definition of our space. As including not only the original data points but also some random data, so we can use the same mechanism. Of defining the problem in terms of the function F of X. While this doesn't normally yield any practical benefit, it certainly allows us to understand both classification and clustering, and some of the other techniques that we will study very soon with the same formalism. It also allows us to imagine a situation where if one were able to find decision boundaries, of functions, like F of X, or find regions where F of X is large or small, efficiently. One could solve classification, clustering, and a whole bunch of other machine learning techniques with one set of methods. However, this remains a research area even though much work has been done on this direction in the past, and assumes especially more importance now with big datawear. Wearing the data directly often yields much better results than if one actually had very small amounts of data.