[MUSIC] We're gonna assume that no labels are provided. And we're gonna aim to infer groups of related articles which are Clusters. Here the input is gonna be a vector. So every observation that we're plotting here is our word count vector. And in this case, we're just looking at a very simple example with a vocabulary that only has two words. So we have a vector, we have word 1 and word 2. And this axis here is word 2 and this axis is word 1. Of course, remember that in reality we tend to have very large vocabularies and we have these big high dimensional vectors. So, when we're plotting our observations, they're really in this high dimensional space but for visualization, let's just look at this 2D representation. So we have a whole bunch of documents here, all represented by their word counts over these two different words in the vocabulary. Okay, so that's the input to a clustering algorithm, and output is gonna be cluster labels. And so what I mean is that this observation and all of these observations here, [SOUND] all of these get labeled as red. So maybe they get applied, some cluster label one. So let's just call this. Cluster 1. So for every document, it's gonna get some label. So that'll be labelled one. Then all of these observations here they're gonna get some other label. And let's assume that this is cluster 2. So this observation gets the label 2. And all of these observations here would get the label 3. So that's gonna be the output of this algorithm. And maybe what you could do is post facto, you could go through and look at some articles in cluster 1 and you could say that, this cluster is really a cluster about sports. And I just wanna write down explicitly that this label is provided post facto. Okay, well this is an example of an unsupervised learning task because we're operating without any labels. All we have are observations and we're trying to uncover some structure in these observations. So, again, just to reiterate, the input are our word count vectors and the output is, for every document in the corpus, we're gonna associate some cluster label with that document. Okay, well, what defines a cluster? Well, every cluster is defined by a cluster center, so maybe I'll mark the cluster centers with Xs. And then there's the shape to the cluster, and these ellipses are representing the shapes of each of these clusters. And so, when we think about whether this observation, this observation here should be assigned to the green cluster, or the red cluster. What we're doing is we're looking at how similar this article is, to other articles based on the shape of this cluster. So we score every observation based on the cluster center as well as the shape of that cluster. And in this case, because this cluster is, has this kind of oblong-skewed shape, it actually gets assigned to the green cluster instead of this red cluster. But another approach which is very common is instead of looking at the shape of the cluster, we just look at cluster centers. So we just measure the distance of this observation, so maybe let me change colors here, so this alternative approach here. As we would just look at the distance of this observation to the green cluster center versus the red cluster center. And in this case, it would be very challenging to decide whether that article should go with the green cluster or the red cluster. But, there are other cases like this observation here where pretty obvious with this metric that it would get assigned to this red cluster.