[MUSIC] We're gonna assume that
no labels are provided. And we're gonna aim to infer groups of
related articles which are Clusters. Here the input is gonna be a vector. So every observation that we're
plotting here is our word count vector. And in this case, we're just looking at a
very simple example with a vocabulary that only has two words. So we have a vector,
we have word 1 and word 2. And this axis here is word 2 and
this axis is word 1. Of course, remember that in reality we
tend to have very large vocabularies and we have these big high
dimensional vectors. So, when we're plotting our observations,
they're really in this high dimensional space but for visualization,
let's just look at this 2D representation. So we have a whole bunch of
documents here, all represented by their word counts over these two
different words in the vocabulary. Okay, so that's the input to
a clustering algorithm, and output is gonna be cluster labels. And so
what I mean is that this observation and all of these observations here,
[SOUND] all of these get labeled as red. So maybe they get applied,
some cluster label one. So let's just call this. Cluster 1. So for every document,
it's gonna get some label. So that'll be labelled one. Then all of these observations here
they're gonna get some other label. And let's assume that this is cluster 2. So this observation gets the label 2. And all of these observations
here would get the label 3. So that's gonna be the output
of this algorithm. And maybe what you could do is post facto, you could go through and
look at some articles in cluster 1 and you could say that, this cluster
is really a cluster about sports. And I just wanna write down explicitly
that this label is provided post facto. Okay, well this is an example of an unsupervised learning task because
we're operating without any labels. All we have are observations and we're trying to uncover some
structure in these observations. So, again, just to reiterate,
the input are our word count vectors and the output is, for
every document in the corpus, we're gonna associate some
cluster label with that document. Okay, well, what defines a cluster? Well, every cluster is defined
by a cluster center, so maybe I'll mark the cluster
centers with Xs. And then there's the shape to the cluster,
and these ellipses are representing
the shapes of each of these clusters. And so, when we think about
whether this observation, this observation here should be assigned
to the green cluster, or the red cluster. What we're doing is we're
looking at how similar this article is, to other articles
based on the shape of this cluster. So we score every observation based on the cluster center as
well as the shape of that cluster. And in this case, because this cluster is,
has this kind of oblong-skewed shape, it actually gets assigned to the green
cluster instead of this red cluster. But another approach which is very common is instead of looking at the shape of the
cluster, we just look at cluster centers. So we just measure the distance
of this observation, so maybe let me change colors here,
so this alternative approach here. As we would just look at
the distance of this observation to the green cluster center
versus the red cluster center. And in this case, it would be very
challenging to decide whether that article should go with the green cluster or
the red cluster. But, there are other cases like this
observation here where pretty obvious with this metric that it would
get assigned to this red cluster.