[MUSIC] So far we've been focusing on a document
retrieval task where somebody is reading an article and we just want to search over
all other articles in the corpus to find one that's most similar or very similar to
the article that the person is reading. But in a lot of cases, we're actually
interested in discovering a structured representation of our data,
of all the articles out there. So, one example of a structured
representation we might be interested in is discovering
a clustering of articles, so discovering groups of articles
that are related to one another. And we're going to discuss both
reasons why we would want to perform such clustering as well as
algorithms for performing this clustering. So let's dig into
the objective of clustering, as well as some motivating
applications for performing clustering within the context
of our document application. So the goal of clustering is to
discover groups of related articles. So, for example,
maybe there's a group where all the articles within this group relate
to subjects related to sports. And maybe there's another group where
all the articles are about world news. If we can discover this type of structure
in our data, then if we're doing something like the document retrieval task
that we talked about before. Then if a person is reading
an article we can look at it what group that article falls in. And then when we go to
retrieve another article for that reader, then we can simply search
over the articles within that given group. So if a person's reading a sports article, maybe we can recommend another
sports article just by searching for the nearest neighbor within
this group of sports articles. But we can actually use clustering
to do even fancier things. So, as an example, maybe we're interested
in learning the preferences of a user over a set of different topics. In this case, we're assuming that the
person isn't just interested in sports. They might have other interests as well. And when we're going to present
an article to that user, maybe we'd want to explore some
of these other preferences. So in this case, if we imagine that we
have a clustering of our articles into these groups of related articles, then a user is going to read some
subset of the articles in the corpus, and so, these are the articles
that are not grayed out here. So these are all the articles
that the user read, and then we can imagine that the user gives
us some feedback about the articles, whether they like the article or not. So, plus sign is going to say,
yep, they like the article, minus sign is going to be that
they did not like the article. So this user liked those
two articles in Cluster 1, that green Cluster as well this article
in Cluster 4, the orange Custer, didn’t like the article that they
read in this blue Cluster, Cluster 2. Did not like this other
article read in Cluster 4, liked this article in this blue cluster,
Cluster 2 and so on. And after we get this feedback or
as we're getting it over time, we can use this to learn this type
of preference factor over topics. One thing that we're going to discuss
is that we don't actually have topic labels like sports, world news and so on. But we do know that there
are groups of articles and we can use it to select
articles from these groups. Or we can post facto go in and
dig in to these groups and put labels on them ourselves. [MUSIC]