[MUSIC] Welcome to the Clustering and Retrieval
course, a machine learning specialization. And as we're going to see, clustering and
retrieval are both widely used and broadly applicable tools. These tools can be used for anything from finding similar products to
the one that you're currently browsing. To discovering groups of patients
with related medical conditions. And in the foundation's course, we provided the high level introduction to
the concepts of clustering and retrieval. But now in this course, we're going to
take a deep dive into the models and algorithms. This course is part of a specialization
that was designed to be taken in a specific sequence. So, although you can take this
course as a standalone course. To get the experience that we intended, we strongly suggest that you
take the entire sequence. In particular, in the Foundations course,
we provided a high level overview of the concepts that we're going to be cover
throughout the entire specialization. Then we went into much more detailed
courses on regression and classification. And in those courses, we learned
general machine learning concepts that are going to appear again in this
course on clustering and retrieval. And we'll go through a specific list of
the concepts that we expect you to know later on in this
introduction to the course. But first, let's provide an overview
of what this course is about. And remember that in the machine
learning foundations course, we said that machine learning was about
extracting intelligence from data. Well, in the first part of this course,
the machine learning method that we're going to be going through is nearest
neighbor search for retrieval task. So in particular here, the inputs are going to be the features
of a specific query point. As well as the features of all other
datapoints in a specific dataset. And then the output is going to be
the nearest neighbor to the query point. Being the point within this entire dataset
that's most similar to this query point. So, what's needed for our nearest neighbor search is to
search over all these other datapoints. And compute the similarity
to these datapoints or their distances and
then return the one that's closest. So as a running example that we're
going to use throughout this course, let's imagine that we had
a whole bunch of documents and somebody's reading a specific document. So this is this query
article in gray here and let's say that the user
likes this article. And we want to retrieve another article
that's similar that they might also like. So this would be the task of
finding the nearest neighbor. But maybe we also want to retrieve
a set of nearest neighbors, so a set of similar document
to display to this user. And we're going to be providing this
search over all these other documents to retrieve these nearest neighbors just
based on the text of the document. So just based on the features
of our datapoints. And we see retrieval
applications almost everywhere. So perhaps we have some image and
want to find a set of similar images. Or maybe we're shopping for some product
and we want to be provided with a set of other similar items that we
might want to consider instead. Or like we've discussed, maybe we're reading some article and
want to read a related article on a topic. And this kind of idea appears in a lot
of streaming contexts where you might be listening to a song,
watching a movie or watching a TV show. And you want to be provided with
a set of other songs or movies or TV shows that you might
also be interested in. Or perhaps you're a user
on some social network and based on features of you as a user. You want to be presented with other people
that you might want to connect with. So in general, this idea of retrieving
similar items can be a really, really powerful tool. And throughout this course we're going to
discuss ways to provide this type of search. Then in the second part of
the course we're going to turn to clustering as our machine learning method. And here, our inputs are features associated with
each of our datapoints within a dataset. And the output are going to be
cluster labels associated with each of these datapoints. Where the goal of the clustering method
is to discover these disjoint sets of datapoints. And we're going to look at a couple
case studies when we're talking about clustering. But one of them is going to be the same
idea of a document retrieval task where we have an entire
corpus of documents. And here though as opposed
to our simple retrieval setting where we're just performing
nearest neighbor search. Our goal is to discover a structured
representation of this input space. So a structured grouping of
our corpus into groups that might relate to things like
topics present in this corpus. And just like we saw through retrieval, clustering has applications
almost everywhere as well. So maybe when we're doing image search, we
don't just want to take a given image and provide a set of related images. But we might want to actually structure
our dataset where we're going to discover groups of related images. Or maybe we want to discover groups
of Coursera learners like yourselves. Where based on features of the user and
past behavior on Coursera. We can find users who
are related to one another and might have similar interest and use this
to better target courses to the learners. So this very simple ideas of retrieval and
clustering. Actually have very,
very significant impact in the world and our very much taken for granted. You just go on to your app or
your device and you search for something. And you just assume okay, of course,
is going to be able to provide a set of other things that might be related
to the given query that you've made. Or maybe we can very easily just
discover from data that we've seen, what are related groups of people or
items. Well, how do we actually do this? So this is what this course is about and we're also going to touch
upon how we do this at scale. And throughout this course,
like always we're going to learn general machine learning concepts
that are very useful. For example, in this course we're
going to talk about the task of unsupervised learning. And we're also going to talk about
things like MapReduce framework for scaling up algorithms by
parallelizing their computations. [MUSIC]