[MUSIC] So that was our deep dive into retrieval
where a lot of our focus was not only on how do we think about representing
our data and computing distances, which are very critical to the performance
of the nearest neighbour algorithm, but we also focus very heavily
on how we think about scaling up nearest neighbour search
to really large data sets. Thinking about using KD trees, and
locality sense of hashing, and using both of these to perform
approximate nearest neighbor search. But for the remainder of the course,
we turn to this idea of clustering. And in particular, in our second module, we first discuss,
what does clustering mean? And we return to our
document analysis where the goal here was to discover
groups of related articles. And the first algorithm
that we presented for performing clustering was
something called k-means. And that is probably the most widely
used algorithm for clustering. I remember that in this algorithm
we somehow initialized by just placing a set of cluster centers,
and then the algorithm iterates between making hard assignments of data points
to the nearest cluster center, and then updating the cluster centers. And this procedure repeats
until convergence. And in particular, we cascade means as a
coordinate descent algorithm, and through this we know that it's going to converge
to a local motive of the objective. So importantly, we know that we're
not going to converge necessarily to the global mode, but the algorithm
does converge just to a local mode. So we went through this in pictures,
showing that for different initializations, we can
converge to quite different solutions. So in practice people often run multiple
initializations of the algorithm and choose the best one. And we discussed how initialization
matters really significantly for the quality of the solution found,
as well as the convergence rates. But, just like in our first module
on retrieval, in our second module, one of our big emphasis was on
the scalability of the procedure. And for this, we introduce a generic
framework called MapReduce for parellelizing computations across multiple
machines under certain specific setups. In particular, we talked about the
MapReduce abstraction in terms of a map step that has to be data-parallel
over elements, and a reduce step that performs some aggregation that
has to be data-parallel over keys. But if your algorithm has this structure,
then you can think about using MapReduce to paralyze and
distribute the computations. And we showed that indeed K-means
has this type of structure. Where in particular we can think about
the classify step of making the hard assignments of data points to cluster
centers as data parallel over elements. And the recenter step, where we recompute the centers of each of
the clusters as data parallel over keys. So we have a map and reduce step here, but in k-means we mention that we have to
iterate between these steps again and again, which is a non-standard
implementation of MapReduce. But it can definitely be
applied in this context and give us really good scaling properties for
k-means. [MUSIC]