1
00:00:00,000 --> 00:00:04,563
[MUSIC]

2
00:00:04,563 --> 00:00:08,609
So that was our deep dive into retrieval
where a lot of our focus was not only

3
00:00:08,609 --> 00:00:12,590
on how do we think about representing
our data and computing distances,

4
00:00:12,590 --> 00:00:17,028
which are very critical to the performance
of the nearest neighbour algorithm,

5
00:00:17,028 --> 00:00:20,354
but we also focus very heavily
on how we think about scaling up

6
00:00:20,354 --> 00:00:23,320
nearest neighbour search
to really large data sets.

7
00:00:23,320 --> 00:00:27,370
Thinking about using KD trees, and
locality sense of hashing, and

8
00:00:27,370 --> 00:00:30,320
using both of these to perform
approximate nearest neighbor search.

9
00:00:31,420 --> 00:00:35,420
But for the remainder of the course,
we turn to this idea of clustering.

10
00:00:35,420 --> 00:00:37,250
And in particular, in our second module,

11
00:00:37,250 --> 00:00:40,390
we first discuss,
what does clustering mean?

12
00:00:40,390 --> 00:00:43,850
And we return to our
document analysis where

13
00:00:43,850 --> 00:00:47,470
the goal here was to discover
groups of related articles.

14
00:00:47,470 --> 00:00:49,220
And the first algorithm
that we presented for

15
00:00:49,220 --> 00:00:52,110
performing clustering was
something called k-means.

16
00:00:52,110 --> 00:00:56,540
And that is probably the most widely
used algorithm for clustering.

17
00:00:56,540 --> 00:00:59,500
I remember that in this algorithm
we somehow initialized by

18
00:00:59,500 --> 00:01:04,400
just placing a set of cluster centers,
and then the algorithm iterates between

19
00:01:04,400 --> 00:01:08,980
making hard assignments of data points
to the nearest cluster center, and

20
00:01:08,980 --> 00:01:11,270
then updating the cluster centers.

21
00:01:11,270 --> 00:01:14,790
And this procedure repeats
until convergence.

22
00:01:14,790 --> 00:01:20,130
And in particular, we cascade means as a
coordinate descent algorithm, and through

23
00:01:20,130 --> 00:01:24,190
this we know that it's going to converge
to a local motive of the objective.

24
00:01:24,190 --> 00:01:28,310
So importantly, we know that we're
not going to converge necessarily to

25
00:01:28,310 --> 00:01:31,830
the global mode, but the algorithm
does converge just to a local mode.

26
00:01:32,920 --> 00:01:35,010
So we went through this in pictures,
showing that for

27
00:01:35,010 --> 00:01:38,830
different initializations, we can
converge to quite different solutions.

28
00:01:38,830 --> 00:01:43,800
So in practice people often run multiple
initializations of the algorithm and

29
00:01:43,800 --> 00:01:45,340
choose the best one.

30
00:01:45,340 --> 00:01:49,840
And we discussed how initialization
matters really significantly for

31
00:01:49,840 --> 00:01:54,270
the quality of the solution found,
as well as the convergence rates.

32
00:01:54,270 --> 00:01:58,535
But, just like in our first module
on retrieval, in our second module,

33
00:01:58,535 --> 00:02:02,795
one of our big emphasis was on
the scalability of the procedure.

34
00:02:02,795 --> 00:02:07,325
And for this, we introduce a generic
framework called MapReduce for

35
00:02:07,325 --> 00:02:13,255
parellelizing computations across multiple
machines under certain specific setups.

36
00:02:13,255 --> 00:02:17,715
In particular, we talked about the
MapReduce abstraction in terms of a map

37
00:02:17,715 --> 00:02:23,080
step that has to be data-parallel
over elements, and a reduce step

38
00:02:23,080 --> 00:02:28,770
that performs some aggregation that
has to be data-parallel over keys.

39
00:02:28,770 --> 00:02:31,320
But if your algorithm has this structure,
then you can think about

40
00:02:31,320 --> 00:02:35,220
using MapReduce to paralyze and
distribute the computations.

41
00:02:35,220 --> 00:02:39,160
And we showed that indeed K-means
has this type of structure.

42
00:02:39,160 --> 00:02:43,090
Where in particular we can think about
the classify step of making the hard

43
00:02:43,090 --> 00:02:49,620
assignments of data points to cluster
centers as data parallel over elements.

44
00:02:49,620 --> 00:02:51,790
And the recenter step,

45
00:02:51,790 --> 00:02:57,700
where we recompute the centers of each of
the clusters as data parallel over keys.

46
00:02:57,700 --> 00:03:00,326
So we have a map and reduce step here, but

47
00:03:00,326 --> 00:03:04,140
in k-means we mention that we have to
iterate between these steps again and

48
00:03:04,140 --> 00:03:06,870
again, which is a non-standard
implementation of MapReduce.

49
00:03:08,020 --> 00:03:10,829
But it can definitely be
applied in this context and

50
00:03:10,829 --> 00:03:13,848
give us really good scaling properties for
k-means.

51
00:03:13,848 --> 00:03:18,479
[MUSIC]