1
00:00:00,820 --> 00:00:06,140
In this module we've talked about

2
00:00:06,140 --> 00:00:10,920
a document retrieval task and we've also
talked about a notion of clustering where

3
00:00:10,920 --> 00:00:15,780
we're trying to uncover some underlying
structure in the data and we talked

4
00:00:15,780 --> 00:00:19,890
about many different areas in which this
notion of clustering can be really useful.

5
00:00:19,890 --> 00:00:22,960
So let's go through the workflow for
a clustering algorithm.

6
00:00:24,560 --> 00:00:27,290
And if you feel like you already know this
at this point because you've seen this

7
00:00:27,290 --> 00:00:29,710
workflow now in two other modules.

8
00:00:29,710 --> 00:00:30,380
Well, wake up!

9
00:00:30,380 --> 00:00:31,850
Because this one's going
to be a bit different.

10
00:00:33,330 --> 00:00:35,780
Okay, so
let's talk about our training data.

11
00:00:35,780 --> 00:00:37,420
Here, our training data for

12
00:00:37,420 --> 00:00:43,870
a document clustering task is
going to be a document ID and

13
00:00:43,870 --> 00:00:50,550
document text table.

14
00:00:50,550 --> 00:00:52,820
So we have a whole bunch of documents.

15
00:00:52,820 --> 00:00:55,690
And we have all the texts
associated with each of those.

16
00:00:55,690 --> 00:00:57,610
And then we're gonna extract
some set of features.

17
00:00:57,610 --> 00:01:01,428
And we talked about a lot of different
ways to represent a document.

18
00:01:01,428 --> 00:01:06,863
But the one that I'll use just as
an example here is our tf-idf.

19
00:01:06,863 --> 00:01:10,560
Term frequency-inverse document
frequency representation.

20
00:01:12,520 --> 00:01:14,620
And then what we're gonna
do is we're gonna try and

21
00:01:14,620 --> 00:01:17,770
cluster our documents based
on this representation.

22
00:01:17,770 --> 00:01:21,850
So we're gonna put these features
through some machine learning model.

23
00:01:21,850 --> 00:01:24,694
Which in this case is a clustering model.

24
00:01:29,500 --> 00:01:33,720
And we're gonna output for
each document a cluster label.

25
00:01:33,720 --> 00:01:38,490
So the output white hat
is our cluster label.

26
00:01:40,140 --> 00:01:43,110
Okay, so
here's where things get interesting,

27
00:01:43,110 --> 00:01:46,520
because we wanna assess
the accuracy of our cluster labels.

28
00:01:46,520 --> 00:01:50,070
Well, in this case we don't
have true cluster labels, so

29
00:01:50,070 --> 00:01:57,370
I should say this is our predicted,
or estimated, cluster label.

30
00:01:58,890 --> 00:02:02,370
But we don't have a true cluster
label to compare against.

31
00:02:02,370 --> 00:02:07,130
So this y here, it does not exist.

32
00:02:08,280 --> 00:02:13,840
And that's because, as we talked about,
we're in an unsupervised learning setting.

33
00:02:15,620 --> 00:02:16,520
Unsupervised.

34
00:02:18,500 --> 00:02:19,460
Okay.

35
00:02:19,460 --> 00:02:20,570
So we don't have that but

36
00:02:20,570 --> 00:02:24,740
somehow we wanna assess some measure
of accuracy of our clustering.

37
00:02:24,740 --> 00:02:30,680
So let's draw a little picture here which
is gonna be our Voronoi tessellation and

38
00:02:30,680 --> 00:02:36,633
our k-means algorithm where we have some
set of cluster centers and we have data.

39
00:02:38,551 --> 00:02:40,947
I should say our data look like this,
I don't know,

40
00:02:40,947 --> 00:02:42,750
I'm just gonna draw some points here.

41
00:02:44,590 --> 00:02:49,290
And the measure of accuracy
that we're gonna use,

42
00:02:49,290 --> 00:02:54,700
the way we're gonna measure our quality is
to look at how coherent our clustering is.

43
00:02:54,700 --> 00:02:58,275
So we're gonna look at the distances
from each observation to

44
00:02:58,275 --> 00:03:00,209
their assigned cluster center.

45
00:03:02,682 --> 00:03:07,990
And a good clustering algorithm
has those distances very small.

46
00:03:09,050 --> 00:03:14,050
Okay, so the goal is to minimize these
distances and so what we see is to measure

47
00:03:14,050 --> 00:03:19,020
this accuracy, to measure these distances,
what we need is our data.

48
00:03:19,020 --> 00:03:21,690
We need our tf-idf vectors.

49
00:03:22,760 --> 00:03:29,330
So those are gonna come in here, and
then we also need the cluster centers.

50
00:03:29,330 --> 00:03:33,690
And so W hat,
that's our current estimate of,

51
00:03:33,690 --> 00:03:36,780
that's our model parameter here and
the k means algorithm.

52
00:03:36,780 --> 00:03:39,930
It's our cluster, whoops.

53
00:03:42,190 --> 00:03:46,190
Let's see if we can spell that correctly,
cluster centers.

54
00:03:46,190 --> 00:03:47,933
That's what W hat represents.

55
00:03:47,933 --> 00:03:52,636
And so of course to measure these
distances we also need W hat.

56
00:03:52,636 --> 00:03:57,845
So instead of having actual cluster
labels to assess accuracy we're

57
00:03:57,845 --> 00:04:03,280
gonna take our document representation and
our cluster centers.

58
00:04:03,280 --> 00:04:06,797
Plug it into this quality measure,

59
00:04:06,797 --> 00:04:11,852
which is looking at distances
to cluster centers.

60
00:04:16,599 --> 00:04:20,780
That's our measure of error,
though it's not really an error.

61
00:04:20,780 --> 00:04:22,210
It's just a measure of.

62
00:04:22,210 --> 00:04:22,710
Oops.

63
00:04:25,460 --> 00:04:26,770
Just a measure of quality.

64
00:04:29,930 --> 00:04:32,090
So I won't put that word there.

65
00:04:32,090 --> 00:04:35,180
Okay, so
I think that's a little confusing.

66
00:04:35,180 --> 00:04:38,240
But let's just write
distances to cluster centers.

67
00:04:38,240 --> 00:04:39,920
And then what's our algorithm?

68
00:04:39,920 --> 00:04:43,350
We're talking about k-means as
a method of doing clustering.

69
00:04:43,350 --> 00:04:46,158
Of course there are others, but
let's focus on k-means, well,

70
00:04:46,158 --> 00:04:47,270
what's k-means doing?

71
00:04:47,270 --> 00:04:54,300
Let's just redraw this diagram, actually,
I can just switch to another color.

72
00:04:54,300 --> 00:04:56,320
That will save us some time here.

73
00:04:57,320 --> 00:05:02,980
Well, k-means is trying to minimize this
distance, or the sum of these distances,

74
00:05:02,980 --> 00:05:07,540
and the way it's doing that is
iteratively it's updating so

75
00:05:07,540 --> 00:05:11,870
this was our W hat that we had before and
we're moving it

76
00:05:13,680 --> 00:05:20,290
to a new W hat that represents
the center of mass of these points.

77
00:05:20,290 --> 00:05:24,040
So these points are getting shifted.

78
00:05:24,040 --> 00:05:27,520
And this point will go directly
on top of that observation

79
00:05:29,600 --> 00:05:34,900
and so
this is the work flow of clustering.

80
00:05:34,900 --> 00:05:37,130
Let's just say it at a high
level one more time.

81
00:05:37,130 --> 00:05:41,575
We take our documents, we represent them
in some way, using either raw word counts,

82
00:05:41,575 --> 00:05:44,440
tf-idf, normalizations of these things.

83
00:05:44,440 --> 00:05:48,000
Lots of different bigrams,
trigrams, things we can look at for

84
00:05:48,000 --> 00:05:49,400
our document representation.

85
00:05:50,500 --> 00:05:55,950
Then our clustering algorithm, like
k-means is producing clustering labels and

86
00:05:55,950 --> 00:06:01,400
iteratively, we're looping through
here again and again updating

87
00:06:01,400 --> 00:06:07,410
our cluster centers, that's
the parameters of this clustering model.

88
00:06:07,410 --> 00:06:15,620
By looking at how far our assigned
observations are to those cluster centers.

89
00:06:15,620 --> 00:06:20,020
So in this module in contrast to other
modules we actually presented you with

90
00:06:20,020 --> 00:06:24,030
some of the algorithmic details behind
the methods that we're looking at.

91
00:06:24,030 --> 00:06:28,190
Specifically for clustering, we talked
about the k-means algorithm, and then for

92
00:06:28,190 --> 00:06:31,690
our document retrieval task, we also
talked about doing nearest neighbor

93
00:06:31,690 --> 00:06:34,570
search, and provided some of
the algorithmic details for that, and

94
00:06:34,570 --> 00:06:37,520
you specifically explore that
in an ipython notebook for

95
00:06:37,520 --> 00:06:39,900
doing Wikipedia entry retrieval.

96
00:06:39,900 --> 00:06:42,190
So, at this point, you really
should be able to go out there and

97
00:06:42,190 --> 00:06:46,630
build a really cool retrieval system for
doing news article retrieval.

98
00:06:46,630 --> 00:06:51,770
Or any other really, really, really cool
retrieval that I can't think of right now.

99
00:06:51,770 --> 00:06:53,760
But of course there's lots
of interesting examples.

100
00:06:53,760 --> 00:06:57,611
So go out there and think of ideas
that I can't think of right now.

101
00:06:57,611 --> 00:07:01,299
[MUSIC]