1
00:00:00,000 --> 00:00:04,664
[MUSIC]

2
00:00:04,664 --> 00:00:08,560
Let's now dig into a more detailed
description of this course content.

3
00:00:09,620 --> 00:00:12,250
So, the course overview
that we just talked about

4
00:00:12,250 --> 00:00:15,920
is going to be presented in four
different modules in this course.

5
00:00:15,920 --> 00:00:18,520
So let's step through
each of these modules and

6
00:00:18,520 --> 00:00:21,123
discuss what we will to
present at a high level.

7
00:00:21,123 --> 00:00:25,034
So in our first module, we are going to
discuss this document retrieval task.

8
00:00:25,034 --> 00:00:27,443
Where there is some document
that a person's reading and

9
00:00:27,443 --> 00:00:29,509
we want to search over all
these other documents.

10
00:00:29,509 --> 00:00:32,680
To find a similar one that they
might also be interested in reading.

11
00:00:33,930 --> 00:00:36,921
And to do this we are going to talk
about nearest neighbor search.

12
00:00:36,921 --> 00:00:40,858
Where we compute the distances
between our query article and

13
00:00:40,858 --> 00:00:42,952
every other article out there.

14
00:00:42,952 --> 00:00:45,697
And then retrieve
the document that's closest,

15
00:00:45,697 --> 00:00:48,490
the one with the smallest distance.

16
00:00:48,490 --> 00:00:53,290
Well there are a couple of key components
of doing this nearest neighbor search.

17
00:00:53,290 --> 00:00:56,490
One of them is how are we
going to represent our document.

18
00:00:56,490 --> 00:01:00,240
And the second is how are we going to
compute the similarity between documents,

19
00:01:00,240 --> 00:01:02,470
or the distance between them?

20
00:01:02,470 --> 00:01:07,070
Because both, how we represent our data,
as well as how we compute these distances

21
00:01:07,070 --> 00:01:09,998
really influences what we think
of as a nearest neighbor.

22
00:01:09,998 --> 00:01:14,516
And we're going to go through various
choices for each of these components and

23
00:01:14,516 --> 00:01:17,350
talk about some of
the tradeoffs that are made.

24
00:01:17,350 --> 00:01:21,203
Then we're going to turn to how to scale
up nearest neighbour search to really

25
00:01:21,203 --> 00:01:22,129
large data sets.

26
00:01:22,129 --> 00:01:26,175
Because our brute force approach of
computing distances to every single

27
00:01:26,175 --> 00:01:27,708
datapoint in our data set.

28
00:01:27,708 --> 00:01:30,246
Is computationally really,
really intensive for

29
00:01:30,246 --> 00:01:32,553
just performing one
nearest neighbor query.

30
00:01:32,553 --> 00:01:36,252
So instead we're going to talk
about a really cool data structure,

31
00:01:36,252 --> 00:01:37,401
called KD-trees.

32
00:01:37,401 --> 00:01:41,575
That allows us, in the context of nearest
neighbor search, to prune out a large

33
00:01:41,575 --> 00:01:46,400
portion of the search space when we're
going to compute our nearest neighbor.

34
00:01:46,400 --> 00:01:50,770
Well, at least, it can often do that,
though you can get unlucky.

35
00:01:50,770 --> 00:01:56,422
But, KD-trees, although very cool, don't
work very well in high dimensional spaces.

36
00:01:56,422 --> 00:01:59,674
So if you think of our document and
having a very large vocabulary.

37
00:01:59,674 --> 00:02:02,816
And that our document representation
is going to be based on

38
00:02:02,816 --> 00:02:04,459
the size of that vocabulary.

39
00:02:04,459 --> 00:02:08,023
KD-trees no longer seem like that
attractive of an approach for

40
00:02:08,023 --> 00:02:10,069
doing our nearest neighbor search.

41
00:02:10,069 --> 00:02:14,423
So instead, we're going to talk about ways
to do approximate nearest neighbor search.

42
00:02:14,423 --> 00:02:17,362
Using something called
locality sensitive hashing.

43
00:02:17,362 --> 00:02:21,650
Because one of the concepts we going to
teach in this course is the fact that in

44
00:02:21,650 --> 00:02:26,286
a lot of applications we don't actually
care about providing the exact nearest

45
00:02:26,286 --> 00:02:27,066
neighbour.

46
00:02:27,066 --> 00:02:30,560
But maybe one that is pretty
close to that is good enough.

47
00:02:31,728 --> 00:02:34,760
And when we're willing to accept
that type of approximation,

48
00:02:34,760 --> 00:02:37,579
we can do things much more
efficiently in some cases.

49
00:02:39,220 --> 00:02:43,160
Then, in the second module we're going to
turn to this idea of clustering.

50
00:02:43,160 --> 00:02:47,150
Where here our goals are going to be
to discover groups of related articles

51
00:02:47,150 --> 00:02:50,160
just based on the content
of the articles themselves.

52
00:02:50,160 --> 00:02:54,090
So this is going to be an example
of an unsupervised learning task.

53
00:02:54,090 --> 00:02:55,870
And the first algorithm
we're going to present for

54
00:02:55,870 --> 00:02:58,815
performing this type of
clustering is k-means.

55
00:02:58,815 --> 00:03:03,513
Where the goal of k-means is to minimize
the sum of the square distances

56
00:03:03,513 --> 00:03:07,520
between data points and
the set of learned cluster centers.

57
00:03:07,520 --> 00:03:12,355
And, so we can think of this just
as a partitioning of our data set.

58
00:03:12,355 --> 00:03:17,190
Where we're making hard assignments
of data points to specific clusters.

59
00:03:17,190 --> 00:03:21,099
And again this is an example
of an unsupervised algorithm.

60
00:03:21,099 --> 00:03:23,745
Because we're just taking
the inputs themselves and

61
00:03:23,745 --> 00:03:27,518
trying to learn this structure from
them without having any output labels.

62
00:03:27,518 --> 00:03:32,294
We don't know the cluster labels at any
of our data points in this data set.

63
00:03:32,294 --> 00:03:36,782
Then later in the second module we
are going to turn to trying to scale up

64
00:03:36,782 --> 00:03:41,250
k-means to large data sets using
something called map reduce.

65
00:03:41,250 --> 00:03:43,233
Map reduce is a framework for

66
00:03:43,233 --> 00:03:47,787
parallelizing operations over
a number of different machines.

67
00:03:47,787 --> 00:03:51,504
We're going to discuss the general
map reduce framework that can be used

68
00:03:51,504 --> 00:03:53,588
in a wide range of different settings.

69
00:03:53,588 --> 00:03:57,910
As well as it's specific
application to k-means.

70
00:03:57,910 --> 00:04:02,100
Then in our third module, we're going to
turn to probabilistic models for

71
00:04:02,100 --> 00:04:03,910
performing clustering.

72
00:04:03,910 --> 00:04:06,640
And the goal here is
to capture uncertainty

73
00:04:06,640 --> 00:04:09,300
in the cluster assignments
that we're making.

74
00:04:09,300 --> 00:04:13,340
So for example, maybe there's
an article where it's not really clear.

75
00:04:13,340 --> 00:04:16,620
Is about science or
is it about world news?

76
00:04:16,620 --> 00:04:18,440
Or really we don't have these labels, so

77
00:04:18,440 --> 00:04:23,940
is it more related to the group of
articles that have content about science?

78
00:04:23,940 --> 00:04:28,283
Or the group of articles that have
concepts about world news without these

79
00:04:28,283 --> 00:04:29,555
labels being shown?

80
00:04:29,555 --> 00:04:32,977
And so there are some articles that it
might not really be clear exactly what

81
00:04:32,977 --> 00:04:33,969
assignment to make.

82
00:04:33,969 --> 00:04:38,781
And we want the algorithm to present
us with this uncertainty instead of

83
00:04:38,781 --> 00:04:43,287
just providing a hard assignment
of a document to a given cluster.

84
00:04:43,287 --> 00:04:47,358
And one thing we can do with this type
of output is help us learn a user's

85
00:04:47,358 --> 00:04:49,916
preference over a set of different topics.

86
00:04:49,916 --> 00:04:54,180
Because if we have feedback from the user
about articles that they liked or

87
00:04:54,180 --> 00:04:55,086
didn't like.

88
00:04:55,086 --> 00:05:01,261
And we also have the uncertainty of what
cluster that article was associated with.

89
00:05:01,261 --> 00:05:04,594
We can put both of these pieces
of information together to have

90
00:05:04,594 --> 00:05:05,886
a better description.

91
00:05:05,886 --> 00:05:10,198
Of what impact the feedback
from the user should have

92
00:05:10,198 --> 00:05:14,707
over our learned preference
on topics from that user.

93
00:05:14,707 --> 00:05:17,543
And the probabilistic model
that we're going to use for

94
00:05:17,543 --> 00:05:19,628
clustering is called a mixture model.

95
00:05:19,628 --> 00:05:23,931
And here, when we're thinking about
assigning data points to clusters,

96
00:05:23,931 --> 00:05:28,509
it's not just the cluster center that's
going to come into play like in K-mint,

97
00:05:28,509 --> 00:05:30,510
but also the shape of the cluster.

98
00:05:31,730 --> 00:05:34,540
And here, again, unlike in k-means,

99
00:05:34,540 --> 00:05:37,600
we're not going to make hard
assignments of data points to clusters.

100
00:05:37,600 --> 00:05:40,232
We're going to make what
are called soft assignments.

101
00:05:40,232 --> 00:05:43,466
Where data points can have
weights in multiple clusters.

102
00:05:43,466 --> 00:05:47,074
Where those weights are based on our
uncertainty about the assignment of

103
00:05:47,074 --> 00:05:48,620
the data point to the cluster.

104
00:05:49,780 --> 00:05:54,131
So the goal here is to go from
a set of unlabeled data points,

105
00:05:54,131 --> 00:05:59,106
like this gray cloud of points shown here,
to a set of color points.

106
00:05:59,106 --> 00:06:01,519
But now,
instead of making hard assignments,

107
00:06:01,519 --> 00:06:04,126
there's going to be an entire
spectrum of colors.

108
00:06:04,126 --> 00:06:09,404
Representing the allocation of data points
to clusters and the uncertainty in that.

109
00:06:09,404 --> 00:06:13,956
So for example if you look at
the points line between this blue and

110
00:06:13,956 --> 00:06:16,245
green cluster there is a point.

111
00:06:16,245 --> 00:06:19,385
They're points that have this shading
somewhere in-between those two colors.

112
00:06:19,385 --> 00:06:23,649
Meaning there's uncertainty about whether
those observations are assigned to

113
00:06:23,649 --> 00:06:25,724
the blue cluster or the green cluster.

114
00:06:25,724 --> 00:06:29,030
And, likewise, between the green and
fuchsia clusters.

115
00:06:30,040 --> 00:06:32,110
And, the algorithm we're
going to go through for

116
00:06:32,110 --> 00:06:35,480
inferring these soft assignments
of data points to clusters.

117
00:06:35,480 --> 00:06:40,766
Is referred to as expectation
maximization or the Algorithm.

118
00:06:40,766 --> 00:06:44,414
And finally, in our fourth module,
we're going to describe a more intricate

119
00:06:44,414 --> 00:06:46,761
probabilistic model for
analyzing documents.

120
00:06:46,761 --> 00:06:49,990
And this is called
Latent Dirichlet Allocation or LDA.

121
00:06:49,990 --> 00:06:53,210
And to motivate the ideas behind LDA,

122
00:06:53,210 --> 00:06:56,450
let's look at a specific
document shown here.

123
00:06:56,450 --> 00:07:00,970
And what we see that this document
has a bunch of science related words.

124
00:07:00,970 --> 00:07:05,640
So maybe, based on the content of
this document we might group this

125
00:07:05,640 --> 00:07:10,580
article with other documents that
are also about science related topics.

126
00:07:10,580 --> 00:07:13,990
But, this article also has
a lot of tech related words so

127
00:07:13,990 --> 00:07:19,260
may be a better description is to group
this article with the tech articles.

128
00:07:19,260 --> 00:07:24,110
But instead this document is really
about science and technology.

129
00:07:24,110 --> 00:07:26,340
And that's exactly LEA allows us to do.

130
00:07:26,340 --> 00:07:31,470
It allows us to provide what's called
mix membership where a given document

131
00:07:31,470 --> 00:07:34,720
is allowed to belong to
multiple different topics.

132
00:07:34,720 --> 00:07:38,650
And, not only is a document
associated with multiple topics, but

133
00:07:38,650 --> 00:07:43,530
LDA provides the relative proportion
of these topics in this document.

134
00:07:44,710 --> 00:07:50,055
In particular, every topic within
the LDA model is described

135
00:07:50,055 --> 00:07:54,597
as a distribution over
the words in the vocabulary.

136
00:07:54,597 --> 00:07:59,111
And so what we see is that maybe
the science topic has more weight on

137
00:07:59,111 --> 00:08:00,845
science related words.

138
00:08:00,845 --> 00:08:05,075
Whereas the tech topic has more weight
on words related to technology and

139
00:08:05,075 --> 00:08:06,890
likewise for sports and so on.

140
00:08:06,890 --> 00:08:10,900
But the crazy thing is we're going to
be doing this in a completely

141
00:08:10,900 --> 00:08:12,620
unsupervised manner.

142
00:08:12,620 --> 00:08:16,650
We're just going to be providing
the words of a document and

143
00:08:16,650 --> 00:08:19,020
this for all documents in the corpus.

144
00:08:19,020 --> 00:08:22,704
So, we have a bunch and bunch of words for
every document in the corpus.

145
00:08:22,704 --> 00:08:24,497
And just from that alone,

146
00:08:24,497 --> 00:08:29,888
we're going to infer these topic specific
distributions over the vocabulary.

147
00:08:29,888 --> 00:08:36,350
As well as the document specific
proportions of topics in that document.

148
00:08:36,350 --> 00:08:39,340
And so we're going to go through
exactly how we can extract this kind of

149
00:08:39,340 --> 00:08:40,350
information.

150
00:08:40,350 --> 00:08:45,638
But what we see is that this is another
example of discovering structure in data,

151
00:08:45,638 --> 00:08:47,679
in this unsupervised manner.

152
00:08:47,679 --> 00:08:51,879
[MUSIC]